stark_qa.skb
stark_qa.skb.amazon
- class stark_qa.skb.amazon.AmazonSKB(root=None, categories=['Sports_and_Outdoors'], meta_link_types=['brand', 'category', 'color'], max_entries=25, download_processed=True, **kwargs)[source]
Bases:
SKB
- COMMON = {'Appliances', 'Arts_Crafts_and_Sewing', 'Automotive', 'Cell_Phones_and_Accessories', 'Clothing_Shoes_and_Jewelry', 'Electronics', 'Grocery_and_Gourmet_Food', 'Home_and_Kitchen', 'Musical_Instruments', 'Office_Products', 'Patio_Lawn_and_Garden', 'Pet_Supplies', 'Sports_and_Outdoors', 'Tools_and_Home_Improvement', 'Toys_and_Games', 'Video_Games'}
- QA_CATEGORIES = {'Appliances', 'Arts_Crafts_and_Sewing', 'Automotive', 'Baby', 'Beauty', 'Cell_Phones_and_Accessories', 'Clothing_Shoes_and_Jewelry', 'Electronics', 'Grocery_and_Gourmet_Food', 'Health_and_Personal_Care', 'Home_and_Kitchen', 'Musical_Instruments', 'Office_Products', 'Patio_Lawn_and_Garden', 'Pet_Supplies', 'Sports_and_Outdoors', 'Tools_and_Home_Improvement', 'Toys_and_Games', 'Video_Games'}
- REVIEW_CATEGORIES = {'All_Beauty', 'Amazon_Fashion', 'Appliances', 'Arts_Crafts_and_Sewing', 'Automotive', 'Books', 'CDs_and_Vinyl', 'Cell_Phones_and_Accessories', 'Clothing_Shoes_and_Jewelry', 'Digital_Music', 'Electronics', 'Gift_Cards', 'Grocery_and_Gourmet_Food', 'Home_and_Kitchen', 'Industrial_and_Scientific', 'Kindle_Store', 'Luxury_Beauty', 'Magazine_Subscriptions', 'Movies_and_TV', 'Musical_Instruments', 'Office_Products', 'Patio_Lawn_and_Garden', 'Pet_Supplies', 'Prime_Pantry', 'Software', 'Sports_and_Outdoors', 'Tools_and_Home_Improvement', 'Toys_and_Games', 'Video_Games'}
- candidate_types = ['product']
- construct_raw_node_info(df_meta, df_review, df_qa)[source]
Construct raw node information.
- Parameters:
df_meta (pd.DataFrame) – DataFrame containing meta information.
df_review (pd.DataFrame) – DataFrame containing review information.
df_qa (pd.DataFrame) – DataFrame containing QA information.
- Returns:
Dictionary containing node information.
- Return type:
dict
- create_raw_product_graph(df, columns)[source]
Create raw product graph.
- Parameters:
df (pd.DataFrame) – DataFrame containing meta information.
columns (list) – List of columns to create edges.
- Returns:
Tuple containing edge index and edge types.
- Return type:
tuple
- get_chunk_info(idx, attribute)[source]
Get chunk information for the specified attribute.
- Parameters:
idx (int) – Index of the node.
attribute (str) – Attribute to get chunk information for.
- Returns:
Chunk information.
- Return type:
str
- get_doc_info(idx, add_rel=True, compact=False)[source]
Get document information for the specified node.
- Parameters:
idx (int) – Index of the node.
add_rel (bool) – Whether to add relationship information.
compact (bool) – Whether to compact the text.
- Returns:
Document information.
- Return type:
str
- get_rel_info(idx, rel_types=None, n_rel=-1)[source]
Get relation information for the specified node.
- Parameters:
idx (int) – Index of the node.
rel_types (Union[list, None]) – List of relation types or None if all relation types are included.
n_rel (int) – Number of relations. Default is -1 if all relations are included.
- Returns:
Relation information.
- Return type:
doc (str)
- has_also_buy(idx, also_buy_item)[source]
Check if the node has the specified also_buy item.
- Parameters:
idx (int) – Index of the node.
also_buy_item (int) – Item to check.
- Returns:
Whether the node has the specified also_buy item.
- Return type:
bool
- has_also_view(idx, also_view_item)[source]
Check if the node has the specified also_view item.
- Parameters:
idx (int) – Index of the node.
also_view_item (int) – Item to check.
- Returns:
Whether the node has the specified also_view item.
- Return type:
bool
- has_brand(idx, brand)[source]
Check if the node has the specified brand.
- Parameters:
idx (int) – Index of the node.
brand (str) – Brand name.
- Returns:
Whether the node has the specified brand.
- Return type:
bool
- link_columns = ['also_buy', 'also_view']
- meta_columns = ['asin', 'title', 'global_category', 'category', 'price', 'brand', 'feature', 'rank', 'details', 'description']
- node_attr_dict = {'brand': ['brand_name'], 'category': ['category_name'], 'color': ['color_name'], 'product': ['title', 'dimensions', 'weight', 'description', 'features', 'reviews', 'Q&A']}
- post_process(raw_info, meta_link_types, cache_path=None)[source]
Post-process the raw information to add meta link types.
- Parameters:
raw_info (dict) – Raw information.
meta_link_types (list) – List of meta link types to add.
cache_path (str) – Path to cache the processed data.
- Returns:
Post-processed data.
- Return type:
dict
- qa_columns = ['questionType', 'answerType', 'question', 'answer', 'answerTime']
- review_columns = ['reviewerID', 'summary', 'style', 'reviewText', 'vote', 'overall', 'verified', 'reviewTime']
- stark_qa.skb.amazon.read_qa(path)[source]
Read and parse QA files.
- Parameters:
path (str) – Path to the QA file.
- Returns:
DataFrame containing the QA data.
- Return type:
pd.DataFrame
- stark_qa.skb.amazon.read_review(path)[source]
Read and parse review files.
- Parameters:
path (str) – Path to the review file.
- Returns:
DataFrame containing the reviews.
- Return type:
pd.DataFrame
stark_qa.skb.knowledge_base
- class stark_qa.skb.knowledge_base.SKB(node_info, edge_index, node_type_dict=None, edge_type_dict=None, node_types=None, edge_types=None, indirected=True, **kwargs)[source]
Bases:
object
- edge_type2id(edge_type)[source]
Get the edge type ID given the edge type.
- Return type:
int
- get_all_paths(start_node_id, node_types, edge_types, max_num=None, direction='in-and-out')[source]
Get all paths given the node types and edge types. Use “*” to indicate any edge type.
- Return type:
list
- get_candidate_ids()[source]
Get the candidate IDs.
- Return type:
list
- get_doc_info(idx, add_rel=False, compact=False)[source]
Return a text document containing information about the node.
- Parameters:
idx (int) – Node index.
add_rel (bool) – Whether to add relational information explicitly.
compact (bool) – Whether to compact the text.
- Return type:
str
- get_edge_ids_by_type(edge_type)[source]
Get the edge IDs given the edge type.
- Return type:
list
- get_edge_type_by_id(edge_id)[source]
Get the edge type given the edge ID.
- Return type:
str
- get_neighbor_nodes(idx, edge_type='*')[source]
Get the neighbor nodes given the node ID and the edge type.
- Parameters:
idx (int) – Node index.
edge_type (str) – Edge type, use “*” to indicate any edge type.
- Return type:
list
- get_node_ids_by_type(node_type)[source]
Get the node IDs given the node type.
- Return type:
list
- get_node_ids_by_value(node_type, key, value)[source]
Get the node IDs given the node type and the value of a specific attribute.
- Return type:
list
- get_node_type_by_id(node_id)[source]
Get the node type given the node ID.
- Return type:
str
- get_rel_info(idx, rel_type=None)[source]
Return a text document containing information about the node.
- Parameters:
idx (int) – Node index.
rel_type (str, optional) – Relation type.
- Return type:
str
- get_tuples()[source]
Get all possible tuples of node types and edge types.
- Return type:
list
- is_rel_type(edge_type)[source]
Check if the edge type is a relation type.
- k_hop_neighbor(node_idx, num_hops, **kwargs)[source]
Get the k-hop neighbor subgraph.
- Parameters:
node_idx (int) – Node index.
num_hops (int) – Number of hops.
**kwargs – Additional arguments.
- node_attr_dict()[source]
Return the node attribute dictionary.
- node_type2id(node_type)[source]
Get the node type ID given the node type.
- Return type:
int
- node_type_lst()[source]
Return the list of node types.
- num_edges(node_type_id=None)[source]
Return the number of edges.
- num_nodes(node_type_id=None)[source]
Return the number of nodes.
- rel_type_lst()[source]
Return the list of relation types.
- sample_paths(node_types, edge_types, start_node_id=None, size=1)[source]
Sample paths given the node types and edge types. Use “*” to indicate any edge type.
- Return type:
list
stark_qa.skb.mag
- class stark_qa.skb.mag.MagSKB(root=None, download_processed=True, **kwargs)[source]
Bases:
SKB
- candidate_types = ['paper']
- edge_type_dict = {0: 'author___affiliated_with___institution', 1: 'paper___cites___paper', 2: 'paper___has_topic___field_of_study', 3: 'author___writes___paper'}
- get_doc_info(idx, compact=False, add_rel=True, n_rel=-1)[source]
Get document information for the specified node.
- Parameters:
idx (int) – Index of the node.
compact (bool) – Whether to compact the text.
add_rel (bool) – Whether to add relation information.
n_rel (int) – Number of relations to add. Default is -1 if all relations are included.
- Returns:
Document information.
- Return type:
str
- get_map(df)[source]
Create mappings between MAG IDs and internal IDs.
- Parameters:
df (DataFrame) – DataFrame containing MAG IDs.
- Returns:
Mappings from MAG IDs to internal IDs and vice versa.
- Return type:
tuple
- get_rel_info(idx, rel_types=None, n_rel=-1)[source]
Get relation information for the specified node.
- Parameters:
idx (int) – Index of the node.
rel_types (Union[list, None]) – List of relation types or None if all relation types are included.
n_rel (int) – Number of relations. Default is -1 if all relations are included.
- Returns:
Relation information.
- Return type:
doc (str)
- load_edge(edge_type)[source]
Load edge data for the specified edge type.
- Parameters:
edge_type (str) – Type of edge to load.
- Returns:
A tuple containing edge tensor and edge numbers.
- Return type:
tuple
- load_english_paper_text(mag_ids, download_cache=True)[source]
Load English text data for the papers.
- Parameters:
mag_ids (list) – List of MAG IDs for the papers.
download_cache (bool) – Whether to download cached data.
- Returns:
DataFrame containing English titles and abstracts.
- Return type:
DataFrame
- load_meta_data()[source]
Load metadata for the MAG dataset.
- Returns:
DataFrames for authors, fields of study, institutions, and papers.
- Return type:
tuple
- node_attr_dict = {'author': ['name'], 'field_of_study': ['name'], 'institution': ['name'], 'paper': ['title', 'abstract', 'publication date', 'venue']}
- node_type_dict = {0: 'author', 1: 'institution', 2: 'field_of_study', 3: 'paper'}
- test_columns = ['title', 'abstract', 'text']
stark_qa.skb.prime
- class stark_qa.skb.prime.PrimeSKB(root=None, download_processed=True, **kwargs)[source]
Bases:
SKB
- META_DATA = ['id', 'type', 'name', 'source', 'details']
- NODE_TYPES = ['disease', 'gene/protein', 'molecular_function', 'drug', 'pathway', 'anatomy', 'effect/phenotype', 'biological_process', 'cellular_component', 'exposure']
- RELATION_TYPES = ['ppi', 'carrier', 'enzyme', 'target', 'transporter', 'contraindication', 'indication', 'off-label use', 'synergistic interaction', 'associated with', 'parent-child', 'phenotype absent', 'phenotype present', 'side effect', 'interacts with', 'linked to', 'expression present', 'expression absent']
- candidate_types = ['disease', 'gene/protein', 'molecular_function', 'drug', 'pathway', 'anatomy', 'effect/phenotype', 'biological_process', 'cellular_component', 'exposure']
- get_doc_info(idx, add_rel=True, compact=False, n_rel=-1)[source]
Get document information for the specified node.
- Parameters:
idx (int) – Index of the node.
add_rel (bool) – Whether to add relationship information.
compact (bool) – Whether to compact the text.
n_rel (int) – Number of relationships to add.
- Returns:
Document information.
- Return type:
str
- get_rel_info(idx, rel_types=None, n_rel=-1)[source]
Get relation information for the specified node.
- Parameters:
idx (int) – Index of the node.
rel_types (Union[list, None]) – List of relation types or None if all relation types are included.
n_rel (int) – Number of relations. Default is -1 if all relations are included.
- Returns:
Relation information.
- Return type:
doc (str)