stark_qa.skb

stark_qa.skb.amazon

class stark_qa.skb.amazon.AmazonSKB(root=None, categories=['Sports_and_Outdoors'], meta_link_types=['brand', 'category', 'color'], max_entries=25, download_processed=True, **kwargs)[source]

Bases: SKB

COMMON = {'Appliances', 'Arts_Crafts_and_Sewing', 'Automotive', 'Cell_Phones_and_Accessories', 'Clothing_Shoes_and_Jewelry', 'Electronics', 'Grocery_and_Gourmet_Food', 'Home_and_Kitchen', 'Musical_Instruments', 'Office_Products', 'Patio_Lawn_and_Garden', 'Pet_Supplies', 'Sports_and_Outdoors', 'Tools_and_Home_Improvement', 'Toys_and_Games', 'Video_Games'}
QA_CATEGORIES = {'Appliances', 'Arts_Crafts_and_Sewing', 'Automotive', 'Baby', 'Beauty', 'Cell_Phones_and_Accessories', 'Clothing_Shoes_and_Jewelry', 'Electronics', 'Grocery_and_Gourmet_Food', 'Health_and_Personal_Care', 'Home_and_Kitchen', 'Musical_Instruments', 'Office_Products', 'Patio_Lawn_and_Garden', 'Pet_Supplies', 'Sports_and_Outdoors', 'Tools_and_Home_Improvement', 'Toys_and_Games', 'Video_Games'}
REVIEW_CATEGORIES = {'All_Beauty', 'Amazon_Fashion', 'Appliances', 'Arts_Crafts_and_Sewing', 'Automotive', 'Books', 'CDs_and_Vinyl', 'Cell_Phones_and_Accessories', 'Clothing_Shoes_and_Jewelry', 'Digital_Music', 'Electronics', 'Gift_Cards', 'Grocery_and_Gourmet_Food', 'Home_and_Kitchen', 'Industrial_and_Scientific', 'Kindle_Store', 'Luxury_Beauty', 'Magazine_Subscriptions', 'Movies_and_TV', 'Musical_Instruments', 'Office_Products', 'Patio_Lawn_and_Garden', 'Pet_Supplies', 'Prime_Pantry', 'Software', 'Sports_and_Outdoors', 'Tools_and_Home_Improvement', 'Toys_and_Games', 'Video_Games'}
candidate_types = ['product']
construct_raw_node_info(df_meta, df_review, df_qa)[source]

Construct raw node information.

Parameters:
  • df_meta (pd.DataFrame) – DataFrame containing meta information.

  • df_review (pd.DataFrame) – DataFrame containing review information.

  • df_qa (pd.DataFrame) – DataFrame containing QA information.

Returns:

Dictionary containing node information.

Return type:

dict

create_raw_product_graph(df, columns)[source]

Create raw product graph.

Parameters:
  • df (pd.DataFrame) – DataFrame containing meta information.

  • columns (list) – List of columns to create edges.

Returns:

Tuple containing edge index and edge types.

Return type:

tuple

get_chunk_info(idx, attribute)[source]

Get chunk information for the specified attribute.

Parameters:
  • idx (int) – Index of the node.

  • attribute (str) – Attribute to get chunk information for.

Returns:

Chunk information.

Return type:

str

get_doc_info(idx, add_rel=True, compact=False)[source]

Get document information for the specified node.

Parameters:
  • idx (int) – Index of the node.

  • add_rel (bool) – Whether to add relationship information.

  • compact (bool) – Whether to compact the text.

Returns:

Document information.

Return type:

str

get_rel_info(idx, rel_types=None, n_rel=-1)[source]

Get relation information for the specified node.

Parameters:
  • idx (int) – Index of the node.

  • rel_types (Union[list, None]) – List of relation types or None if all relation types are included.

  • n_rel (int) – Number of relations. Default is -1 if all relations are included.

Returns:

Relation information.

Return type:

doc (str)

has_also_buy(idx, also_buy_item)[source]

Check if the node has the specified also_buy item.

Parameters:
  • idx (int) – Index of the node.

  • also_buy_item (int) – Item to check.

Returns:

Whether the node has the specified also_buy item.

Return type:

bool

has_also_view(idx, also_view_item)[source]

Check if the node has the specified also_view item.

Parameters:
  • idx (int) – Index of the node.

  • also_view_item (int) – Item to check.

Returns:

Whether the node has the specified also_view item.

Return type:

bool

has_brand(idx, brand)[source]

Check if the node has the specified brand.

Parameters:
  • idx (int) – Index of the node.

  • brand (str) – Brand name.

Returns:

Whether the node has the specified brand.

Return type:

bool

link_columns = ['also_buy', 'also_view']
meta_columns = ['asin', 'title', 'global_category', 'category', 'price', 'brand', 'feature', 'rank', 'details', 'description']
node_attr_dict = {'brand': ['brand_name'], 'category': ['category_name'], 'color': ['color_name'], 'product': ['title', 'dimensions', 'weight', 'description', 'features', 'reviews', 'Q&A']}
post_process(raw_info, meta_link_types, cache_path=None)[source]

Post-process the raw information to add meta link types.

Parameters:
  • raw_info (dict) – Raw information.

  • meta_link_types (list) – List of meta link types to add.

  • cache_path (str) – Path to cache the processed data.

Returns:

Post-processed data.

Return type:

dict

qa_columns = ['questionType', 'answerType', 'question', 'answer', 'answerTime']
review_columns = ['reviewerID', 'summary', 'style', 'reviewText', 'vote', 'overall', 'verified', 'reviewTime']
stark_qa.skb.amazon.read_qa(path)[source]

Read and parse QA files.

Parameters:

path (str) – Path to the QA file.

Returns:

DataFrame containing the QA data.

Return type:

pd.DataFrame

stark_qa.skb.amazon.read_review(path)[source]

Read and parse review files.

Parameters:

path (str) – Path to the review file.

Returns:

DataFrame containing the reviews.

Return type:

pd.DataFrame

stark_qa.skb.knowledge_base

class stark_qa.skb.knowledge_base.SKB(node_info, edge_index, node_type_dict=None, edge_type_dict=None, node_types=None, edge_types=None, indirected=True, **kwargs)[source]

Bases: object

edge_type2id(edge_type)[source]

Get the edge type ID given the edge type.

Return type:

int

get_all_paths(start_node_id, node_types, edge_types, max_num=None, direction='in-and-out')[source]

Get all paths given the node types and edge types. Use “*” to indicate any edge type.

Return type:

list

get_candidate_ids()[source]

Get the candidate IDs.

Return type:

list

get_doc_info(idx, add_rel=False, compact=False)[source]

Return a text document containing information about the node.

Parameters:
  • idx (int) – Node index.

  • add_rel (bool) – Whether to add relational information explicitly.

  • compact (bool) – Whether to compact the text.

Return type:

str

get_edge_ids_by_type(edge_type)[source]

Get the edge IDs given the edge type.

Return type:

list

get_edge_type_by_id(edge_id)[source]

Get the edge type given the edge ID.

Return type:

str

get_neighbor_nodes(idx, edge_type='*')[source]

Get the neighbor nodes given the node ID and the edge type.

Parameters:
  • idx (int) – Node index.

  • edge_type (str) – Edge type, use “*” to indicate any edge type.

Return type:

list

get_node_ids_by_type(node_type)[source]

Get the node IDs given the node type.

Return type:

list

get_node_ids_by_value(node_type, key, value)[source]

Get the node IDs given the node type and the value of a specific attribute.

Return type:

list

get_node_type_by_id(node_id)[source]

Get the node type given the node ID.

Return type:

str

get_rel_info(idx, rel_type=None)[source]

Return a text document containing information about the node.

Parameters:
  • idx (int) – Node index.

  • rel_type (str, optional) – Relation type.

Return type:

str

get_tuples()[source]

Get all possible tuples of node types and edge types.

Return type:

list

is_rel_type(edge_type)[source]

Check if the edge type is a relation type.

k_hop_neighbor(node_idx, num_hops, **kwargs)[source]

Get the k-hop neighbor subgraph.

Parameters:
  • node_idx (int) – Node index.

  • num_hops (int) – Number of hops.

  • **kwargs – Additional arguments.

node_attr_dict()[source]

Return the node attribute dictionary.

node_type2id(node_type)[source]

Get the node type ID given the node type.

Return type:

int

node_type_lst()[source]

Return the list of node types.

num_edges(node_type_id=None)[source]

Return the number of edges.

num_nodes(node_type_id=None)[source]

Return the number of nodes.

rel_type_lst()[source]

Return the list of relation types.

sample_paths(node_types, edge_types, start_node_id=None, size=1)[source]

Sample paths given the node types and edge types. Use “*” to indicate any edge type.

Return type:

list

stark_qa.skb.mag

class stark_qa.skb.mag.MagSKB(root=None, download_processed=True, **kwargs)[source]

Bases: SKB

candidate_types = ['paper']
edge_type_dict = {0: 'author___affiliated_with___institution', 1: 'paper___cites___paper', 2: 'paper___has_topic___field_of_study', 3: 'author___writes___paper'}
get_doc_info(idx, compact=False, add_rel=True, n_rel=-1)[source]

Get document information for the specified node.

Parameters:
  • idx (int) – Index of the node.

  • compact (bool) – Whether to compact the text.

  • add_rel (bool) – Whether to add relation information.

  • n_rel (int) – Number of relations to add. Default is -1 if all relations are included.

Returns:

Document information.

Return type:

str

get_map(df)[source]

Create mappings between MAG IDs and internal IDs.

Parameters:

df (DataFrame) – DataFrame containing MAG IDs.

Returns:

Mappings from MAG IDs to internal IDs and vice versa.

Return type:

tuple

get_rel_info(idx, rel_types=None, n_rel=-1)[source]

Get relation information for the specified node.

Parameters:
  • idx (int) – Index of the node.

  • rel_types (Union[list, None]) – List of relation types or None if all relation types are included.

  • n_rel (int) – Number of relations. Default is -1 if all relations are included.

Returns:

Relation information.

Return type:

doc (str)

load_edge(edge_type)[source]

Load edge data for the specified edge type.

Parameters:

edge_type (str) – Type of edge to load.

Returns:

A tuple containing edge tensor and edge numbers.

Return type:

tuple

load_english_paper_text(mag_ids, download_cache=True)[source]

Load English text data for the papers.

Parameters:
  • mag_ids (list) – List of MAG IDs for the papers.

  • download_cache (bool) – Whether to download cached data.

Returns:

DataFrame containing English titles and abstracts.

Return type:

DataFrame

load_meta_data()[source]

Load metadata for the MAG dataset.

Returns:

DataFrames for authors, fields of study, institutions, and papers.

Return type:

tuple

node_attr_dict = {'author': ['name'], 'field_of_study': ['name'], 'institution': ['name'], 'paper': ['title', 'abstract', 'publication date', 'venue']}
node_type_dict = {0: 'author', 1: 'institution', 2: 'field_of_study', 3: 'paper'}
test_columns = ['title', 'abstract', 'text']

stark_qa.skb.prime

class stark_qa.skb.prime.PrimeSKB(root=None, download_processed=True, **kwargs)[source]

Bases: SKB

META_DATA = ['id', 'type', 'name', 'source', 'details']
NODE_TYPES = ['disease', 'gene/protein', 'molecular_function', 'drug', 'pathway', 'anatomy', 'effect/phenotype', 'biological_process', 'cellular_component', 'exposure']
RELATION_TYPES = ['ppi', 'carrier', 'enzyme', 'target', 'transporter', 'contraindication', 'indication', 'off-label use', 'synergistic interaction', 'associated with', 'parent-child', 'phenotype absent', 'phenotype present', 'side effect', 'interacts with', 'linked to', 'expression present', 'expression absent']
candidate_types = ['disease', 'gene/protein', 'molecular_function', 'drug', 'pathway', 'anatomy', 'effect/phenotype', 'biological_process', 'cellular_component', 'exposure']
get_doc_info(idx, add_rel=True, compact=False, n_rel=-1)[source]

Get document information for the specified node.

Parameters:
  • idx (int) – Index of the node.

  • add_rel (bool) – Whether to add relationship information.

  • compact (bool) – Whether to compact the text.

  • n_rel (int) – Number of relationships to add.

Returns:

Document information.

Return type:

str

get_rel_info(idx, rel_types=None, n_rel=-1)[source]

Get relation information for the specified node.

Parameters:
  • idx (int) – Index of the node.

  • rel_types (Union[list, None]) – List of relation types or None if all relation types are included.

  • n_rel (int) – Number of relations. Default is -1 if all relations are included.

Returns:

Relation information.

Return type:

doc (str)