stark_qa.skb

stark_qa.skb.amazon

class stark_qa.skb.amazon.AmazonSKB(root=None, categories=['Sports_and_Outdoors'], meta_link_types=['brand', 'category', 'color'], max_entries=25, download_processed=True, **kwargs)[source]

Bases: SKB

COMMON = {'Appliances', 'Arts_Crafts_and_Sewing', 'Automotive', 'Cell_Phones_and_Accessories', 'Clothing_Shoes_and_Jewelry', 'Electronics', 'Grocery_and_Gourmet_Food', 'Home_and_Kitchen', 'Musical_Instruments', 'Office_Products', 'Patio_Lawn_and_Garden', 'Pet_Supplies', 'Sports_and_Outdoors', 'Tools_and_Home_Improvement', 'Toys_and_Games', 'Video_Games'}

QA_CATEGORIES = {'Appliances', 'Arts_Crafts_and_Sewing', 'Automotive', 'Baby', 'Beauty', 'Cell_Phones_and_Accessories', 'Clothing_Shoes_and_Jewelry', 'Electronics', 'Grocery_and_Gourmet_Food', 'Health_and_Personal_Care', 'Home_and_Kitchen', 'Musical_Instruments', 'Office_Products', 'Patio_Lawn_and_Garden', 'Pet_Supplies', 'Sports_and_Outdoors', 'Tools_and_Home_Improvement', 'Toys_and_Games', 'Video_Games'}

REVIEW_CATEGORIES = {'All_Beauty', 'Amazon_Fashion', 'Appliances', 'Arts_Crafts_and_Sewing', 'Automotive', 'Books', 'CDs_and_Vinyl', 'Cell_Phones_and_Accessories', 'Clothing_Shoes_and_Jewelry', 'Digital_Music', 'Electronics', 'Gift_Cards', 'Grocery_and_Gourmet_Food', 'Home_and_Kitchen', 'Industrial_and_Scientific', 'Kindle_Store', 'Luxury_Beauty', 'Magazine_Subscriptions', 'Movies_and_TV', 'Musical_Instruments', 'Office_Products', 'Patio_Lawn_and_Garden', 'Pet_Supplies', 'Prime_Pantry', 'Software', 'Sports_and_Outdoors', 'Tools_and_Home_Improvement', 'Toys_and_Games', 'Video_Games'}

candidate_types = ['product']

construct_raw_node_info(df_meta, df_review, df_qa)[source]

Construct raw node information.

Parameters:

df_meta (pd.DataFrame) – DataFrame containing meta information.
df_review (pd.DataFrame) – DataFrame containing review information.
df_qa (pd.DataFrame) – DataFrame containing QA information.

Returns:

Dictionary containing node information.

Return type:

dict

create_raw_product_graph(df, columns)[source]

Create raw product graph.

Parameters:

df (pd.DataFrame) – DataFrame containing meta information.
columns (list) – List of columns to create edges.

Returns:

Tuple containing edge index and edge types.

Return type:

tuple

get_chunk_info(idx, attribute)[source]

Get chunk information for the specified attribute.

Parameters:

idx (int) – Index of the node.
attribute (str) – Attribute to get chunk information for.

Returns:

Chunk information.

Return type:

str

get_doc_info(idx, add_rel=True, compact=False)[source]

Get document information for the specified node.

Parameters:

idx (int) – Index of the node.
add_rel (bool) – Whether to add relationship information.
compact (bool) – Whether to compact the text.

Returns:

Document information.

Return type:

str

get_rel_info(idx, rel_types=None, n_rel=-1)[source]

Get relation information for the specified node.

Parameters:

idx (int) – Index of the node.
rel_types (Union[list, None]) – List of relation types or None if all relation types are included.
n_rel (int) – Number of relations. Default is -1 if all relations are included.

Returns:

Relation information.

Return type:

doc (str)

has_also_buy(idx, also_buy_item)[source]

Check if the node has the specified also_buy item.

Parameters:

idx (int) – Index of the node.
also_buy_item (int) – Item to check.

Returns:

Whether the node has the specified also_buy item.

Return type:

bool

has_also_view(idx, also_view_item)[source]

Check if the node has the specified also_view item.

Parameters:

idx (int) – Index of the node.
also_view_item (int) – Item to check.

Returns:

Whether the node has the specified also_view item.

Return type:

bool

has_brand(idx, brand)[source]

Check if the node has the specified brand.

Parameters:

idx (int) – Index of the node.
brand (str) – Brand name.

Returns:

Whether the node has the specified brand.

Return type:

bool

link_columns = ['also_buy', 'also_view']

meta_columns = ['asin', 'title', 'global_category', 'category', 'price', 'brand', 'feature', 'rank', 'details', 'description']

node_attr_dict = {'brand': ['brand_name'], 'category': ['category_name'], 'color': ['color_name'], 'product': ['title', 'dimensions', 'weight', 'description', 'features', 'reviews', 'Q&A']}

post_process(raw_info, meta_link_types, cache_path=None)[source]

Post-process the raw information to add meta link types.

Parameters:

raw_info (dict) – Raw information.
meta_link_types (list) – List of meta link types to add.
cache_path (str) – Path to cache the processed data.

Returns:

Post-processed data.

Return type:

dict

qa_columns = ['questionType', 'answerType', 'question', 'answer', 'answerTime']

review_columns = ['reviewerID', 'summary', 'style', 'reviewText', 'vote', 'overall', 'verified', 'reviewTime']

stark_qa.skb.amazon.read_qa(path)[source]

Read and parse QA files.

Parameters:: path (str) – Path to the QA file.
Returns:: DataFrame containing the QA data.
Return type:: pd.DataFrame

stark_qa.skb.amazon.read_review(path)[source]

Read and parse review files.

Parameters:: path (str) – Path to the review file.
Returns:: DataFrame containing the reviews.
Return type:: pd.DataFrame

stark_qa.skb.knowledge_base

class stark_qa.skb.knowledge_base.SKB(node_info, edge_index, node_type_dict=None, edge_type_dict=None, node_types=None, edge_types=None, indirected=True, **kwargs)[source]

Bases: object

edge_type2id(edge_type)[source]

Get the edge type ID given the edge type.

Return type:: int

get_all_paths(start_node_id, node_types, edge_types, max_num=None, direction='in-and-out')[source]

Get all paths given the node types and edge types. Use “*” to indicate any edge type.

Return type:: list

get_candidate_ids()[source]

Get the candidate IDs.

Return type:: list

get_doc_info(idx, add_rel=False, compact=False)[source]

Return a text document containing information about the node.

Parameters:

idx (int) – Node index.
add_rel (bool) – Whether to add relational information explicitly.
compact (bool) – Whether to compact the text.

Return type:

str

get_edge_ids_by_type(edge_type)[source]

Get the edge IDs given the edge type.

Return type:: list

get_edge_type_by_id(edge_id)[source]

Get the edge type given the edge ID.

Return type:: str

get_neighbor_nodes(idx, edge_type='*')[source]

Get the neighbor nodes given the node ID and the edge type.

Parameters:

idx (int) – Node index.
edge_type (str) – Edge type, use “*” to indicate any edge type.

Return type:

list

get_node_ids_by_type(node_type)[source]

Get the node IDs given the node type.

Return type:: list

get_node_ids_by_value(node_type, key, value)[source]

Get the node IDs given the node type and the value of a specific attribute.

Return type:: list

get_node_type_by_id(node_id)[source]

Get the node type given the node ID.

Return type:: str

get_rel_info(idx, rel_type=None)[source]

Return a text document containing information about the node.

Parameters:

idx (int) – Node index.
rel_type (str, optional) – Relation type.

Return type:

str

get_tuples()[source]

Get all possible tuples of node types and edge types.

Return type:: list

is_rel_type(edge_type)[source]: Check if the edge type is a relation type.

k_hop_neighbor(node_idx, num_hops, **kwargs)[source]

Get the k-hop neighbor subgraph.

Parameters:

node_idx (int) – Node index.
num_hops (int) – Number of hops.
**kwargs – Additional arguments.

node_attr_dict()[source]: Return the node attribute dictionary.

node_type2id(node_type)[source]

Get the node type ID given the node type.

Return type:: int

node_type_lst()[source]: Return the list of node types.

num_edges(node_type_id=None)[source]: Return the number of edges.

num_nodes(node_type_id=None)[source]: Return the number of nodes.

rel_type_lst()[source]: Return the list of relation types.

sample_paths(node_types, edge_types, start_node_id=None, size=1)[source]

Sample paths given the node types and edge types. Use “*” to indicate any edge type.

Return type:: list

stark_qa.skb.mag

class stark_qa.skb.mag.MagSKB(root=None, download_processed=True, **kwargs)[source]

Bases: SKB

candidate_types = ['paper']

edge_type_dict = {0: 'author___affiliated_with___institution', 1: 'paper___cites___paper', 2: 'paper___has_topic___field_of_study', 3: 'author___writes___paper'}

get_doc_info(idx, compact=False, add_rel=True, n_rel=-1)[source]

Get document information for the specified node.

Parameters:

idx (int) – Index of the node.
compact (bool) – Whether to compact the text.
add_rel (bool) – Whether to add relation information.
n_rel (int) – Number of relations to add. Default is -1 if all relations are included.

Returns:

Document information.

Return type:

str

get_map(df)[source]

Create mappings between MAG IDs and internal IDs.

Parameters:: df (DataFrame) – DataFrame containing MAG IDs.
Returns:: Mappings from MAG IDs to internal IDs and vice versa.
Return type:: tuple

get_rel_info(idx, rel_types=None, n_rel=-1)[source]

Get relation information for the specified node.

Parameters:

idx (int) – Index of the node.
rel_types (Union[list, None]) – List of relation types or None if all relation types are included.
n_rel (int) – Number of relations. Default is -1 if all relations are included.

Returns:

Relation information.

Return type:

doc (str)

load_edge(edge_type)[source]

Load edge data for the specified edge type.

Parameters:: edge_type (str) – Type of edge to load.
Returns:: A tuple containing edge tensor and edge numbers.
Return type:: tuple

load_english_paper_text(mag_ids, download_cache=True)[source]

Load English text data for the papers.

Parameters:

mag_ids (list) – List of MAG IDs for the papers.
download_cache (bool) – Whether to download cached data.

Returns:

DataFrame containing English titles and abstracts.

Return type:

DataFrame

load_meta_data()[source]

Load metadata for the MAG dataset.

Returns:: DataFrames for authors, fields of study, institutions, and papers.
Return type:: tuple

node_attr_dict = {'author': ['name'], 'field_of_study': ['name'], 'institution': ['name'], 'paper': ['title', 'abstract', 'publication date', 'venue']}

node_type_dict = {0: 'author', 1: 'institution', 2: 'field_of_study', 3: 'paper'}

test_columns = ['title', 'abstract', 'text']

stark_qa.skb.prime

class stark_qa.skb.prime.PrimeSKB(root=None, download_processed=True, **kwargs)[source]

Bases: SKB

META_DATA = ['id', 'type', 'name', 'source', 'details']

NODE_TYPES = ['disease', 'gene/protein', 'molecular_function', 'drug', 'pathway', 'anatomy', 'effect/phenotype', 'biological_process', 'cellular_component', 'exposure']

RELATION_TYPES = ['ppi', 'carrier', 'enzyme', 'target', 'transporter', 'contraindication', 'indication', 'off-label use', 'synergistic interaction', 'associated with', 'parent-child', 'phenotype absent', 'phenotype present', 'side effect', 'interacts with', 'linked to', 'expression present', 'expression absent']

candidate_types = ['disease', 'gene/protein', 'molecular_function', 'drug', 'pathway', 'anatomy', 'effect/phenotype', 'biological_process', 'cellular_component', 'exposure']

get_doc_info(idx, add_rel=True, compact=False, n_rel=-1)[source]

Get document information for the specified node.

Parameters:

idx (int) – Index of the node.
add_rel (bool) – Whether to add relationship information.
compact (bool) – Whether to compact the text.
n_rel (int) – Number of relationships to add.

Returns:

Document information.

Return type:

str

get_rel_info(idx, rel_types=None, n_rel=-1)[source]

Get relation information for the specified node.

Parameters:

idx (int) – Index of the node.
rel_types (Union[list, None]) – List of relation types or None if all relation types are included.
n_rel (int) – Number of relations. Default is -1 if all relations are included.

Returns:

Relation information.

Return type:

doc (str)