Overview

The STaRK benchmark features three novel retrieval-based question-answering datasets, each containing synthesized train/val/test sets with 9k to 14k queries and a high-quality, human-generated query set. These queries integrate relational and textual knowledge, closely resembling real-world queries with their natural-sounding language and flexible formats.

The datasets are based on three knowledge bases covering product search, academic paper search, and biomedical inquiries. Each knowledge base is semi-structured, featuring large-scale relational data among entities and comprehensive textual information for each entity.