DatabaseAPIQueryToolSet#

The DatabaseAPIQueryToolSet provides LLM-enhanced natural language queries for 26+ biological databases with schema-driven API parameter generation.

Overview#

Key features:

  • Natural Language Queries: Convert natural language to database API calls

  • 26+ Databases: UniProt, PDB, Ensembl, ClinVar, KEGG, and more

  • Schema-Driven: Uses database schemas for accurate parameter generation

  • Formatted Results: Returns human-readable results for LLM consumption

Supported Databases#

Proteins:

  • UniProt, PDB, AlphaFold, InterPro, STRING, EMDB

Genomics:

  • Ensembl, ClinVar, dbSNP, GnomAD, GWAS Catalog, UCSC, RegulomeDB

Expression:

  • GEO, CCRE, OpenTargets, OpenTargets Genetics, ReMap

Pathways:

  • KEGG, Reactome, GtoPdb

Specialized:

  • BLAST, JASPAR, MPD, IUCN, PRIDE, cBioPortal, WoRMS, Paleobiology

Basic Usage#

from pantheon.agent import Agent
from pantheon.toolsets import DatabaseAPIQueryToolSet

# Create database API toolset
db_tools = DatabaseAPIQueryToolSet(
    name="database_api",
    workspace_path="/path/to/workspace"
)

# Create agent and add toolset at runtime
agent = Agent(
    name="bioinformatician",
    instructions="You can query biological databases using natural language."
)
await agent.toolset(db_tools)

await agent.chat()

Constructor Parameters#

Parameter

Type

Description

name

str

Name of the toolset (default: “database_api”)

workspace_path

str | None

Working directory. Defaults to current directory.

Tools Reference#

query#

Query a biological database using natural language.

result = db_tools.query(
    prompt="Find BRCA1 mutations in breast cancer",
    database="clinvar",
    max_results=5
)

Parameters:

  • prompt: Natural language query describing what to find

  • database: Target database name (e.g., ‘uniprot’, ‘ensembl’, ‘clinvar’)

  • max_results: Maximum number of results to return (default: 5)

  • llm_service_id: Optional LLM service ID for parameter generation

Returns:

{
    "success": True,
    "database": "clinvar",
    "prompt": "Find BRCA1 mutations in breast cancer",
    "api_parameters": {"query": "BRCA1[gene] AND breast cancer", ...},
    "raw_count": 150,
    "results": "Found 5 results from clinvar:\n\n1. Entry:\n   ID: ...",
    "strategy": "llm_enhanced"
}

list_databases#

List all available databases with their categories.

result = db_tools.list_databases()

Returns:

{
    "success": True,
    "databases": ["alphafold", "clinvar", "ensembl", ...],
    "categories": {
        "proteins": ["uniprot", "pdb", "alphafold", ...],
        "genomics": ["ensembl", "clinvar", "dbsnp", ...],
        "expression": ["geo", "ccre", "opentargets", ...],
        "pathways": ["kegg", "reactome", "gtopdb"],
        "specialized": ["blast", "jaspar", "mpd", ...]
    },
    "total": 26
}

database_info#

Get detailed information about a specific database.

result = db_tools.database_info(database="uniprot")

Returns:

{
    "success": True,
    "database": "uniprot",
    "base_url": "https://rest.uniprot.org/uniprotkb",
    "categories": ["search", "entry", "mapping"],
    "example_queries": [
        "Find human insulin protein",
        "Search for p53 tumor suppressor"
    ],
    "is_valid": True
}

Examples#

Protein Research#

# Search UniProt for a specific protein
result = db_tools.query(
    prompt="Find human insulin protein with full sequence",
    database="uniprot",
    max_results=3
)

# Get protein structure from PDB
result = db_tools.query(
    prompt="Find crystal structure of hemoglobin",
    database="pdb",
    max_results=5
)

Genomics Research#

# Search for gene variants
result = db_tools.query(
    prompt="Find pathogenic variants in CFTR gene",
    database="clinvar",
    max_results=10
)

# Get gene information
result = db_tools.query(
    prompt="Get information about TP53 gene in humans",
    database="ensembl"
)

Pathway Analysis#

# Search KEGG pathways
result = db_tools.query(
    prompt="Find pathways involved in glycolysis",
    database="kegg",
    max_results=5
)

# Search Reactome
result = db_tools.query(
    prompt="Find signaling pathways involving EGFR",
    database="reactome"
)

Agent-Driven Research#

from pantheon.agent import Agent
from pantheon.toolsets import DatabaseAPIQueryToolSet

db_tools = DatabaseAPIQueryToolSet(name="databases")

researcher = Agent(
    name="researcher",
    instructions="""You are a bioinformatics researcher. When researching:
    1. Use list_databases to find appropriate databases
    2. Use database_info to understand query options
    3. Use query to search for relevant data
    4. Synthesize findings into comprehensive answers"""
)
await researcher.toolset(db_tools)

result = await researcher.run(
    "Research the role of BRCA1 in cancer and find relevant mutations and pathways"
)

Configuration#

The toolset uses the PANTHEON_AGENT_SERVICE_ID environment variable for LLM-based parameter generation:

export PANTHEON_AGENT_SERVICE_ID="your-service-id"

Best Practices#

  1. Use specific queries: Include gene names, organisms, and conditions

  2. Check available databases: Use list_databases to find appropriate sources

  3. Limit results: Use max_results to manage output size

  4. Combine sources: Query multiple databases for comprehensive research

  5. Review API parameters: Check api_parameters in response for transparency

Error Handling#

result = db_tools.query(
    prompt="Find BRCA1 mutations",
    database="clinvar"
)

if result["success"]:
    print(result["results"])
else:
    print(f"Error: {result['error']}")
    # Check available_databases if database not found
    if "available_databases" in result:
        print(f"Available: {result['available_databases']}")