RAG System (Auto-Build)

RAG System (Auto-Build)#

The RAG System provides tools for automatically building vector databases from various sources including documentation websites, GitHub repositories, and PDF files. It features web crawling, document processing, and integration with Hugging Face for database distribution.

Overview#

The RAG auto-build system (pantheon.toolsets.utils.rag) provides:

Automated Web Crawling: Deep crawl documentation sites and extract content
Multi-Source Support: Build from websites, GitHub READMEs, PDFs, and local files
Vector Database Creation: Automatic chunking and embedding with LanceDB
Hugging Face Integration: Upload and download pre-built databases
Caching System: Intelligent caching for embeddings and build progress
YAML Configuration: Define sources and parameters in YAML files

YAML Configuration Format#

The YAML configuration file defines one or more vector databases with their sources:

Basic Structure#

database_name:
  description: Description of the database
  type: vector_db
  parameters:
    embedding_model: text-embedding-3-large
    chunk_size: 4000
    chunk_overlap: 200
  items:
    item_name:
      type: source_type
      url: source_url
      description: Item description

Supported Source Types#

package documentation: Deep crawls documentation websites
tutorial: Processes tutorial websites with multi-page content
github readme: Fetches README files from GitHub repositories
pdf: Downloads and processes PDF documents

Example Configuration#

single-cell-python-packages:
  description: Vector database of single-cell python packages documentation
  type: vector_db
  parameters:
    embedding_model: text-embedding-3-large
    chunk_size: 4000
    chunk_overlap: 200
  items:
    scanpy:
      type: package documentation
      url: https://scanpy.readthedocs.io/en/stable/
      description: Scanpy toolkit for single-cell analysis
    sc-best-practices:
      type: tutorial
      url: https://www.sc-best-practices.org/
      description: Best practices guide for single-cell analysis
    star:
      type: pdf
      url: https://raw.githubusercontent.com/alexdobin/STAR/master/doc/STARmanual.pdf
      description: STAR RNA-seq aligner manual

Command Line Usage#

Build from YAML Configuration#

Build the database:

python -m pantheon.toolsets.utils.rag build config.yaml ./output_dir

Upload to Hugging Face#

Share your built database:

# Set your Hugging Face token
export HUGGINGFACE_TOKEN=your_token_here

# Upload to default repo (NaNg/pantheon_rag_db)
python -m pantheon.toolsets.utils.rag upload ./output_dir

# Or specify custom repo
python -m pantheon.toolsets.utils.rag upload ./output_dir --repo-id your-username/your-repo

Download Pre-built Database#

Download existing databases:

# Download from default repo
python -m pantheon.toolsets.utils.rag download ./local_dir

# Download from custom repo
python -m pantheon.toolsets.utils.rag download ./local_dir --repo-id username/repo --filename custom.zip

Build Process Details#

Directory Structure#

After building, the output directory contains:

output_dir/
└── database_name/
    ├── metadata.yaml        # Database configuration
    ├── info_cache.json      # Build progress tracking
    ├── raw/                 # Downloaded raw content
    │   └── item_name/
    │       ├── *.md        # Markdown files
    │       └── *.pdf       # PDF files
    └── lancedb/            # Vector database files
        └── database_name.lance/

Processing Pipeline#

Download Phase:
- For documentation/tutorials: Deep crawl with configurable depth
- For GitHub READMEs: Direct file download
- For PDFs: Binary file download
Content Extraction:
- HTML → Markdown conversion for web content
- PDF text extraction using PyMuPDF
- Duplicate removal via content hashing
Text Processing:
- Smart chunking with configurable size and overlap
- Metadata preservation (source, URL)
- Context maintenance across chunks
Vector Storage:
- Embedding generation via OpenAI API
- LanceDB storage with PyArrow schema
- Metadata indexing for filtering

Progress Tracking#

The system maintains build state in info_cache.json:

{
    "item_name": {
        "success": true,
        "created_at": "2024-01-01T12:00:00",
        "download_success": true
    }
}

Key behaviors:

Successfully cached items (success: true) are skipped on subsequent builds
Failed items can be retried without re-processing successful ones
To force re-download: Set download_success to false in info_cache.json
To force complete rebuild: Set success to false in info_cache.json
Then re-run the build command to process the modified items

Programmatic Usage#

Using VectorDB Class#

Direct database operations:

from pantheon.toolsets.utils.rag.vectordb import VectorDB

# Load existing database
db = VectorDB("./output_dir/database_name")

# Query the database
results = await db.query(
    query="How to perform clustering analysis?",
    top_k=5,
    source="scanpy"  # Optional: filter by source
)

# Insert new content
await db.insert(
    text="Your new content here",
    metadata={"source": "custom", "date": "2024-01-01"}
)

# Insert from file with automatic chunking
await db.insert_from_file(
    file_path="./new_doc.md",
    metadata={"source": "local_docs"}
)

Building Programmatically#

Build databases from Python code:

import asyncio
from pantheon.toolsets.utils.rag.build import build_vector_db

db_config = {
    "type": "vector_db",
    "parameters": {
        "embedding_model": "text-embedding-3-large",
        "chunk_size": 4000,
        "chunk_overlap": 200
    },
    "items": {
        "my_docs": {
            "type": "package documentation",
            "url": "https://mydocs.example.com",
            "description": "My documentation"
        },
        "manual": {
            "type": "pdf",
            "url": "https://example.com/manual.pdf",
            "description": "User manual"
        }
    }
}

asyncio.run(build_vector_db("my_knowledge_base", db_config, "./output"))

Full Build Example#

Complete workflow from YAML to usage:

import asyncio
from pantheon.toolsets.utils.rag.build import build_all

# Build all databases defined in YAML
asyncio.run(build_all("./config.yaml", "./output"))

Special Features#

PDF Support#

The system can process PDF documents:

items:
  research_paper:
    type: pdf
    url: https://arxiv.org/pdf/2301.00001.pdf
    description: Research paper on transformers

PDFs are:

Downloaded as binary files
Text extracted using PyMuPDF
Processed like other text documents
Stored with original URL metadata

Web Crawling Configuration#

For documentation and tutorial types:

max_depth: Controls crawling depth (default: 1)
include_external: Include external links (default: false)
Uses BFS (Breadth-First Search) strategy
Automatic markdown extraction from HTML
Removes duplicate content via SHA-256 hashing

Embedding Models#

Supported OpenAI embedding models:

text-embedding-3-large: Best quality, 3072 dimensions
text-embedding-3-small: Faster, 1536 dimensions
text-embedding-ada-002: Legacy model, 1536 dimensions

Caching System#

Two-level caching for efficiency:

Embedding Cache: Disk-based cache prevents redundant API calls
Build Cache: info_cache.json tracks processing status

Hugging Face Integration#

Upload and Distribution#

The system integrates with Hugging Face for sharing databases:

Packaging: Creates ZIP archive of entire database
Upload: Pushes to Hugging Face dataset repository
Versioning: Uploaded as latest.zip by default

# Upload with authentication
export HUGGINGFACE_TOKEN=hf_xxxxx
python -m pantheon.toolsets.utils.rag upload ./my_db --repo-id myorg/my-rag-db

Download and Use#

Download pre-built databases:

# Download and extract
python -m pantheon.toolsets.utils.rag download ./local_db --repo-id myorg/my-rag-db

# Use immediately
from pantheon.toolsets.utils.rag.vectordb import VectorDB
db = VectorDB("./local_db/database_name")

Real-World Examples#

Single-Cell Analysis Knowledge Base#

Build a comprehensive single-cell analysis database:

# single_cell.yaml
single-cell-tools:
  description: Comprehensive single-cell analysis tools documentation
  type: vector_db
  parameters:
    embedding_model: text-embedding-3-large
    chunk_size: 4000
    chunk_overlap: 200
  items:
    scanpy:
      type: package documentation
      url: https://scanpy.readthedocs.io/en/stable/
      description: Core single-cell analysis toolkit
    scvi-tools:
      type: package documentation
      url: https://docs.scvi-tools.org/en/stable/
      description: Deep learning for single-cell
    cellranger:
      type: pdf
      url: https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/tutorial_ct
      description: 10x Genomics Cell Ranger manual

Machine Learning Documentation#

Create ML framework knowledge base:

# ml_docs.yaml
ml-frameworks:
  description: Machine learning frameworks documentation
  type: vector_db
  parameters:
    embedding_model: text-embedding-3-large
    chunk_size: 3000
    chunk_overlap: 300
  items:
    pytorch:
      type: package documentation
      url: https://pytorch.org/docs/stable/
      description: PyTorch deep learning framework
    tensorflow:
      type: package documentation
      url: https://www.tensorflow.org/api_docs
      description: TensorFlow ML platform
    papers:
      type: pdf
      url: https://arxiv.org/pdf/1706.03762.pdf
      description: Attention Is All You Need paper

Best Practices#

Choose Appropriate Chunk Sizes: - Larger chunks (4000-5000) for narrative content - Smaller chunks (1000-2000) for API references - Balance between context and precision
Optimize Embedding Models: - Use text-embedding-3-large for best quality - Consider text-embedding-3-small for large datasets - Monitor API costs
Organize Sources Logically: - Group related documentation in same database - Use descriptive item names - Add clear descriptions for each source
Handle Build Failures: - Check info_cache.json for error details - Failed items can be retried individually - Network issues don’t affect completed items
Use Hugging Face for Distribution: - Share pre-built databases to save compute - Version control via repository tags - Collaborate on knowledge bases
Regular Updates: - Rebuild periodically for fresh content - Use caching to minimize redundant work - Track changes via git

Troubleshooting#

Common Issues#

Build Failures:

Check network connectivity for web crawling
Verify URLs are accessible (not behind authentication)
Review error messages in info_cache.json
Ensure sufficient disk space

PDF Processing Errors:

Verify PDF URL is directly accessible
Check if PDF requires authentication
Some PDFs may have text extraction restrictions
Install PyMuPDF: pip install pymupdf

Embedding Errors:

Ensure OPENAI_API_KEY is set correctly
Check API rate limits and quotas
Verify model name is correct
Monitor token usage

Storage Issues:

Ensure sufficient disk space (databases can be large)
Check write permissions on output directory
Clean old cache files if needed
Verify LanceDB installation

Performance Tips#

Use embedding cache to avoid redundant API calls
Enable progress tracking to resume interrupted builds
Process sources in parallel when system allows
Consider smaller embedding models for very large datasets
Use local caching for frequently accessed databases

Integration with VectorRAGToolSet#

The VectorRAGToolSet class provides a toolset interface for agents to interact with the built databases.

Basic Usage#

Create a VectorRAGToolSet with your built database:

from pantheon.toolsets.vector_rag import VectorRAGToolSet

# Initialize with database path
rag_toolset = VectorRAGToolSet(
    name="knowledge_assistant",
    db_path="./output/single-cell-tools"
)

# The toolset provides these tools:
# - query_vector_db: Query the database with optional source filtering
# - get_vector_db_info: Get metadata about the database

Configuration Parameters#

Full configuration options:

rag_toolset = VectorRAGToolSet(
    name="custom_rag",
    db_path="./output/ml-frameworks",
    worker_params=None,  # Optional worker configuration
    allow_insert=False,  # Enable insert_vector_db tool
    allow_delete=False,  # Enable delete_vector_db tool
    db_params={}        # Additional database parameters
)

Available Tools#

query_vector_db: Main query interface

query: Query string
top_k: Number of results (default: 3)
source: Optional source filter

get_vector_db_info: Returns database metadata including description

insert_vector_db (optional): Add new content - Enabled with allow_insert=True - Parameters: text, metadata

delete_vector_db (optional): Remove content - Enabled with allow_delete=True - Parameter: id (string or list)

Using with Agents#

Connect the toolset to an agent as a remote service:

from pantheon.agent import Agent
from pantheon.toolsets.vector_rag import VectorRAGToolSet
import asyncio

async def main():
    # Create RAG toolset
    rag_toolset = VectorRAGToolSet(
        name="bio_knowledge",
        db_path="./output/single-cell-tools",
        allow_insert=True  # Allow agent to add new knowledge
    )

    # Start toolset service
    await rag_toolset.start_service()

    # Create agent
    agent = Agent(
        name="bioinformatics_expert",
        instructions="You are a single-cell analysis expert. Use get_vector_db_info first to understand the database, then query_vector_db to find relevant information."
    )

    # Connect agent to toolset
    await agent.remote_toolset(rag_toolset.service_id)

    # Query through agent
    response = await agent.run("How do I perform trajectory inference?")

asyncio.run(main())

Programmatic Direct Usage#

Use the toolset directly without agents:

async def search_knowledge():
    rag = VectorRAGToolSet(
        name="ml_rag",
        db_path="./ml_db/ml-frameworks"
    )

    # Get database info
    info = await rag.get_vector_db_info()
    print(f"Database: {info['description']}")

    # Query the database
    results = await rag.query_vector_db(
        query="transformer architecture",
        top_k=5,
        source="pytorch"  # Filter by source
    )

    for result in results:
        print(f"Score: {result['score']}")
        print(f"Text: {result['text'][:200]}...")
        print(f"Source: {result['metadata']['source']}")
        print("---")