RAG System (Auto-Build)#
The RAG System provides tools for automatically building vector databases from various sources including documentation websites, GitHub repositories, and PDF files. It features web crawling, document processing, and integration with Hugging Face for database distribution.
Overview#
The RAG auto-build system (pantheon.toolsets.utils.rag) provides:
Automated Web Crawling: Deep crawl documentation sites and extract content
Multi-Source Support: Build from websites, GitHub READMEs, PDFs, and local files
Vector Database Creation: Automatic chunking and embedding with LanceDB
Hugging Face Integration: Upload and download pre-built databases
Caching System: Intelligent caching for embeddings and build progress
YAML Configuration: Define sources and parameters in YAML files
YAML Configuration Format#
The YAML configuration file defines one or more vector databases with their sources:
Basic Structure#
database_name:
description: Description of the database
type: vector_db
parameters:
embedding_model: text-embedding-3-large
chunk_size: 4000
chunk_overlap: 200
items:
item_name:
type: source_type
url: source_url
description: Item description
Supported Source Types#
package documentation: Deep crawls documentation websites
tutorial: Processes tutorial websites with multi-page content
github readme: Fetches README files from GitHub repositories
pdf: Downloads and processes PDF documents
Example Configuration#
single-cell-python-packages:
description: Vector database of single-cell python packages documentation
type: vector_db
parameters:
embedding_model: text-embedding-3-large
chunk_size: 4000
chunk_overlap: 200
items:
scanpy:
type: package documentation
url: https://scanpy.readthedocs.io/en/stable/
description: Scanpy toolkit for single-cell analysis
sc-best-practices:
type: tutorial
url: https://www.sc-best-practices.org/
description: Best practices guide for single-cell analysis
star:
type: pdf
url: https://raw.githubusercontent.com/alexdobin/STAR/master/doc/STARmanual.pdf
description: STAR RNA-seq aligner manual
Command Line Usage#
Build from YAML Configuration#
Build the database:
python -m pantheon.toolsets.utils.rag build config.yaml ./output_dir
Upload to Hugging Face#
Share your built database:
# Set your Hugging Face token
export HUGGINGFACE_TOKEN=your_token_here
# Upload to default repo (NaNg/pantheon_rag_db)
python -m pantheon.toolsets.utils.rag upload ./output_dir
# Or specify custom repo
python -m pantheon.toolsets.utils.rag upload ./output_dir --repo-id your-username/your-repo
Download Pre-built Database#
Download existing databases:
# Download from default repo
python -m pantheon.toolsets.utils.rag download ./local_dir
# Download from custom repo
python -m pantheon.toolsets.utils.rag download ./local_dir --repo-id username/repo --filename custom.zip
Build Process Details#
Directory Structure#
After building, the output directory contains:
output_dir/
└── database_name/
├── metadata.yaml # Database configuration
├── info_cache.json # Build progress tracking
├── raw/ # Downloaded raw content
│ └── item_name/
│ ├── *.md # Markdown files
│ └── *.pdf # PDF files
└── lancedb/ # Vector database files
└── database_name.lance/
Processing Pipeline#
Download Phase:
For documentation/tutorials: Deep crawl with configurable depth
For GitHub READMEs: Direct file download
For PDFs: Binary file download
Content Extraction:
HTML → Markdown conversion for web content
PDF text extraction using PyMuPDF
Duplicate removal via content hashing
Text Processing:
Smart chunking with configurable size and overlap
Metadata preservation (source, URL)
Context maintenance across chunks
Vector Storage:
Embedding generation via OpenAI API
LanceDB storage with PyArrow schema
Metadata indexing for filtering
Progress Tracking#
The system maintains build state in info_cache.json:
{
"item_name": {
"success": true,
"created_at": "2024-01-01T12:00:00",
"download_success": true
}
}
Key behaviors:
Successfully cached items (
success: true) are skipped on subsequent buildsFailed items can be retried without re-processing successful ones
To force re-download: Set
download_successtofalseininfo_cache.jsonTo force complete rebuild: Set
successtofalseininfo_cache.jsonThen re-run the build command to process the modified items
Programmatic Usage#
Using VectorDB Class#
Direct database operations:
from pantheon.toolsets.utils.rag.vectordb import VectorDB
# Load existing database
db = VectorDB("./output_dir/database_name")
# Query the database
results = await db.query(
query="How to perform clustering analysis?",
top_k=5,
source="scanpy" # Optional: filter by source
)
# Insert new content
await db.insert(
text="Your new content here",
metadata={"source": "custom", "date": "2024-01-01"}
)
# Insert from file with automatic chunking
await db.insert_from_file(
file_path="./new_doc.md",
metadata={"source": "local_docs"}
)
Building Programmatically#
Build databases from Python code:
import asyncio
from pantheon.toolsets.utils.rag.build import build_vector_db
db_config = {
"type": "vector_db",
"parameters": {
"embedding_model": "text-embedding-3-large",
"chunk_size": 4000,
"chunk_overlap": 200
},
"items": {
"my_docs": {
"type": "package documentation",
"url": "https://mydocs.example.com",
"description": "My documentation"
},
"manual": {
"type": "pdf",
"url": "https://example.com/manual.pdf",
"description": "User manual"
}
}
}
asyncio.run(build_vector_db("my_knowledge_base", db_config, "./output"))
Full Build Example#
Complete workflow from YAML to usage:
import asyncio
from pantheon.toolsets.utils.rag.build import build_all
# Build all databases defined in YAML
asyncio.run(build_all("./config.yaml", "./output"))
Special Features#
PDF Support#
The system can process PDF documents:
items:
research_paper:
type: pdf
url: https://arxiv.org/pdf/2301.00001.pdf
description: Research paper on transformers
PDFs are:
Downloaded as binary files
Text extracted using PyMuPDF
Processed like other text documents
Stored with original URL metadata
Web Crawling Configuration#
For documentation and tutorial types:
max_depth: Controls crawling depth (default: 1)
include_external: Include external links (default: false)
Uses BFS (Breadth-First Search) strategy
Automatic markdown extraction from HTML
Removes duplicate content via SHA-256 hashing
Embedding Models#
Supported OpenAI embedding models:
text-embedding-3-large: Best quality, 3072 dimensionstext-embedding-3-small: Faster, 1536 dimensionstext-embedding-ada-002: Legacy model, 1536 dimensions
Caching System#
Two-level caching for efficiency:
Embedding Cache: Disk-based cache prevents redundant API calls
Build Cache:
info_cache.jsontracks processing status
Hugging Face Integration#
Upload and Distribution#
The system integrates with Hugging Face for sharing databases:
Packaging: Creates ZIP archive of entire database
Upload: Pushes to Hugging Face dataset repository
Versioning: Uploaded as
latest.zipby default
# Upload with authentication
export HUGGINGFACE_TOKEN=hf_xxxxx
python -m pantheon.toolsets.utils.rag upload ./my_db --repo-id myorg/my-rag-db
Download and Use#
Download pre-built databases:
# Download and extract
python -m pantheon.toolsets.utils.rag download ./local_db --repo-id myorg/my-rag-db
# Use immediately
from pantheon.toolsets.utils.rag.vectordb import VectorDB
db = VectorDB("./local_db/database_name")
Real-World Examples#
Single-Cell Analysis Knowledge Base#
Build a comprehensive single-cell analysis database:
# single_cell.yaml
single-cell-tools:
description: Comprehensive single-cell analysis tools documentation
type: vector_db
parameters:
embedding_model: text-embedding-3-large
chunk_size: 4000
chunk_overlap: 200
items:
scanpy:
type: package documentation
url: https://scanpy.readthedocs.io/en/stable/
description: Core single-cell analysis toolkit
scvi-tools:
type: package documentation
url: https://docs.scvi-tools.org/en/stable/
description: Deep learning for single-cell
cellranger:
type: pdf
url: https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/tutorial_ct
description: 10x Genomics Cell Ranger manual
Machine Learning Documentation#
Create ML framework knowledge base:
# ml_docs.yaml
ml-frameworks:
description: Machine learning frameworks documentation
type: vector_db
parameters:
embedding_model: text-embedding-3-large
chunk_size: 3000
chunk_overlap: 300
items:
pytorch:
type: package documentation
url: https://pytorch.org/docs/stable/
description: PyTorch deep learning framework
tensorflow:
type: package documentation
url: https://www.tensorflow.org/api_docs
description: TensorFlow ML platform
papers:
type: pdf
url: https://arxiv.org/pdf/1706.03762.pdf
description: Attention Is All You Need paper
Best Practices#
Choose Appropriate Chunk Sizes: - Larger chunks (4000-5000) for narrative content - Smaller chunks (1000-2000) for API references - Balance between context and precision
Optimize Embedding Models: - Use
text-embedding-3-largefor best quality - Considertext-embedding-3-smallfor large datasets - Monitor API costsOrganize Sources Logically: - Group related documentation in same database - Use descriptive item names - Add clear descriptions for each source
Handle Build Failures: - Check
info_cache.jsonfor error details - Failed items can be retried individually - Network issues don’t affect completed itemsUse Hugging Face for Distribution: - Share pre-built databases to save compute - Version control via repository tags - Collaborate on knowledge bases
Regular Updates: - Rebuild periodically for fresh content - Use caching to minimize redundant work - Track changes via git
Troubleshooting#
Common Issues#
Build Failures:
Check network connectivity for web crawling
Verify URLs are accessible (not behind authentication)
Review error messages in
info_cache.jsonEnsure sufficient disk space
PDF Processing Errors:
Verify PDF URL is directly accessible
Check if PDF requires authentication
Some PDFs may have text extraction restrictions
Install PyMuPDF:
pip install pymupdf
Embedding Errors:
Ensure
OPENAI_API_KEYis set correctlyCheck API rate limits and quotas
Verify model name is correct
Monitor token usage
Storage Issues:
Ensure sufficient disk space (databases can be large)
Check write permissions on output directory
Clean old cache files if needed
Verify LanceDB installation
Performance Tips#
Use embedding cache to avoid redundant API calls
Enable progress tracking to resume interrupted builds
Process sources in parallel when system allows
Consider smaller embedding models for very large datasets
Use local caching for frequently accessed databases
Integration with VectorRAGToolSet#
The VectorRAGToolSet class provides a toolset interface for agents to interact with the built databases.
Basic Usage#
Create a VectorRAGToolSet with your built database:
from pantheon.toolsets.vector_rag import VectorRAGToolSet
# Initialize with database path
rag_toolset = VectorRAGToolSet(
name="knowledge_assistant",
db_path="./output/single-cell-tools"
)
# The toolset provides these tools:
# - query_vector_db: Query the database with optional source filtering
# - get_vector_db_info: Get metadata about the database
Configuration Parameters#
Full configuration options:
rag_toolset = VectorRAGToolSet(
name="custom_rag",
db_path="./output/ml-frameworks",
worker_params=None, # Optional worker configuration
allow_insert=False, # Enable insert_vector_db tool
allow_delete=False, # Enable delete_vector_db tool
db_params={} # Additional database parameters
)
Available Tools#
query_vector_db: Main query interface
query: Query stringtop_k: Number of results (default: 3)source: Optional source filter
get_vector_db_info: Returns database metadata including description
insert_vector_db (optional): Add new content
- Enabled with allow_insert=True
- Parameters: text, metadata
delete_vector_db (optional): Remove content
- Enabled with allow_delete=True
- Parameter: id (string or list)
Using with Agents#
Connect the toolset to an agent as a remote service:
from pantheon.agent import Agent
from pantheon.toolsets.vector_rag import VectorRAGToolSet
import asyncio
async def main():
# Create RAG toolset
rag_toolset = VectorRAGToolSet(
name="bio_knowledge",
db_path="./output/single-cell-tools",
allow_insert=True # Allow agent to add new knowledge
)
# Start toolset service
await rag_toolset.start_service()
# Create agent
agent = Agent(
name="bioinformatics_expert",
instructions="You are a single-cell analysis expert. Use get_vector_db_info first to understand the database, then query_vector_db to find relevant information."
)
# Connect agent to toolset
await agent.remote_toolset(rag_toolset.service_id)
# Query through agent
response = await agent.run("How do I perform trajectory inference?")
asyncio.run(main())
Programmatic Direct Usage#
Use the toolset directly without agents:
async def search_knowledge():
rag = VectorRAGToolSet(
name="ml_rag",
db_path="./ml_db/ml-frameworks"
)
# Get database info
info = await rag.get_vector_db_info()
print(f"Database: {info['description']}")
# Query the database
results = await rag.query_vector_db(
query="transformer architecture",
top_k=5,
source="pytorch" # Filter by source
)
for result in results:
print(f"Score: {result['score']}")
print(f"Text: {result['text'][:200]}...")
print(f"Source: {result['metadata']['source']}")
print("---")