Scraper API

Scraper API#

The Scraper API toolset provides web scraping capabilities through ScraperAPI, enabling Google search and web page fetching with built-in proxy rotation and anti-bot measures.

Overview#

The ScraperToolSet provides two main tools:

google_search: Perform Google searches via ScraperAPI
fetch_web_page: Fetch and convert web pages to various formats

Key features:

Proxy Rotation: Automatic IP rotation via ScraperAPI
Anti-Bot Protection: Built-in handling of anti-scraping measures
Format Conversion: Output as markdown, HTML, or plain text
Concurrent Fetching: Fetch multiple URLs in parallel
Rate Limit Handling: Automatic retry and rate limit management

Prerequisites#

To use this toolset, you need:

A ScraperAPI account and API key
Set the environment variable: SCRAPER_API_KEY

Basic Usage#

Setting Up Scraper Toolset#

from pantheon.toolsets.scraper import ScraperToolSet
from pantheon.toolsets.utils.toolset import run_toolsets
from pantheon.agent import Agent

# Create scraper toolset
scraper_tools = ScraperToolSet("scraper")

# Run as service
async with run_toolsets([scraper_tools]):
    # Create agent with scraping capabilities
    agent = Agent(
        name="web_scraper",
        instructions="Search the web and extract information from websites."
    )

    # Connect to scraper toolset
    await agent.remote_toolset(service_name="scraper")

Command Line Deployment#

Run the scraper toolset from command line:

# Set API key
export SCRAPER_API_KEY=your_api_key_here

# Run toolset
python -m pantheon.toolsets.scraper --service-name scraper

Google Search#

# Perform Google search
response = await agent.run([{
    "role": "user",
    "content": "Search Google for recent news about artificial intelligence"
}])

# Agent uses google_search tool with parameters:
# - query: "recent news about artificial intelligence"
# - max_results: 10
# - country: "us"
# - language: "en"

Web Page Fetching#

# Fetch single page
response = await agent.run([{
    "role": "user",
    "content": "Fetch the content from https://example.com/article"
}])

# Fetch multiple pages
response = await agent.run([{
    "role": "user",
    "content": "Fetch content from these 3 URLs and summarize each"
}])

Tool Parameters#

google_search#

query (str): The search query
max_results (int): Maximum number of results (default: 10)
country (str): Country code for search (default: “us”)
language (str): Language code for search (default: “en”)
timeout (float): Request timeout in seconds (default: 30.0)

fetch_web_page#

urls (List[str]): List of URLs to fetch
output_format (str): Output format - “markdown”, “html”, or “text” (default: “markdown”)
timeout (float): Timeout per request in seconds (default: 30.0)

Common Use Cases#

Research Assistant#

researcher = Agent(
    name="research_assistant",
    instructions="""Research topics by:
    1. Search Google for relevant information
    2. Fetch top results
    3. Extract and summarize key points
    4. Provide sources"""
)

response = await researcher.run([{
    "role": "user",
    "content": "Research the latest developments in quantum computing"
}])

News Aggregator#

news_agent = Agent(
    name="news_aggregator",
    instructions="""Aggregate news on specific topics:
    1. Search for recent news articles
    2. Fetch article content
    3. Summarize key stories
    4. Identify trends"""
)

# Daily news briefing
response = await news_agent.run([{
    "role": "user",
    "content": "Create a daily tech news briefing"
}])

Content Monitor#

monitor_agent = Agent(
    name="content_monitor",
    instructions="""Monitor web content:
    1. Fetch specified URLs
    2. Compare with previous versions
    3. Identify changes
    4. Alert on significant updates"""
)

# Monitor competitor websites
urls_to_monitor = [
    "https://competitor1.com/pricing",
    "https://competitor2.com/features"
]

Advanced Patterns#

Batch Processing#

batch_scraper = Agent(
    name="batch_processor",
    instructions="Process multiple URLs efficiently"
)

# Process URLs in batches
urls = ["https://site1.com", "https://site2.com", "https://site3.com"]
response = await batch_scraper.run([{
    "role": "user",
    "content": f"Fetch and analyze these URLs: {urls}"
}])

Search and Analyze#

analyst = Agent(
    name="search_analyst",
    instructions="""For each search:
    1. Search Google for the topic
    2. Fetch top 5 results
    3. Extract key information
    4. Synthesize findings
    5. Provide analysis"""
)

# Comprehensive analysis
response = await analyst.run([{
    "role": "user",
    "content": "Analyze market trends for electric vehicles in 2024"
}])

Multi-Language Support#

multilingual_agent = Agent(
    name="multilingual_scraper",
    instructions="Search and fetch content in multiple languages"
)

# Search in different languages/countries
searches = [
    {"query": "technologie IA", "country": "fr", "language": "fr"},
    {"query": "KI Technologie", "country": "de", "language": "de"},
    {"query": "AI technology", "country": "us", "language": "en"}
]

Error Handling#

The toolset includes built-in error handling:

Missing API Key: Raises ValueError if SCRAPER_API_KEY not set
HTTP Errors: Logs errors and returns empty content for failed URLs
Timeout Handling: Configurable timeout with graceful failure
Batch Resilience: Failed URLs don’t stop processing of others

Best Practices#

API Key Security: Store API key in environment variables, not code
Rate Limiting: Be mindful of ScraperAPI rate limits
Error Recovery: Check for empty results and handle gracefully
Batch Efficiency: Use batch fetching for multiple URLs
Format Selection: Choose appropriate output format for your needs
Timeout Configuration: Adjust timeout based on target site responsiveness

Integration Examples#

With Data Analysis#

# Combine with Python toolset for analysis
data_scraper = Agent(
    name="data_analyst",
    instructions="Scrape web data and analyze with Python"
)
await data_scraper.remote_toolset(service_name="scraper")
await data_scraper.remote_toolset(service_name="python")

With File Storage#

# Save scraped content to files
archiver = Agent(
    name="web_archiver",
    instructions="Scrape and archive web content",
    tools=[write_file]
)
await archiver.remote_toolset(service_name="scraper")

Performance Tips#

Use markdown format for better structure extraction
Batch URLs to reduce API calls
Set appropriate timeouts for slow sites
Cache results when appropriate
Use country/language parameters for localized results

Limitations#

Requires valid ScraperAPI subscription
Subject to ScraperAPI rate limits
Cannot handle sites requiring authentication
No JavaScript interaction capabilities
Limited to ScraperAPI supported features