Web Scraping for Price Monitoring: Technology and Best Practices

Web scraping is the fundamental technology that allows Kompara to monitor thousands of competitor prices 24/7 without manual intervention. In this technical article, we explore how it works, the challenges it faces, and the best practices for implementing it ethically.

What is Web Scraping for Prices?

Web scraping (also known as web harvesting or web data extraction) is the automated process of extracting structured information from websites. In the context of price monitoring, it means:

Automated visitation of competitor product pages
Data extraction such as price, availability, descriptions
Normalization and storage of information in databases
Analysis and comparison to generate actionable insights

Web Scraping Architecture

┌─────────────┐
│   Scheduler │  ← Programs visits every hour
└──────┬──────┘
       │
       ▼
┌─────────────────────────────────────┐
│      Distributed Scraper            │
│  ┌────────┐  ┌────────┐  ┌────────┐│
│  │Worker 1│  │Worker 2│  │Worker 3││  ← Multiple workers
│  └────────┘  └────────┘  └────────┘│     in parallel
└──────────────┬──────────────────────┘
               │
               ▼
┌──────────────────────────────────────┐
│       Parser & Normalizer            │  ← Cleans and structures
└──────────────┬───────────────────────┘     the data
               │
               ▼
┌──────────────────────────────────────┐
│        Database (PostgreSQL)         │  ← Stores history
└──────────────┬───────────────────────┘
               │
               ▼
┌──────────────────────────────────────┐
│      Analytics Engine + AI           │  ← Generates insights
└──────────────────────────────────────┘

Technologies and Tech Stack

🐍 Python + Scrapy

Robust framework for large-scale scraping. Handles concurrency, automatic retries, and customizable middleware.

🌐 Selenium / Playwright

For sites with heavy JavaScript. Simulates real browsers to access dynamic content.

☁️ Cloud Infrastructure

AWS Lambda or Google Cloud Functions to scale workers on demand.

🔄 Rotating Proxies

Network of distributed IPs to avoid blocks and respect rate limits.

💾 PostgreSQL + Redis

Primary database and cache for fast queries.

📊 Apache Airflow

Pipeline orchestration and complex task scheduling.

Technical Challenges of Web Scraping

1. Detection and Anti-Bot Blocking

Modern ecommerce sites implement multiple layers of protection:

🛡️ Rate Limiting

Limits requests per IP/minute. Solution: Rotating proxies + intelligent delays.

🤖 User-Agent Detection

Detects scrapers by headers. Solution: Rotate real user agents.

🧩 CAPTCHA

Visual challenges for humans. Solution: Resolution services + human-like behavior.

📱 Browser Fingerprinting

Identifies unique characteristics. Solution: Full browser emulation.

2. Dynamic JavaScript Content

Many modern sites load prices via JavaScript after the initial render:

// Example: Dynamically loaded price
fetch('/api/product/12345/price')
  .then(res => res.json())
  .then(data => {
    document.getElementById('price').textContent = data.price;
  });

Solution at Kompara:

Use of headless browsers (Playwright/Puppeteer)
Intelligent waiting until critical elements are loaded
API request interception when possible

3. Variable HTML Structure

Sites change their HTML frequently, breaking selectors:

💡 Solution: Smart Selectors with Fallbacks

selectors = [
    'span.price-now',           # Main selector
    'div.product-price span',   # Fallback 1
    '[data-price]',             # Fallback 2
    'meta[property="og:price"]' # Fallback 3
]

for selector in selectors:
    price = page.query_selector(selector)
    if price:
        return normalize_price(price.text)

Ethical and Legal Web Scraping

⚠️ Important Legal Considerations

Web scraping exists in a legal gray area. At Kompara, we follow these guidelines:

✓ We respect robots.txt and meta tags
✓ We implement conservative rate limiting
✓ We only extract publicly available data
✓ We do not access areas requiring login
✓ We monitor only public prices and availability
✓ We do not resell or republish scraped data

Principles of Ethical Scraping

✅ Best Practices We Implement

Respect for bandwidth: Delays between requests (2-5 seconds minimum)
Clear identification: Descriptive User-Agent with contact info
Intelligent scheduling: Higher activity during server off-peak hours
Aggressive caching: Do not repeat unnecessary requests
Error detection: Exponential backoff on 5xx errors

Data Normalization and Quality

Extracting the price is just the first step. Normalization is critical:

Normalization Challenges

Diverse formats: "$1,234.56", "1.234,56 €", "R$ 1.234,56"
Additional information: "Before: $100 Now: $80"
Discounts and promotions: "50% OFF", "2x1"
Availability: "Out of stock", "Call for price", "Pre-order"

💡 Normalization Pipeline at Kompara

def normalize_price(raw_price: str) -> Optional[float]:
    # 1. Basic cleaning
    clean = raw_price.strip()
    clean = re.sub(r'[^\d.,]', '', clean)  # Remove symbols
    
    # 2. Regional format detection
    if ',' in clean and '.' in clean:
        # Determine if , is decimal or thousands separator
        if clean.rfind(',') > clean.rfind('.'):
            clean = clean.replace('.', '').replace(',', '.')
        else:
            clean = clean.replace(',', '')
    elif ',' in clean:
        # Could be European decimal
        if len(clean.split(',')[1]) == 2:
            clean = clean.replace(',', '.')
    
    # 3. Conversion
    try:
        return float(clean)
    except ValueError:
        logger.warning(f"Could not parse: {raw_price}")
        return None

Scalability and Performance

Monitoring 10M+ Daily Prices

To operate at Kompara's scale, we need:

⚡ High Concurrency

100+ workers in parallel with async/await and event loops

🔄 Smart Scheduling

Dynamic prioritization: popular products more frequently

💾 Intelligent Caching

Redis to avoid rescraping unchanged pages

📊 Real-Time Monitoring

Success/failure metrics per worker and site

Change Detection and Alerts

Scraping generates value when it detects significant changes:

class PriceChangeDetector:
    def analyze_change(self, old_price, new_price, product_id):
        change_pct = ((new_price - old_price) / old_price) * 100
        
        # Drastic change (>10%)
        if abs(change_pct) > 10:
            self.trigger_alert(
                type='drastic_change',
                product_id=product_id,
                old=old_price,
                new=new_price,
                change_pct=change_pct
            )
        
        # Leadership lost
        if self.was_cheapest(product_id) and not self.is_cheapest(product_id):
            self.trigger_alert(
                type='leadership_lost',
                product_id=product_id,
                competitor=self.get_new_leader(product_id)
            )

Future: Machine Learning in Scraping

The next evolution includes:

Auto-adaptation of selectors: ML to detect prices without manual selectors
Anomaly detection: Automatically identify incorrect data
Change prediction: Anticipate when a competitor will change prices
Intelligent prioritization: Decide which products to scrape more frequently

Want to see our scraping in action?

Request a demo and we'll show you how we monitor your competition in real-time.

Request Free Demo →

📚 Additional Resources

Open Source Tools:

Scrapy Framework - Python framework for scraping
Playwright - Browser automation
Beautiful Soup - HTML/XML parser

Web Scraping for Price Monitoring