← Back to Home

Web Scraping for Price Monitoring

How the technology behind automatic real-time competitor price monitoring works

πŸ“… Updated: November 2025 ⏱️ Read time: 8 minutes πŸ”§ Level: Technical

Web scraping is the fundamental technology that allows Kompara to monitor thousands of competitor prices 24/7 without manual intervention. In this technical article, we explore how it works, the challenges it faces, and the best practices for implementing it ethically.

What is Web Scraping for Prices?

Web scraping (also known as web harvesting or web data extraction) is the automated process of extracting structured information from websites. In the context of price monitoring, it means:

Web Scraping Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Scheduler β”‚  ← Programs visits every hour
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚      Distributed Scraper            β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
β”‚  β”‚Worker 1β”‚  β”‚Worker 2β”‚  β”‚Worker 3β”‚β”‚  ← Multiple workers
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚     in parallel
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚       Parser & Normalizer            β”‚  ← Cleans and structures
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     the data
               β”‚
               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚        Database (PostgreSQL)         β”‚  ← Stores history
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚      Analytics Engine + AI           β”‚  ← Generates insights
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            

Technologies and Tech Stack

🐍 Python + Scrapy

Robust framework for large-scale scraping. Handles concurrency, automatic retries, and customizable middleware.

🌐 Selenium / Playwright

For sites with heavy JavaScript. Simulates real browsers to access dynamic content.

☁️ Cloud Infrastructure

AWS Lambda or Google Cloud Functions to scale workers on demand.

πŸ”„ Rotating Proxies

Network of distributed IPs to avoid blocks and respect rate limits.

πŸ’Ύ PostgreSQL + Redis

Primary database and cache for fast queries.

πŸ“Š Apache Airflow

Pipeline orchestration and complex task scheduling.

Technical Challenges of Web Scraping

1. Detection and Anti-Bot Blocking

Modern ecommerce sites implement multiple layers of protection:

πŸ›‘οΈ Rate Limiting

Limits requests per IP/minute. Solution: Rotating proxies + intelligent delays.

πŸ€– User-Agent Detection

Detects scrapers by headers. Solution: Rotate real user agents.

🧩 CAPTCHA

Visual challenges for humans. Solution: Resolution services + human-like behavior.

πŸ“± Browser Fingerprinting

Identifies unique characteristics. Solution: Full browser emulation.

2. Dynamic JavaScript Content

Many modern sites load prices via JavaScript after the initial render:

// Example: Dynamically loaded price
fetch('/api/product/12345/price')
  .then(res => res.json())
  .then(data => {
    document.getElementById('price').textContent = data.price;
  });

Solution at Kompara:

3. Variable HTML Structure

Sites change their HTML frequently, breaking selectors:

πŸ’‘ Solution: Smart Selectors with Fallbacks

selectors = [
    'span.price-now',           # Main selector
    'div.product-price span',   # Fallback 1
    '[data-price]',             # Fallback 2
    'meta[property="og:price"]' # Fallback 3
]

for selector in selectors:
    price = page.query_selector(selector)
    if price:
        return normalize_price(price.text)

Ethical and Legal Web Scraping

⚠️ Important Legal Considerations

Web scraping exists in a legal gray area. At Kompara, we follow these guidelines:

Principles of Ethical Scraping

βœ… Best Practices We Implement

  1. Respect for bandwidth: Delays between requests (2-5 seconds minimum)
  2. Clear identification: Descriptive User-Agent with contact info
  3. Intelligent scheduling: Higher activity during server off-peak hours
  4. Aggressive caching: Do not repeat unnecessary requests
  5. Error detection: Exponential backoff on 5xx errors

Data Normalization and Quality

Extracting the price is just the first step. Normalization is critical:

Normalization Challenges

πŸ’‘ Normalization Pipeline at Kompara

def normalize_price(raw_price: str) -> Optional[float]:
    # 1. Basic cleaning
    clean = raw_price.strip()
    clean = re.sub(r'[^\d.,]', '', clean)  # Remove symbols
    
    # 2. Regional format detection
    if ',' in clean and '.' in clean:
        # Determine if , is decimal or thousands separator
        if clean.rfind(',') > clean.rfind('.'):
            clean = clean.replace('.', '').replace(',', '.')
        else:
            clean = clean.replace(',', '')
    elif ',' in clean:
        # Could be European decimal
        if len(clean.split(',')[1]) == 2:
            clean = clean.replace(',', '.')
    
    # 3. Conversion
    try:
        return float(clean)
    except ValueError:
        logger.warning(f"Could not parse: {raw_price}")
        return None

Scalability and Performance

Monitoring 10M+ Daily Prices

To operate at Kompara's scale, we need:

⚑ High Concurrency

100+ workers in parallel with async/await and event loops

πŸ”„ Smart Scheduling

Dynamic prioritization: popular products more frequently

πŸ’Ύ Intelligent Caching

Redis to avoid rescraping unchanged pages

πŸ“Š Real-Time Monitoring

Success/failure metrics per worker and site

Change Detection and Alerts

Scraping generates value when it detects significant changes:

class PriceChangeDetector:
    def analyze_change(self, old_price, new_price, product_id):
        change_pct = ((new_price - old_price) / old_price) * 100
        
        # Drastic change (>10%)
        if abs(change_pct) > 10:
            self.trigger_alert(
                type='drastic_change',
                product_id=product_id,
                old=old_price,
                new=new_price,
                change_pct=change_pct
            )
        
        # Leadership lost
        if self.was_cheapest(product_id) and not self.is_cheapest(product_id):
            self.trigger_alert(
                type='leadership_lost',
                product_id=product_id,
                competitor=self.get_new_leader(product_id)
            )

Future: Machine Learning in Scraping

The next evolution includes:

Want to see our scraping in action?

Request a demo and we'll show you how we monitor your competition in real-time.

Request Free Demo β†’

πŸ“š Additional Resources

Open Source Tools: