Web scraping is the fundamental technology that allows Kompara to monitor thousands of competitor prices 24/7 without manual intervention. In this technical article, we explore how it works, the challenges it faces, and the best practices for implementing it ethically.
What is Web Scraping for Prices?
Web scraping (also known as web harvesting or web data extraction) is the automated process of extracting structured information from websites. In the context of price monitoring, it means:
- Automated visitation of competitor product pages
- Data extraction such as price, availability, descriptions
- Normalization and storage of information in databases
- Analysis and comparison to generate actionable insights
Web Scraping Architecture
βββββββββββββββ
β Scheduler β β Programs visits every hour
ββββββββ¬βββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββ
β Distributed Scraper β
β ββββββββββ ββββββββββ βββββββββββ
β βWorker 1β βWorker 2β βWorker 3ββ β Multiple workers
β ββββββββββ ββββββββββ βββββββββββ in parallel
ββββββββββββββββ¬βββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββ
β Parser & Normalizer β β Cleans and structures
ββββββββββββββββ¬ββββββββββββββββββββββββ the data
β
βΌ
ββββββββββββββββββββββββββββββββββββββββ
β Database (PostgreSQL) β β Stores history
ββββββββββββββββ¬ββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββ
β Analytics Engine + AI β β Generates insights
ββββββββββββββββββββββββββββββββββββββββ
Technologies and Tech Stack
π Python + Scrapy
Robust framework for large-scale scraping. Handles concurrency, automatic retries, and customizable middleware.
π Selenium / Playwright
For sites with heavy JavaScript. Simulates real browsers to access dynamic content.
βοΈ Cloud Infrastructure
AWS Lambda or Google Cloud Functions to scale workers on demand.
π Rotating Proxies
Network of distributed IPs to avoid blocks and respect rate limits.
πΎ PostgreSQL + Redis
Primary database and cache for fast queries.
π Apache Airflow
Pipeline orchestration and complex task scheduling.
Technical Challenges of Web Scraping
1. Detection and Anti-Bot Blocking
Modern ecommerce sites implement multiple layers of protection:
π‘οΈ Rate Limiting
Limits requests per IP/minute. Solution: Rotating proxies + intelligent delays.
π€ User-Agent Detection
Detects scrapers by headers. Solution: Rotate real user agents.
π§© CAPTCHA
Visual challenges for humans. Solution: Resolution services + human-like behavior.
π± Browser Fingerprinting
Identifies unique characteristics. Solution: Full browser emulation.
2. Dynamic JavaScript Content
Many modern sites load prices via JavaScript after the initial render:
// Example: Dynamically loaded price
fetch('/api/product/12345/price')
.then(res => res.json())
.then(data => {
document.getElementById('price').textContent = data.price;
});
Solution at Kompara:
- Use of headless browsers (Playwright/Puppeteer)
- Intelligent waiting until critical elements are loaded
- API request interception when possible
3. Variable HTML Structure
Sites change their HTML frequently, breaking selectors:
π‘ Solution: Smart Selectors with Fallbacks
selectors = [
'span.price-now', # Main selector
'div.product-price span', # Fallback 1
'[data-price]', # Fallback 2
'meta[property="og:price"]' # Fallback 3
]
for selector in selectors:
price = page.query_selector(selector)
if price:
return normalize_price(price.text)
Ethical and Legal Web Scraping
β οΈ Important Legal Considerations
Web scraping exists in a legal gray area. At Kompara, we follow these guidelines:
- β We respect
robots.txtand meta tags - β We implement conservative rate limiting
- β We only extract publicly available data
- β We do not access areas requiring login
- β We monitor only public prices and availability
- β We do not resell or republish scraped data
Principles of Ethical Scraping
β Best Practices We Implement
- Respect for bandwidth: Delays between requests (2-5 seconds minimum)
- Clear identification: Descriptive User-Agent with contact info
- Intelligent scheduling: Higher activity during server off-peak hours
- Aggressive caching: Do not repeat unnecessary requests
- Error detection: Exponential backoff on 5xx errors
Data Normalization and Quality
Extracting the price is just the first step. Normalization is critical:
Normalization Challenges
- Diverse formats: "$1,234.56", "1.234,56 β¬", "R$ 1.234,56"
- Additional information: "Before: $100 Now: $80"
- Discounts and promotions: "50% OFF", "2x1"
- Availability: "Out of stock", "Call for price", "Pre-order"
π‘ Normalization Pipeline at Kompara
def normalize_price(raw_price: str) -> Optional[float]:
# 1. Basic cleaning
clean = raw_price.strip()
clean = re.sub(r'[^\d.,]', '', clean) # Remove symbols
# 2. Regional format detection
if ',' in clean and '.' in clean:
# Determine if , is decimal or thousands separator
if clean.rfind(',') > clean.rfind('.'):
clean = clean.replace('.', '').replace(',', '.')
else:
clean = clean.replace(',', '')
elif ',' in clean:
# Could be European decimal
if len(clean.split(',')[1]) == 2:
clean = clean.replace(',', '.')
# 3. Conversion
try:
return float(clean)
except ValueError:
logger.warning(f"Could not parse: {raw_price}")
return None
Scalability and Performance
Monitoring 10M+ Daily Prices
To operate at Kompara's scale, we need:
β‘ High Concurrency
100+ workers in parallel with async/await and event loops
π Smart Scheduling
Dynamic prioritization: popular products more frequently
πΎ Intelligent Caching
Redis to avoid rescraping unchanged pages
π Real-Time Monitoring
Success/failure metrics per worker and site
Change Detection and Alerts
Scraping generates value when it detects significant changes:
class PriceChangeDetector:
def analyze_change(self, old_price, new_price, product_id):
change_pct = ((new_price - old_price) / old_price) * 100
# Drastic change (>10%)
if abs(change_pct) > 10:
self.trigger_alert(
type='drastic_change',
product_id=product_id,
old=old_price,
new=new_price,
change_pct=change_pct
)
# Leadership lost
if self.was_cheapest(product_id) and not self.is_cheapest(product_id):
self.trigger_alert(
type='leadership_lost',
product_id=product_id,
competitor=self.get_new_leader(product_id)
)
Future: Machine Learning in Scraping
The next evolution includes:
- Auto-adaptation of selectors: ML to detect prices without manual selectors
- Anomaly detection: Automatically identify incorrect data
- Change prediction: Anticipate when a competitor will change prices
- Intelligent prioritization: Decide which products to scrape more frequently
Want to see our scraping in action?
Request a demo and we'll show you how we monitor your competition in real-time.
Request Free Demo βπ Additional Resources
Open Source Tools:
- Scrapy Framework - Python framework for scraping
- Playwright - Browser automation
- Beautiful Soup - HTML/XML parser