Tutorials

How to Scrape Amazon Product Data with Python (2026 Guide)

Step-by-step tutorial on scraping Amazon product data using Python. Covers requests, BeautifulSoup, Playwright, proxy rotation, and how to handle anti-bot measures at scale.

Amazon Scraping Team6 min read

Python is the go-to language for web scraping — and Amazon is the most-scraped website on the internet. In this hands-on tutorial, you'll learn exactly how to extract Amazon product data using Python, from a simple first scraper to a production-ready extraction pipeline.

What You'll Learn

  • Setting up your scraping environment
  • Simple HTTP scraping with requests + BeautifulSoup
  • Handling JavaScript-rendered content with Playwright
  • Rotating proxies to avoid IP bans
  • Parsing all key product fields (title, price, rating, reviews, BSR)
  • Storing data as JSON or CSV
  • Common errors and how to fix them

Prerequisites

  • Python 3.9+
  • Basic Python knowledge
  • pip installed

Step 1 — Install Dependencies

pip install requests beautifulsoup4 lxml playwright pandas
playwright install chromium

Step 2 — Simple HTTP Scraper (Small Scale)

For low-volume scraping (under a few hundred requests), a basic requests + BeautifulSoup scraper works:

import requests
from bs4 import BeautifulSoup
import json
import time
import random

HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                  'AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/124.0.0.0 Safari/537.36',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'DNT': '1',
    'Connection': 'keep-alive',
}

def scrape_product(asin: str, marketplace: str = 'amazon.com') -> dict:
    url = f'https://www.{marketplace}/dp/{asin}'
    
    try:
        response = requests.get(url, headers=HEADERS, timeout=10)
        response.raise_for_status()
    except requests.RequestException as e:
        print(f'Request failed for {asin}: {e}')
        return {}
    
    soup = BeautifulSoup(response.content, 'lxml')
    
    # Parse fields
    title_el = soup.find('span', {'id': 'productTitle'})
    price_el = soup.find('span', {'class': 'a-price-whole'})
    rating_el = soup.find('span', {'class': 'a-icon-alt'})
    review_el = soup.find('span', {'id': 'acrCustomerReviewText'})
    
    return {
        'asin': asin,
        'url': url,
        'title': title_el.text.strip() if title_el else None,
        'price': price_el.text.strip() if price_el else None,
        'rating': rating_el.text.strip() if rating_el else None,
        'reviews': review_el.text.strip() if review_el else None,
    }

# Usage
asins = ['B09G3HRMVB', 'B08N5WRWNW', 'B07XJ8C8F5']

results = []
for asin in asins:
    data = scrape_product(asin)
    results.append(data)
    print(f'Scraped: {data.get("title", "Failed")}')
    time.sleep(random.uniform(2, 5))  # Random delay!

# Save to JSON
with open('amazon_products.json', 'w') as f:
    json.dump(results, f, indent=2)

Note: This basic approach works for testing, but Amazon blocks it heavily at scale. You'll see CAPTCHA pages or empty responses after ~50 requests without proxy rotation.

Step 3 — Handling JavaScript Content with Playwright

Many Amazon pages load pricing and availability via JavaScript after page load. For these, you need a real browser:

from playwright.sync_api import sync_playwright
import json

def scrape_with_playwright(asin: str) -> dict:
    with sync_playwright() as p:
        browser = p.chromium.launch(
            headless=True,
            args=['--no-sandbox', '--disable-setuid-sandbox']
        )
        
        context = browser.new_context(
            user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                       'AppleWebKit/537.36 (KHTML, like Gecko) '
                       'Chrome/124.0.0.0 Safari/537.36',
            viewport={'width': 1920, 'height': 1080},
            locale='en-US',
        )
        
        page = context.new_page()
        
        # Block unnecessary resources for speed
        page.route('**/*.{png,jpg,gif,svg,woff,woff2}', 
                   lambda route: route.abort())
        
        page.goto(f'https://www.amazon.com/dp/{asin}', 
                  wait_until='domcontentloaded')
        
        # Wait for price element
        try:
            page.wait_for_selector('.a-price-whole', timeout=5000)
        except:
            pass  # Price might not exist
        
        title = page.query_selector('#productTitle')
        price = page.query_selector('.a-price-whole')
        rating = page.query_selector('.a-icon-alt')
        
        result = {
            'asin': asin,
            'title': title.inner_text().strip() if title else None,
            'price': price.inner_text().strip() if price else None,
            'rating': rating.inner_text().strip() if rating else None,
        }
        
        browser.close()
        return result

Step 4 — Proxy Rotation (Essential for Scale)

Without proxy rotation, Amazon blocks you after 50–100 requests. Here's a simple proxy rotator:

import requests
import random
from itertools import cycle

# Residential proxies work best (datacenter proxies get blocked faster)
PROXIES = [
    'http://user:pass@proxy1.example.com:8080',
    'http://user:pass@proxy2.example.com:8080',
    'http://user:pass@proxy3.example.com:8080',
]

proxy_pool = cycle(PROXIES)

def get_with_proxy(url: str, retries: int = 3) -> requests.Response | None:
    for attempt in range(retries):
        proxy = next(proxy_pool)
        try:
            response = requests.get(
                url,
                headers=HEADERS,
                proxies={'http': proxy, 'https': proxy},
                timeout=15
            )
            if response.status_code == 200:
                return response
            elif response.status_code == 503:
                print(f'Got CAPTCHA on attempt {attempt + 1}, rotating proxy...')
        except Exception as e:
            print(f'Proxy failed: {e}')
    return None

Step 5 — Parsing All Key Fields

Here's a comprehensive parser for all the major product fields:

def parse_product_page(soup: BeautifulSoup, asin: str) -> dict:
    def text(selector, attr_id=None, attr_class=None):
        """Safe text extractor."""
        try:
            if attr_id:
                el = soup.find(attrs={'id': attr_id})
            else:
                el = soup.find(class_=attr_class)
            return el.get_text(strip=True) if el else None
        except:
            return None
    
    # Price — combine whole + fraction
    price_whole = text(None, attr_class='a-price-whole')
    price_frac  = text(None, attr_class='a-price-fraction')
    price = f"{price_whole}{price_frac}" if price_whole else None
    
    # Images
    import re
    img_data = soup.find('div', {'id': 'imgTagWrapperId'})
    img_url  = img_data.find('img')['src'] if img_data else None
    
    # BSR
    bsr_el  = soup.find('span', string=re.compile(r'Best Sellers Rank'))
    bsr_txt = bsr_el.find_next('span').text.strip() if bsr_el else None
    
    return {
        'asin':         asin,
        'title':        text(attr_id='productTitle'),
        'brand':        text(attr_id='bylineInfo'),
        'price':        price,
        'rating':       text(None, attr_class='a-icon-alt'),
        'review_count': text(attr_id='acrCustomerReviewText'),
        'bsr':          bsr_txt,
        'availability': text(attr_id='availability'),
        'main_image':   img_url,
    }

Step 6 — Save as CSV

import pandas as pd

# Assuming results is a list of product dicts
df = pd.DataFrame(results)
df.to_csv('amazon_products.csv', index=False, encoding='utf-8')
print(f'Saved {len(df)} products to amazon_products.csv')

Common Errors and Fixes

ErrorCauseFix
503 Service UnavailableIP blocked / CAPTCHARotate proxy, add delays
Empty title elementJavaScript-rendered pageUse Playwright instead of requests
ConnectionErrorProxy failedAdd retry logic with fallback proxies
Price returns NoneDifferent price selectorCheck for .a-offscreen as fallback
Inconsistent dataLayout A/B test by AmazonUse multiple selector fallbacks

Success Rate Expectations

ApproachExpected Success RateGood For
requests alone20–40%Testing only
requests + headers40–60%Very small scale
requests + proxy rotation70–85%Small–medium projects
Playwright + proxies85–95%Medium projects
Professional service98–99.5%Production / enterprise

When to Use a Professional Service Instead

Building and maintaining a Python scraper becomes impractical when:

  • You need consistent 98%+ success rates (Amazon changes layout frequently)
  • You're scraping millions of records per month
  • You need data from multiple marketplaces simultaneously
  • You want automatic maintenance when Amazon changes its structure
  • Your team doesn't have scraping infrastructure expertise

At that point, the engineering cost of maintaining your own scraper exceeds the cost of a managed service.

Get a free quote and we'll assess your requirements — including a sample extraction to demonstrate output quality.

Amazon Scraping TeamData Extraction Specialists · 10+ Years Experience

Our team of senior data engineers and web scraping specialists has delivered over 500 million records across 12+ Amazon marketplaces. We write about scraping techniques, eCommerce data strategy, and Amazon market intelligence based on real-world project experience.