How to Scrape Amazon Product Data with Python (2026 Guide) | Amazon Scraping

Python is the go-to language for web scraping — and Amazon is the most-scraped website on the internet. In this hands-on tutorial, you'll learn exactly how to extract Amazon product data using Python, from a simple first scraper to a production-ready extraction pipeline.

What You'll Learn

Setting up your scraping environment
Simple HTTP scraping with requests + BeautifulSoup
Handling JavaScript-rendered content with Playwright
Rotating proxies to avoid IP bans
Parsing all key product fields (title, price, rating, reviews, BSR)
Storing data as JSON or CSV
Common errors and how to fix them

Prerequisites

Python 3.9+
Basic Python knowledge
pip installed

Step 1 — Install Dependencies

pip install requests beautifulsoup4 lxml playwright pandas
playwright install chromium

Step 2 — Simple HTTP Scraper (Small Scale)

For low-volume scraping (under a few hundred requests), a basic requests + BeautifulSoup scraper works:

import requests
from bs4 import BeautifulSoup
import json
import time
import random

HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                  'AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/124.0.0.0 Safari/537.36',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'DNT': '1',
    'Connection': 'keep-alive',
}

def scrape_product(asin: str, marketplace: str = 'amazon.com') -> dict:
    url = f'https://www.{marketplace}/dp/{asin}'
    
    try:
        response = requests.get(url, headers=HEADERS, timeout=10)
        response.raise_for_status()
    except requests.RequestException as e:
        print(f'Request failed for {asin}: {e}')
        return {}
    
    soup = BeautifulSoup(response.content, 'lxml')
    
    # Parse fields
    title_el = soup.find('span', {'id': 'productTitle'})
    price_el = soup.find('span', {'class': 'a-price-whole'})
    rating_el = soup.find('span', {'class': 'a-icon-alt'})
    review_el = soup.find('span', {'id': 'acrCustomerReviewText'})
    
    return {
        'asin': asin,
        'url': url,
        'title': title_el.text.strip() if title_el else None,
        'price': price_el.text.strip() if price_el else None,
        'rating': rating_el.text.strip() if rating_el else None,
        'reviews': review_el.text.strip() if review_el else None,
    }

# Usage
asins = ['B09G3HRMVB', 'B08N5WRWNW', 'B07XJ8C8F5']

results = []
for asin in asins:
    data = scrape_product(asin)
    results.append(data)
    print(f'Scraped: {data.get("title", "Failed")}')
    time.sleep(random.uniform(2, 5))  # Random delay!

# Save to JSON
with open('amazon_products.json', 'w') as f:
    json.dump(results, f, indent=2)

Note: This basic approach works for testing, but Amazon blocks it heavily at scale. You'll see CAPTCHA pages or empty responses after ~50 requests without proxy rotation.

Step 3 — Handling JavaScript Content with Playwright

Many Amazon pages load pricing and availability via JavaScript after page load. For these, you need a real browser:

from playwright.sync_api import sync_playwright
import json

def scrape_with_playwright(asin: str) -> dict:
    with sync_playwright() as p:
        browser = p.chromium.launch(
            headless=True,
            args=['--no-sandbox', '--disable-setuid-sandbox']
        )
        
        context = browser.new_context(
            user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                       'AppleWebKit/537.36 (KHTML, like Gecko) '
                       'Chrome/124.0.0.0 Safari/537.36',
            viewport={'width': 1920, 'height': 1080},
            locale='en-US',
        )
        
        page = context.new_page()
        
        # Block unnecessary resources for speed
        page.route('**/*.{png,jpg,gif,svg,woff,woff2}', 
                   lambda route: route.abort())
        
        page.goto(f'https://www.amazon.com/dp/{asin}', 
                  wait_until='domcontentloaded')
        
        # Wait for price element
        try:
            page.wait_for_selector('.a-price-whole', timeout=5000)
        except:
            pass  # Price might not exist
        
        title = page.query_selector('#productTitle')
        price = page.query_selector('.a-price-whole')
        rating = page.query_selector('.a-icon-alt')
        
        result = {
            'asin': asin,
            'title': title.inner_text().strip() if title else None,
            'price': price.inner_text().strip() if price else None,
            'rating': rating.inner_text().strip() if rating else None,
        }
        
        browser.close()
        return result

Step 4 — Proxy Rotation (Essential for Scale)

Without proxy rotation, Amazon blocks you after 50–100 requests. Here's a simple proxy rotator:

import requests
import random
from itertools import cycle

# Residential proxies work best (datacenter proxies get blocked faster)
PROXIES = [
    'http://user:pass@proxy1.example.com:8080',
    'http://user:pass@proxy2.example.com:8080',
    'http://user:pass@proxy3.example.com:8080',
]

proxy_pool = cycle(PROXIES)

def get_with_proxy(url: str, retries: int = 3) -> requests.Response | None:
    for attempt in range(retries):
        proxy = next(proxy_pool)
        try:
            response = requests.get(
                url,
                headers=HEADERS,
                proxies={'http': proxy, 'https': proxy},
                timeout=15
            )
            if response.status_code == 200:
                return response
            elif response.status_code == 503:
                print(f'Got CAPTCHA on attempt {attempt + 1}, rotating proxy...')
        except Exception as e:
            print(f'Proxy failed: {e}')
    return None

Step 5 — Parsing All Key Fields

Here's a comprehensive parser for all the major product fields:

def parse_product_page(soup: BeautifulSoup, asin: str) -> dict:
    def text(selector, attr_id=None, attr_class=None):
        """Safe text extractor."""
        try:
            if attr_id:
                el = soup.find(attrs={'id': attr_id})
            else:
                el = soup.find(class_=attr_class)
            return el.get_text(strip=True) if el else None
        except:
            return None
    
    # Price — combine whole + fraction
    price_whole = text(None, attr_class='a-price-whole')
    price_frac  = text(None, attr_class='a-price-fraction')
    price = f"{price_whole}{price_frac}" if price_whole else None
    
    # Images
    import re
    img_data = soup.find('div', {'id': 'imgTagWrapperId'})
    img_url  = img_data.find('img')['src'] if img_data else None
    
    # BSR
    bsr_el  = soup.find('span', string=re.compile(r'Best Sellers Rank'))
    bsr_txt = bsr_el.find_next('span').text.strip() if bsr_el else None
    
    return {
        'asin':         asin,
        'title':        text(attr_id='productTitle'),
        'brand':        text(attr_id='bylineInfo'),
        'price':        price,
        'rating':       text(None, attr_class='a-icon-alt'),
        'review_count': text(attr_id='acrCustomerReviewText'),
        'bsr':          bsr_txt,
        'availability': text(attr_id='availability'),
        'main_image':   img_url,
    }

Step 6 — Save as CSV

import pandas as pd

# Assuming results is a list of product dicts
df = pd.DataFrame(results)
df.to_csv('amazon_products.csv', index=False, encoding='utf-8')
print(f'Saved {len(df)} products to amazon_products.csv')

Common Errors and Fixes

Error	Cause	Fix
`503 Service Unavailable`	IP blocked / CAPTCHA	Rotate proxy, add delays
Empty `title` element	JavaScript-rendered page	Use Playwright instead of requests
`ConnectionError`	Proxy failed	Add retry logic with fallback proxies
Price returns `None`	Different price selector	Check for `.a-offscreen` as fallback
Inconsistent data	Layout A/B test by Amazon	Use multiple selector fallbacks

Success Rate Expectations

Approach	Expected Success Rate	Good For
`requests` alone	20–40%	Testing only
`requests` + headers	40–60%	Very small scale
`requests` + proxy rotation	70–85%	Small–medium projects
`Playwright` + proxies	85–95%	Medium projects
Professional service	98–99.5%	Production / enterprise

When to Use a Professional Service Instead

Building and maintaining a Python scraper becomes impractical when:

You need consistent 98%+ success rates (Amazon changes layout frequently)
You're scraping millions of records per month
You need data from multiple marketplaces simultaneously
You want automatic maintenance when Amazon changes its structure
Your team doesn't have scraping infrastructure expertise

At that point, the engineering cost of maintaining your own scraper exceeds the cost of a managed service.

Get a free quote and we'll assess your requirements — including a sample extraction to demonstrate output quality.