Build a basic web scraper with pagination

To build a basic web scraper with pagination using Python and BeautifulSoup, you’ll need the following tools:

requests – to send HTTP requests
BeautifulSoup – to parse HTML
A target website with paginated content (e.g., blog posts, product listings, etc.)

Here’s a working example that scrapes article titles from a paginated blog-style site (you’ll need to update the URL and parsing logic for your specific use case):

python
import requests
from bs4 import BeautifulSoup
import time

def scrape_paginated_site(base_url, page_param='page', start_page=1, end_page=5, delay=1):
    all_items = []

    for page in range(start_page, end_page + 1):
        # Construct the paginated URL
        url = f"{base_url}?{page_param}={page}"
        print(f"Scraping: {url}")
        
        response = requests.get(url)
        if response.status_code != 200:
            print(f"Failed to retrieve page {page}, status code: {response.status_code}")
            continue
        
        soup = BeautifulSoup(response.content, 'html.parser')

        # Adjust this selector to your specific target HTML structure
        items = soup.select('h2.post-title a')  # Example: blog post titles
        for item in items:
            title = item.get_text(strip=True)
            link = item.get('href')
            all_items.append({'title': title, 'link': link})
        
        time.sleep(delay)  # Be polite, avoid hammering the server

    return all_items

# Example usage
base_url = 'https://example-blog.com/articles'  # Replace with real paginated URL
results = scrape_paginated_site(base_url, start_page=1, end_page=3)

for result in results:
    print(result)

Key Components

Pagination: Adjusted via page_param, common values include page, p, or offset.
Selector: The CSS selector used in soup.select() must match the elements you want to scrape.
Politeness: A delay between requests is essential to avoid getting blocked.

Notes

For websites using JavaScript to render content, you’ll need Selenium or Playwright instead of requests/BeautifulSoup.
Always check the site’s robots.txt and terms of service to ensure scraping is allowed.

Let me know if you want to adapt this for a specific site structure or framework.

Share this Page your favorite way: Click any app below to share.

See all the ways to share this page

Key Components

Notes

Check Out Our Newest Posts we wrote about

Why your ML system design must support partial retraining

Why your ML pipeline must detect missing or stale features

Why your ML feedback loop must consider label quality

Why your ML deployment plan must include fallback logic