Build a basic web scraper with pagination

To build a basic web scraper with pagination using Python and BeautifulSoup, you’ll need the following tools:

requests – to send HTTP requests
BeautifulSoup – to parse HTML
A target website with paginated content (e.g., blog posts, product listings, etc.)

Here’s a working example that scrapes article titles from a paginated blog-style site (you’ll need to update the URL and parsing logic for your specific use case):

python
import requests
from bs4 import BeautifulSoup
import time

def scrape_paginated_site(base_url, page_param='page', start_page=1, end_page=5, delay=1):
    all_items = []

    for page in range(start_page, end_page + 1):
        # Construct the paginated URL
        url = f"{base_url}?{page_param}={page}"
        print(f"Scraping: {url}")
        
        response = requests.get(url)
        if response.status_code != 200:
            print(f"Failed to retrieve page {page}, status code: {response.status_code}")
            continue
        
        soup = BeautifulSoup(response.content, 'html.parser')

        # Adjust this selector to your specific target HTML structure
        items = soup.select('h2.post-title a')  # Example: blog post titles
        for item in items:
            title = item.get_text(strip=True)
            link = item.get('href')
            all_items.append({'title': title, 'link': link})
        
        time.sleep(delay)  # Be polite, avoid hammering the server

    return all_items

# Example usage
base_url = 'https://example-blog.com/articles'  # Replace with real paginated URL
results = scrape_paginated_site(base_url, start_page=1, end_page=3)

for result in results:
    print(result)

Key Components

Pagination: Adjusted via page_param, common values include page, p, or offset.
Selector: The CSS selector used in soup.select() must match the elements you want to scrape.
Politeness: A delay between requests is essential to avoid getting blocked.

Notes

For websites using JavaScript to render content, you’ll need Selenium or Playwright instead of requests/BeautifulSoup.
Always check the site’s robots.txt and terms of service to ensure scraping is allowed.

Let me know if you want to adapt this for a specific site structure or framework.

Share This Page:

Key Components

Notes

Comments

Leave a Reply Cancel reply

Check Out Our Newest Posts we wrote about

Write scripts to automate online shopping

Write a Python script to clean HTML files

Why You Need an AI Content Operations Strategy

Why You Need a Business Case for Every Model