Categories We Write About

Build a basic web scraper with pagination

To build a basic web scraper with pagination using Python and BeautifulSoup, you’ll need the following tools:

  • requests – to send HTTP requests

  • BeautifulSoup – to parse HTML

  • A target website with paginated content (e.g., blog posts, product listings, etc.)

Here’s a working example that scrapes article titles from a paginated blog-style site (you’ll need to update the URL and parsing logic for your specific use case):

python
import requests from bs4 import BeautifulSoup import time def scrape_paginated_site(base_url, page_param='page', start_page=1, end_page=5, delay=1): all_items = [] for page in range(start_page, end_page + 1): # Construct the paginated URL url = f"{base_url}?{page_param}={page}" print(f"Scraping: {url}") response = requests.get(url) if response.status_code != 200: print(f"Failed to retrieve page {page}, status code: {response.status_code}") continue soup = BeautifulSoup(response.content, 'html.parser') # Adjust this selector to your specific target HTML structure items = soup.select('h2.post-title a') # Example: blog post titles for item in items: title = item.get_text(strip=True) link = item.get('href') all_items.append({'title': title, 'link': link}) time.sleep(delay) # Be polite, avoid hammering the server return all_items # Example usage base_url = 'https://example-blog.com/articles' # Replace with real paginated URL results = scrape_paginated_site(base_url, start_page=1, end_page=3) for result in results: print(result)

Key Components

  • Pagination: Adjusted via page_param, common values include page, p, or offset.

  • Selector: The CSS selector used in soup.select() must match the elements you want to scrape.

  • Politeness: A delay between requests is essential to avoid getting blocked.

Notes

  • For websites using JavaScript to render content, you’ll need Selenium or Playwright instead of requests/BeautifulSoup.

  • Always check the site’s robots.txt and terms of service to ensure scraping is allowed.

Let me know if you want to adapt this for a specific site structure or framework.

Share This Page:

Enter your email below to join The Palos Publishing Company Email List

We respect your email privacy

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Categories We Write About