The Palos Publishing Company

Follow Us On The X Platform @PalosPublishing
Categories We Write About

Web Scraping with Requests and lxml

Web scraping is a vital technique in modern data analysis and digital research, enabling the automatic extraction of information from websites. Among Python’s vast array of libraries for scraping, requests and lxml stand out due to their simplicity, speed, and power. This article explores how to use these tools effectively to gather data from the web in a structured and efficient way.

Understanding Web Scraping

Web scraping refers to the process of programmatically collecting information from web pages. Instead of manually copying data, scripts can access and parse content at scale, saving time and effort. Typical use cases include price monitoring, news aggregation, SEO tracking, and competitor research.

Why Use requests and lxml?

The combination of requests and lxml is a popular choice because:

  • requests: A user-friendly HTTP library for sending all kinds of HTTP requests. It simplifies handling responses, headers, cookies, and more.

  • lxml: A powerful library based on the libxml2 and libxslt libraries. It provides rich support for XPath, enabling fast and precise extraction of HTML and XML content.

Together, they form a lightweight yet robust toolkit for scraping structured data from static pages.

Installing the Required Libraries

To get started, install the required packages using pip:

bash
pip install requests lxml

Sending HTTP Requests with requests

The first step in web scraping is downloading the web page content. The requests library makes this straightforward:

python
import requests url = 'https://example.com' response = requests.get(url) if response.status_code == 200: html_content = response.text else: print(f"Failed to retrieve content. Status code: {response.status_code}")

The response.text attribute contains the HTML content of the page, which is what lxml will parse next.

Parsing HTML with lxml

With the HTML content obtained, the next step is to parse it using lxml. This is typically done through its html module:

python
from lxml import html tree = html.fromstring(html_content)

Once parsed, the tree object provides a complete DOM representation, allowing you to query elements using XPath expressions.

Extracting Data with XPath

XPath is a powerful language for selecting nodes in an XML document. It enables precise targeting of elements, attributes, and text.

For example, to extract all article titles wrapped in <h2> tags:

python
titles = tree.xpath('//h2/text()')

Or to get the href attributes from all anchor tags:

python
links = tree.xpath('//a/@href')

XPath supports a wide range of filters and functions, making it ideal for complex extraction logic.

Example: Scraping a Blog Page

Consider scraping article titles and links from a blog listing page:

python
url = 'https://example-blog.com' response = requests.get(url) tree = html.fromstring(response.content) titles = tree.xpath('//div[@class="post-title"]/a/text()') urls = tree.xpath('//div[@class="post-title"]/a/@href') for title, link in zip(titles, urls): print(f"{title} -> {link}")

This code targets <a> tags within a div of class post-title, extracting both the visible text and the link.

Handling Relative URLs

Web pages often use relative URLs. These need to be converted to absolute URLs using the base domain:

python
from urllib.parse import urljoin absolute_urls = [urljoin(url, link) for link in urls]

This ensures that all links are valid and usable for further scraping or crawling.

Managing Headers and User-Agents

Some websites may block requests that don’t appear to come from a browser. To avoid this, set a custom User-Agent:

python
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)' } response = requests.get(url, headers=headers)

This mimics a real browser session, increasing the chances of successful responses.

Navigating Pagination

For multi-page content, pagination must be handled. This typically involves identifying the “Next” page URL and looping through all pages:

python
while True: response = requests.get(url) tree = html.fromstring(response.content) # Extract data here... next_page = tree.xpath('//a[@rel="next"]/@href') if not next_page: break url = urljoin(url, next_page[0])

This loop continues until no further “next” page is found.

Error Handling and Rate Limiting

Scrapers must be resilient to failures and respectful of server load:

  • Error handling: Check HTTP status codes, use try-except blocks.

  • Rate limiting: Use time.sleep() to pause between requests.

python
import time try: response = requests.get(url) response.raise_for_status() except requests.exceptions.RequestException as e: print(f"Request failed: {e}") continue time.sleep(2) # Pause to avoid overwhelming the server

Working with Dynamic Content

requests and lxml only work with static HTML. If the site content is dynamically generated with JavaScript, consider alternatives like Selenium or Playwright. However, many sites still expose their data through static HTML or accessible APIs.

Storing Scraped Data

Collected data can be stored in formats like CSV, JSON, or databases. Here’s a quick way to write to CSV:

python
import csv with open('data.csv', 'w', newline='', encoding='utf-8') as file: writer = csv.writer(file) writer.writerow(['Title', 'Link']) for title, link in zip(titles, absolute_urls): writer.writerow([title, link])

For more structured data or larger volumes, SQLite or Pandas might be more suitable.

Respecting Terms of Use

Always review a website’s robots.txt file and terms of service before scraping. Ethical scraping ensures compliance and avoids legal or reputational issues.

python
robots_url = 'https://example.com/robots.txt' robots_response = requests.get(robots_url) print(robots_response.text)

If a site disallows scraping of specific paths, those should be respected.

Advantages of Using requests and lxml

  • Lightweight and fast

  • Full XPath support

  • Better performance than heavier tools like Selenium

  • Suitable for most static web scraping tasks

Conclusion

Combining requests and lxml provides a powerful and efficient method for web scraping in Python. With their straightforward APIs and robust capabilities, you can extract, transform, and store valuable data from a wide range of websites. By adhering to best practices, handling exceptions gracefully, and respecting site policies, you can build reliable and responsible scraping tools that unlock the full potential of online data.

Share this Page your favorite way: Click any app below to share.

Enter your email below to join The Palos Publishing Company Email List

We respect your email privacy

Categories We Write About