Web Scraping with Requests and lxml

Web scraping is a vital technique in modern data analysis and digital research, enabling the automatic extraction of information from websites. Among Python’s vast array of libraries for scraping, requests and lxml stand out due to their simplicity, speed, and power. This article explores how to use these tools effectively to gather data from the web in a structured and efficient way.

Understanding Web Scraping

Web scraping refers to the process of programmatically collecting information from web pages. Instead of manually copying data, scripts can access and parse content at scale, saving time and effort. Typical use cases include price monitoring, news aggregation, SEO tracking, and competitor research.

Why Use `requests` and `lxml`?

The combination of requests and lxml is a popular choice because:

requests: A user-friendly HTTP library for sending all kinds of HTTP requests. It simplifies handling responses, headers, cookies, and more.
lxml: A powerful library based on the libxml2 and libxslt libraries. It provides rich support for XPath, enabling fast and precise extraction of HTML and XML content.

Together, they form a lightweight yet robust toolkit for scraping structured data from static pages.

Installing the Required Libraries

To get started, install the required packages using pip:

bash
pip install requests lxml

Sending HTTP Requests with `requests`

The first step in web scraping is downloading the web page content. The requests library makes this straightforward:

python
import requests

url = 'https://example.com'
response = requests.get(url)

if response.status_code == 200:
    html_content = response.text
else:
    print(f"Failed to retrieve content. Status code: {response.status_code}")

The response.text attribute contains the HTML content of the page, which is what lxml will parse next.

Parsing HTML with `lxml`

With the HTML content obtained, the next step is to parse it using lxml. This is typically done through its html module:

python
from lxml import html

tree = html.fromstring(html_content)

Once parsed, the tree object provides a complete DOM representation, allowing you to query elements using XPath expressions.

Extracting Data with XPath

XPath is a powerful language for selecting nodes in an XML document. It enables precise targeting of elements, attributes, and text.

For example, to extract all article titles wrapped in <h2> tags:

python
titles = tree.xpath('//h2/text()')

Or to get the href attributes from all anchor tags:

python
links = tree.xpath('//a/@href')

XPath supports a wide range of filters and functions, making it ideal for complex extraction logic.

Example: Scraping a Blog Page

Consider scraping article titles and links from a blog listing page:

python
url = 'https://example-blog.com'
response = requests.get(url)
tree = html.fromstring(response.content)

titles = tree.xpath('//div[@class="post-title"]/a/text()')
urls = tree.xpath('//div[@class="post-title"]/a/@href')

for title, link in zip(titles, urls):
    print(f"{title} -> {link}")

This code targets <a> tags within a div of class post-title, extracting both the visible text and the link.

Handling Relative URLs

Web pages often use relative URLs. These need to be converted to absolute URLs using the base domain:

python
from urllib.parse import urljoin

absolute_urls = [urljoin(url, link) for link in urls]

This ensures that all links are valid and usable for further scraping or crawling.

Managing Headers and User-Agents

Some websites may block requests that don’t appear to come from a browser. To avoid this, set a custom User-Agent:

python
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
}

response = requests.get(url, headers=headers)

This mimics a real browser session, increasing the chances of successful responses.

Navigating Pagination

For multi-page content, pagination must be handled. This typically involves identifying the “Next” page URL and looping through all pages:

python
while True:
    response = requests.get(url)
    tree = html.fromstring(response.content)

    # Extract data here...

    next_page = tree.xpath('//a[@rel="next"]/@href')
    if not next_page:
        break

    url = urljoin(url, next_page[0])

This loop continues until no further “next” page is found.

Error Handling and Rate Limiting

Scrapers must be resilient to failures and respectful of server load:

Error handling: Check HTTP status codes, use try-except blocks.
Rate limiting: Use time.sleep() to pause between requests.

python
import time

try:
    response = requests.get(url)
    response.raise_for_status()
except requests.exceptions.RequestException as e:
    print(f"Request failed: {e}")
    continue

time.sleep(2)  # Pause to avoid overwhelming the server

Working with Dynamic Content

requests and lxml only work with static HTML. If the site content is dynamically generated with JavaScript, consider alternatives like Selenium or Playwright. However, many sites still expose their data through static HTML or accessible APIs.

Storing Scraped Data

Collected data can be stored in formats like CSV, JSON, or databases. Here’s a quick way to write to CSV:

python
import csv

with open('data.csv', 'w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['Title', 'Link'])
    for title, link in zip(titles, absolute_urls):
        writer.writerow([title, link])

For more structured data or larger volumes, SQLite or Pandas might be more suitable.

Respecting Terms of Use

Always review a website’s robots.txt file and terms of service before scraping. Ethical scraping ensures compliance and avoids legal or reputational issues.

python
robots_url = 'https://example.com/robots.txt'
robots_response = requests.get(robots_url)
print(robots_response.text)

If a site disallows scraping of specific paths, those should be respected.

Advantages of Using `requests` and `lxml`

Lightweight and fast
Full XPath support
Better performance than heavier tools like Selenium
Suitable for most static web scraping tasks

Conclusion

Combining requests and lxml provides a powerful and efficient method for web scraping in Python. With their straightforward APIs and robust capabilities, you can extract, transform, and store valuable data from a wide range of websites. By adhering to best practices, handling exceptions gracefully, and respecting site policies, you can build reliable and responsible scraping tools that unlock the full potential of online data.

Share this Page your favorite way: Click any app below to share.

See all the ways to share this page

Understanding Web Scraping

Why Use `requests` and `lxml`?

Installing the Required Libraries

Sending HTTP Requests with `requests`

Parsing HTML with `lxml`

Extracting Data with XPath

Example: Scraping a Blog Page

Handling Relative URLs

Managing Headers and User-Agents

Navigating Pagination

Error Handling and Rate Limiting

Working with Dynamic Content

Storing Scraped Data

Respecting Terms of Use

Advantages of Using `requests` and `lxml`

Conclusion

Check Out Our Newest Posts we wrote about

Why your ML system design must support partial retraining

Why your ML pipeline must detect missing or stale features

Why your ML feedback loop must consider label quality

Why your ML deployment plan must include fallback logic

Web Scraping with Requests and lxml

Understanding Web Scraping

Why Use requests and lxml?

Installing the Required Libraries

Sending HTTP Requests with requests

Parsing HTML with lxml

Extracting Data with XPath

Example: Scraping a Blog Page

Handling Relative URLs

Managing Headers and User-Agents

Navigating Pagination

Error Handling and Rate Limiting

Working with Dynamic Content

Storing Scraped Data

Respecting Terms of Use

Advantages of Using requests and lxml

Conclusion

Check Out Our Newest Posts we wrote about

Why your ML system design must support partial retraining

Why your ML pipeline must detect missing or stale features

Why your ML feedback loop must consider label quality

Why your ML deployment plan must include fallback logic

Why Use `requests` and `lxml`?

Sending HTTP Requests with `requests`

Parsing HTML with `lxml`

Advantages of Using `requests` and `lxml`