Monitoring Web Pages for Updates with Python

Monitoring web pages for updates is a common task for many applications, from tracking product price changes to watching news sites for breaking stories. Python, with its rich ecosystem of libraries, offers effective tools to automate this process. This article explores how to monitor web pages for updates using Python, covering key concepts, libraries, and a step-by-step guide to building your own monitoring script.

Why Monitor Web Pages for Updates?

Web page monitoring can be essential for several reasons:

Price tracking on e-commerce sites to catch discounts or availability.
Content monitoring to detect changes in news, blogs, or official announcements.
Competitor analysis to stay updated with competitor offers or news.
Data collection for research or analysis purposes.

Manual checking can be tedious and inefficient, especially when monitoring multiple pages. Automated scripts allow you to get updates in real-time or at scheduled intervals.

Key Concepts in Web Page Monitoring

Fetching the Web Page Content: You need to retrieve the current HTML content of the target page.
Detecting Changes: Compare the newly fetched content with previously saved content to detect any changes.
Notification: Alert the user when changes are detected via email, SMS, or other means.
Scheduling: Run the monitoring script periodically, using tools like cron jobs or task schedulers.

Python Libraries for Web Page Monitoring

requests: For sending HTTP requests to get the web page content.
BeautifulSoup (bs4): For parsing and extracting relevant parts of HTML.
difflib: To compare differences between two versions of text.
smtplib: For sending email notifications.
schedule: For scheduling periodic checks inside your script.

Step-by-Step Guide to Monitor a Web Page with Python

Step 1: Install Required Libraries

bash
pip install requests beautifulsoup4 schedule

Step 2: Fetch the Web Page Content

Use the requests library to fetch the HTML content.

python
import requests

def fetch_page(url):
    try:
        response = requests.get(url)
        response.raise_for_status()
        return response.text
    except requests.RequestException as e:
        print(f"Error fetching page: {e}")
        return None

Step 3: Extract Relevant Content

Often, you don’t need to monitor the entire HTML but just a specific section (like product price, article body). Use BeautifulSoup to extract this.

python
from bs4 import BeautifulSoup

def extract_content(html, selector):
    soup = BeautifulSoup(html, 'html.parser')
    element = soup.select_one(selector)
    if element:
        return element.get_text(strip=True)
    return None

Here, selector is a CSS selector string to target the content of interest.

Step 4: Compare Old and New Content

Store the last known content in a file and compare with new content.

python
import difflib

def has_content_changed(old, new):
    return old != new

For more detailed comparison, you can use difflib.unified_diff to generate a diff.

Step 5: Save Content Locally

python
def save_content(filename, content):
    with open(filename, 'w', encoding='utf-8') as f:
        f.write(content)

def load_content(filename):
    try:
        with open(filename, 'r', encoding='utf-8') as f:
            return f.read()
    except FileNotFoundError:
        return None

Step 6: Notify on Change

Here’s a simple example using email to notify changes. You need to replace placeholders with actual SMTP server details and credentials.

python
import smtplib
from email.mime.text import MIMEText

def send_email(subject, body, from_email, to_email, smtp_server, smtp_port, username, password):
    msg = MIMEText(body)
    msg['Subject'] = subject
    msg['From'] = from_email
    msg['To'] = to_email

    try:
        with smtplib.SMTP_SSL(smtp_server, smtp_port) as server:
            server.login(username, password)
            server.send_message(msg)
        print("Notification email sent.")
    except Exception as e:
        print(f"Failed to send email: {e}")

Step 7: Combine Everything into a Monitoring Function

python
import time

URL = "https://example.com/product-page"
SELECTOR = ".price"  # CSS selector for the price element
CONTENT_FILE = "last_content.txt"

def monitor():
    html = fetch_page(URL)
    if not html:
        return

    new_content = extract_content(html, SELECTOR)
    if not new_content:
        print("Could not extract content.")
        return

    old_content = load_content(CONTENT_FILE)

    if old_content is None:
        print("No previous content found, saving current content.")
        save_content(CONTENT_FILE, new_content)
        return

    if has_content_changed(old_content, new_content):
        print("Content has changed!")
        # Send notification here (email, SMS, etc.)
        # Example:
        # send_email("Page Update Detected", f"Content changed to:n{new_content}", ...)
        save_content(CONTENT_FILE, new_content)
    else:
        print("No changes detected.")

Step 8: Schedule Periodic Checks

You can use the schedule library to run this function at regular intervals.

python
import schedule

schedule.every(10).minutes.do(monitor)

while True:
    schedule.run_pending()
    time.sleep(1)

Alternatively, you can run the script as a cron job or Windows Task Scheduler if you prefer not to keep the script running constantly.

Best Practices and Tips

Respect website terms of service: Avoid aggressive scraping that may harm the site or violate policies.
Handle dynamic content: Some websites load content via JavaScript; you may need tools like Selenium or Playwright for those.
Use headers and delays: Mimic browser headers and add delays to avoid being blocked.
Log errors and changes: Keep a log file for monitoring the script’s performance and issues.
Secure sensitive data: Store email credentials securely using environment variables or secret managers.

Conclusion

Monitoring web pages for updates with Python is straightforward using libraries like requests and BeautifulSoup. By fetching content, extracting relevant data, comparing it over time, and sending notifications, you can automate tracking changes on any web page efficiently. Scheduling the script ensures continuous monitoring without manual intervention, making Python an ideal choice for this task.

Share this Page your favorite way: Click any app below to share.

See all the ways to share this page