Create a simple web crawler

Creating a simple web crawler can be done using Python with libraries like requests for fetching web pages and BeautifulSoup from bs4 for parsing HTML. Below is a basic example that crawls a given URL, extracts all links, and visits them up to a certain depth.

python
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse

class SimpleCrawler:
    def __init__(self, base_url, max_depth=2):
        self.base_url = base_url
        self.max_depth = max_depth
        self.visited = set()

    def crawl(self, url, depth=0):
        if depth > self.max_depth or url in self.visited:
            return
        print(f'Crawling: {url} at depth {depth}')
        self.visited.add(url)

        try:
            response = requests.get(url)
            if response.status_code != 200:
                return
        except requests.RequestException:
            return

        soup = BeautifulSoup(response.text, 'html.parser')
        links = soup.find_all('a', href=True)

        for link in links:
            href = link['href']
            full_url = urljoin(url, href)
            if self.is_valid_url(full_url):
                self.crawl(full_url, depth + 1)

    def is_valid_url(self, url):
        parsed_base = urlparse(self.base_url)
        parsed_url = urlparse(url)
        return parsed_url.scheme in ('http', 'https') and parsed_base.netloc == parsed_url.netloc


if __name__ == '__main__':
    start_url = 'https://example.com'  # Change this to the site you want to crawl
    crawler = SimpleCrawler(start_url, max_depth=2)
    crawler.crawl(start_url)

How it works:

Starts crawling from the start_url.
Visits pages up to max_depth levels deep.
Only follows links within the same domain as the base URL.
Prints URLs as it visits them.
Avoids revisiting URLs.

You can expand this crawler by adding features like saving page content, handling robots.txt, or extracting specific data.

Share This Page:

How it works:

Comments

Leave a Reply Cancel reply

Check Out Our Newest Posts we wrote about

Write scripts to automate online shopping

Write a Python script to clean HTML files

Why You Need an AI Content Operations Strategy

Why You Need a Business Case for Every Model