Create a simple web crawler

Creating a simple web crawler can be done using Python with libraries like requests for fetching web pages and BeautifulSoup from bs4 for parsing HTML. Below is a basic example that crawls a given URL, extracts all links, and visits them up to a certain depth.

python
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse

class SimpleCrawler:
    def __init__(self, base_url, max_depth=2):
        self.base_url = base_url
        self.max_depth = max_depth
        self.visited = set()

    def crawl(self, url, depth=0):
        if depth > self.max_depth or url in self.visited:
            return
        print(f'Crawling: {url} at depth {depth}')
        self.visited.add(url)

        try:
            response = requests.get(url)
            if response.status_code != 200:
                return
        except requests.RequestException:
            return

        soup = BeautifulSoup(response.text, 'html.parser')
        links = soup.find_all('a', href=True)

        for link in links:
            href = link['href']
            full_url = urljoin(url, href)
            if self.is_valid_url(full_url):
                self.crawl(full_url, depth + 1)

    def is_valid_url(self, url):
        parsed_base = urlparse(self.base_url)
        parsed_url = urlparse(url)
        return parsed_url.scheme in ('http', 'https') and parsed_base.netloc == parsed_url.netloc


if __name__ == '__main__':
    start_url = 'https://example.com'  # Change this to the site you want to crawl
    crawler = SimpleCrawler(start_url, max_depth=2)
    crawler.crawl(start_url)

How it works:

Starts crawling from the start_url.
Visits pages up to max_depth levels deep.
Only follows links within the same domain as the base URL.
Prints URLs as it visits them.
Avoids revisiting URLs.

You can expand this crawler by adding features like saving page content, handling robots.txt, or extracting specific data.

Share this Page your favorite way: Click any app below to share.

See all the ways to share this page

How it works:

Check Out Our Newest Posts we wrote about

Why your ML system design must support partial retraining

Why your ML pipeline must detect missing or stale features

Why your ML feedback loop must consider label quality

Why your ML deployment plan must include fallback logic