Categories We Write About

Create a simple web crawler

Creating a simple web crawler can be done using Python with libraries like requests for fetching web pages and BeautifulSoup from bs4 for parsing HTML. Below is a basic example that crawls a given URL, extracts all links, and visits them up to a certain depth.

python
import requests from bs4 import BeautifulSoup from urllib.parse import urljoin, urlparse class SimpleCrawler: def __init__(self, base_url, max_depth=2): self.base_url = base_url self.max_depth = max_depth self.visited = set() def crawl(self, url, depth=0): if depth > self.max_depth or url in self.visited: return print(f'Crawling: {url} at depth {depth}') self.visited.add(url) try: response = requests.get(url) if response.status_code != 200: return except requests.RequestException: return soup = BeautifulSoup(response.text, 'html.parser') links = soup.find_all('a', href=True) for link in links: href = link['href'] full_url = urljoin(url, href) if self.is_valid_url(full_url): self.crawl(full_url, depth + 1) def is_valid_url(self, url): parsed_base = urlparse(self.base_url) parsed_url = urlparse(url) return parsed_url.scheme in ('http', 'https') and parsed_base.netloc == parsed_url.netloc if __name__ == '__main__': start_url = 'https://example.com' # Change this to the site you want to crawl crawler = SimpleCrawler(start_url, max_depth=2) crawler.crawl(start_url)

How it works:

  • Starts crawling from the start_url.

  • Visits pages up to max_depth levels deep.

  • Only follows links within the same domain as the base URL.

  • Prints URLs as it visits them.

  • Avoids revisiting URLs.

You can expand this crawler by adding features like saving page content, handling robots.txt, or extracting specific data.

Share This Page:

Enter your email below to join The Palos Publishing Company Email List

We respect your email privacy

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Categories We Write About