The Palos Publishing Company

Follow Us On The X Platform @PalosPublishing
Categories We Write About

Create a Python-based site map generator

Here’s a Python-based sitemap generator that recursively crawls a website and creates a sitemap.xml file. This script uses requests and BeautifulSoup to extract links, and urllib for URL handling.


✅ Python Sitemap Generator

python
import requests from bs4 import BeautifulSoup from urllib.parse import urljoin, urlparse import xml.etree.ElementTree as ET import time class SitemapGenerator: def __init__(self, base_url, max_urls=1000, delay=1): self.base_url = base_url.rstrip('/') self.visited = set() self.to_visit = [self.base_url] self.max_urls = max_urls self.delay = delay self.domain = urlparse(self.base_url).netloc def is_valid_url(self, url): parsed = urlparse(url) return parsed.netloc == self.domain and parsed.scheme in ['http', 'https'] def get_links(self, url): try: response = requests.get(url, timeout=5) if 'text/html' not in response.headers.get('Content-Type', ''): return [] soup = BeautifulSoup(response.text, 'html.parser') links = set() for a in soup.find_all('a', href=True): full_url = urljoin(url, a['href']) clean_url = full_url.split('#')[0].rstrip('/') if self.is_valid_url(clean_url): links.add(clean_url) return links except Exception: return [] def crawl(self): while self.to_visit and len(self.visited) < self.max_urls: current_url = self.to_visit.pop(0) if current_url in self.visited: continue print(f'Crawling: {current_url}') self.visited.add(current_url) new_links = self.get_links(current_url) for link in new_links: if link not in self.visited and link not in self.to_visit: self.to_visit.append(link) time.sleep(self.delay) def generate_sitemap(self, output_file='sitemap.xml'): urlset = ET.Element('urlset', { 'xmlns': 'http://www.sitemaps.org/schemas/sitemap/0.9' }) for url in sorted(self.visited): url_tag = ET.SubElement(urlset, 'url') loc = ET.SubElement(url_tag, 'loc') loc.text = url lastmod = ET.SubElement(url_tag, 'lastmod') lastmod.text = time.strftime('%Y-%m-%d') tree = ET.ElementTree(urlset) tree.write(output_file, encoding='utf-8', xml_declaration=True) print(f'Sitemap generated: {output_file}') if __name__ == '__main__': base_url = input("Enter the base URL (e.g. https://example.com): ").strip() generator = SitemapGenerator(base_url) generator.crawl() generator.generate_sitemap()

🔧 Requirements

Install dependencies:

bash
pip install requests beautifulsoup4

📝 Notes

  • The script respects internal domain-only crawling.

  • Limits the crawl to 1000 URLs by default.

  • Outputs a standard sitemap.xml suitable for Google Search Console submission.

  • It can be extended to include <priority>, <changefreq>, or <lastmod> dynamically per URL.

Share this Page your favorite way: Click any app below to share.

Enter your email below to join The Palos Publishing Company Email List

We respect your email privacy

Categories We Write About