The Palos Publishing Company

Follow Us On The X Platform @PalosPublishing

Categories We Write About

Create a Python-based site map generator

Written by

in

Computer Science

Here’s a Python-based sitemap generator that recursively crawls a website and creates a sitemap.xml file. This script uses requests and BeautifulSoup to extract links, and urllib for URL handling.

✅ Python Sitemap Generator

python
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
import xml.etree.ElementTree as ET
import time

class SitemapGenerator:
    def __init__(self, base_url, max_urls=1000, delay=1):
        self.base_url = base_url.rstrip('/')
        self.visited = set()
        self.to_visit = [self.base_url]
        self.max_urls = max_urls
        self.delay = delay
        self.domain = urlparse(self.base_url).netloc

    def is_valid_url(self, url):
        parsed = urlparse(url)
        return parsed.netloc == self.domain and parsed.scheme in ['http', 'https']

    def get_links(self, url):
        try:
            response = requests.get(url, timeout=5)
            if 'text/html' not in response.headers.get('Content-Type', ''):
                return []
            soup = BeautifulSoup(response.text, 'html.parser')
            links = set()
            for a in soup.find_all('a', href=True):
                full_url = urljoin(url, a['href'])
                clean_url = full_url.split('#')[0].rstrip('/')
                if self.is_valid_url(clean_url):
                    links.add(clean_url)
            return links
        except Exception:
            return []

    def crawl(self):
        while self.to_visit and len(self.visited) < self.max_urls:
            current_url = self.to_visit.pop(0)
            if current_url in self.visited:
                continue
            print(f'Crawling: {current_url}')
            self.visited.add(current_url)
            new_links = self.get_links(current_url)
            for link in new_links:
                if link not in self.visited and link not in self.to_visit:
                    self.to_visit.append(link)
            time.sleep(self.delay)

    def generate_sitemap(self, output_file='sitemap.xml'):
        urlset = ET.Element('urlset', {
            'xmlns': 'http://www.sitemaps.org/schemas/sitemap/0.9'
        })

        for url in sorted(self.visited):
            url_tag = ET.SubElement(urlset, 'url')
            loc = ET.SubElement(url_tag, 'loc')
            loc.text = url
            lastmod = ET.SubElement(url_tag, 'lastmod')
            lastmod.text = time.strftime('%Y-%m-%d')

        tree = ET.ElementTree(urlset)
        tree.write(output_file, encoding='utf-8', xml_declaration=True)
        print(f'Sitemap generated: {output_file}')

if __name__ == '__main__':
    base_url = input("Enter the base URL (e.g. https://example.com): ").strip()
    generator = SitemapGenerator(base_url)
    generator.crawl()
    generator.generate_sitemap()

🔧 Requirements

Install dependencies:

bash
pip install requests beautifulsoup4

📝 Notes

The script respects internal domain-only crawling.
Limits the crawl to 1000 URLs by default.
Outputs a standard sitemap.xml suitable for Google Search Console submission.
It can be extended to include <priority>, <changefreq>, or <lastmod> dynamically per URL.

Share this Page your favorite way: Click any app below to share.

See all the ways to share this page

Check Out Our Newest Posts we wrote about

Categories We Write About