The Palos Publishing Company

Follow Us On The X Platform @PalosPublishing
Categories We Write About

Scrape license info from public GitHub repos

Scraping license information from public GitHub repositories can be done using GitHub’s REST API. Here’s how you can do it in Python using the requests library:


Requirements

  • A GitHub Personal Access Token (optional but recommended to avoid rate limits)

  • Python 3

  • requests module (pip install requests if not already installed)


Example Script to Scrape License Info

python
import requests # Optional: Use a personal access token to increase rate limit GITHUB_TOKEN = "your_token_here" # Leave empty if not using HEADERS = {"Authorization": f"token {GITHUB_TOKEN}"} if GITHUB_TOKEN else {} def get_license_info(owner, repo): url = f"https://api.github.com/repos/{owner}/{repo}/license" response = requests.get(url, headers=HEADERS) if response.status_code == 200: data = response.json() license_info = { "repo": f"{owner}/{repo}", "license": data["license"]["name"] if data.get("license") else "No license" } return license_info elif response.status_code == 404: return {"repo": f"{owner}/{repo}", "license": "Not Found"} else: return {"repo": f"{owner}/{repo}", "license": f"Error: {response.status_code}"} # Example usage repos = [ ("torvalds", "linux"), ("numpy", "numpy"), ("psf", "requests"), ] for owner, repo in repos: info = get_license_info(owner, repo) print(info)

Output Example

bash
{'repo': 'torvalds/linux', 'license': 'GPL-2.0'} {'repo': 'numpy/numpy', 'license': 'BSD 3-Clause'} {'repo': 'psf/requests', 'license': 'Apache 2.0'}

Scaling for Many Repos

If you’re dealing with many repositories:

  1. Use GitHub’s search API to gather repos:
    https://api.github.com/search/repositories?q=topic:machine-learning+stars:>100

  2. Paginate through results with ?page=X&per_page=100

  3. Collect owner and repo from each item, then call the /license endpoint for each.


Respect GitHub API Limits

  • Unauthenticated: 60 requests/hour

  • Authenticated: 5000 requests/hour


Let me know if you want this adapted into a CLI tool or expanded to read from a CSV or repo list file.

Share this Page your favorite way: Click any app below to share.

Enter your email below to join The Palos Publishing Company Email List

We respect your email privacy

Categories We Write About