Scraping license information from public GitHub repositories can be done using GitHub’s REST API. Here’s how you can do it in Python using the requests library:
Requirements
-
A GitHub Personal Access Token (optional but recommended to avoid rate limits)
-
Python 3
-
requestsmodule (pip install requestsif not already installed)
Example Script to Scrape License Info
Output Example
Scaling for Many Repos
If you’re dealing with many repositories:
-
Use GitHub’s search API to gather repos:
https://api.github.com/search/repositories?q=topic:machine-learning+stars:>100 -
Paginate through results with
?page=X&per_page=100 -
Collect
ownerandrepofrom each item, then call the/licenseendpoint for each.
Respect GitHub API Limits
-
Unauthenticated: 60 requests/hour
-
Authenticated: 5000 requests/hour
Let me know if you want this adapted into a CLI tool or expanded to read from a CSV or repo list file.