The Palos Publishing Company

Follow Us On The X Platform @PalosPublishing
Categories We Write About

Scrape GitHub commit stats by language

Scraping GitHub commit statistics by programming language involves gathering data on how many commits are made to repositories categorized by their primary language. Since GitHub doesn’t provide a direct, public API endpoint for aggregated commit stats broken down by language across all repositories, the process requires a strategic approach:


How to Scrape GitHub Commit Stats by Language

1. Understand GitHub Data Structure

  • Each GitHub repository has a primary language.

  • Each repository has commits history accessible via GitHub API.

  • GitHub API v3 (REST) or v4 (GraphQL) can be used to fetch commits.

  • Rate limits apply: unauthenticated requests are limited to 60/hour, authenticated requests up to 5,000/hour.

2. Strategy to Get Commit Stats by Language

Option A: For Specific Repositories or Users

  • Query a known list of repos with their languages.

  • Use GitHub API to fetch commit count for each repo.

  • Aggregate commits grouped by language.

Option B: For Broad Language Stats (Aggregate across many repos)

  • Use GitHub Search API to find repositories by language.

  • For each repository found, fetch commit count.

  • Sum commits per language.

Limitations: This approach requires iterating through many repositories, which is time-consuming and prone to API rate limiting.


Step-by-Step Guide

Step 1: Get repositories by language

Use GitHub Search API to list repositories by language:

http
GET https://api.github.com/search/repositories?q=language:Python&sort=stars&order=desc&per_page=100&page=1

This returns repositories tagged as Python, sorted by stars.

Step 2: For each repository, get the commit count

Use the commits endpoint:

http
GET https://api.github.com/repos/{owner}/{repo}/commits?per_page=1

To get the total commits, you can use the Link header for pagination or the GitHub repository statistics API:

http
GET https://api.github.com/repos/{owner}/{repo}/stats/commit_activity

This returns weekly commit activity for the past year (number of commits per week). Summing these gives the commits in the last year.

Alternatively, get the commit count from the repository’s main branch:

http
GET https://api.github.com/repos/{owner}/{repo}/branches/main

The commit object includes the latest commit SHA, but no direct commit count.

Step 3: Aggregate commits per language

Sum commits from all repos of a particular language to get total commits by that language.


Example Python Script Outline Using GitHub API

python
import requests import time TOKEN = 'your_github_token' HEADERS = {'Authorization': f'token {TOKEN}'} def get_repositories(language, page=1): url = f'https://api.github.com/search/repositories?q=language:{language}&sort=stars&order=desc&per_page=100&page={page}' response = requests.get(url, headers=HEADERS) response.raise_for_status() return response.json() def get_commit_activity(owner, repo): url = f'https://api.github.com/repos/{owner}/{repo}/stats/commit_activity' response = requests.get(url, headers=HEADERS) if response.status_code == 202: # GitHub is calculating statistics; retry after delay time.sleep(3) return get_commit_activity(owner, repo) response.raise_for_status() return response.json() def main(): languages = ['Python', 'JavaScript', 'Go'] commit_counts = {} for language in languages: commit_counts[language] = 0 page = 1 while True: data = get_repositories(language, page) repos = data.get('items', []) if not repos: break for repo in repos: owner = repo['owner']['login'] repo_name = repo['name'] commits = get_commit_activity(owner, repo_name) total_commits = sum(week['total'] for week in commits) commit_counts[language] += total_commits page += 1 if page > 2: # limit to avoid excessive requests break print(commit_counts) if __name__ == '__main__': main()

Notes and Best Practices

  • Rate limiting: Use OAuth token for higher limits. Monitor API rate limits with headers.

  • Caching: Cache repo lists and commit stats to avoid repeated requests.

  • Pagination: The Search API returns max 1000 results total.

  • Commit stats generation delay: The /stats/commit_activity endpoint may return 202 status if stats are being generated. Retry with delay.

  • Scaling: Scraping many repos may require batching and delay management.

  • Alternatives: Use GitHub Archive or BigQuery public datasets for large-scale GitHub data analysis without API limits.


If you want, I can help create a fully functional script or guide you through using GitHub data more efficiently!

Share this Page your favorite way: Click any app below to share.

Enter your email below to join The Palos Publishing Company Email List

We respect your email privacy

Categories We Write About