Scraping GitHub issue tracker data can be done using the GitHub API, which is the recommended and compliant way to access repository issues, comments, and metadata. Here’s a step-by-step guide on how to retrieve GitHub issue tracker data:
1. Set Up GitHub Personal Access Token (Optional but Recommended)
To avoid strict rate limits, generate a personal access token:
-
Go to GitHub Settings > Developer settings > Personal access tokens
-
Generate a token with at least
repoandread:orgscopes (public data only needs minimal permissions)
2. Use GitHub REST API to Fetch Issues
GitHub REST API endpoint for issues:
Example with curl:
Replace OWNER and REPO with the GitHub repo owner and repository name.
3. Using Python (Recommended for Automation)
Install dependencies:
Sample Python script:
4. Fields You Can Extract
Each issue object contains:
-
title -
body -
user.login(creator) -
state(open/closed) -
labels -
created_at -
updated_at -
closed_at -
comments(count)
To fetch comments:
5. Respect GitHub’s Rate Limits
Without authentication: 60 requests/hour
With token: 5000 requests/hour
Check your usage:
6. Store or Analyze the Data
Once fetched, you can:
-
Save to CSV or JSON
-
Use pandas for analysis
-
Index into a search engine like Elasticsearch
Example (saving as CSV):
Would you like a version of this script customized for a specific repository or feature (e.g., filter by label or keyword)?