Scrape GitHub issue tracker data

Scraping GitHub issue tracker data can be done using the GitHub API, which is the recommended and compliant way to access repository issues, comments, and metadata. Here’s a step-by-step guide on how to retrieve GitHub issue tracker data:

1. Set Up GitHub Personal Access Token (Optional but Recommended)

To avoid strict rate limits, generate a personal access token:

Go to GitHub Settings > Developer settings > Personal access tokens
Generate a token with at least repo and read:org scopes (public data only needs minimal permissions)

2. Use GitHub REST API to Fetch Issues

GitHub REST API endpoint for issues:

bash
GET https://api.github.com/repos/{owner}/{repo}/issues

Example with `curl`:

bash
curl -H "Authorization: token YOUR_TOKEN" 
     -H "Accept: application/vnd.github.v3+json" 
     "https://api.github.com/repos/OWNER/REPO/issues?state=all&per_page=100"

Replace OWNER and REPO with the GitHub repo owner and repository name.

3. Using Python (Recommended for Automation)

Install dependencies:

bash
pip install requests

Sample Python script:

python
import requests

TOKEN = 'your_github_token'  # Optional but helps avoid rate limiting
OWNER = 'octocat'
REPO = 'Hello-World'

headers = {
    'Authorization': f'token {TOKEN}',
    'Accept': 'application/vnd.github.v3+json'
}

params = {
    'state': 'all',
    'per_page': 100,
    'page': 1
}

all_issues = []

while True:
    url = f'https://api.github.com/repos/{OWNER}/{REPO}/issues'
    response = requests.get(url, headers=headers, params=params)
    data = response.json()

    if not data:
        break

    all_issues.extend(data)
    params['page'] += 1

# Output first issue title for verification
for issue in all_issues:
    print(issue['title'])

4. Fields You Can Extract

Each issue object contains:

title
body
user.login (creator)
state (open/closed)
labels
created_at
updated_at
closed_at
comments (count)

To fetch comments:

bash
GET /repos/{owner}/{repo}/issues/{issue_number}/comments

5. Respect GitHub’s Rate Limits

Without authentication: 60 requests/hour
With token: 5000 requests/hour

Check your usage:

nginx
GET https://api.github.com/rate_limit

6. Store or Analyze the Data

Once fetched, you can:

Save to CSV or JSON
Use pandas for analysis
Index into a search engine like Elasticsearch

Example (saving as CSV):

python
import csv

with open('github_issues.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(['Title', 'State', 'Created At', 'User'])
    for issue in all_issues:
        writer.writerow([issue['title'], issue['state'], issue['created_at'], issue['user']['login']])

Would you like a version of this script customized for a specific repository or feature (e.g., filter by label or keyword)?

Share this Page your favorite way: Click any app below to share.

See all the ways to share this page

1. Set Up GitHub Personal Access Token (Optional but Recommended)

2. Use GitHub REST API to Fetch Issues

Example with `curl`:

3. Using Python (Recommended for Automation)

4. Fields You Can Extract

5. Respect GitHub’s Rate Limits

6. Store or Analyze the Data

Check Out Our Newest Posts we wrote about

Why your ML system design must support partial retraining

Why your ML pipeline must detect missing or stale features

Why your ML feedback loop must consider label quality

Why your ML deployment plan must include fallback logic

Scrape GitHub issue tracker data

1. Set Up GitHub Personal Access Token (Optional but Recommended)

2. Use GitHub REST API to Fetch Issues

Example with curl:

3. Using Python (Recommended for Automation)

4. Fields You Can Extract

5. Respect GitHub’s Rate Limits

6. Store or Analyze the Data

Check Out Our Newest Posts we wrote about

Why your ML system design must support partial retraining

Why your ML pipeline must detect missing or stale features

Why your ML feedback loop must consider label quality

Why your ML deployment plan must include fallback logic

Example with `curl`: