The Palos Publishing Company

Follow Us On The X Platform @PalosPublishing
Categories We Write About

Scrape GitHub issue tracker data

Scraping GitHub issue tracker data can be done using the GitHub API, which is the recommended and compliant way to access repository issues, comments, and metadata. Here’s a step-by-step guide on how to retrieve GitHub issue tracker data:


1. Set Up GitHub Personal Access Token (Optional but Recommended)

To avoid strict rate limits, generate a personal access token:

  • Go to GitHub Settings > Developer settings > Personal access tokens

  • Generate a token with at least repo and read:org scopes (public data only needs minimal permissions)


2. Use GitHub REST API to Fetch Issues

GitHub REST API endpoint for issues:

bash
GET https://api.github.com/repos/{owner}/{repo}/issues

Example with curl:

bash
curl -H "Authorization: token YOUR_TOKEN" -H "Accept: application/vnd.github.v3+json" "https://api.github.com/repos/OWNER/REPO/issues?state=all&per_page=100"

Replace OWNER and REPO with the GitHub repo owner and repository name.


3. Using Python (Recommended for Automation)

Install dependencies:

bash
pip install requests

Sample Python script:

python
import requests TOKEN = 'your_github_token' # Optional but helps avoid rate limiting OWNER = 'octocat' REPO = 'Hello-World' headers = { 'Authorization': f'token {TOKEN}', 'Accept': 'application/vnd.github.v3+json' } params = { 'state': 'all', 'per_page': 100, 'page': 1 } all_issues = [] while True: url = f'https://api.github.com/repos/{OWNER}/{REPO}/issues' response = requests.get(url, headers=headers, params=params) data = response.json() if not data: break all_issues.extend(data) params['page'] += 1 # Output first issue title for verification for issue in all_issues: print(issue['title'])

4. Fields You Can Extract

Each issue object contains:

  • title

  • body

  • user.login (creator)

  • state (open/closed)

  • labels

  • created_at

  • updated_at

  • closed_at

  • comments (count)

To fetch comments:

bash
GET /repos/{owner}/{repo}/issues/{issue_number}/comments

5. Respect GitHub’s Rate Limits

Without authentication: 60 requests/hour
With token: 5000 requests/hour

Check your usage:

nginx
GET https://api.github.com/rate_limit

6. Store or Analyze the Data

Once fetched, you can:

  • Save to CSV or JSON

  • Use pandas for analysis

  • Index into a search engine like Elasticsearch

Example (saving as CSV):

python
import csv with open('github_issues.csv', 'w', newline='', encoding='utf-8') as f: writer = csv.writer(f) writer.writerow(['Title', 'State', 'Created At', 'User']) for issue in all_issues: writer.writerow([issue['title'], issue['state'], issue['created_at'], issue['user']['login']])

Would you like a version of this script customized for a specific repository or feature (e.g., filter by label or keyword)?

Share this Page your favorite way: Click any app below to share.

Enter your email below to join The Palos Publishing Company Email List

We respect your email privacy

Categories We Write About