Scraping comments from GitHub issues can be done efficiently using the GitHub API. Here’s a detailed guide on how to do it programmatically, along with sample Python code.
How to Scrape Comments from GitHub Issues
1. Understand GitHub API Endpoints
GitHub provides REST API endpoints to interact with issues and their comments. The two main endpoints you need are:
-
List issues for a repository:
GET /repos/{owner}/{repo}/issues -
List comments on an issue:
GET /repos/{owner}/{repo}/issues/{issue_number}/comments
2. Authentication
To avoid API rate limits and access private repositories if needed, use a GitHub Personal Access Token (PAT).
3. Steps to Scrape Comments
-
Get all issues for a repository (optionally filtered by state like
openorclosed). -
For each issue, fetch all comments.
-
Store or process the comments as needed.
4. Example Python Script Using requests
Notes:
-
Rate Limits: Without authentication, you get 60 requests per hour. With a PAT, it increases to 5,000 requests per hour.
-
Pagination: The GitHub API returns up to 100 items per page, so handling pagination is essential.
-
Issue Types: Pull requests also appear as issues. You can filter out pull requests by checking if the
pull_requestkey exists in the issue object. -
Data Storage: You can save the comments to a file or database as per your needs.
This method is reliable, API-friendly, and keeps you within GitHub’s usage policies. If you want to scrape comments without API access, you’d have to resort to web scraping (HTML parsing), which is less stable and not recommended.