The Palos Publishing Company

Follow Us On The X Platform @PalosPublishing
Categories We Write About

Scrape comments from GitHub issues

Scraping comments from GitHub issues can be done efficiently using the GitHub API. Here’s a detailed guide on how to do it programmatically, along with sample Python code.


How to Scrape Comments from GitHub Issues

1. Understand GitHub API Endpoints

GitHub provides REST API endpoints to interact with issues and their comments. The two main endpoints you need are:

  • List issues for a repository:
    GET /repos/{owner}/{repo}/issues

  • List comments on an issue:
    GET /repos/{owner}/{repo}/issues/{issue_number}/comments


2. Authentication

To avoid API rate limits and access private repositories if needed, use a GitHub Personal Access Token (PAT).


3. Steps to Scrape Comments

  1. Get all issues for a repository (optionally filtered by state like open or closed).

  2. For each issue, fetch all comments.

  3. Store or process the comments as needed.


4. Example Python Script Using requests

python
import requests # GitHub repository details owner = 'octocat' # Replace with the repository owner repo = 'Hello-World' # Replace with the repository name # GitHub personal access token (for authentication) token = 'your_personal_access_token_here' headers = { 'Authorization': f'token {token}', 'Accept': 'application/vnd.github.v3+json' } # Function to get all issues (including pagination) def get_issues(owner, repo): issues = [] page = 1 while True: url = f'https://api.github.com/repos/{owner}/{repo}/issues?state=all&per_page=100&page={page}' response = requests.get(url, headers=headers) response.raise_for_status() data = response.json() if not data: break issues.extend(data) page += 1 return issues # Function to get comments for a single issue def get_comments(owner, repo, issue_number): comments = [] page = 1 while True: url = f'https://api.github.com/repos/{owner}/{repo}/issues/{issue_number}/comments?per_page=100&page={page}' response = requests.get(url, headers=headers) response.raise_for_status() data = response.json() if not data: break comments.extend(data) page += 1 return comments def main(): issues = get_issues(owner, repo) all_comments = [] for issue in issues: issue_number = issue['number'] comments = get_comments(owner, repo, issue_number) for comment in comments: comment_data = { 'issue_number': issue_number, 'comment_id': comment['id'], 'user': comment['user']['login'], 'created_at': comment['created_at'], 'body': comment['body'] } all_comments.append(comment_data) # Example: print all comments for comment in all_comments: print(f"Issue #{comment['issue_number']} - Comment by {comment['user']} at {comment['created_at']}:") print(comment['body']) print('-' * 80) if __name__ == '__main__': main()

Notes:

  • Rate Limits: Without authentication, you get 60 requests per hour. With a PAT, it increases to 5,000 requests per hour.

  • Pagination: The GitHub API returns up to 100 items per page, so handling pagination is essential.

  • Issue Types: Pull requests also appear as issues. You can filter out pull requests by checking if the pull_request key exists in the issue object.

  • Data Storage: You can save the comments to a file or database as per your needs.


This method is reliable, API-friendly, and keeps you within GitHub’s usage policies. If you want to scrape comments without API access, you’d have to resort to web scraping (HTML parsing), which is less stable and not recommended.

Share this Page your favorite way: Click any app below to share.

Enter your email below to join The Palos Publishing Company Email List

We respect your email privacy

Categories We Write About