The Palos Publishing Company

Follow Us On The X Platform @PalosPublishing

Categories We Write About

Scrape public GitHub issues by topic

Written by

in

Computer Science

To scrape public GitHub issues by topic, you can use the GitHub REST API. Below is a Python script using requests to fetch issues related to a specific topic from public repositories.

⚙️ Prerequisites:

Python 3.x
requests library (pip install requests)
A GitHub Personal Access Token (optional but recommended to avoid rate limiting)

✅ Python Script to Scrape GitHub Issues by Topic

python
import requests
import time

# Optional: Set your GitHub token here to increase rate limits
GITHUB_TOKEN = 'your_github_token_here'  # Replace with your token or set to None
HEADERS = {'Authorization': f'token {GITHUB_TOKEN}'} if GITHUB_TOKEN else {}

# Define the topic (GitHub "topics" relate to repositories, not issues)
TOPIC = "machine-learning"
MAX_REPOS = 10  # Number of repositories to fetch
MAX_ISSUES_PER_REPO = 10  # Number of issues to fetch per repository

def search_repositories_by_topic(topic, max_repos):
    url = f"https://api.github.com/search/repositories?q=topic:{topic}&sort=stars&order=desc&per_page={max_repos}"
    response = requests.get(url, headers=HEADERS)
    response.raise_for_status()
    return response.json().get("items", [])

def get_issues(repo_full_name, max_issues):
    issues_url = f"https://api.github.com/repos/{repo_full_name}/issues?state=open&per_page={max_issues}"
    response = requests.get(issues_url, headers=HEADERS)
    response.raise_for_status()
    return response.json()

def scrape_github_issues(topic):
    repos = search_repositories_by_topic(topic, MAX_REPOS)
    all_issues = []

    for repo in repos:
        repo_name = repo['full_name']
        print(f"Fetching issues for: {repo_name}")
        try:
            issues = get_issues(repo_name, MAX_ISSUES_PER_REPO)
            for issue in issues:
                if 'pull_request' not in issue:  # Exclude PRs
                    all_issues.append({
                        "repository": repo_name,
                        "issue_title": issue["title"],
                        "issue_url": issue["html_url"],
                        "created_at": issue["created_at"],
                        "user": issue["user"]["login"]
                    })
            time.sleep(1)  # To avoid hitting rate limits
        except Exception as e:
            print(f"Failed to fetch issues for {repo_name}: {e}")
    
    return all_issues

# Run the scraper
if __name__ == "__main__":
    topic = "machine-learning"  # You can change this to any topic
    issues = scrape_github_issues(topic)
    for i, issue in enumerate(issues, start=1):
        print(f"{i}. [{issue['issue_title']}]({issue['issue_url']}) - {issue['repository']} (by {issue['user']})")

🔍 Notes:

GitHub’s search API only supports repository-level topics. Issues themselves don’t have topics.
The script fetches repositories tagged with the specified topic, then scrapes open issues from each.
You can customize MAX_REPOS and MAX_ISSUES_PER_REPO as needed.

Let me know if you want this data exported to CSV or filtered by keywords within the issues.

Share this Page your favorite way: Click any app below to share.

See all the ways to share this page

Check Out Our Newest Posts we wrote about

Categories We Write About