The Palos Publishing Company

Follow Us On The X Platform @PalosPublishing

Categories We Write About

Scrape news articles on a specific topic

Written by

in

Computer Science

To scrape news articles on a specific topic, here’s a step-by-step guide using Python with the requests, BeautifulSoup, and optionally newspaper3k libraries. Below is a basic script for scraping news articles from Google News or directly from known sites.

Option 1: Scrape Google News RSS Feed

python
import feedparser

def scrape_google_news(topic, max_results=10):
    query = topic.replace(' ', '+')
    rss_url = f'https://news.google.com/rss/search?q={query}&hl=en-US&gl=US&ceid=US:en'
    feed = feedparser.parse(rss_url)

    articles = []
    for entry in feed.entries[:max_results]:
        articles.append({
            'title': entry.title,
            'link': entry.link,
            'published': entry.published,
            'summary': entry.summary
        })
    return articles

# Example usage
news = scrape_google_news("Artificial Intelligence")
for article in news:
    print(f"{article['title']} - {article['link']}")

Option 2: Use `newspaper3k` to Extract Full Articles

python
from newspaper import Article

def extract_full_article(url):
    article = Article(url)
    article.download()
    article.parse()
    return {
        'title': article.title,
        'text': article.text
    }

# Example: extract full content of one URL
url = 'https://example.com/article-about-ai'
content = extract_full_article(url)
print(content['title'])
print(content['text'][:500])  # Print first 500 characters

Option 3: Scrape Specific Website with `requests` + `BeautifulSoup`

python
import requests
from bs4 import BeautifulSoup

def scrape_website(url, article_selector):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    articles = soup.select(article_selector)
    return [article.get_text(strip=True) for article in articles]

# Example usage for scraping headlines
headlines = scrape_website("https://www.bbc.com/news", "h3")
for headline in headlines[:10]:
    print(headline)

Notes:

Use user-agent headers if you’re blocked or throttled.
For robust scraping, respect each site’s robots.txt.
Use selenium if content is loaded dynamically with JavaScript.
You can automate daily scraping and storage with cron jobs or apscheduler.

Let me know your specific topic and preferred source, and I can tailor the code to your needs.

Share this Page your favorite way: Click any app below to share.

See all the ways to share this page

Check Out Our Newest Posts we wrote about

Categories We Write About