Scrape trending keywords from tech blogs

To scrape trending keywords from tech blogs, you’ll need to use a combination of web scraping tools and natural language processing (NLP) techniques. Here’s a straightforward process using Python and libraries like requests, BeautifulSoup, and nltk or spaCy. Note that scraping must comply with each website’s terms of service — always check their robots.txt.

Step-by-Step Python Script to Scrape Trending Keywords from Tech Blogs

1. Install Required Libraries

bash
pip install requests beautifulsoup4 nltk

Optional for advanced keyword extraction:

bash
pip install spacy
python -m spacy download en_core_web_sm

2. Basic Script to Scrape Articles and Extract Keywords

python
import requests
from bs4 import BeautifulSoup
from collections import Counter
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

# List of tech blogs to scrape
urls = [
    'https://techcrunch.com/',
    'https://thenextweb.com/',
    'https://www.theverge.com/tech',
    'https://www.wired.com/category/tech/',
    'https://gizmodo.com/c/tech'
]

headers = {'User-Agent': 'Mozilla/5.0'}

def clean_text(text):
    text = re.sub(r'[^A-Za-zs]', '', text)
    text = text.lower()
    return text

def get_article_text(url):
    try:
        res = requests.get(url, headers=headers)
        soup = BeautifulSoup(res.text, 'html.parser')
        paragraphs = soup.find_all(['p'])
        content = ' '.join(p.text for p in paragraphs)
        return clean_text(content)
    except Exception as e:
        print(f"Failed to scrape {url}: {e}")
        return ""

def extract_keywords(text, stop_words):
    words = text.split()
    words = [word for word in words if word not in stop_words and len(word) > 3]
    return Counter(words).most_common(50)

def scrape_all():
    full_text = ''
    for url in urls:
        print(f"Scraping {url}")
        text = get_article_text(url)
        full_text += ' ' + text

    stop_words = set(stopwords.words('english'))
    keywords = extract_keywords(full_text, stop_words)
    return keywords

# Run the scraper
trending_keywords = scrape_all()
for word, freq in trending_keywords:
    print(f"{word}: {freq}")

Output Example (Sample Keywords)

makefile
ai: 23  
startup: 19  
nvidia: 18  
apple: 16  
chip: 14  
cloud: 13  
openai: 12  
blockchain: 11  
quantum: 10  
privacy: 9

Tips for Improvement

Use RSS Feeds: Most tech blogs offer RSS feeds, which are easier and faster to parse.
NER with spaCy: Extract proper nouns like company names or products.
Time Filtering: Scrape only articles from the last 7 days to ensure trending relevance.
Deduplication: Filter out repeated content from syndication.

Would you like a version using spaCy for Named Entity Recognition or to save the results into a CSV or database?

Share this Page your favorite way: Click any app below to share.

See all the ways to share this page

Step-by-Step Python Script to Scrape Trending Keywords from Tech Blogs

1. Install Required Libraries

2. Basic Script to Scrape Articles and Extract Keywords

Output Example (Sample Keywords)

Tips for Improvement

Check Out Our Newest Posts we wrote about

Why your ML system design must support partial retraining

Why your ML pipeline must detect missing or stale features

Why your ML feedback loop must consider label quality

Why your ML deployment plan must include fallback logic