The Palos Publishing Company

Follow Us On The X Platform @PalosPublishing
Categories We Write About

Scrape patent databases for keywords

Scraping patent databases for keywords involves gathering and analyzing data from publicly accessible patent repositories. Depending on the database, the method and legality of scraping can vary. Here’s a guide outlining how to approach this, tools you can use, and ethical/legal considerations:


1. Popular Patent Databases to Target

  • Google Patents (https://patents.google.com)
    Offers a simple interface and full-text search across global patent databases. Scraping is technically possible but requires careful request management to avoid being blocked.

  • USPTO (United States Patent and Trademark Office)

  • Espacenet (European Patent Office) (https://worldwide.espacenet.com/)
    Offers over 120 million patent documents. Scraping is more difficult due to aggressive bot protection.

  • WIPO PATENTSCOPE (https://patentscope.wipo.int/)
    Includes international (PCT) patent applications. Offers a search API after registration.


2. Ethical and Legal Guidelines

  • Check the robots.txt file for each site.

  • Prefer official APIs or bulk datasets over raw HTML scraping.

  • Do not overwhelm servers with excessive requests (respect rate limits).

  • Use scraped data only for permitted purposes (research, indexing, etc.).


3. Tools and Libraries

a. Python Libraries

  • requests, httpx – for making HTTP requests.

  • BeautifulSoup, lxml – for parsing HTML.

  • Selenium, Playwright – for sites requiring JavaScript rendering.

  • pandas – for organizing data.

  • spaCy, nltk, or scikit-learn – for keyword extraction and text analysis.

b. APIs for Legal Access

  • USPTO Patent Examination Data System (PEDS)

  • Google Cloud BigQuery – Google Patents Dataset

  • WIPO PATENTSCOPE API


4. Basic Workflow for Keyword Extraction

Step 1: Fetch Patent Data

Use an API or scrape content (title, abstract, claims, description).

python
import requests from bs4 import BeautifulSoup url = 'https://patents.google.com/patent/US20210123456A1/en' headers = {'User-Agent': 'Mozilla/5.0'} response = requests.get(url, headers=headers) soup = BeautifulSoup(response.text, 'html.parser') abstract = soup.find('meta', {'name': 'DC.description'}).get('content') print(abstract)

Step 2: Preprocess Text

Tokenize, remove stopwords, and lemmatize.

python
import spacy nlp = spacy.load('en_core_web_sm') doc = nlp(abstract.lower()) keywords = [token.lemma_ for token in doc if token.is_alpha and not token.is_stop] print(keywords)

Step 3: Extract Keywords

Use TF-IDF, RAKE, or KeyBERT.

python
from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer(max_features=20) X = vectorizer.fit_transform([abstract]) print(vectorizer.get_feature_names_out())

5. Storing and Analyzing Data

  • Use SQLite, PostgreSQL, or MongoDB to store patent documents and metadata.

  • Visualize keyword trends with matplotlib, seaborn, or Plotly.


6. Scaling and Automation

  • Scrapy: A scalable Python framework for structured crawling.

  • Airflow or cron jobs for scheduled scraping.

  • Use proxies and rotating user agents for robustness.


7. Recommended Strategy

  • Use bulk data or APIs whenever possible for reliability and legality.

  • Focus on titles, abstracts, and claims for keyword mining.

  • Automate periodic scraping and analysis for trend detection.

  • Build dashboards for visualization using tools like Tableau or Dash.


Let me know if you want a custom scraper or keyword analysis tool built in Python for any of these databases.

Share this Page your favorite way: Click any app below to share.

Enter your email below to join The Palos Publishing Company Email List

We respect your email privacy

Categories We Write About