Scrape patent databases for keywords

Scraping patent databases for keywords involves gathering and analyzing data from publicly accessible patent repositories. Depending on the database, the method and legality of scraping can vary. Here’s a guide outlining how to approach this, tools you can use, and ethical/legal considerations:

1. Popular Patent Databases to Target

Google Patents (https://patents.google.com)
Offers a simple interface and full-text search across global patent databases. Scraping is technically possible but requires careful request management to avoid being blocked.
USPTO (United States Patent and Trademark Office)
- Patent Full-Text and Image Database (https://patft.uspto.gov/)
- Bulk data via USPTO’s open data portal (https://developer.uspto.gov/)
  Ideal for legal scraping through their APIs and bulk download options.
Espacenet (European Patent Office) (https://worldwide.espacenet.com/)
Offers over 120 million patent documents. Scraping is more difficult due to aggressive bot protection.
WIPO PATENTSCOPE (https://patentscope.wipo.int/)
Includes international (PCT) patent applications. Offers a search API after registration.

2. Ethical and Legal Guidelines

Check the robots.txt file for each site.
Prefer official APIs or bulk datasets over raw HTML scraping.
Do not overwhelm servers with excessive requests (respect rate limits).
Use scraped data only for permitted purposes (research, indexing, etc.).

3. Tools and Libraries

a. Python Libraries

requests, httpx – for making HTTP requests.
BeautifulSoup, lxml – for parsing HTML.
Selenium, Playwright – for sites requiring JavaScript rendering.
pandas – for organizing data.
spaCy, nltk, or scikit-learn – for keyword extraction and text analysis.

b. APIs for Legal Access

USPTO Patent Examination Data System (PEDS)
Google Cloud BigQuery – Google Patents Dataset
WIPO PATENTSCOPE API

4. Basic Workflow for Keyword Extraction

Step 1: Fetch Patent Data

Use an API or scrape content (title, abstract, claims, description).

python
import requests
from bs4 import BeautifulSoup

url = 'https://patents.google.com/patent/US20210123456A1/en'
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

abstract = soup.find('meta', {'name': 'DC.description'}).get('content')
print(abstract)

Step 2: Preprocess Text

Tokenize, remove stopwords, and lemmatize.

python
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(abstract.lower())

keywords = [token.lemma_ for token in doc if token.is_alpha and not token.is_stop]
print(keywords)

Step 3: Extract Keywords

Use TF-IDF, RAKE, or KeyBERT.

python
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=20)
X = vectorizer.fit_transform([abstract])
print(vectorizer.get_feature_names_out())

5. Storing and Analyzing Data

Use SQLite, PostgreSQL, or MongoDB to store patent documents and metadata.
Visualize keyword trends with matplotlib, seaborn, or Plotly.

6. Scaling and Automation

Scrapy: A scalable Python framework for structured crawling.
Airflow or cron jobs for scheduled scraping.
Use proxies and rotating user agents for robustness.

7. Recommended Strategy

Use bulk data or APIs whenever possible for reliability and legality.
Focus on titles, abstracts, and claims for keyword mining.
Automate periodic scraping and analysis for trend detection.
Build dashboards for visualization using tools like Tableau or Dash.

Let me know if you want a custom scraper or keyword analysis tool built in Python for any of these databases.

Share this Page your favorite way: Click any app below to share.

See all the ways to share this page