Scraping patent databases for keywords involves gathering and analyzing data from publicly accessible patent repositories. Depending on the database, the method and legality of scraping can vary. Here’s a guide outlining how to approach this, tools you can use, and ethical/legal considerations:
1. Popular Patent Databases to Target
-
Google Patents (https://patents.google.com)
Offers a simple interface and full-text search across global patent databases. Scraping is technically possible but requires careful request management to avoid being blocked. -
USPTO (United States Patent and Trademark Office)
-
Patent Full-Text and Image Database (https://patft.uspto.gov/)
-
Bulk data via USPTO’s open data portal (https://developer.uspto.gov/)
Ideal for legal scraping through their APIs and bulk download options.
-
-
Espacenet (European Patent Office) (https://worldwide.espacenet.com/)
Offers over 120 million patent documents. Scraping is more difficult due to aggressive bot protection. -
WIPO PATENTSCOPE (https://patentscope.wipo.int/)
Includes international (PCT) patent applications. Offers a search API after registration.
2. Ethical and Legal Guidelines
-
Check the robots.txt file for each site.
-
Prefer official APIs or bulk datasets over raw HTML scraping.
-
Do not overwhelm servers with excessive requests (respect rate limits).
-
Use scraped data only for permitted purposes (research, indexing, etc.).
3. Tools and Libraries
a. Python Libraries
-
requests,httpx– for making HTTP requests. -
BeautifulSoup,lxml– for parsing HTML. -
Selenium,Playwright– for sites requiring JavaScript rendering. -
pandas– for organizing data. -
spaCy,nltk, orscikit-learn– for keyword extraction and text analysis.
b. APIs for Legal Access
-
USPTO Patent Examination Data System (PEDS)
-
Google Cloud BigQuery – Google Patents Dataset
-
WIPO PATENTSCOPE API
4. Basic Workflow for Keyword Extraction
Step 1: Fetch Patent Data
Use an API or scrape content (title, abstract, claims, description).
Step 2: Preprocess Text
Tokenize, remove stopwords, and lemmatize.
Step 3: Extract Keywords
Use TF-IDF, RAKE, or KeyBERT.
5. Storing and Analyzing Data
-
Use
SQLite,PostgreSQL, orMongoDBto store patent documents and metadata. -
Visualize keyword trends with
matplotlib,seaborn, orPlotly.
6. Scaling and Automation
-
Scrapy: A scalable Python framework for structured crawling.
-
Airflow or cron jobs for scheduled scraping.
-
Use proxies and rotating user agents for robustness.
7. Recommended Strategy
-
Use bulk data or APIs whenever possible for reliability and legality.
-
Focus on titles, abstracts, and claims for keyword mining.
-
Automate periodic scraping and analysis for trend detection.
-
Build dashboards for visualization using tools like Tableau or Dash.
Let me know if you want a custom scraper or keyword analysis tool built in Python for any of these databases.