Categories We Write About

Extract URLs from documents

Extracting URLs from documents involves scanning the text and identifying patterns that match web addresses. Here’s a straightforward approach to do this, depending on your context:

If you want a simple explanation or code snippets, here are some ways:

1. Using Regular Expressions (Regex)

A regex pattern can find URLs in plain text.

Example in Python:

python
import re text = """ Here are some links: https://www.example.com Visit http://test.org or https://sub.domain.com/page?query=1 """ url_pattern = r'https?://[^s<>"]+|www.[^s<>"]+' urls = re.findall(url_pattern, text) print(urls)

This will output:

css
['https://www.example.com', 'http://test.org', 'https://sub.domain.com/page?query=1']

2. Extracting URLs from a Document (Word, PDF)

  • For Word (.docx): Use Python’s python-docx to extract text and then run regex.

  • For PDF: Use PyPDF2 or pdfplumber to extract text, then regex.

Example for Word:

python
from docx import Document import re doc = Document('file.docx') full_text = [] for para in doc.paragraphs: full_text.append(para.text) text = 'n'.join(full_text) url_pattern = r'https?://[^s<>"]+|www.[^s<>"]+' urls = re.findall(url_pattern, text) print(urls)

Example for PDF:

python
import pdfplumber import re with pdfplumber.open('file.pdf') as pdf: text = '' for page in pdf.pages: text += page.extract_text() + 'n' url_pattern = r'https?://[^s<>"]+|www.[^s<>"]+' urls = re.findall(url_pattern, text) print(urls)

If you want me to write a full SEO-friendly article on “Extract URLs from documents” with examples and details, just let me know!

Share This Page:

Enter your email below to join The Palos Publishing Company Email List

We respect your email privacy

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Categories We Write About