Extract URLs from documents

Extracting URLs from documents involves scanning the text and identifying patterns that match web addresses. Here’s a straightforward approach to do this, depending on your context:

If you want a simple explanation or code snippets, here are some ways:

1. Using Regular Expressions (Regex)

A regex pattern can find URLs in plain text.

Example in Python:

python
import re

text = """
Here are some links:
https://www.example.com
Visit http://test.org or https://sub.domain.com/page?query=1
"""

url_pattern = r'https?://[^s<>"]+|www.[^s<>"]+'
urls = re.findall(url_pattern, text)

print(urls)

This will output:

css
['https://www.example.com', 'http://test.org', 'https://sub.domain.com/page?query=1']

2. Extracting URLs from a Document (Word, PDF)

For Word (.docx): Use Python’s python-docx to extract text and then run regex.
For PDF: Use PyPDF2 or pdfplumber to extract text, then regex.

Example for Word:

python
from docx import Document
import re

doc = Document('file.docx')
full_text = []
for para in doc.paragraphs:
    full_text.append(para.text)

text = 'n'.join(full_text)
url_pattern = r'https?://[^s<>"]+|www.[^s<>"]+'
urls = re.findall(url_pattern, text)

print(urls)

Example for PDF:

python
import pdfplumber
import re

with pdfplumber.open('file.pdf') as pdf:
    text = ''
    for page in pdf.pages:
        text += page.extract_text() + 'n'

url_pattern = r'https?://[^s<>"]+|www.[^s<>"]+'
urls = re.findall(url_pattern, text)

print(urls)

If you want me to write a full SEO-friendly article on “Extract URLs from documents” with examples and details, just let me know!

Share This Page:

1. Using Regular Expressions (Regex)

2. Extracting URLs from a Document (Word, PDF)

Example for Word:

Example for PDF:

Comments

Leave a Reply Cancel reply

Check Out Our Newest Posts we wrote about

Zero-shot extraction of product attributes

Zero-shot classification for product categorization

Zero-Shot and Few-Shot Learning in Practice

Zero Downtime LLM Deployments