The Palos Publishing Company

Follow Us On The X Platform @PalosPublishing
Categories We Write About

Extract metadata from files with Python

Extracting metadata from files using Python depends on the file type you’re working with—documents, images, videos, PDFs, or audio files each have their own libraries and formats. Here’s a comprehensive guide on how to extract metadata from various types of files using Python.


1. Extracting Metadata from Images

You can use the Pillow and exifread libraries to extract EXIF data.

python
from PIL import Image from PIL.ExifTags import TAGS def extract_image_metadata(file_path): image = Image.open(file_path) exif_data = image._getexif() metadata = {} if exif_data: for tag_id, value in exif_data.items(): tag = TAGS.get(tag_id, tag_id) metadata[tag] = value return metadata # Usage metadata = extract_image_metadata('image.jpg') for key, value in metadata.items(): print(f"{key}: {value}")

For more detailed EXIF data:

bash
pip install exifread
python
import exifread def extract_exifread_metadata(file_path): with open(file_path, 'rb') as file: tags = exifread.process_file(file) return tags tags = extract_exifread_metadata('image.jpg') for tag in tags: print(f"{tag}: {tags[tag]}")

2. Extracting Metadata from PDFs

Use the PyPDF2 or pdfplumber libraries.

bash
pip install PyPDF2
python
import PyPDF2 def extract_pdf_metadata(file_path): with open(file_path, 'rb') as file: reader = PyPDF2.PdfReader(file) info = reader.metadata return info metadata = extract_pdf_metadata('document.pdf') for key, value in metadata.items(): print(f"{key}: {value}")

3. Extracting Metadata from Audio Files

Use mutagen for MP3, FLAC, etc.

bash
pip install mutagen
python
from mutagen import File def extract_audio_metadata(file_path): audio = File(file_path) return audio metadata = extract_audio_metadata('song.mp3') for key, value in metadata.items(): print(f"{key}: {value}")

4. Extracting Metadata from Video Files

Use ffprobe through ffmpeg-python or pymediainfo.

bash
pip install pymediainfo
python
from pymediainfo import MediaInfo def extract_video_metadata(file_path): media_info = MediaInfo.parse(file_path) for track in media_info.tracks: for key, value in track.to_data().items(): print(f"{key}: {value}") extract_video_metadata('video.mp4')

5. Extracting Metadata from Office Documents

For Microsoft Word, Excel, and PowerPoint files, use python-docx, openpyxl, or python-pptx.

bash
pip install python-docx openpyxl python-pptx

Word documents (.docx):

python
from docx import Document def extract_docx_metadata(file_path): doc = Document(file_path) props = doc.core_properties metadata = { 'title': props.title, 'author': props.author, 'subject': props.subject, 'created': props.created, 'last_modified_by': props.last_modified_by } return metadata metadata = extract_docx_metadata('file.docx') for key, value in metadata.items(): print(f"{key}: {value}")

Excel (.xlsx):

python
import openpyxl def extract_excel_metadata(file_path): wb = openpyxl.load_workbook(file_path) props = wb.properties metadata = { 'title': props.title, 'author': props.creator, 'created': props.created, 'modified': props.modified } return metadata metadata = extract_excel_metadata('file.xlsx') for key, value in metadata.items(): print(f"{key}: {value}")

6. Extracting File System Metadata

Use Python’s os and pathlib modules.

python
import os from pathlib import Path import time def extract_file_system_metadata(file_path): file = Path(file_path) metadata = { "name": file.name, "size_bytes": file.stat().st_size, "created": time.ctime(file.stat().st_ctime), "modified": time.ctime(file.stat().st_mtime), "accessed": time.ctime(file.stat().st_atime) } return metadata metadata = extract_file_system_metadata('example.txt') for key, value in metadata.items(): print(f"{key}: {value}")

7. Extracting Metadata from Emails (EML Files)

bash
pip install extract-msg
python
import extract_msg def extract_email_metadata(file_path): msg = extract_msg.Message(file_path) metadata = { 'subject': msg.subject, 'sender': msg.sender, 'date': msg.date, 'to': msg.to, 'cc': msg.cc } return metadata metadata = extract_email_metadata('email.eml') for key, value in metadata.items(): print(f"{key}: {value}")

Conclusion

Extracting metadata with Python is straightforward using the appropriate libraries for each file type. These tools provide access to everything from image EXIF data and video codecs to PDF properties and document authorship. By modularizing the approach based on file type, you can automate metadata extraction across various file formats efficiently.

Share this Page your favorite way: Click any app below to share.

Enter your email below to join The Palos Publishing Company Email List

We respect your email privacy

Categories We Write About