The Palos Publishing Company

Categories We Write About

Build a script to auto-categorize documents

Written by

in

Computer Science

Here’s a Python script that auto-categorizes documents using Natural Language Processing (NLP) with scikit-learn. It reads documents (e.g., .txt files), vectorizes them using TF-IDF, and classifies them into predefined categories using a Naive Bayes classifier. You can expand this with more advanced models or datasets later.

python
import os
import glob
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn import metrics

# Step 1: Prepare data
def load_documents_from_folder(base_path):
    documents = []
    labels = []
    for category in os.listdir(base_path):
        category_path = os.path.join(base_path, category)
        if os.path.isdir(category_path):
            for filepath in glob.glob(os.path.join(category_path, '*.txt')):
                with open(filepath, 'r', encoding='utf-8') as file:
                    documents.append(file.read())
                    labels.append(category)
    return documents, labels

# Step 2: Train classifier
def train_model(documents, labels):
    X_train, X_test, y_train, y_test = train_test_split(documents, labels, test_size=0.2, random_state=42)
    model = Pipeline([
        ('vectorizer', TfidfVectorizer(stop_words='english')),
        ('classifier', MultinomialNB())
    ])
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
    return model

# Step 3: Predict category of new documents
def predict_category(model, file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()
    prediction = model.predict([content])
    return prediction[0]

# Main usage
if __name__ == "__main__":
    base_folder = "data/train"  # Folder structure: data/train/<category_name>/*.txt
    new_doc_path = "data/test/new_document.txt"

    docs, labels = load_documents_from_folder(base_folder)
    classifier_model = train_model(docs, labels)
    predicted_category = predict_category(classifier_model, new_doc_path)
    print(f"Predicted Category: {predicted_category}")

Folder Structure Example:

bash
data/
├── train/
│   ├── finance/
│   │   ├── doc1.txt
│   │   └── doc2.txt
│   ├── tech/
│   │   ├── doc1.txt
│   │   └── doc2.txt
├── test/
│   └── new_document.txt

Notes:

You can replace MultinomialNB with LogisticRegression or any other classifier.
Add a pre-trained model (like BERT) for higher accuracy if you’re processing more complex text.
You can persist the model using joblib for later predictions.

Let me know if you want this adapted for PDF, Word docs, or a database integration.

Share This Page:

Comments

Leave a Reply Cancel reply

Check Out Our Newest Posts we wrote about

Categories We Write About