Categories We Write About

Build a script to auto-categorize documents

Here’s a Python script that auto-categorizes documents using Natural Language Processing (NLP) with scikit-learn. It reads documents (e.g., .txt files), vectorizes them using TF-IDF, and classifies them into predefined categories using a Naive Bayes classifier. You can expand this with more advanced models or datasets later.


python
import os import glob from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.pipeline import Pipeline from sklearn.model_selection import train_test_split from sklearn import metrics # Step 1: Prepare data def load_documents_from_folder(base_path): documents = [] labels = [] for category in os.listdir(base_path): category_path = os.path.join(base_path, category) if os.path.isdir(category_path): for filepath in glob.glob(os.path.join(category_path, '*.txt')): with open(filepath, 'r', encoding='utf-8') as file: documents.append(file.read()) labels.append(category) return documents, labels # Step 2: Train classifier def train_model(documents, labels): X_train, X_test, y_train, y_test = train_test_split(documents, labels, test_size=0.2, random_state=42) model = Pipeline([ ('vectorizer', TfidfVectorizer(stop_words='english')), ('classifier', MultinomialNB()) ]) model.fit(X_train, y_train) y_pred = model.predict(X_test) print("Accuracy:", metrics.accuracy_score(y_test, y_pred)) return model # Step 3: Predict category of new documents def predict_category(model, file_path): with open(file_path, 'r', encoding='utf-8') as file: content = file.read() prediction = model.predict([content]) return prediction[0] # Main usage if __name__ == "__main__": base_folder = "data/train" # Folder structure: data/train/<category_name>/*.txt new_doc_path = "data/test/new_document.txt" docs, labels = load_documents_from_folder(base_folder) classifier_model = train_model(docs, labels) predicted_category = predict_category(classifier_model, new_doc_path) print(f"Predicted Category: {predicted_category}")

Folder Structure Example:

bash
data/ ├── train/ │ ├── finance/ │ │ ├── doc1.txt │ │ └── doc2.txt │ ├── tech/ │ │ ├── doc1.txt │ │ └── doc2.txt ├── test/ │ └── new_document.txt

Notes:

  • You can replace MultinomialNB with LogisticRegression or any other classifier.

  • Add a pre-trained model (like BERT) for higher accuracy if you’re processing more complex text.

  • You can persist the model using joblib for later predictions.

Let me know if you want this adapted for PDF, Word docs, or a database integration.

Share This Page:

Enter your email below to join The Palos Publishing Company Email List

We respect your email privacy

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Categories We Write About