This tutorial is about building a simple spam classifier using logistic regression and the sklearn
library. I would recommend creating a folder named "ml', creating a virtual environment, activating, installing jupyter notebook, and starting it.
mkdir ml
cd ml
python -m venv env
.\env\Scripts\activate
#source env/bin/activate for linux/mac
pip install notebook pandas scikit-learn
jupyter-notebook
Let's break down the code step-by-step:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
import joblib
These imports include tools for data handling (pandas
), splitting datasets (train_test_split
), converting text into vectors (CountVectorizer
), our classification model (LogisticRegression
), and a tool for saving/loading models (joblib
).
df = pd.read_csv('spam.csv', encoding='latin-1')
print(df.head())
columns_to_drop = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"]
df.drop(columns=columns_to_drop, inplace=True)
Here, we are loading a CSV file named 'spam.csv' into a DataFrame df
and printing the first few rows. The encoding 'latin-1'
is used for compatibility purposes as this dataset may contain special characters.
The dataset contains some columns that are not relevant for our purpose, so we drop them.
X = df['v2']
y = df["v1"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Here, we are separating our data (emails/texts) from the labels (spam or not spam). After that, we split them into training and testing datasets. The random state helps maintain our results be reproducible.
vectorizer = CountVectorizer()
X_train_transformed = vectorizer.fit_transform(X_train)
Machine learning models don't understand raw text. Thus, we convert text data into a numerical format using CountVectorizer
, resulting in a matrix representation of our text data.
CountVectorizer()
:
fit_transform(X_train)
:
fit
: Learns the vocabulary of X_train
i.e., all unique words.transform
: Transforms our text data into a matrix where each row corresponds to a text and each column corresponds to a unique word in the data. The value in each cell of this matrix represents the count of that word in that text.
model = LogisticRegression(solver='liblinear')
model.fit(X_train_transformed, y_train)
We instantiate and train a logistic regression model. The 'liblinear' solver is generally a good choice for small datasets and binary classification, making it apt for our use-case.
X_test_transformed = vectorizer.transform(X_test)
accuracy = model.score(X_test_transformed, y_test)
print("Accuracy: {:.2f}%".format(accuracy * 100))
Once the model is trained, we evaluate its performance on unseen/test data. The model's accuracy indicates how well it can predict spam messages.
example_text = pd.Series("Hello, Are you enjoying learning fastapi?")
example_text_transformed = vectorizer.transform(example_text)
prediction = model.predict(example_text_transformed)
print(prediction)
To see our model in action, we provide it with an example text and check its prediction. This example illustrates how you'd use the model in a real-world scenario.
After training, we save the model and the vectorizer to disk using joblib
. This is crucial for real-world applications as it negates the need to retrain the model every time we want to use it.
Brige the gap between Tutorial hell and Industry. We want to bring in the culture of Clean Code, Test Driven Development.
We know, we might make it hard for you but definitely worth the efforts.
© Copyright 2022-23 Team FastAPITutorial