This tutorial is about building a simple spam classifier using logistic regression and the sklearn
library. I would recommend creating a folder named "ml', creating a virtual environment, activating, installing jupyter notebook, and starting it.
mkdir ml
cd ml
python -m venv env
.\env\Scripts\activate
#source env/bin/activate for linux/mac
pip install notebook pandas scikit-learn
jupyter-notebook
Let's break down the code step-by-step:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
import joblib
These imports include tools for data handling (pandas
), splitting datasets (train_test_split
), converting text into vectors (CountVectorizer
), our classification model (LogisticRegression
), and a tool for saving/loading models (joblib
).
df = pd.read_csv('spam.csv', encoding='latin-1')
print(df.head())
columns_to_drop = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"]
df.drop(columns=columns_to_drop, inplace=True)
Here, we are loading a CSV file named 'spam.csv' into a DataFrame df
and printing the first few rows. The encoding 'latin-1'
is used for compatibility purposes as this dataset may contain special characters.
The dataset contains some columns that are not relevant for our purpose, so we drop them.
X = df['v2']
y = df["v1"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Here, we are separating our data (emails/texts) from the labels (spam or not spam). After that, we split them into training and testing datasets. The random state helps maintain our results be reproducible.
vectorizer = CountVectorizer()
X_train_transformed = vectorizer.fit_transform(X_train)
Machine learning models don't understand raw text. Thus, we convert text data into a numerical format using CountVectorizer
, resulting in a matrix representation of our text data.
CountVectorizer()
:
- Purpose: This is used to convert a collection of text data into a matrix of token counts.
- How it works: It tokenizes the text data and gives an integer ID to each token. Then, it counts the occurrences of each of these tokens.
fit_transform(X_train)
:
fit
: Learns the vocabulary ofX_train
i.e., all unique words.transform
: Transforms our text data into a matrix where each row corresponds to a text and each column corresponds to a unique word in the data. The value in each cell of this matrix represents the count of that word in that text.
model = LogisticRegression(solver='liblinear')
model.fit(X_train_transformed, y_train)
We instantiate and train a logistic regression model. The 'liblinear' solver is generally a good choice for small datasets and binary classification, making it apt for our use-case.
X_test_transformed = vectorizer.transform(X_test)
accuracy = model.score(X_test_transformed, y_test)
print("Accuracy: {:.2f}%".format(accuracy * 100))
Once the model is trained, we evaluate its performance on unseen/test data. The model's accuracy indicates how well it can predict spam messages.
example_text = pd.Series("Hello, Are you enjoying learning fastapi?")
example_text_transformed = vectorizer.transform(example_text)
prediction = model.predict(example_text_transformed)
print(prediction)
To see our model in action, we provide it with an example text and check its prediction. This example illustrates how you'd use the model in a real-world scenario.
After training, we save the model and the vectorizer to disk using joblib
. This is crucial for real-world applications as it negates the need to retrain the model every time we want to use it.