02 Building the Machine learning classifier

This tutorial is about building a simple spam classifier using logistic regression and the sklearn library. I would recommend creating a folder named "ml', creating a virtual environment, activating, installing jupyter notebook, and starting it.

mkdir ml
cd ml
python -m venv env
.\env\Scripts\activate  
#source env/bin/activate for linux/mac
pip install notebook pandas scikit-learn
jupyter-notebook

Let's break down the code step-by-step:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
import joblib

These imports include tools for data handling (pandas), splitting datasets (train_test_split), converting text into vectors (CountVectorizer), our classification model (LogisticRegression), and a tool for saving/loading models (joblib).

df = pd.read_csv('spam.csv', encoding='latin-1')
print(df.head())

columns_to_drop = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"]
df.drop(columns=columns_to_drop, inplace=True)

Here, we are loading a CSV file named 'spam.csv' into a DataFrame df and printing the first few rows. The encoding 'latin-1' is used for compatibility purposes as this dataset may contain special characters.

The dataset contains some columns that are not relevant for our purpose, so we drop them.

X = df['v2']
y = df["v1"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Here, we are separating our data (emails/texts) from the labels (spam or not spam). After that, we split them into training and testing datasets. The random state helps maintain our results be reproducible.

vectorizer = CountVectorizer()
X_train_transformed = vectorizer.fit_transform(X_train)

Machine learning models don't understand raw text. Thus, we convert text data into a numerical format using CountVectorizer, resulting in a matrix representation of our text data.

CountVectorizer():

Purpose: This is used to convert a collection of text data into a matrix of token counts.
How it works: It tokenizes the text data and gives an integer ID to each token. Then, it counts the occurrences of each of these tokens.

fit_transform(X_train):

fit: Learns the vocabulary of X_train i.e., all unique words.
transform: Transforms our text data into a matrix where each row corresponds to a text and each column corresponds to a unique word in the data. The value in each cell of this matrix represents the count of that word in that text.

model = LogisticRegression(solver='liblinear')
model.fit(X_train_transformed, y_train)

We instantiate and train a logistic regression model. The 'liblinear' solver is generally a good choice for small datasets and binary classification, making it apt for our use-case.

X_test_transformed = vectorizer.transform(X_test)
accuracy = model.score(X_test_transformed, y_test)
print("Accuracy: {:.2f}%".format(accuracy * 100))

Once the model is trained, we evaluate its performance on unseen/test data. The model's accuracy indicates how well it can predict spam messages.

example_text = pd.Series("Hello, Are you enjoying learning fastapi?")
example_text_transformed = vectorizer.transform(example_text)
prediction = model.predict(example_text_transformed)
print(prediction)

To see our model in action, we provide it with an example text and check its prediction. This example illustrates how you'd use the model in a real-world scenario.

After training, we save the model and the vectorizer to disk using joblib. This is crucial for real-world applications as it negates the need to retrain the model every time we want to use it.