Email is the most important tool for communications and it’s widely used in almost every field like business, corporations, education institutes and even for individual users. To communicate effectively spam detection is one of the important features that aim to enhance user experience and security. Spam mail, or junk mail, is a type of email that is sent to a massive number of users at one time, frequently containing cryptic messages, scams, or most dangerously, phishing content. A spam detector -as shown below in figure 1- is a program used to detect unsolicited, unwanted and virus infected emails and prevent those messages from getting to a user’s inbox.
We’re going to take the following approach:
1. Problem Definition
2. Data
3. Evaluation
4. Features
5. Modelling
6. Experimentation
1. Problem Definition:
Given an email content, we can predict whether it is spam or ham (legitimate)?
2. Data:
The data has been downloaded from Kaggle’s Website.
The link has been given below: https://www.kaggle.com/datasets/shantanudhakadd/email-spam-detection-dataset-classification
3. Evaluation:
If we can achieve above 97% accuracy at predicting whether an email is spam or ham, we’ll pursue the project.
4. Features:
This is where you will get different information about each of the features in your data.
Data Dictionary
- result : whether an email is spam or ham
- emails : content of email
Preparing the tools
We’re going to use :
->Pandas for analyzing, cleaning, exploring, and manipulating data.
->NumPy for tasks like data cleaning, transformation, and aggregation, making data ready for further analysis or machine learning models.
->Matplotlib and Seaborn for creating statistical data visualizations.
# Import all the tools we need# Regular EDA
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Data Preprocessing
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
import re
from collections import Counter
from wordcloud import WordCloud
from sklearn.preprocessing import LabelEncoder
# To ignore warnings
import warnings
warnings.filterwarnings("ignore")
# Model Building
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn import metrics
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
Load Data
df = pd.read_csv('spam.csv', encoding= 'latin1')
df.head()data.head( ) will pick 5 emails from top of the dataset to show us. Each email has two parts: a label (like “ham” for regular emails or “spam” for unwanted ones) and the actual content of the email. However, it seems like there are some extra columns in the dataset that don’t have useful information, so they’re labeled as “Unnamed” with missing values. By looking at this sample, we can get an idea of what kinds of emails are in the dataset and how they’re organized, helping us understand the data better.Exploring our data
df.tail()Data Cleaning
df.drop(columns=["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], inplace=True)
df.rename(columns={'v1': 'result', 'v2': 'emails'}, inplace=True)
df.head()# Checking for null values
df.isnull().sum()# Checking for duplicate values
df.duplicated().sum()
df = df.drop_duplicates(keep='first')
df.shapeNow, our data is clean let’s do EDA.
Exploratory Data Analysis (EDA)
- Distribution of Labels
df['result'].value_counts()
# Plotting
plt.figure(figsize=(10, 6))
plt.pie(df['result'].value_counts(), labels=df['result'].value_counts().index, autopct='%1.1f%%', startangle=140)
plt.title('Distribution of Spam and Non-Spam Emails')
plt.axis('equal')
plt.show()2. Average Length of Emails for Spam and Ham
df['Length'] = df['num_words'] = df['emails'].apply(word_tokenize).apply(len)
df['num_sentence'] = df['emails'].apply(sent_tokenize).apply(len)
df.head(2)
avg_length_spam = df[df['result'] == 'spam']['Length'].mean()
avg_length_ham = df[df['result'] == 'ham']['Length'].mean()
#plotting
print("Average Length of Spam Emails:", avg_length_spam)
print("Average Length of Ham Emails:", avg_length_ham)
plt.bar(['Spam', 'Ham'], [avg_length_spam, avg_length_ham], color=['Blue', 'green'])
plt.title('Average Length of Emails for Spam and Ham')
plt.xlabel('Email Type')
plt.ylabel('Average Length')
plt.show()3. Average Word of Emails for Spam and Ham
avg_word_spam = df[df['result'] == 'spam']['num_words'].mean()
avg_word_ham = df[df['result'] == 'ham']['num_words'].mean()
print("Average Words of Spam Emails:", avg_word_spam)
print("Average Words of Ham Emails:", avg_word_ham)# Plotting the graph
plt.bar(['Spam', 'Ham'], [avg_word_spam, avg_word_ham], color=['Black', 'blue'])
plt.title('Average Words of Emails for Spam and Ham')
plt.xlabel('Email Type')
plt.ylabel('Average Words')
plt.show()4. Average Sentence of Emails for Spam or Ham
4.Average Sentence of Emails for Spam or Ham
avg_sentence_spam = df[df['result'] == 'spam']['num_sentence'].mean()
avg_sentence_ham = df[df['result'] == 'ham']['num_sentence'].mean()
print("Average Sentence of Spam Emails:", avg_sentence_spam)
print("Average Sentence of Ham Emails:", avg_sentence_ham)# Plotting the graph
plt.bar(['Spam', 'Ham'], [avg_sentence_spam, avg_sentence_ham], color=['orange', 'grey'])
plt.title('Average Sentence of Emails for Spam and Ham')
plt.xlabel('Email Type')
plt.ylabel('Average Sentence')
plt.show()
5.Relationship between Length and Spam
correlation = df['Length'].corr((df['result'] == 'spam').astype(int))
print("Correlation coefficient between email length and spam classification:", correlation)sns.violinplot(data=df, x='Length', y='result', hue='result')
plt.xlabel('Email Length')
plt.ylabel('Spam Classification')
plt.title('Relationship between Email Length and Spam Classification')
plt.show()
6. Relationship between features
cm = df[['Length', 'num_words', 'num_sentence']].corr()
print("The Relationship between Features are ",cm )
# Visualize the correlation matrix using a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, cmap='flare', fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix of Features')
plt.xlabel('Features')
plt.ylabel('Features')
plt.show()Data Preprocessing
Before building machine learning models, we preprocess the email data to convert it into a suitable format for analysis. This involves tasks such as lowercasing, tokenization, removing special characters, stopwords, and punctuation, as well as stemming to reduce words to their root forms.
df['transform_text'] = df['emails'].str.lower()
# Tokenization
df['transform_text'] = df['transform_text'].apply(word_tokenize)# Removing special characters
df['transform_text'] = df['transform_text'].apply(lambda x: [re.sub(r'[^a-zA-Z0-9s]', '', word) for word in x])
# Removing stop words and punctuation
stop_words = set(stopwords.words('english'))
df['transform_text'] = df['transform_text'].apply(lambda x: [word for word in x if word not in stop_words and word not in string.punctuation])
# Stemming
ps = PorterStemmer()
df['transform_text'] = df['transform_text'].apply(lambda x: [ps.stem(word) for word in x])
# Convert the preprocessed text back to string
df['transform_text'] = df['transform_text'].apply(lambda x: ' '.join(x))
# Display the preprocessed data
print(df[['emails', 'transform_text']].head()
In the data preprocessing step, we’re getting our email data ready for analysis. First, we convert all text to lowercase so that uppercase and lowercase letters don’t cause confusion. Then, we break down each email into smaller parts called tokens using a process called tokenization. After that, we remove any special characters like symbols or emojis that don’t add useful information. Next, we take out common words like “the” or “and,” as well as punctuation marks, because they’re not helpful for identifying spam. Finally, we reduce words to their base form using a process called stemming, which helps in simplifying the data for analysis.
df.head()7. Most Common Words in Spam Emails
spam_emails = df[df['result'] == 'spam']['transform_text']# Tokenize the text in spam emails
spam_words = ' '.join(spam_emails).split()
# Count occurrences of each word
word_counts = Counter(spam_words)
# Find the most common words
most_common_words = word_counts.most_common(10)
print("Top 10 Most Common Words in Spam Emails:")
for word, count in most_common_words:
print(f"{word}: {count} occurrences")
# Generate Word Cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(dict(most_common_words))
# Plot Word Cloud
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.imshow(wordcloud, interpolation='bilinear')
plt.title('Word Cloud for Most Common Words in Spam Emails')
plt.axis('off')
# Plot Bar Graph
plt.subplot(1, 2, 2)
words, counts = zip(*most_common_words)
plt.bar(words, counts, color='orange')
plt.title('Bar Graph for Most Common Words in Spam Emails')
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
8.Most Common Words in Ham Emails
ham_emails = df[df['result'] == 'ham']['transform_text']# Tokenize the text in spam emails
ham_words = ' '.join(ham_emails).split()
# Count occurrences of each word
word_counts = Counter(ham_words)
# Find the most common words
most_common_words = word_counts.most_common(10)
print("Top 10 Most Common Words in ham Emails:")
for word, count in most_common_words:
print(f"{word}: {count} occurrences")
# Generate Word Cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(dict(most_common_words))
# Plot Word Cloud
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.imshow(wordcloud, interpolation='bilinear')
plt.title('Word Cloud for Most Common Words in ham Emails')
plt.axis('off')
# Plot Bar Graph
plt.subplot(1, 2, 2)
words, counts = zip(*most_common_words)
plt.bar(words, counts, color='orange')
plt.title('Bar Graph for Most Common Words in ham Emails')
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
Preparing data for Machine Learning using Label Encoder and Vectorization
Here, First we use a Label Encoder to convert the ‘result’ column, which contains labels for spam and non-spam (ham) emails, into numerical values. This step is essential because machine learning algorithms typically work with numerical data. Then, we split our data into training and testing sets using train_test_split, where 80% of the data is used for training (X_train and y_train), and 20% is used for testing (X_test and y_test), which helps evaluate the performance of our model. Finally, we use TfidfVectorizer to convert our email text data into numerical vectors, making it suitable for machine learning algorithms to understand and analyze effectively.
encoder = LabelEncoder()
df['result'] = encoder.fit_transform(df['result'])
tfidf = TfidfVectorizer(max_features=3000)
X = tfidf.fit_transform(df['transform_text']).toarray()
y = df['result']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)5. Modelling
We’re going to try 3 different Machine Learning Models:
- SVC (Support Vector Classifier)
- Random Forest Classifier
- Naive Bayes Classifier
# Put models in a dictionary
models = {'SVC': SVC(),
'Random Forest': RandomForestClassifier(),
'Naive Bayes' : MultinomialNB()}# Let's create a function to fit and score our models
def fit_and_score(models, X_train, X_test, y_train, y_test):
"""
Fits and evaluates given machine learning models.
models: a dict. of different Scikit-Learn machine learning models
X_train: training set (no labels)
X_test: testing set (no labels)
y_train: training labels
y_test: testing labels
"""
# Setup random seed
np.random.seed(42)
# Make a dict to save model score
model_score = {}
# Loop through models
for name, model in models.items():
# Fit the model to the data
model.fit(X_train, y_train)
# Evaluate the model and append its score to model_scores
model_score[name] = model.score(X_test, y_test)
return model_score
model_score = fit_and_score(models= models,
X_train= X_train,
X_test= X_test,
y_train= y_train,
y_test= y_test)
model_scoreLet’s compare accuracy using bar graph.
# Calculate precision scores for each classifier
precision_svc = precision_score(y_test, y_pred_svc)
precision_rf = precision_score(y_test, y_pred_rf)
precision_nb = precision_score(y_test, y_pred_nb)# Create lists to store accuracies and precision scores
classifiers = ['SVC', 'Random Forest', 'Naive Bayes']
accuracies = [accuracy_svc, accuracy_rf, accuracy_nb]
precision_scores = [precision_svc, precision_rf, precision_nb]
# Plot bar graph for accuracies and precision scores side by side
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
# Plot bar graph for accuracies
axes[0].bar(classifiers, accuracies, color='orange')
axes[0].set_xlabel('Classifier')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Accuracy Comparison of Different Classifiers')
axes[0].set_ylim(0, 1)
# Plot bar graph for precision scores
axes[1].bar(classifiers, precision_scores, color='yellow')
axes[1].set_xlabel('Classifier')
axes[1].set_ylabel('Precision Score')
axes[1].set_title('Precision Score Comparison of Different Classifiers')
axes[1].set_ylim(0, 1)
plt.tight_layout()
plt.show()
Note: Since our Model SVC is performing well we don’t have to adjust Hyperparameters if you need to adjust you can adjust your Model by:
- By Hand
- Randomly with RandomizedSearchCV
- Exhaustively with GridSearchCV
Evaluating our Machine Learning Classifier beyond Accuracy
- ROC(Receiver Operating Characteristics) Curve and AUC(Area Under Curve)Score
- Confusion Matrix
- Precision
- Recall
- F-1 Score
- Classification Report
… and it would be great if cross-validation is used where possible.
To make comparisons and evaluate our trained models, first we need to make predictions.
y_pred = svc_classifier.predict(X_test)# Plot ROC Curve and calculate AUC Score
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred)
roc_auc = metrics.auc(fpr, tpr)
display = metrics.RocCurveDisplay(fpr= fpr, tpr= tpr, roc_auc= roc_auc)
display.plot()
plt.show()# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(cm)
disp.plot()
plt.show()Now we’ve got a ROC Curve, AUC Score and a Confusion Matrix, let’s get a Classification Report as well as cross-validated precision, recall and f1-score.
print(classification_report(y_test, y_pred))Calculate Evaluation Metrics using Cross-Validation
# Cross-validated Accuracy
cv_accuracy = cross_val_score(svc_classifier, X, y, cv= 5, scoring= 'accuracy')
cv_accuracy = np.mean(cv_accuracy)
cv_accuracy# Cross-validated Precision
cv_precision = cross_val_score(svc_classifier, X, y, cv= 5, scoring= 'precision')
cv_precision = np.mean(cv_precision)
cv_precision# Cross-validated Recall
cv_recall = cross_val_score(svc_classifier, X, y, cv= 5, scoring= 'recall')
cv_recall = np.mean(cv_recall)
cv_recall# Cross-validated F1-Score
cv_f1 = cross_val_score(svc_classifier, X, y, cv= 5, scoring= 'f1')
cv_f1 = np.mean(cv_f1)
cv_f1Feature Importance
Feature Importance is another asking “which features contributed most of the outcomes of the model and how did they contribute?”
Finding features importance is different to each Machine Learning Model.
Since our data only have 2 columns we don’t have to look for search importance.
Model Prediction
new_emails = [
"Get a free iPhone now!",
"Hey, how's it going?",
"Congratulations! You've won a prize!",
"Reminder: Meeting at 2 PM tomorrow."
]# Convert new data into numerical vectors using the trained tfidf_vectorizer
new_X = tfidf.transform(new_emails)
new_X_dense = new_X.toarray()
# Use the trained SVM model to make predictions
svm_predictions = svc_classifier.predict(new_X_dense)
# Print the predictions
for email, prediction in zip(new_emails, svm_predictions):
if prediction == 1:
print(f"'{email}' is predicted as spam.")
else:
print(f"'{email}' is predicted as ham.")
User Input Data Prediction
def predict_email(email):
# Convert email into numerical vector using the trained TF-IDF vectorizer
email_vector = tfidf.transform([email])# Convert sparse matrix to dense array
email_vector_dense = email_vector.toarray()
# Use the trained SVM model to make predictions
prediction = svc_classifier.predict(email_vector_dense)
# Print the prediction
if prediction[0] == 1:
print("The email is predicted as spam.")
else:
print("The email is predicted as ham.")
# Get user input for email
user_email = input("Enter the email text: ")
# Predict whether the input email is spam or ham
predict_email(user_email)
predict_email function takes an email as input, converts it into a numerical vector using TF-IDF, and predicts whether it’s spam or ham using a trained SVM model. For instance, if you enter an email claiming you’ve won a prize, it’ll classify it as spam. This process ensures emails are accurately categorized to help users distinguish between legitimate and unwanted messages. By analyzing the email’s content and comparing it to known patterns, the function helps maintain inbox cleanliness and security. It simplifies the complex task of spam detection, making email management more efficient and user-friendly.6.Experimentation
If you haven’t got your evaluation metrics you’ve to improve you accuracy score, precision score, recall score and f-1 score by using different ways such as choose another model or by hyperparameter tuning etc.
Conclusion:
Email spam detection using machine learning provides a strong solution to the annoying issue of unwanted messages. By cleaning up and organizing the data, creating useful features, and building smart models, we can make effective filters that keep our emails safe. Since email is so important for communication, it’s crucial to have good spam filters. These filters help us avoid clutter in our inboxes and make sure our digital conversations stay secure. With continued development, we can keep improving these systems to ensure our email experience is smooth and hassle-free.
