Categories Machine Learning

Emails Spam Detection using Machine Learning

Email is the most important tool for communications and it’s widely used in almost every field like business, corporations, education institutes and even for individual users. To communicate effectively spam detection is one of the important features that aim to enhance user experience and security. Spam mail, or junk mail, is a type of email that is sent to a massive number of users at one time, frequently containing cryptic messages, scams, or most dangerously, phishing content. A spam detector -as shown below in figure 1- is a program used to detect unsolicited, unwanted and virus infected emails and prevent those messages from getting to a user’s inbox.

Press enter or click to view image in full size

Figure 1: Spam Detector

We’re going to take the following approach:
1. Problem Definition
2. Data
3. Evaluation
4. Features
5. Modelling
6. Experimentation

1. Problem Definition:
Given an email content, we can predict whether it is spam or ham (legitimate)?

2. Data:
The data has been downloaded from Kaggle’s Website.
The link has been given below: https://www.kaggle.com/datasets/shantanudhakadd/email-spam-detection-dataset-classification

3. Evaluation:
If we can achieve above 97% accuracy at predicting whether an email is spam or ham, we’ll pursue the project.

4. Features:
This is where you will get different information about each of the features in your data.

Data Dictionary

  1. result : whether an email is spam or ham
  2. emails : content of email

Preparing the tools

We’re going to use :
->Pandas for analyzing, cleaning, exploring, and manipulating data.
->NumPy for tasks like data cleaning, transformation, and aggregation, making data ready for further analysis or machine learning models.
->Matplotlib and Seaborn for creating statistical data visualizations.

# Import all the tools we need

# Regular EDA
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Data Preprocessing
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
import re
from collections import Counter
from wordcloud import WordCloud
from sklearn.preprocessing import LabelEncoder

# To ignore warnings
import warnings
warnings.filterwarnings("ignore")

# Model Building
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn import metrics
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB

Load Data

df = pd.read_csv('spam.csv', encoding= 'latin1')
df.head()
Press enter or click to view image in full size

The code data.head( ) will pick 5 emails from top of the dataset to show us. Each email has two parts: a label (like “ham” for regular emails or “spam” for unwanted ones) and the actual content of the email. However, it seems like there are some extra columns in the dataset that don’t have useful information, so they’re labeled as “Unnamed” with missing values. By looking at this sample, we can get an idea of what kinds of emails are in the dataset and how they’re organized, helping us understand the data better.

Exploring our data

df.tail()
Press enter or click to view image in full size

Press enter or click to view image in full size

Total number of emails in the data
Press enter or click to view image in full size

Data Types of each column
Press enter or click to view image in full size

Checking for missing values.

Data Cleaning

df.drop(columns=["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], inplace=True)
df.rename(columns={'v1': 'result', 'v2': 'emails'}, inplace=True)
df.head()
# Checking for null values
df.isnull().sum()
No null values are there
# Checking for duplicate values
df.duplicated().sum()
df = df.drop_duplicates(keep='first')
df.shape

Now, our data is clean let’s do EDA.

Exploratory Data Analysis (EDA)

  1. Distribution of Labels
df['result'].value_counts()
# Plotting
plt.figure(figsize=(10, 6))
plt.pie(df['result'].value_counts(), labels=df['result'].value_counts().index, autopct='%1.1f%%', startangle=140)
plt.title('Distribution of Spam and Non-Spam Emails')
plt.axis('equal')
plt.show()
Press enter or click to view image in full size

From the above graph we can see most emails in the dataset (87.4%) are non-spam (ham), while only a smaller portion (12.6%) are classified as spam. This difference is important because it affects how well our model can spot spam emails. Since there are many more non-spam emails, the model might become biased and miss some spam emails. To fix this, we need to use special techniques when training and testing our model. By doing this, we can make sure our model is good at finding both spam and non-spam emails, keeping our email inboxes safe and clutter-free.

2. Average Length of Emails for Spam and Ham

df['Length'] = df['num_words'] = df['emails'].apply(word_tokenize).apply(len)
df['num_sentence'] = df['emails'].apply(sent_tokenize).apply(len)
df.head(2)
avg_length_spam = df[df['result'] == 'spam']['Length'].mean()
avg_length_ham = df[df['result'] == 'ham']['Length'].mean()
#plotting
print("Average Length of Spam Emails:", avg_length_spam)
print("Average Length of Ham Emails:", avg_length_ham)
plt.bar(['Spam', 'Ham'], [avg_length_spam, avg_length_ham], color=['Blue', 'green'])
plt.title('Average Length of Emails for Spam and Ham')
plt.xlabel('Email Type')
plt.ylabel('Average Length')
plt.show()
Press enter or click to view image in full size

After looking at the lengths of spam and regular (ham) emails, we found that spam emails are much longer on average, around 137 characters. On the other hand, regular emails are much shorter, averaging about 70 characters. This means that spam emails tend to be more wordy and detailed, possibly because they’re trying to grab your attention with lots of information. Regular emails, like the ones you get from friends or for work, are usually shorter and to the point. Understanding this helps us make better tools to filter out spam and keep our inboxes organized.

3. Average Word of Emails for Spam and Ham

avg_word_spam = df[df['result'] == 'spam']['num_words'].mean()
avg_word_ham = df[df['result'] == 'ham']['num_words'].mean()
print("Average Words of Spam Emails:", avg_word_spam)
print("Average Words of Ham Emails:", avg_word_ham)

# Plotting the graph
plt.bar(['Spam', 'Ham'], [avg_word_spam, avg_word_ham], color=['Black', 'blue'])
plt.title('Average Words of Emails for Spam and Ham')
plt.xlabel('Email Type')
plt.ylabel('Average Words')
plt.show()4. Average Sentence of Emails for Spam or Ham

Press enter or click to view image in full size

From the above graph, we can see that spam emails are longer, with an average of about 27 words per email. On the other hand, regular ham emails are shorter, averaging around 17 words per email. This means spam emails tend to be more wordy, maybe because they contain advertisements or misleading information. Meanwhile, regular emails are more straightforward and direct. Understanding this helps us create better filters to catch spam and keep our inboxes clean of unwanted messages, making it easier to find the emails that matter to us.

4.Average Sentence of Emails for Spam or Ham

avg_sentence_spam = df[df['result'] == 'spam']['num_sentence'].mean()
avg_sentence_ham = df[df['result'] == 'ham']['num_sentence'].mean()
print("Average Sentence of Spam Emails:", avg_sentence_spam)
print("Average Sentence of Ham Emails:", avg_sentence_ham)

# Plotting the graph
plt.bar(['Spam', 'Ham'], [avg_sentence_spam, avg_sentence_ham], color=['orange', 'grey'])
plt.title('Average Sentence of Emails for Spam and Ham')
plt.xlabel('Email Type')
plt.ylabel('Average Sentence')
plt.show()

Press enter or click to view image in full size

From the above graph, we can see that spam emails tend to have longer sentences compared to regular emails. On average, spam emails have about 3 sentences, while regular emails have about 2 sentences. This means that spam emails might be trying to say more or convince you of something, while regular emails are usually shorter and more straightforward. Understanding this difference helps us build better systems to detect and filter out spam emails, keeping our email inboxes cleaner and safer for everyone to use.

5.Relationship between Length and Spam

correlation = df['Length'].corr((df['result'] == 'spam').astype(int))
print("Correlation coefficient between email length and spam classification:", correlation)

sns.violinplot(data=df, x='Length', y='result', hue='result')
plt.xlabel('Email Length')
plt.ylabel('Spam Classification')
plt.title('Relationship between Email Length and Spam Classification')
plt.show()

Press enter or click to view image in full size

As we found that there is a positive correlation (correlation coefficient: 0.38) between email length and spam classification. This means that, on average, spam emails tend to be slightly longer than non-spam emails. However, it’s important to note that the correlation is not very strong, indicating that other factors may also influence whether an email is classified as spam. Nonetheless, understanding this relationship can help improve spam detection algorithms by considering email length as one of the features in the classification process, alongside other relevant factors.

6. Relationship between features

cm = df[['Length', 'num_words', 'num_sentence']].corr()
print("The Relationship between Features are ",cm )
# Visualize the correlation matrix using a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, cmap='flare', fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix of Features')
plt.xlabel('Features')
plt.ylabel('Features')
plt.show()
Press enter or click to view image in full size

The correlation matrix indicates strong positive correlations between email length and the number of words (0.97) as well as between the number of words and the number of sentences (0.68). This suggests that longer emails tend to have more words, and emails with more words tend to have more sentences. However, the correlation between email length and the number of sentences is weaker (0.62). This implies that while longer emails may have more sentences, the relationship is not as strong as with the number of words. Understanding these relationships helps us grasp how different features contribute to the overall structure and content of emails, aiding in spam classification.

Data Preprocessing

Before building machine learning models, we preprocess the email data to convert it into a suitable format for analysis. This involves tasks such as lowercasing, tokenization, removing special characters, stopwords, and punctuation, as well as stemming to reduce words to their root forms.

df['transform_text'] = df['emails'].str.lower()
# Tokenization
df['transform_text'] = df['transform_text'].apply(word_tokenize)

# Removing special characters
df['transform_text'] = df['transform_text'].apply(lambda x: [re.sub(r'[^a-zA-Z0-9s]', '', word) for word in x])

# Removing stop words and punctuation
stop_words = set(stopwords.words('english'))
df['transform_text'] = df['transform_text'].apply(lambda x: [word for word in x if word not in stop_words and word not in string.punctuation])

# Stemming
ps = PorterStemmer()
df['transform_text'] = df['transform_text'].apply(lambda x: [ps.stem(word) for word in x])

# Convert the preprocessed text back to string
df['transform_text'] = df['transform_text'].apply(lambda x: ' '.join(x))

# Display the preprocessed data
print(df[['emails', 'transform_text']].head()

Press enter or click to view image in full size

In the data preprocessing step, we’re getting our email data ready for analysis. First, we convert all text to lowercase so that uppercase and lowercase letters don’t cause confusion. Then, we break down each email into smaller parts called tokens using a process called tokenization. After that, we remove any special characters like symbols or emojis that don’t add useful information. Next, we take out common words like “the” or “and,” as well as punctuation marks, because they’re not helpful for identifying spam. Finally, we reduce words to their base form using a process called stemming, which helps in simplifying the data for analysis.

df.head()
Press enter or click to view image in full size

7. Most Common Words in Spam Emails

spam_emails = df[df['result'] == 'spam']['transform_text']

# Tokenize the text in spam emails
spam_words = ' '.join(spam_emails).split()

# Count occurrences of each word
word_counts = Counter(spam_words)

# Find the most common words
most_common_words = word_counts.most_common(10)

print("Top 10 Most Common Words in Spam Emails:")
for word, count in most_common_words:
print(f"{word}: {count} occurrences")

# Generate Word Cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(dict(most_common_words))

# Plot Word Cloud
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.imshow(wordcloud, interpolation='bilinear')
plt.title('Word Cloud for Most Common Words in Spam Emails')
plt.axis('off')

# Plot Bar Graph
plt.subplot(1, 2, 2)
words, counts = zip(*most_common_words)
plt.bar(words, counts, color='orange')
plt.title('Bar Graph for Most Common Words in Spam Emails')
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

Press enter or click to view image in full size

Looking at the most common words found in spam emails, we see patterns that spammers often use to catch our attention or convince us to act. Words like “call,” “free,” and “txt” show up frequently, suggesting offers or requests for action. This helps us understand what to watch out for in our emails to avoid falling for spam. By knowing these common tricks, we can be more careful about which emails we open or respond to, keeping our inboxes safer. Email filters also use this information to better recognize and block spam messages, making our email experience more secure.

8.Most Common Words in Ham Emails

ham_emails = df[df['result'] == 'ham']['transform_text']

# Tokenize the text in spam emails
ham_words = ' '.join(ham_emails).split()

# Count occurrences of each word
word_counts = Counter(ham_words)

# Find the most common words
most_common_words = word_counts.most_common(10)

print("Top 10 Most Common Words in ham Emails:")
for word, count in most_common_words:
print(f"{word}: {count} occurrences")

# Generate Word Cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(dict(most_common_words))

# Plot Word Cloud
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.imshow(wordcloud, interpolation='bilinear')
plt.title('Word Cloud for Most Common Words in ham Emails')
plt.axis('off')

# Plot Bar Graph
plt.subplot(1, 2, 2)
words, counts = zip(*most_common_words)
plt.bar(words, counts, color='orange')
plt.title('Bar Graph for Most Common Words in ham Emails')
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

Press enter or click to view image in full size

The top 10 most common words in non-spam emails are “u,” “go,” “nt,” “get,” “2,” “gt,” “lt,” “come,” “ok,” and “got.” These words show how people talk in emails, with shortcuts like “u” instead of “you” and “nt” for “not.” They also reveal common topics like going somewhere or confirming things with “ok.” Understanding these words helps in spotting normal emails. It tells us what to expect in regular messages, making it easier to spot unusual or suspicious ones, like spam.

Preparing data for Machine Learning using Label Encoder and Vectorization

Here, First we use a Label Encoder to convert the ‘result’ column, which contains labels for spam and non-spam (ham) emails, into numerical values. This step is essential because machine learning algorithms typically work with numerical data. Then, we split our data into training and testing sets using train_test_split, where 80% of the data is used for training (X_train and y_train), and 20% is used for testing (X_test and y_test), which helps evaluate the performance of our model. Finally, we use TfidfVectorizer to convert our email text data into numerical vectors, making it suitable for machine learning algorithms to understand and analyze effectively.

encoder = LabelEncoder()
df['result'] = encoder.fit_transform(df['result'])
tfidf = TfidfVectorizer(max_features=3000)
X = tfidf.fit_transform(df['transform_text']).toarray()
y = df['result']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

5. Modelling

We’re going to try 3 different Machine Learning Models:

  1. SVC (Support Vector Classifier)
  2. Random Forest Classifier
  3. Naive Bayes Classifier
# Put models in a dictionary
models = {'SVC': SVC(),
'Random Forest': RandomForestClassifier(),
'Naive Bayes' : MultinomialNB()}

# Let's create a function to fit and score our models
def fit_and_score(models, X_train, X_test, y_train, y_test):
"""
Fits and evaluates given machine learning models.
models: a dict. of different Scikit-Learn machine learning models
X_train: training set (no labels)
X_test: testing set (no labels)
y_train: training labels
y_test: testing labels
"""
# Setup random seed
np.random.seed(42)
# Make a dict to save model score
model_score = {}
# Loop through models
for name, model in models.items():
# Fit the model to the data
model.fit(X_train, y_train)
# Evaluate the model and append its score to model_scores
model_score[name] = model.score(X_test, y_test)
return model_score

model_score = fit_and_score(models= models,
X_train= X_train,
X_test= X_test,
y_train= y_train,
y_test= y_test)
model_score
Press enter or click to view image in full size

SVC is performing very well.

Let’s compare accuracy using bar graph.

 # Calculate precision scores for each classifier
precision_svc = precision_score(y_test, y_pred_svc)
precision_rf = precision_score(y_test, y_pred_rf)
precision_nb = precision_score(y_test, y_pred_nb)

# Create lists to store accuracies and precision scores
classifiers = ['SVC', 'Random Forest', 'Naive Bayes']
accuracies = [accuracy_svc, accuracy_rf, accuracy_nb]
precision_scores = [precision_svc, precision_rf, precision_nb]

# Plot bar graph for accuracies and precision scores side by side
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Plot bar graph for accuracies
axes[0].bar(classifiers, accuracies, color='orange')
axes[0].set_xlabel('Classifier')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Accuracy Comparison of Different Classifiers')
axes[0].set_ylim(0, 1)

# Plot bar graph for precision scores
axes[1].bar(classifiers, precision_scores, color='yellow')
axes[1].set_xlabel('Classifier')
axes[1].set_ylabel('Precision Score')
axes[1].set_title('Precision Score Comparison of Different Classifiers')
axes[1].set_ylim(0, 1)
plt.tight_layout()
plt.show()

Press enter or click to view image in full size

After sorting the model we find that SVC performs the best. So, we’ll use SVC to make price predictions with SVC. predict(). This process helps us choose the most accurate model for predicting.

Note: Since our Model SVC is performing well we don’t have to adjust Hyperparameters if you need to adjust you can adjust your Model by:

  • By Hand
  • Randomly with RandomizedSearchCV
  • Exhaustively with GridSearchCV

Evaluating our Machine Learning Classifier beyond Accuracy

  • ROC(Receiver Operating Characteristics) Curve and AUC(Area Under Curve)Score
  • Confusion Matrix
  • Precision
  • Recall
  • F-1 Score
  • Classification Report

… and it would be great if cross-validation is used where possible.

To make comparisons and evaluate our trained models, first we need to make predictions.

y_pred = svc_classifier.predict(X_test)
# Plot ROC Curve and calculate AUC Score
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred)
roc_auc = metrics.auc(fpr, tpr)
display = metrics.RocCurveDisplay(fpr= fpr, tpr= tpr, roc_auc= roc_auc)
display.plot()
plt.show()
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(cm)
disp.plot()
plt.show()

Now we’ve got a ROC Curve, AUC Score and a Confusion Matrix, let’s get a Classification Report as well as cross-validated precision, recall and f1-score.

print(classification_report(y_test, y_pred))
Press enter or click to view image in full size

Calculate Evaluation Metrics using Cross-Validation

# Cross-validated Accuracy
cv_accuracy = cross_val_score(svc_classifier, X, y, cv= 5, scoring= 'accuracy')
cv_accuracy = np.mean(cv_accuracy)
cv_accuracy
Press enter or click to view image in full size

# Cross-validated Precision
cv_precision = cross_val_score(svc_classifier, X, y, cv= 5, scoring= 'precision')
cv_precision = np.mean(cv_precision)
cv_precision
Press enter or click to view image in full size

# Cross-validated Recall
cv_recall = cross_val_score(svc_classifier, X, y, cv= 5, scoring= 'recall')
cv_recall = np.mean(cv_recall)
cv_recall
Press enter or click to view image in full size

# Cross-validated F1-Score
cv_f1 = cross_val_score(svc_classifier, X, y, cv= 5, scoring= 'f1')
cv_f1 = np.mean(cv_f1)
cv_f1
Press enter or click to view image in full size

Feature Importance

Feature Importance is another asking “which features contributed most of the outcomes of the model and how did they contribute?”

Finding features importance is different to each Machine Learning Model.

Since our data only have 2 columns we don’t have to look for search importance.

Model Prediction

new_emails = [
"Get a free iPhone now!",
"Hey, how's it going?",
"Congratulations! You've won a prize!",
"Reminder: Meeting at 2 PM tomorrow."
]

# Convert new data into numerical vectors using the trained tfidf_vectorizer
new_X = tfidf.transform(new_emails)
new_X_dense = new_X.toarray()

# Use the trained SVM model to make predictions
svm_predictions = svc_classifier.predict(new_X_dense)

# Print the predictions
for email, prediction in zip(new_emails, svm_predictions):
if prediction == 1:
print(f"'{email}' is predicted as spam.")
else:
print(f"'{email}' is predicted as ham.")

Press enter or click to view image in full size

User Input Data Prediction

def predict_email(email):
# Convert email into numerical vector using the trained TF-IDF vectorizer
email_vector = tfidf.transform([email])

# Convert sparse matrix to dense array
email_vector_dense = email_vector.toarray()

# Use the trained SVM model to make predictions
prediction = svc_classifier.predict(email_vector_dense)

# Print the prediction
if prediction[0] == 1:
print("The email is predicted as spam.")
else:
print("The email is predicted as ham.")

# Get user input for email
user_email = input("Enter the email text: ")

# Predict whether the input email is spam or ham
predict_email(user_email)

Press enter or click to view image in full size

The predict_email function takes an email as input, converts it into a numerical vector using TF-IDF, and predicts whether it’s spam or ham using a trained SVM model. For instance, if you enter an email claiming you’ve won a prize, it’ll classify it as spam. This process ensures emails are accurately categorized to help users distinguish between legitimate and unwanted messages. By analyzing the email’s content and comparing it to known patterns, the function helps maintain inbox cleanliness and security. It simplifies the complex task of spam detection, making email management more efficient and user-friendly.

6.Experimentation

If you haven’t got your evaluation metrics you’ve to improve you accuracy score, precision score, recall score and f-1 score by using different ways such as choose another model or by hyperparameter tuning etc.

Conclusion:

Email spam detection using machine learning provides a strong solution to the annoying issue of unwanted messages. By cleaning up and organizing the data, creating useful features, and building smart models, we can make effective filters that keep our emails safe. Since email is so important for communication, it’s crucial to have good spam filters. These filters help us avoid clutter in our inboxes and make sure our digital conversations stay secure. With continued development, we can keep improving these systems to ensure our email experience is smooth and hassle-free.

You May Also Like