Predict Sales with Linear Regression

Step 1: Framing the Problem

The question I wanted to answer was simple

“Can we predict weekly retail sales using available business metrics?”

I imagined I was working for a retail chain that wanted to plan inventory, staffing, and promotions more effectively. The data included:

Advertising Spend: how much marketing budget was used each week.
Discount %: average discount applied to products.
Store Size: because large stores usually have higher base sales.
Month/Season: to capture seasonality effects.
Holiday Flag: a simple binary variable showing whether the week contained a holiday or not.

Right away, this sounded like a business question that any analyst could relate to just with one extra word: predict.

Step 2: Understanding the Data

I have downloaded a dataset from Kaggle a retail sales CSV with a few thousand rows. It wasn’t large, but it was enough to practice building a baseline model. Here’s what I did first (a usual cleaning task):

Checked for missing values and replaced a few NA values using the median.
Dropped redundant columns (like IDs or duplicate features).
Converted categorical variables like “Month” and “Holiday_Flag” into readable categories.
Scaled numeric values to avoid bias in the model.

A few lines of Python later, my data felt “clean enough” to move forward but not perfect. And that’s something I’ve learned: perfect data is a myth. At some point, you just move forward and let the model do its job.

Step 3: Building the Model. My First “Hello World” of ML!!!

Instead of going straight into complex algorithms, I chose Linear Regression the simplest and most transparent model.

Why?
Because linear regression doesn’t just predict but it teaches. It shows you how each feature contributes to the target variable, and that’s exactly what analysts need when they start learning machine learning.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import pandas as pd# Load data
data = pd.read_csv('retail_sales.csv')
# Select features and target
X = data[['Advertising_Spend', 'Discount', 'Store_Size', 'Month', 'Holiday_Flag']]
y = data['Weekly_Sales']
# Separate categorical and numeric features
categorical = ['Month', 'Holiday_Flag']
numeric = ['Advertising_Spend', 'Discount', 'Store_Size']
# Define preprocessing
preprocessor = ColumnTransformer([
('cat', OneHotEncoder(drop='first'), categorical),
('num', StandardScaler(), numeric)
])
# Create the pipeline
model = Pipeline([
('prep', preprocessor),
('lr', LinearRegression())
])
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train and evaluate
model.fit(X_train, y_train)
r2 = model.score(X_test, y_test)
print("R-squared:", round(r2, 3))

When I first ran this, the output was:

R-squared: 0.78

That number meant my model explained 78% of the variance in sales. Not bad at all for a first attempt.

Step 4: Reading the Coefficients. Where the Analyst in Me Came Alive

Linear regression is beautiful because it doesn’t just give you predictions; it gives you insight. You can look at the coefficients and understand what’s driving your target variable. Here’s what stood out when I interpreted my model:

Advertising Spend had a strong positive relationship with sales. Every extra dollar in marketing spend brought a measurable increase in weekly sales.
Discount % was interesting… small discounts boosted sales, but too large discounts started reducing profits.
Store Size was naturally influential. Large stores just made more money.
Holidays created spikes that the model could now detect automatically.

It felt like rediscovering everything I already knew as an analyst but now through the lens of mathematics. This is where machine learning stopped feeling like “black box magic” and started feeling like augmented analytics.

Step 5: Evaluating the Model

After training the model, I tested it on unseen data and evaluated the results using:

R² (R-squared) – 0.78, meaning my model captured most of the pattern.
MAE (Mean Absolute Error) – around 1,500, meaning on average my predictions were off by about ₹1,500 per week.

These weren’t perfect, but they were interpretable. I could now explain, “The model predicts within a reasonable margin for most weeks, though it struggles around seasonal spikes.”

That line right there, it’s what separates a good analyst from a good ML practitioner: interpretability.

Step 6: What Went Wrong (and Why It’s a Good Thing)

Not everything worked perfectly. Here’s what my model struggled with:

It didn’t capture sudden spikes in sales during festivals or promotions.
It underpredicted smaller stores — likely because I had fewer data points for them.
It over-relied on discount percentage (which, in real life, isn’t linear).

But I didn’t see these as failures. Each miss was a clue about what the model didn’t know yet. And that’s the beauty of machine learning is that your mistakes guide your next step.

Step 7: From Predicting to Understanding

Here’s what I took away from this project:

Simple models are the best teachers.
Linear regression taught me how features influence outcomes before I got lost in tuning and complexity.
Data storytelling doesn’t stop in ML.
When I explained my model’s results to a friend, I didn’t talk about algorithms. I said, “Sales go up when we advertise more, but too much discounting doesn’t help.” That’s storytelling, not statistics.
Models aren’t oracles.
They don’t predict the future, they predict patterns. And sometimes, the future doesn’t follow patterns.