We like shiny things. In machine learning, that often means grabbing a deep model before asking, do I actually need it? I recently ran a small, fair test on AAPL closing prices using two very different tools: a classic ARIMA model and a LSTM neural network. To my surprise (and slight disappointment as a DL fan), ARIMA won.
This post isn’t a victory lap for statistics or a takedown of neural nets. It’s a reminder that method should follow problem, not trend. Here’s what I did, what went right/wrong, and a simple checklist you can reuse before you reach for the biggest hammer in your toolbox.
The Task (and Why It’s Tricky)
Goal: Predict the next day’s closing price for AAPL.
Data: Business-day closes from 2020–2025.
Setup: Train on the first ~80% of the timeline, test on the last ~20%. No shuffling, no cheating.
Time series forecasting has two booby traps:
- Leakage : letting future info “sneak” into training.
- Unfair evaluation : training or tuning on the test window (even accidentally) or using a look-ahead that wouldn’t exist in real life.
I kept things honest: all scaling fit on train only, and evaluation was rolling one-step-ahead (predict day t+1 using data up to day t, then move forward).
The Two Models
1) ARIMA (AutoRegressive Integrated Moving Average)
- I ran a small grid search over (p, d, q) on the train set, picked the order with the best AIC.
- Fit once on the train data.
- Rolled through the test period: forecast one day, then update the model state with the actual; no parameter refits.
Why this works: for short-horizon, single-variable scenarios, ARIMA captures short-term autocorrelation and local drift efficiently.
2) LSTM (Long Short-Term Memory)
- Scaling:
MinMaxScalerfitted on train only. - Lookback: 40 days → predict next day.
- Architecture: LSTM(128) → Dropout → LSTM(64) → Dense → 1.
- Training: time-ordered validation, EarlyStopping, ReduceLROnPlateau.
- Evaluation: rolling next-day predictions, inverse-scaled back to dollars.
Why this can struggle:
with only one feature (Close), an LSTM often learns a smooth, conservative mapping. When the market rips upward fast, it tends to lag.
Results (Short Version)
- ARIMA delivered lower RMSE/MAE/MAPE on the held-out test window.
- LSTM under-predicted during the 2024 rally and carried a bias (errors skewed positive: actual − prediction > 0).
I’m not claiming “ARIMA is better than LSTM.” I’m saying for this exact task (univariate, one-day horizon, business-day frequency), a well-set ARIMA with honest evaluation was hard to beat.
Why “Old School” Won Here
- Horizon matters. For one-step ahead, you’re mainly riding short-term autocorrelation and drift. ARIMA’s wheelhouse.
- Data richness matters. LSTMs shine when you feed them richer context: multiple features, exogenous signals, or longer horizons. Here we asked it to learn next-day levels from a single series.
- Adaptation speed. Updating ARIMA’s state each day helps it track fresh regimes without retraining. A fixed LSTM trained on long history may smooth away regime changes.
What I’d Try Next (If Deep Learning Must Win)
- Add features: volume, market/sector indices, macro, news sentiment, technical indicators. LSTM needs signal to flex.
- Switch targets: predict returns instead of price level; standardize; add loss penalties for underestimation.
- Walk-forward updates: periodically fine-tune or retrain the LSTM as new data arrives.
- Probabilistic outputs: instead of point forecasts, estimate intervals or quantiles (use pinball loss).
Deep models aren’t wrong here , they’re just underfed.
A Simple, Reusable Evaluation Recipe
Want an honest read on your forecaster? Bookmark this:
- Pin the window. Fix your date range before modeling.
- Split by time. First ~80% train, last ~20% test. No shuffle.
- Fit scalers on train only. Apply to test; never refit on test.
- Roll forward. Predict one step ahead across the test set, always using info that would have existed at the time.
- Compare apples-to-apples. Use the same evaluation window and metrics (RMSE, MAE, MAPE) for all models.
- Plot everything. Actual vs predictions, plus error histograms. Bias shows up fast.
If your fancy model can’t beat a tight baseline under this setup, iterate on features and problem framing, not just the architecture.
What This Means for Your Projects
- Start with a baseline that respects time. ARIMA/ETS/Prophet or even a seasonal naive model can save you days.
- Earn your complexity. If a baseline is already good, you’ll need more signal (features) or a different objective to justify deep learning.
- Own your evaluation. Most “wow” demos fall apart when you switch to a fair, rolling test.
Code Shape (High Level)
The core steps looked like this:
# 1) Data
df = yf.download("AAPL", start="2020-10-03", end="2025-10-03")[["Close"]]
df = df.asfreq("B")
df["Close"] = df["Close"].ffill()# 2) Split
split = int(len(df) * 0.8)
train, test = df.iloc[:split], df.iloc[split:]
# 3) ARIMA (grid on train → best order → single fit → roll)
best_order = (p, d, q) # from AIC on train
model = ARIMA(train["Close"], order=best_order,
enforce_stationarity=False, enforce_invertibility=False).fit()
arima_preds = []
res = model
for i in range(len(test)):
arima_preds.append(res.forecast(1)[0])
# Update state with actual; no parameter refit
try:
res = res.append(test["Close"].iloc[i:i+1])
except:
pass # fallback omitted for brevity
# 4) LSTM (train-only scaling → lookback sequences → rolling next-day)
scaler = MinMaxScaler().fit(train[["Close"]])
# ... create sequences with lookback=40 ...
# ... train LSTM with EarlyStopping ...
# ... predict rolling one-day ahead and inverse-transform ...
(I’m omitting full code here to keep things readable, but the structure is this straightforward.)
Key Takeaways
- Pick the model that matches the question. For one-step-ahead levels on a single series, ARIMA is a strong first move.
- Evaluation honesty beats model size. If you can’t win a fair rolling test, change the features or the objective, not just the architecture.
- Deep learning shines with context. Give it richer inputs, longer horizons, or tasks where nonlinearities really matter.
If You Want to Reproduce
- Pull AAPL data with
yfinancefor 2020–2025 (business days, forward-fill). - Split by time (80/20).
- Fit ARIMA with a small grid on train; evaluate rolling 1-step on test.
- Scale train-only for LSTM; use a 40-day lookback; rolling 1-step predictions; inverse scale.
- Compare RMSE/MAE/MAPE, and plot.
If your results mirror mine and ARIMA wins, congrats — you just saved compute and mental bandwidth. If LSTM overtakes it after you add more features, even better ..you earned it.
Thanks for reading! If you’re into practical, reproducible ML (with fewer surprises and more signal), I post more hands-on notes like this. Feel free to drop questions or ideas you want tested next.
