Ace Your ML System Design Interview: A Hands-On Guide

Phase 3: EXTRACT Non-Functional Requirements

The goal of this phase is to define how well the system must perform. And here where the software engineering system design and ML system design is completely different.

While software engineers must focus and cover such concepts as scale calculations and scale requirements (e.g. video consumption part, peak calculations, average session assumptions, interaction events), performance requirements (e.g. latency side, network RTT, CDN video fetch, app processing, api budget latency, buffers etc.), reliability discussion (degradation strategy covering serving modes and user experiences etc.).

You as ML Engineer must focus on ML side of the system. It is a good practice to show also software engineering and engineering literacy, but it is not neccessary and do it only if you are in the time limits of the interview process and can afford to spent some minutes on engineering concepts.

During providing the system design interview I’ve faced a lot with a situation where ML engineers are failing. Due to the fact that ML system design a little bit different to software engineering system design and you have a little information online how to prepare for this. And hence ML specific requirements are often get missed.

This is where many ML engineers fail — they design a beautiful distributed system but forget the ML fundamentals. Let me walk through each category and explain why it matters.

Before diving into specifics, you should think about ML requirements in five dimensions: model performance, data, model training/fine-tuning, inference + deployment, and API. Each dimension has cascading effects on our architecture.

First start with model performance requirements, e.g. for our use case:

Relevance Metric: D1 Retention (users return next day)
Ranking Quality: NDCG@10 > 0.7
CTR Prediction: AUC > 0.75

You may choose other metrics, depending on your case and specifics, but again you have to THINK OUT LOUD and explain why such metrics matters, e.g.:

“Notice I didn’t just say ‘high accuracy’ — that’s meaningless. Let me explain each choice:
D1 Retention as North Star: This directly ties to business value. A user who returns tomorrow is more valuable than one who watches for 2 hours today but never comes back. This metric will push our model to:
– Balance immediate engagement with long-term satisfaction
– Avoid clickbait that burns users out
– Promote content diversity to keep things fresh
NDCG@10 > 0.7: I choose NDCG (Normalized Discounted Cumulative Gain) because position matters in a feed. The first video matters more than the 10th.
P.S. The threshold 0.7 means we’re better than random but realistic — Netflix achieves about 0.75–0.8.
CTR AUC > 0.75: This is for our click-through rate predictor. AUC of 0.75 means we can reasonably distinguish engaging from non-engaging content. Why not higher? Because in practice, user behavior is noisy — even 0.8 is exceptional for real-world systems.”

And choosing appropriate ML metrics you immediately gain the architectural insights in your head as metrics impact on it and you can explain during the architecture:

These metrics tell us that we need:
– Multi-objective optimization (D1 retention vs immediate CTR)
– Position-aware training (for NDCG optimization)
– Proper train/test splitting by time (not random) to measure true performance”

ML metrics is one side of the coin, also you must mention the data requirements as without the data there is no ML. According to your use case you should think carefully what the Data Requirements are (freshness, storage, updates, user features, item features and etc.), e.g. for our case:

Data Freshness requirements:

User Features: < 5 minutes stale
Item Features: < 1 hour stale
Model Updates: Daily retraining
Embeddings: Refresh every 6 hours

And of course, THINK OUT LOUD and explain why you did such that and not differently:

Freshness requirements are about the speed of learning. Let me justify each choice:
User Features < 5 minutes: User interests change rapidly in short-form video. If someone just watched 5 cat videos, the 6th should reflect that. This means:
— Real-time feature computation pipeline needed
— Streaming infrastructure (Flink/Spark Streaming)
— Online feature store with sub-5-minute write propagation
Item Features < 1 hour: Videos accumulate signals — views, likes, completion rates. One hour staleness is acceptable because:
— Video popularity stabilizes after initial spike
— Batch computation is cheaper than streaming
— We can use hourly Spark jobs instead of real-time
Daily Model Retraining: Why daily, not hourly or weekly?
— Daily captures trending topics without overfitting to noise
— Training cost vs improvement tradeoff
— Allows for human verification before deployment
Embeddings Every 6 Hours: User and content embeddings are expensive to compute but critical for similarity matching. 6 hours because:
— Balances computational cost with freshness
— Aligns with user session patterns (morning/afternoon/evening/night)
— Gives new content time to accumulate signals

The other nuances you should cover in this case (I won’t cover much detilas here) is content policies and fairness, where you should mention relevance optimization, diversity, fairness and safety filters. Let it be an exercise for you!

After this to show that you are really capable not only for ML related work but also for infrastructure and engineering you should mention (at least for this case, and of course it depends on your situation, but 90% cases have this problem) how you will handle the cold start problem.

As we have new users we should decide how to form user cohorts, for new videos how to build a logic for exploration pool and define a fallback strategy and backup policies, e.g.:

Cold start is where many ML systems fail. Here’s my strategy:
New User Cold Start: No history = no personalization? Wrong approach. We can:
— Use registration demographics (age, location, language) to find similar cohorts
— Start with region-specific trending content
— Apply aggressive exploration (Thompson sampling or ε-greedy with high ε)
— Learn fast from first 10 interactions
New Content Cold Start: Great content with no views gets buried. We solve this with:
— Exploration pool: Reserve 10% of recommendations for new content
— Creator features: Use creator’s historical performance as prior
— Content understanding: Use video embeddings to find similar successful content
— Graduated exposure: 100 views → 1000 → 10000 based on performance
Fallback Hierarchy: When personalization fails, gracefully degrade:
1. Personalized recommendations (normal path)
2. Collaborative filtering from similar users
3. Recent interactions + popular content
4. Regional trending only
5. Global top content (last resort)

And finally the crucial step for any ML engineer is model training and deployment requirements:

Training pipeline: data volume, data storage, training frequency, training time budget, experemntation, optimization and metrics.
Model architecture: model type, architecture, embeddings, model size, ensemble.
Training infrastructure: distributed training, feature store, data versioning, hyperparameter tuning.
Model deployment: type (e.g. blue-green with gradual rollout), testing (e.g. canary testing), rollback time, model versioning and metrics monitoring.
Serving infrastructure: model loading infra, batch predictions process, model caching, fallback models.
Online learning: incremental updates, bandit algorithms, feature updates, feedback loop.
Online metrics: prediction latency, model drift, feature drift, A/B testing metrics.
Offline validation: backtesting, time-based splits, slice analysis, fairness metrics.

And of course you should explain your choice, e.g.

Let me explain the key ML trade-offs I’m making and why:
Real-time vs Batch Learning:
— Real-time for user features (they change quickly)
— Batch for model training (stability and validation)
— Hybrid for embeddings (6-hour batches)
Model Complexity vs Serving Cost:
— Two-tower allows independent scaling
— 256-dim embeddings balance quality and memory
— Gradient boosting for tabular features (faster than DNN)
Exploration vs Exploitation:
— 10% exploration budget for new content
— Thompson sampling (better than ε-greedy for many arms)
— Graduated exposure based on early signals

After you have a full picture of ML requirements you have almost everything to directly drive the technical architecture. And this is where your hands-on experience is really need to build quickly the prototype of the architecture to show your excellence!