SQL Patterns for FAANG Data Science Interviews

Image by Author

#Introduction

The technical screening for data science roles in FAANG companies is very thorough. However, even they can’t come up with an endless stream of unique interview questions. Once you’ve gone through the grind enough times, you start to notice that some SQL patterns keep showing up.

Here are the top 5, with examples and code (PostgreSQL) for practice.

Image by Author | Napkin AI

Master these and you’ll be ready for most SQL interviews.

#Pattern #1: Aggregating Data with GROUP BY

Using aggregate functions with GROUP BY allows you to aggregate metrics across categories.

This pattern is often combined with data filtering, which means using one of the two clauses:

WHERE: Filters data before the aggregation.
HAVING: Filters data after the aggregation.

Example: This Meta interview question asks you to find the total number of comments made 30 or fewer days before 2020-02-10 per user. Users with no comments should be excluded from the output.

We use the SUM() function with a GROUP BY clause to sum the number of comments per user. Outputting the comments only within the specified period is achieved by filtering the data before aggregation, i.e., using WHERE. There’s no need to calculate which date is “30 days before 2020-02-10”; we simply subtract 30 days from that date using the INTERVAL date function.

SELECT user_id,
       SUM(number_of_comments) AS number_of_comments
FROM fb_comments_count
WHERE created_at BETWEEN '2020-02-10'::DATE - 30 * INTERVAL '1 day' AND '2020-02-10'::DATE
GROUP BY user_id;

Here’s the output.

user_id	number_of_comments
5	1
8	4
9	2
…	…
99	2

Business Use:

User activity metrics: DAU & MAU, churn rate.
Revenue metrics: revenue per region/product/time period.
User engagement: average session length, average clicks per user.

#Pattern #2: Filtering with Subqueries

When using subqueries for filtering, you create a data subset, then filter the main query against it.

The two main subquery types are:

Scalar subqueries: Return a single value, e.g., maximum amount.
Correlated subqueries: Reference and depend on the result of the outer query to return the values.

Example: This interview question from Meta asks you to create a recommendation system for Facebook. For each user, you should find pages that this user doesn’t follow, but at least one of their friends does. The output should consist of the user ID and the ID of the page that should be recommended to this user.

The outer query returns all the user-page pairs where the page is followed by at least one friend.

Then, we use a subquery in the WHERE clause to remove the pages that the user already follows. There are two conditions in the subquery: one that only considers pages followed by this specific user (checks for this user only), and then checks if the page considered for recommendation is among those followed by the user (checks for this page only).

As the subquery returns all the pages followed by the user, using NOT EXISTS in WHERE excludes all those pages from the recommendation.

SELECT DISTINCT f.user_id,
                p.page_id
FROM users_friends f
JOIN users_pages p ON f.friend_id = p.user_id
WHERE NOT EXISTS
    (SELECT *
     FROM users_pages pg
     WHERE pg.user_id = f.user_id
       AND pg.page_id = p.page_id);

Here’s the output.

user_id	page_id
1	23
1	24
1	28
…	…
5	25

Business Use:

Customer activity: most recent login per user, latest subscription change.
Sales: highest order per customer, top revenue order per region.
Product performance: most purchased product in each category, highest-revenue product per month.
User behaviour: Longest session per user, first purchase per customer.
Reviews & feedback: top reviewer, latest review for each product.
Operations: Latest shipment status per order, fastest delivery time per region.

#Pattern #3: Ranking with Window Functions

Using window functions such as ROW_NUMBER(), RANK(), and DENSE_RANK() allows you to order rows within data partitions, and then identify the first, second, or nth record.

Here is what each of these ranking window functions does:

ROW_NUMBER(): Assigns a unique sequential number within each partition; tied values get different row numbers.
RANK(): Assigns the same rank to tied values and skips the next ranks for the next non-tied value.
DENSE_RANK(): Same as RANK(), only it doesn’t skip rank after ties.

Example: In an Amazon interview question, we need to find the highest daily order cost between 2019-02-01 and 2019-05-01. If a customer has more than one order on a certain day, sum the order costs on a daily basis. The output should contain the customer’s first name, the total cost of their order(s), and the date of the order.

In the first common table expression (CTE), we find the orders between the specified dates and sum the customer’s daily totals for each date.

In the second CTE, we use RANK() to rank customers by order cost descendingly for each date.

Now, we join two CTEs to output the required columns and filter only the orders with the first rank assigned to them, i.e., the highest order.

WITH customer_daily_totals AS (
  SELECT o.cust_id,
         o.order_date,
         SUM(o.total_order_cost) AS total_daily_cost
  FROM orders o
  WHERE o.order_date BETWEEN '2019-02-01' AND '2019-05-01'
  GROUP BY o.cust_id, o.order_date
),

ranked_daily_totals AS (
  SELECT cust_id,
         order_date,
         total_daily_cost,
         RANK() OVER (PARTITION BY order_date ORDER BY total_daily_cost DESC) AS rnk
  FROM customer_daily_totals
)

SELECT c.first_name,
       rdt.order_date,
       rdt.total_daily_cost AS max_cost
FROM ranked_daily_totals rdt
JOIN customers c ON rdt.cust_id = c.id
WHERE rdt.rnk = 1
ORDER BY rdt.order_date;

Here’s the output.

first_name	order_date	max_cost
Mia	2019-02-01	100
Farida	2019-03-01	80
Mia	2019-03-01	80
…	…	…
Farida	2019-04-23	120

Business Use:

User activity: “Top 5 most active users last month”.
Revenue: “The second-highest revenue region”.
Product popularity: “Top 10 best-selling products”.
Purchases “The first purchase of each customer”.

#Pattern #4: Calculating Moving Averages & Cumulative Sums

The moving (rolling) average calculates the average over the last N rows, typically months or days. It’s calculated using the AVG() window function and defining the window as ROWS BETWEEN N PRECEDING AND CURRENT ROW.

The cumulative sum (running total) is the sum from the first row up to the current row, which is reflected in defining the window as ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW in the SUM() window function.

Example: The interview question from Amazon wants us to find the 3-month rolling average of total revenue from purchases. We should output the year-month (YYYY-MM) and the 3-month rolling average, sorted from the earliest to the latest month.

Also, the returns (negative purchase values) should not be included.

We use a subquery to calculate monthly revenue using SUM() and convert the purchase date to a YYYY-MM format with the TO_CHAR() function.

Then, we use AVG() to calculate the moving average. In the OVER() clause, we order the data in partition by month and define the window as ROWS BETWEEN 2 PRECEDING AND CURRENT ROW; we calculate the 3-month moving average, which takes into account the current and the previous two months.

SELECT t.month,
       AVG(t.monthly_revenue) OVER(ORDER BY t.month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS avg_revenue
FROM
  (SELECT TO_CHAR(created_at::DATE, 'YYYY-MM') AS month,
          SUM(purchase_amt) AS monthly_revenue
   FROM amazon_purchases
   WHERE purchase_amt > 0
   GROUP BY 1
   ORDER BY 1) AS t
ORDER BY t.month ASC;

Here’s the output.

month	avg_revenue
2020-01	26292
2020-02	23493.5
2020-03	25535.666666666668
…	…
2020-10	21211

To calculate a cumulative sum, we’d do it like this.

SELECT t.month,
       SUM(t.monthly_revenue) OVER(ORDER BY t.month ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cum_sum
FROM
  (SELECT TO_CHAR(created_at::DATE, 'YYYY-MM') AS month,
          SUM(purchase_amt) AS monthly_revenue
   FROM amazon_purchases
   WHERE purchase_amt > 0
   GROUP BY 1
   ORDER BY 1) AS t
ORDER BY t.month ASC;

Here’s the output.

month	cum_sum
2020-01	26292
2020-02	46987
2020-03	76607
…	…
2020-10	239869

Business Use:

Engagement metrics: 7-day moving average of DAU or messages sent, cumulative cancellations.
Financial KPIs: 30-day moving average of costs/conversions/stock prices, revenue reporting (cumulative YTD).
Product performance: logins per user moving average, cumulative app installs.
Operations: cumulative orders shipped, tickets resolved, bugs closed.

#Pattern #5: Applying Conditional Aggregations

Conditional aggregation lets you compute multiple segmented metrics in one pass by putting the CASE WHEN statement inside aggregate functions.

Example: A question from an Amazon interview asks you to identify returning active users by finding users who made a second purchase within 1 to 7 days after their first purchase. The output should consist only of these users’ IDs. The same-day purchases should be ignored.

The first CTE identifies the users and the dates of their purchases, excluding same-day purchases by using the DISTINCT keyword.

The second CTE ranks each user’s purchase dates from the oldest to the newest.

The last CTE finds the first and second purchases for each user by using conditional aggregation. We use MAX() to pick the single non-NULL value for the first and second purchase dates.

Finally, we use the result of the last CTE and retain only users who made a second purchase (non-NULL) within 7 days of their first purchase.

WITH daily AS (
  SELECT DISTINCT user_id,
         created_at::DATE AS purchase_date
  FROM amazon_transactions
),

ranked AS (
  SELECT user_id,
         purchase_date,
         ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY purchase_date) AS rn
  FROM daily
),

first_two AS (
  SELECT user_id,
         MAX(CASE WHEN rn = 1 THEN purchase_date END) AS first_date,
         MAX(CASE WHEN rn = 2 THEN purchase_date END) AS second_date
  FROM ranked
  WHERE rn <= 2
  GROUP BY user_id
)

SELECT user_id
FROM first_two
WHERE second_date IS NOT NULL AND (second_date - first_date) BETWEEN 1 AND 7
ORDER BY user_id;

Here’s the output.

user_id
100
103
105
…
143

Business Use:

Subscription reporting: paid vs. free users, active vs. churned users by plan tier.
Marketing funnel dashboards: signed up vs. purchased users by traffic source, emails opened vs. clicked vs. converted.
E-commerce: completed vs. refunded vs. cancelled orders by region, new vs. returning buyers.
Product analysis: iOS vs. Android vs. Web usage, feature adopted vs. not adopted counts per cohort.
Finance: revenue from new vs. existing customers, gross vs. net revenue.
A/B testing & experiments: control vs. treatment metrics.

#Conclusion

If you want a job at FAANG (and others, too) companies, focus on those five SQL patterns for the interviews. Of course, they’re not the only SQL concepts tested. But they are most commonly tested. By focusing on them, you ensure that your interview preparation is as efficient as possible for most SQL interviews at FAANG companies.

Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.