Designing Video Chat Infra (Like a PM at Microsoft Would)
Microsoft asked this real PM interview question. Here's how to crack it, step by step.
Designing a Scalable Video Chat Infrastructure
(Asked at Microsoft – PM Interview | Technical Systems Design)
***
What’s the Question About?
You're asked: “How would you design a system for video chat that can scale to millions of users, and keep costs low?”
This is a technical product design question, but you don’t need to be an engineer to answer it well. You just need to show structured thinking and awareness of how real-time products work.
Step 1: What Problem Are We Solving?
Video calls are used by millions of people every day on Microsoft Teams. Our job is to design a system that:
Keeps video and audio smooth
Works well even when usage spikes
Doesn’t cost too much to run
Step 2: Who Are the Users?
We’ll focus on three main types:
1:1 users like two people chatting
Group users like teams, classes, or meetings
Large audience users for webinars and online events
Each type needs a slightly different setup.
Step 3: What Are the Main Building Blocks?
You don’t need to go deep into code. Just explain the pieces that make video calls work.
1. App Interface
The Teams app that users open on phone or laptop
It adjusts video quality based on internet speed
It handles things like background noise and echo
2. Servers That Manage Video Streams
There are two common types of media servers:
SFU (Selective Forwarding Unit) forwards each person’s video to others without combining them. It's cheaper and easier to scale
MCU (Multipoint Control Unit) mixes all video feeds into one stream. It uses more resources but works better on weaker devices
3. Call Setup (Signaling Server)
This is the part that helps people start and end calls. It tells devices how to connect but doesn’t carry any video or audio.
4. Help for Bad Networks (TURN and STUN Servers)
Sometimes, people can't connect directly due to firewalls or weak networks
STUN helps devices find each other
TURN acts as a relay between them if needed, though it's more expensive
5. Broadcast Tech for Big Events
For webinars or town halls, a Content Delivery Network (CDN) helps stream one video to many people efficiently
Step 4: How Do We Keep It Scalable and Cheap?
Here’s how we balance user experience and cost:
Use SFU for most group calls since it’s cheaper and performs well
Only use MCU when needed for users with weak devices
Add or remove servers based on traffic patterns to avoid overpaying for unused resources
Let users connect directly for 1:1 calls when possible to reduce server usage
Use flexible server options like temporary cloud servers for busy times
For large events, stream through one video feed to many users using a CDN
Step 5: What Are the Trade-offs?
No solution is perfect. We need to consider:
SFU is cheaper but requires better performance from the user’s device
MCU improves user experience but costs more to run
TURN makes connections more reliable but adds to latency and server cost
A system that is easier for users might be harder to manage or scale
Step 6: How Do You Measure Success?
Here’s what we track to know if the system is working well:
Cost per 1,000 minutes of video streamed
Call setup success rate
Video and audio quality (like resolution and smoothness)
How often the system falls back to TURN servers
How quickly new servers are added when user load increases
Final Tip
You don’t need to be highly technical to answer well. Just show that you:
Understand the user experience
Know the basic components of how video calls work
Can balance performance, cost, and reliability
Share this PM Question with a fellow PM Aspirant
Keep PMing
I will see you with yet another PM Interview question. ( ❤️ Drop a like or comment if you liked the explanation! )
Balaji Rao