Designing Scalable Instagram Architecture - System Design
The prerequisite for learning the Instagram architecture design is, how to design a URL Shortner Service like a tiny URL. Please read it here…
1. What is Instagram?
Instagram is a free photo & video-sharing and social networking service that is very popular. In this tutorial, we will learn to design a simple architecture of Instagram that enables users to share photos, follow other users and create a unique "NewsFeed" for each user consisting of top photos from accounts they follow.
2. Design Goals
Our service should provide the following features to a user:
- Mandatory Features
- Users should be able to upload and view photos.
- Users can perform a search based on photo titles.
- A user can follow other users.
- Create a unique NewsFeed for each user, consisting of top photos from all people/accounts the user follows.
- Optional Features
- The service should be highly scalable.
- High Consistency
- High reliability i.e. any uploaded data should never be lost.
3. Capacity Estimation
The important thing to note here is, the number of reading requests will be 100 times more than the number of uploads (writing) requests. Suppose we are going to have 500 Million users registered to our platform with 1 Million daily active users. Let's assume 5M (million) photos are uploaded every day, then the number of photos uploaded in 1 second is:
1 sec = 5M / (246060) ≈ 57 photos
If the average photo size is 150 KB, then storage usage in a day:
5M * 150KB = 716 GB
Assuming our service runs continuously for 10 years, the space required will be:
716GB 365 10 ≈ 2553TB ≈ 2.6PB
4. Database Design Choice
The database design will help us to understand the data flow among various components in the Instagram architecture. We need to store the data about users, their uploaded photos, people they follow. Note - calculating the likes, and comments on a photo uploaded by a user are out of the scope of this tutorial.
Data related to user
- User ID (Primary key): A unique user id to make users globally distinguishable.
- Name: The name of the user.
- Email: The email id of the user
- Password: Password of the user to facilitate the login feature.
- Creation Date: The date on which the user was registered.
Data related to photos
Since we are storing here actual photos and size limit for each photo size is 150KB. As calculated above if service runs for 10 years, we will end up accumulating 2.6 PB of data, which is too much to be stored in a database. Hence we will Object Storage service such as AWS S3.
So, the flow will be as follow: We will store the actual photo in the S3 bucket and store the photo-path(URL) in the database.
- PhotoId (Primary Key): A unique 10B photo id to uniquely identify each photo.
- UserId: The id of a user who uploaded the photo.
- Path: The path/URL of object storage where the photos are stored.
- Latitude & Longitude: We will store this information to find the location of the photo.
- Date & time: The date & timestamp at which the photo was uploaded.
Data related to users following & followers
- Following - UserId of all the users followed by the current user.
- Followers - UserId of all the people following the users.
If we closely look at the above table, both the followers and the following columns will contribute to the primary key.
If we want to find all the people followed by user1, it can be found by querying the following column and to find all followers of user 1, we will query the followers' row with a condition stating that they are following user1.
Followers of user 1 -> user 2, user 3 user 1 following -> user 2, user 3, user 4 We have two different choices of databases: 1) Relational Databases(MySQL) 2) NoSQL Databases(Cassandra).
In general, Relational Databases are good if we have lots of complex queries involving joins, but they are slow. NoSQL databases are pathetic at handling the relationship queries but they are faster. Now, we don’t really need lots of relationships among data, but we do need a fast read and write speed. Hence we will choose NoSQL Database. The key for each row can be the short URL because it is going to be globally unique.
5. Component Design
Our system supports photo upload(writes) and view(read) feature. Since uploading is a slow process, hence few users uploading photos at the same time can consume all the available bandwidth at once. Hence the reading or view photo won't be possible. Even if we assume our web server can support 1000 active connections simultaneously and with a 20:80 read-write ratio, 200 connections will be occupied for writing, and writing(uploading) is going to keep the connection open for a long time.
To overcome this problem we dedicated servers for reading and different dedicated servers for writing. Also, separating photos’ read and write requests will allow us to scale and optimize each of these operations independently. The following diagram aptly defines how the read-write will work.
6. News Feed Generation
1. Generating News Feed
One of the most critical requirements of Instagram type service is designing the unique newsfeed for every user, containing the latest post from each user he/she is following. For simplicity let's assume each user all followers combined uploads 200 new unique photos every day. Hence a user newsfeed will be a combination of these 200 unique photos and after that reputation of past uploads. So, for generating a news feed for a user we will first fetch the metadata(likes, comments, time, location, etc) of the latest 200 photos and pass it to the ranking algorithm and ranking algorithm will determine which how the photos should be arranged in the newsfeed based on metadata. The major challenge with the above newsfeed generation approach is it requires querying lots of tables simultaneously and then ranking them using predefined parameters, hence this approach will result in higher latency i.e. it will take a lot of time to generate newsfeed.
Pregenerating News Feed - To overcome the challenges with the above news feed generating algorithm we will create a server dedicated to generating the newsfeed unique to each user beforehand and storing it in a separate newsfeed table. With this approach whenever the user needs to see the updated newsfeed we will simply query this table.
2. Serving the News Feed
Now we have seen the way of creating a news feed. The next major challenge in Instagram architecture design is how will we serve the resulting newsfeed to the user.
Push - One way is whenever a new photo uploaded by a user we will notify of all of his/her followers. For this, we can use Long-Pooling. A possible problem with this approach is, a user who follows a lot of people or celebrities, in this case, the server has to push updates/ send notifications quite frequently.
Pull - Users will refresh their newsfeeds(make a pull request to the server), whenever they want to see the fresh content. The problem with this approach is that the new post will not be visible until users don't refresh also most of the refresh will result in empty results.
Hybrid Approach - The hybrid approach will involve applying the Pull-Based approach for all the users with lots of followers(celebrities) and the PULL based approach and normal users to the Push-Based approach.
7. Load Balancing
To overcome the problem of limited network bandwidth and single point of failure in Instagram type service architecture design, we will use Load Balancers. Load Balancer does its magic by diving the traffic among a group of servers thus resulting in improved response and availability of a website or application. Read more here…
To distribute the load among servers we will use the Least Bandwidth Method. This algorithm will choose the server currently serving the least amount of traffic, measured in megabits per second (Mbps). We can place the Load Balancers between:
- The client and the server.
- The server and the database.
This is how we will design the Instagram architecture that will scale in realtime.
Books Recommendation for Coding Interview Preparation
- Cracking the Coding Interview. (Buy - India / USA)
- Cracking the PM Interview: How to Land a Product Manager Job in Technology. (Buy - India / USA)
Do you know System Design is a very important topic in product based company interviews and almost every tech giant asks it? At Nlogn we have a dedicated section for system design to help you prepare. Read here.