Designing a URL Shortening service like TinyURL
# Requirements
# Functional Features
- Given a URL, our service should generate a shorter and unique alias of it. This is called a short link. This link should be short enough to be easily copied and pasted into applications.
- When users access a short link, our service should redirect them to the original link.
- Users should optionally be able to pick a custom short link for their URL.
- Links will expire after a standard default timespan. Users should be able to specify the expiration time.
# Non-Functional Features
- The system should be highly available. This is required because, if our service is down, all the URL redirections will start failing.
- URL redirection should happen in real-time with minimal latency.
- Shortened links should not be guessable (not predictable).
# Estimations
Assumptions 100:1 Read-Write ratio
Our system will be read-heavy. There will be lots of redirection requests compared to new URL shortenings. Let’s assume a 100:1 ratio between read and write.
# Traffic estimates
Assuming, we will have 500M new URL shortenings per month, with 100:1 read/write ratio, we can expect 50B redirections during the same period:
What would be Queries Per Second (QPS) for our system? New URLs shortenings per second:
Considering 100:1 read/write ratio, URLs redirections per second will be:
# Storage estimates
Let’s assume we store every URL shortening request (and associated shortened link) for 5 years. Since we expect to have 500M new URLs every month, the total number of objects we expect to store will be 30 billion:
Let’s assume that each stored object will be approximately 500 bytes (just a ballpark estimate–we will dig into it later). We will need 15TB of total storage:
# Bandwidth estimates
For write requests, since we expect 200 new URLs every second, total incoming data for our service will be 100KB per second:
For read requests, since every second we expect ~20K URLs redirections, total outgoing data for our service would be 10MB per second:
# Memory estimates
If we want to cache some of the hot URLs that are frequently accessed, how much memory will we need to store them? If we follow the 80-20 rule, meaning 20% of URLs generate 80% of traffic, we would like to cache these 20% hot URLs.
Since we have 20K requests per second, we will be getting 1.7 billion requests per day:
To cache 20% of these requests, we will need 170GB of memory.
One thing to note here is that since there will be many duplicate requests (of the same URL), our actual memory usage will be less than 170GB.
# API
1 | |
Parameters:
- api_dev_key (string): The API developer key of a registered account. This will be used to, among other things, throttle users based on their allocated quota.
- original_url (string): Original URL to be shortened.
- custom_alias (string): Optional custom key for the URL.
- user_name (string): Optional user name to be used in the encoding.
- expire_date (string): Optional expiration date for the shortened URL.
1 | |
How do we detect and prevent abuse? A malicious user can put us out of business by consuming all URL keys in the current design. To prevent abuse, we can limit users via their api_dev_key. Each api_dev_key can be limited to a certain number of URL creations and redirections per some time period (which may be set to a different duration per developer key).
# Database
A few observations about the nature of the data we will store:
- We need to store billions of records.
- Each object we store is small (less than 1K).
- There are no relationships between records—other than storing which user created a URL.
- Our service is read-heavy.
TABLE URL
PK Hash: varchar(16)
OriginalURL: varchar
CreationDate: datetime
ExpirationDate: datetime
UserID: int
TABLE USER
PK UserID:int
Name: varchar
Email: varchar
CreationDate: datetime
LastLogin: datetime
What kind of database should we use? Since we anticipate storing billions of rows, and we don’t need to use relationships between objects – a NoSQL store like DynamoDB, Cassandra or Riak is a better choice. A NoSQL choice would also be easier to scale. Please see SQL vs NoSQL for more details.
# Design & Algorithm
# Encoding
# Generating Keys Offline
# Partition
# Cache
# LB
# Purging or DB cleanup
# Practical