Airzero Cloud

Next Generation Cloud !

One of the most common system design interview questions in software engineering is how to create a URL shortener like TinyURL or Bitly.

While tinkering with Cloudflare Worker to sync the Daily LeetCode Challenge to To-do list, I had the idea to create an actual URL shortener that anyone could use.

What follows is my thought process, along with code examples, for creating a URL shortener using Cloudflare Worker. If you want to proceed, you'll need a Cloudflare account and the Wrangler CLI.

Requirements

Let's begin, as with any System Design interview, by defining some functional and non-functional requirements.

Functional

  • When given a URL, our service should return a unique and short URL of that URL. For instance,https://betterprogramming.pub/how-to-write-clean-code-in-python-5d67746133f2 s.jerrynsh.com/FpS0a2LU
  • When a user attempts to access s.jerrynsh.com/FpS0a2LU, he or she is redirected to the original URL.
  • The UUID should be encoded using the Base62 encoding scheme (26 + 26 + 10):
  • A lower case alphabet ranging from 'a' to 'z' with a total of 26 characters
  • An upper case alphabet ranging from 'A' to 'Z,' with a total of 26 characters.
  • A digit from '0' to '9', for a total of ten characters

We will not support custom short links in this POC. Our UUID should be 8 characters long because 628 would give us approximately 218 trillion possibilities.

The generated short URL should never expire.

Non-Functional

  • Low latency
  • High availability

Budget, Capacity, and Restrictions Planning

The goal is straightforward: I want to be able to host this service for free. As a result, our constraints are heavily reliant on Cloudflare Worker pricing and platform limitations.

At the time of writing, the constraints per account for hosting our service for free are as follows:

  • 100k requests per day at 1k requests per minute

  • CPU runtime of no more than 10ms

Our application, like most URL shorteners, is expected to have a high read rate but a low write rate. Cloudflare KV, a key-value data store that supports high read with low latency — ideal for our use case — will be used to store our data.

Continuing from our previous limitations, the free tier of KV and limit allows us to have:

  • 1k writes/day 100k reads/day
  • 1 GB of data storage

What is the maximum number of short URLs that we can store?

With a free maximum stored data limit of 1 GB in mind, let's try to estimate how many URLs we can store. I'm using this tool to estimate the byte size of the URL in this case:

  • One character equals one byte.
  • We have no problem with the key size limit because our UUID should only be a maximum of 8 characters.
  • The value size limit, on the other hand — I'm guessing that the maximum URL size should be around 200 characters. As a result, I believe it is safe to assume that each stored object should be an average of 400 bytes, which is significantly less than 25 MiB.
  • Finally, with 1 GB available, our URL shortener can support up to 2,500,000 short URLs.
  • I understand. 2.5 million URLs is not a large number.

In retrospect, we could have made our UUID 4 instead of 8, as 624 possibilities are far more than 2.5 million. Having said that, let's stick with an 8-character UUID.

Overall, I would say that the free tier for Cloudflare Worker and KV is quite generous and more than adequate for our proof of concept. Please keep in mind that the limits are applied per account.

Storage

As I previously stated, we will use Cloudflare KV as the database to store our shortened URLs because we anticipate more reads than writes.

Eventually Consistent

One thing to keep in mind is that, while KV can support extremely high read rates globally, it is not a long-term consistent storage solution. In other words, any writes may take up to 60 seconds to propagate globally — a drawback we can live with.

I've yet to come across anything that lasted more than a couple of seconds in my experiments.

Atomic Operation

Based on what I've learned about how KV works, it's clear that it's not ideal for situations that necessitate atomic operations. Fortunately for us, this is not an issue.

For our proof of concept, the key of our KV would be a UUID that comes after our domain name, and the value would be the long URL provided by the users.

Creating a KV

Simply run the following commands in Wrangler CLI to create a KV.

# Production namespace:
wrangler kv:namespace create "URL_DB"
# This namespace is used for `wrangler dev` local testing:
wrangler kv:namespace create "URL_DB" --preview

In order to create these KV namespaces, we must also update our wrangler.toml file to include the appropriate namespace bindings. To access your KV's dashboard, go to https://dash.cloudflare.com/your Cloudflare account id>/workers/kv/namespaces.

Short URL UUID Generation Logic

This is most likely the most crucial aspect of our entire application.

  • The goal, based on our requirements, is to generate an alphanumeric UUID for each URL, with the length of our key being no more than 8 characters.
  • In an ideal world, the UUID of the generated short link should be unique. Another critical factor to consider is what happens if multiple users shorten the same URL. We should ideally also check for duplicates.

Consider the following alternatives:

  • Using a UUID generator

     https://betterprogramming.pub/stop-using-exceptions-like-this-in-python-2bd8ba7d8841 → UUID Generator → Yyf6AJ39 → s.jerrynsh.com/Yyf6AJ39

This solution is relatively simple to implement. We simply call our UUID generator to generate a new UUID for each new URL we encounter. We'd then use the generated UUID as our key to assign the new URL.

If the UUID already exists (collision) in our KV, we can continue retrying. However, we should be cautious about retrying because it can be costly.

Furthermore, using a UUID generator would not assist us in dealing with duplicates in our KV. It would be relatively slow to look up the long URL value within our KV.

  • Hashing the URL

    https://betterprogramming.pub/how-to-write-clean-code-in-python-5d67746133f2 → MD5 Hash → 99d641e9923e135bd5f3a19dca1afbfa → 99d641e9 → s.jerrynsh.com/99d641e9

Hashing a URL, on the other hand, allows us to check for duplicated URLs because passing a string (URL) through a hashing function always produces the same result. We can then use the result (key) to check for duplication in our KV.

Assuming that we use MD5, our key would be 8 characters long. So, what if we just took the first 8 bytes of the MD5 hash generated? Isn't the problem solved?

No, not exactly. Collisions would always be produced by the hash function. We could generate a longer hash to reduce the likelihood of a collision. However, it would be inconvenient for users. In addition, we want to keep our UUID 8 characters.

  • Using an incremental counter

    https://betterprogramming.pub/3-useful-python-f-string-tricks-you-probably-dont-know-f908f7ed6cf5 → Counter → s.jerrynsh.com/12345678

In my opinion, this is the simplest yet most scalable solution. We will not have any collision problems if we use this solution. We can simply increment the number of characters in our UUID whenever we consume the entire set.

However, I do not want users to be able to guess a short URL at random by visitings.jerrynsh.com/12345678. As a result, this option is out of the question.

There are numerous other solutions (for example, pre-generate a list of keys and assign an unused key when a new request comes in) available, each with its own set of advantages and disadvantages.

We're going with solution 1 for our proof of concept because it's simple to implement and I'm fine with duplicates. To avoid duplicates, we could cache our users' requests in order to shorten URLs.

Nano ID

The nanoid package is used to generate a UUID. We can use the Nano ID collision calculator to estimate our collision rate:

Okay, enough chit-chat; let's get to work on some code!

To deal with the possibility of collision, we simply keep retrying:

// utils/urlKey.js
import { customAlphabet } from 'nanoid'
const ALPHABET = '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
/*
Generate a unique `urlKey` using `nanoid` package.
Keep retrying until a unique urlKey does not exist in the URL_DB.
*/
export const generateUniqueUrlKey = async () => {
   const nanoId = customAlphabet(ALPHABET, 8)
   let urlKey = nanoId()
   while ((await URL_DB.get(urlKey)) !== null) {
       urlKey = nanoId()
   }
   return urlKey
}

API

We will define the API endpoints that we want to support in this section. This project is started with the itty-router worker template, which assists us with all of the routing logic:

wrangler generate  https://github.com/cloudflare/worker-template-router

The beginning of our project lies in the index.js:

// index.js
import { Router } from 'itty-router'
import { createShortUrl } from './src/handlers/createShortUrl'
import { redirectShortUrl } from './src/handlers/redirectShortUrl'
import { LANDING_PAGE_HTML } from './src/utils/constants'
const router = Router()
// GET landing page html
router.get('/', () => {
   return new Response(LANDING_PAGE_HTML, {
       headers: {
           'content-type': 'text/html;charset=UTF-8',
       },
   })
})
// GET redirects the short URL to its original URL.
router.get('/:text', redirectShortUrl)
// POST creates a short URL that is associated with its original URL.
router.post('/api/url', createShortUrl)
// 404 for everything else.
router.all('*', () => new Response('Not Found', { status: 404 }))
// All incoming requests are passed to the router where your routes are called and the response is sent.
addEventListener('fetch', (e) => {
   e.respondWith(router.handle(e.request))
})

In the interest of providing a better user experience, I created a simple HTML landing page that anyone can use.

Creating short URL

To begin, we'll need a POST endpoint (/api/url) that calls createShortUrl, which parses the originalUrl from the body and returns a short URL.

Example Code:

// handlers/createShortUrl.js
import { generateUniqueUrlKey } from '../utils/urlKey'
export const createShortUrl = async (request, event) => {
   try {
       const urlKey = await generateUniqueUrlKey()
       const { host } = new URL(request.url)
       const shortUrl = `https://${host}/${urlKey}`
       const { originalUrl } = await request.json()
       const response = new Response(
           JSON.stringify({
               urlKey,
               shortUrl,
               originalUrl,
           }),
           { headers: { 'Content-Type': 'application/json' } }
       )
       event.waitUntil(URL_DB.put(urlKey, originalUrl))
       return response
   } catch (error) {
       console.error(error, error.stack)
       return new Response('Unexpected Error', { status: 500 })
   }
}

To test this locally, run the following Curl command:

curl --request POST \\
 --url http://127.0.0.1:8787/api/url \\
 --header 'Content-Type: application/json' \\
 --data '{
    "originalUrl": "https://www.google.com/"
}'

Redirecting short URL

As a URL shortening service, we want users to be able to visit a short URL and be redirected to their original URL:

// handlers/redirectShortUrl.js

export const redirectShortUrl = async ({ params }) => {

   const urlKey = decodeURIComponent(params.text)

   const originalUrl = await URL_DB.get(urlKey)

   if (originalUrl) {

       return Response.redirect(originalUrl, 301)

   }

   return new Response('Invalid Short URL', { status: 404 })

}

What about removing something? Because the user does not need to be authorized to shorten any URL, the decision was made to proceed without a deletion API because it makes no sense for any user to simply delete another user's short URL.

Simply run wrangler dev to test our URL shortener locally.

What happens if a user decides to shorten the same URL multiple times? We don't want our KV to end up with duplicated URLs with unique UUIDs, do we? To address this, we could use a cache middleware that caches the originalUrl submitted by users via the Cache API:

import { URL_CACHE } from '../utils/constants'

export const shortUrlCacheMiddleware = async (request) => {
   const { originalUrl } = await request.clone().json()

   if (!originalUrl) {

       return new Response('Invalid Request Body', {

           status: 400,

       })

   }
   const cache = await caches.open(URL_CACHE)

   const response = await cache.match(originalUrl)

   if (response) {

       console.log('Serving response from cache.')

       return response

   }

}

Update our index.js accordingly:

// index.js
...
router.post('/api/url', shortUrlCacheMiddleware, createShortUrl)
...

Finally, after shortening the URL, we must ensure that our cache instance is updated with the original URL:

// handlers/createShortUrl.js

import { URL_CACHE } from '../utils/constants'

import { generateUniqueUrlKey } from '../utils/urlKey'

export const createShortUrl = async (request, event) => {

   try {

       const urlKey = await generateUniqueUrlKey()

       const { host } = new URL(request.url)

       const shortUrl = `https://${host}/${urlKey}`

       const { originalUrl } = await request.json()

       const response = new Response(

           JSON.stringify({

               urlKey,

               shortUrl,

               originalUrl,

           }),

           { headers: { 'Content-Type': 'application/json' } }

       )

       const cache = await caches.open(URL_CACHE) // Access our API cache instance

       event.waitUntil(URL_DB.put(urlKey, originalUrl))

       event.waitUntil(cache.put(originalUrl, response.clone())) // Update our cache here

       return response

   } catch (error) {

       console.error(error, error.stack)

       return new Response('Unexpected Error', { status: 500 })

   }

}

During my testing with wrangler dev, I discovered that the Worker cache does not function locally or on any worker.dev domain.

To test this, run wrangler publishes to publish the application on a custom domain. You can test the changes by sending a request to the /api/url endpoint while using the wrangler tail to view the log.

Deployment

Isn't it true that no side project is ever completed without hosting?

Before you publish your code, you must edit the wrangler.toml file and include your Cloudflare account id. More information about configuring and publishing your code is available in the official documentation. Simply run wrangler publish to deploy and publish any new changes to your Cloudflare Worker.

Conclusion

To be honest, researching, writing, and building this POC at the same time is the most fun I've had in a long time. There are numerous other things that come to mind that we could have done for our URL shortener; to name a few:

  • Metadata such as creation date and the number of visits are saved.
  • Including authentication
  • Handle the deletion and expiration of short URLs.
  • Users' Analytics
  • Personalized link

One issue that most URL shortening services face is that short URLs are frequently used to redirect users to malicious websites. I believe it would be an interesting topic to investigate further. If you have any questions about the preceding topic. Please do not hesitate to contact us. Your digital partner will be Airzero cloud.

Email id: [email protected]

enter image description here Author - Johnson Augustine
Cloud Architect, Ethical hacker
Founder: Airo Global Software Inc
LinkedIn Profile: www.linkedin.com/in/johnsontaugustine/