Thursday, July 10, 2014

Design of a tinyurl service

A tinyurl service compresses a long url into a short one. This saves space, and is useful for scenarios such as twitter or weibo, where each character counts.  The down side is possible spam use.  Concerns include longevity of a short url.

== Single Server ==

What a tinyurl service does is essentially: long-url <==> short_url. The conversion is double-sided, i.e., a two-way function. A site registers its url at the service, and then provides the short url to visitors. The visitors visit the short url, which then redirects them to the original (mostly longer) url.

A common way of thinking is to hash the long url into a short one. However this is not correct, because hash is a one-way function.

Many ways can be used to do the compression. One of the most simple but also effective one, is to have a database table set up this way:

Table T_Url_Conversion (
    ID : int PRIMARY_KEY AUTO_INC,
    Original_url : varchar,
    Short_url : varchar
)

Then the auto-incremental primary key ID is used to do the conversion: (ID, 10) <==> (short_url, BASE). Whenever you insert a new original_url, the query can return the new inserted ID, and use it to derive the short_url, save this short_url and send it to cilent.

Here the BASE can be 62 for a-zA-Z0-9, or 36 for a-z0-9. So this is a conversion of numbers between base-10 and base-62 or base-36. If use base-62, and with a short_url length of 6 characters, that's a space of 62^6 =~ 57 billion short urls to use.

One concern is some short url strings will be reserved, say for at least 2 reasons: 1) urls reserved by certain clients for branding purpose, 2) dirty words that are avoided by clients. Such reserved strings can be stored in another table, and do a check upon new short_url generation.

Follow this design, one can easily setup a website providing short url service. This idea is not new, other people already came up with it [1][2][3].

== Multiple Servers ==

[1] also discussed the case of distributed design. However the scheme to do the assignment is not clear: when receiving a long url it can be hashed to a value and choose a target storage server according to this value.  This may work if the fictitious universal hashing function can evenly dispatch the requests to different servers.  But a more serious issue is, when receiving a short url, how can you know which server it belongs to?

I think one solution can be like this:

-- Solution I --
Say there are 3 servers, and a load-balancing machine that does the assignment in a round-robin fashion: for insertion requests 3k, 3k+1, 3k+2, assign them to servers 1, 2 and 3 respectively. On each server, the ID starts at 1, 2 and 3, and all increment by 3. This way, the first server stores ID 1, 4, 7 .., the second server stores ID 2, 5, 8 ..., the third server stores ID 3, 6, 9.. So when you receive a short url and converts it back to N, then you can go to server N%3.

The potential shortcoming of this method, is when you need to add a fourth server, then the split method is broken. A possible solution can be dividing 3k to 6k' and 6k' + 3. Here is another solution that can be better:

-- Solution II --
Since we already know that by only using 6 characters, the short_url space is about 57 billion, so this distributed scenario comes up only because of heavy load, not for lack of short_url space. So we may have the luxury to use an extra character to specify which server to go: say the short_url is abc, then we append a last digit to specify the server to go. For example, if it goes to server 2, then we make the short_url abc2. This way, the load-balancer continues with the round-robin modulo assignment algorithm when new servers are added, and short_urls can easily find their way. Also, now each server does not have to increment their ID by server_count now, and can use continuous ID.  Seems now everyone can live happily ever after.

It is further easy to see that, using this method, with one appended digit, one can have up to 62 servers on a base-62 system. This should be enough for such a simple service.

It is also easy to see that the initial long_url assignment approach of load-balancer is independent from the short_url assignment method.The long_url assignment now can actually use a universal-hashing method, but a round-robin method still works well and is indeed more simple with just a counter, and is fault-tolerant: losing the counter for a while does not actually matter.

References:
[1] System Design for Big Data [tinyurl]
[2] URL Shortening: Hashes In Practice, 21 Aug 2007
[3] How to code a URL shortener?


1 comment:

Blogger said...

Did you know that you can shorten your long urls with Shortest and get $$$ for every visitor to your short urls.

Blog Archive

Followers