Twitter’s Blob store and Libcrunch – how it works

TwitterYou may have read one of my previous posts “Arming the cloud” where I talked about why and how large cloud providers are using commodity hardware with intelligent API’s to separate the dumb data and intelligent data to give us a better service. Well in a world of distributed computing and networking you will probably not find larger than Twitter.

To me and you when we upload a photo to the cloud its in the “cloud” we do not care much for what goes on in the background all we care about is how long is takes to upload or download. And this has been Twitter’s challenge, how do they  keep all this data synchronized around the world to meet our immediate demands? It is a common problem of how do large-scale web and cloud environment’s allow users from anywhere in the world to use the photo sharing service overcoming latency which ultimately boils down to me and you waiting for the service to work.

So Twitter announced a new photo sharing platform, but what I am going to look at is how the company manage software and infrastructure to enable this service. Here is what Twitter released yesterday;

“When a user tweets a photo, we send the photo off to one of a set of Blobstore front-end servers. The front-end understands where a given photo needs to be written, and forwards it on to the servers responsible for actually storing the data. These storage servers, which we call storage nodes, write the photo to a disk and then inform a Metadata store that the image has been written and instruct it to record the information required to retrieve the photo. This Metadata store, which is a non-relational key-value store cluster with automatic multi-DC synchronization capabilities, spans across all of Twitter’s data centers providing a consistent view of the data that is in Blob store.”

Sound familiar to what I was discussing in my previous posts? Of course it is, this is a classic example of commoditizing storagecomputenetwork hardware and having the software API intelligently manage this data.

So what you have to consider with a platform like Twitter is speed and cost, they want users to be able to see the tweet with the picture as soon as possible but they have to be conscious of cost to deliver this service. Twitter has many data centers with many resources but the trade off is always going to be cost.

The next element of this is reliability, how do Twitter ensure that your photos exist in multiple locations on file but not too many to cost too much to Twitter, it also has to think about how and where it stores information on servers which indicate where the actual file exists (meta data). If we took the servers for example, and then thought about how many photos are uploaded to Twitter each day, that’s a lot of meta data to store, what if one of those servers then fails? Then you would lose all meta data and the service would be unavailable. To remedy this the original way of thinking is to replicate this data, but that is costly and time-consuming to keep synchronized and lets not forget will be using some serious space.

So Twitter introduced a library called “libcrunch” and here is what they had to say about it;

“Libcrunch understands the various data placement rules such as rack-awareness, understands how to replicate the data in way that minimizes risk of data loss while also maximizing the throughput of data recovery, and attempts to minimize the amount of data that needs to be moved upon any change in the cluster topology (such as when nodes are added or removed).”

Does that sound familiar again? This is the Atmos play from EMC which is using intelligent API’s to manage all aspects of an element of data, I referred to this last time as an “Object Store”, and the point of this that the API itself understands what to do with a particular piece of data in terms of replication, security, encryption and protection. So we are no longer administering pools of storage but the API is self managing itself, and in the case of Twitter you have to admit that this would be the only way of doing this.

So what does the infrastructure look like, well they use cheap hard drives to store the actual file and the meta data is served from EFD drives for increased speeds. Think of meta data as a search engine it allows you to find articles related to a query very quickly rather than looking at the entire web.

So to sum this up as we place more and more information in to the cloud which is a blend of distributed compute and network, locating information across them is becoming more difficult and slow. Thinking like this with API’s controlling the data according to policies is the right direction to take when using large cloud services.

If you are interested in looking at a cloud solution platform delivering intelligence like this go to EMC Atmos

 

 

43 thoughts on “Twitter’s Blob store and Libcrunch – how it works

    1. Hi,

      Thank you for your comments they are always appreciated, and I hope it does become more visited 🙂

      Please share this blog through your social media

      Regards

      Storageous

      Like

  1. I think this is one of the most important info for me. And i am glad reading your article.
    But wanna remark on some general things, The site style is great, the articles is really nice :
    D. Good job, cheers

    Like

  2. Howdy! I realize this is sort of off-topic however I had
    to ask. Does building a well-established website like yours require a
    large amount of work? I’m brand new to writing a blog however I do write in my journal
    daily. I’d like to start a blog so I can easily share my own experience and feelings online.

    Please let me know if you have any kind of recommendations or
    tips for new aspiring blog owners. Appreciate it!

    Like

  3. I would like to thank you for the efforts you’ve put in penning this blog.
    I really hope to see the same high-grade blog posts by you later on as well.
    In fact, your creative writing abilities has motivated me
    to get my own, personal blog now 😉

    Like

  4. Thanks for one’s marvelous posting! I definitely enjoyed
    reading it, you may be a great author.I will make certain to bookmark your blog
    and will often come back in the future. I want to encourage you to continue your great writing, have a nice morning!

    Like

  5. This design is incredible! You definitely know how to keep a reader entertained.
    Between your wit and your videos, I was almost moved to start my own blog (well, almost…HaHa!)
    Fantastic job. I really enjoyed what you had to say, and more than that, how you presented it.

    Too cool!

    Like

  6. Hi, I do believe this is a great website. I stumbledupon it 😉 I am
    going to revisit once again since i have saved as a favorite it.
    Money and freedom is the greatest way to change, may you be rich and
    continue to guide others.

    Like

  7. Attractive section of content. I just stumbled upon
    your weblog and in accession capital to assert that I acquire actually enjoyed account your blog posts.
    Anyway I will be subscribing to your augment and even I achievement you access consistently quickly.

    Like

    1. Thank you for your comments, this is my own blog which I pay for and do in my spare time. Please if you like this blog share it through social media and keep coming back to read.

      Like

  8. Hey! Someone in my Myspace group shared this website with us so I came to give it a look.
    I’m definitely enjoying the information. I’m book-marking and will be tweeting this to my followers!

    Outstanding blog and amazing design and
    style.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s