Programming Uncategorized

Big Data DC #3

Two enterprise big data consulting companies presented about the architecture they use for processing and storing at the third Big Data DC meetup.  Much like the first and second meetups, the common thread seemed to be the decisions that the engineers made to optimize certain aspects over others.

First up, Joey Echeverria who works for Cloudera, talking about using HBase in the real world.  Joey’s presentation covered the basics of Hadoop, and then dove into HBase, the database for Hadoop.  He talked about the benefits of HBase, including having a variable schema in each record and it being atomic per row.  He then gave a few examples of real life applications including Lilly, an open source project content repository, OpenTSDC, a distributed, scalable Time Series Database from stumbleupon and Socorro, the crash report database used by Mozilla.  Peruse Joey’s slides for more information on HBase.

Next up, Ted Dunning from MapR spoke about the Hadoop distribution his company sells.  Ted spoke of the bottlenecks in Hadoop that they try to solve with the implementation they built.  These bottlenecks include Read only files, many copies in I/O path, shuffle based on HTTP, and spills go to local file space.  Ted spent a large amount of his talk on maprfs, the file system they built to solve these bottlenecks.

This meetup had the largest turnout of all the Big Data DC meetups so far.   I can’t wait for the 4th meetup.

Current Events Programming Uncategorized

Big Data DC meetup #2

For the second time, a group of bright and talented developers gathered at clearspring to discuss Big Data.The first Big Data DC meetup had a great turnout. Rather then write up a summary, I decided to check out storify and build my first one for the meetup. Once again all of the talks were great. If you have an interest in big data and the technical ways to work with it, you should check it out.

View “Big Data DC #2” on Storify

What do you think of this format? Is this something you would like to see for future meetups or would you prefer a more traditional summary post?

Current Events Programming Uncategorized

The first Big Data DC meetup

At work I deal with a lot of data (i.e we deal with as much data in a day as the library of congress collects in a month) . Part of my job is making that data presentable to to publishers.  I don’t deal with the actual storing of the data ( for that we have some brilliant engineers), but I think in order to do my job correctly, I need to understand the full stack that we have.  The first Big Data DC meetup was an opportunity for me to learn more about the bottom half of the stack.

Matt Abrams kicked things off with an overview of how Clearspring deals with big data.  Big data might be a bit of an understatement.  We are talking about 4-5TB of data a day from 2.5 billion view events that needs to be processed.  How much data is this?  Well if it took one millisecond to process each event, it would take us 29 days to process each full day of data.  To accomplish this Matt and the team have four main design philosophies:

  • Speed of Safety
  • Simplicity over Complexity
  • At scale, small performance delta’s matter
  • Close is good enough in many cases.

Take a look at Matt’s slides as he goes through these philosophies in detail and the stack that Clearspring uses to accomplish this big data task:

(Matt is also doing a series of Blog Posts about this topic.  Check out the first one.)

Next up was Dave (who’s last name I didn’t catch) from Foundation DB.  Foundation DB is creating a distributed key value store with transactions.  Dave presented his philosophy that “The easiest way to build a scalable high performance fault tolerant application is on top of a scalable high performance fault tolerant foundation”.  To do this, they have created Flow.  Flow adds Futures, Promises and actors to C++.  Foundation DB is entering beta soon.  I look forward to seeing where they go with it.

I’m looking forward to future Big Data DC events.  It’s a few days since the meetup and I’m already anticipating learning how more companies are dealing with Big Data.

Want to work with Big Data?  Come work with me at Clearspring.