Wednesday, June 22, 2011

Buzzwords Keynote - Conclusion

In the end, it is up to us to make things better.  We need a way for non-Apache entities to interact with Apache productively.  If we can't do that, then it is quite possible that all of the momentum and excitement that Hadoop now has will be lost.  


Conclusion

The key is that we now have an eco-system, not just a community.  We can make it work.  Or we can elect to let it not work.  Not working is the default state.  We have to take positive action to avoid the default.

Apache can stay a strong voice for business friendly open source by remaining Apache.  Trying to make Apache broad enough to include all of the players in Hadoop and Hadoop derivatives will simply debase the voice of Apache into the average of opposing viewpoints, i.e. into nothing.

There are, however, many other players who are not part of Apache and who should probably not be part of Apache.  There needs to be a way for these others to engage with the Apache viewpoint.  It can't just be on the level of individuals from Apache trying to informally spread the Apache way even though that is critical to have.  It is likely to require a venue in which corporate entities can deal with something comparable to themselves.  A good analogy is how Mozilla participation in W3C has made the web a better place.

But we can make our eco-system work.   It isn’t what it was and it never will be again.

But it can be astonishing

Let's make it so.

Buzzwords Keynote - Part 3

In the third part of my talk, I talked a bit about where Hadoop has come from and where it is going.  Importantly, this involves a choice about where Hadoop and the related companies products and individuals might be able to take things.


Where we are and how we got here

My second section described the rough state of the Hadoop eco-system is a slightly provocative way.  In particular, I described a time when I was on a British train and in partial compensation for delays the operators announced that "free beer would be on sale in the galley car".  Free beer for sale is a wonderful analogy for the recent state of Hadoop and related software.

That said, there are serious problems brewing.  The current world of Hadoop is largely based on the assumption that the current community is all that there is.  This is a problem, however, because the current (Apache-based) community presumes interaction by individuals with a relatively common agenda.  More and more, however, the presence of a fundable business opportunity means that this happy world of individuals building software for the greater good has been invaded by non-human, non-individual corporations.  Corporations can't share the same agenda as the individuals involved in Apache and Apache is constitutively unable to allow corporate entities as members.

This means that the current community can no longer be the current world.  What we now have is not just a community with shared values but is now an eco-system with different kinds of entities, multiple agendas, direct competition and conflicting goals.  The Apache community is one piece of this eco-system.

Our choice of roads

Much as Dante once described his own situation, Hadoop now finds itself in the middle of the road of its life in a dark wood.  The members of the Apache community have a large voice in the future of Hadoop and related software.

As a darker option, the community can pretend that the eco-system that now exists of human and corporate participants is really a community.  If so, it is likely that the recent problems in moving Hadoop forward will continue and even get worse.  Commit wars and factionalization are likely to increase as corporate entities, denied a direct voice in Apache affairs, will tend to gain influence indirectly.  Paralysis in development will stall forward progress of Hadoop itself leading to death by a thousand forks.  Such a dark world would let alternative frameworks such as Azure to gain footholds and possibly to dominate.

In this brighter alternative future, I think that there are ways to create a larger forum in which corporate voices can be heard in their true form rather than via conflicts of interest.  In this scenario, Apache would be stronger because it really can be a strong voice of the open source community.  Rather than being the average of conflicting views, Apache would be free to express the shared values of open source developers.  Corporations would be able to express their goals, some shared, some not in a more direct form and would not need so much to pull the strings of Apache committers.  Importantly, I would hope that Hadoop could become something analogous to a reference implementation and that commercial products derived from Hadoop would have a good way to honor their lineage without finding it difficult to differentiate themselves from the original.  Hopefully in this world innovation would be welcomed, but users would be able to get a more predictable experience because they would be able to pick products offering whatever innovation rate/stability trade-off that they desire.  Importantly, there would be many winners in such a world since different players would measure success in different terms.

We have a key task ahead of us to define just what kind of eco-system we want.  It can be mercenary and driven entirely be corporate goals.  This could easily happen if Apache doesn't somehow facilitate the creation of a forum for eco-system discussion.  In such an eco-system, it is to be expected that the companies that have shown a strong talent at dominating standards processes and competing in often unethical ways will dominate.  My first thought when I imagine such a company is Microsoft, but that is largely based on having been on the receiving end of their business practices.  I have no illusions that talent for that kind of work is exclusively found in Redmond.

In my talk, I proposed some colorful cosmological metaphors for possible worlds, but the key question is how we can build a way for different kinds of entities to talk.  It is important to recognize different values and viewpoints.  Apache members need to understand that not everything is based on individual action, nor do corporation hold the same values.  Companies need to take a strong stance to recognize the incredible debt owed to the Apache community for creating the opportunities we all see.

If we can do this, then Hadoop (and off-spring) really does have a potential to dominate business computing.


Buzzwords Keynote - Part 2

In the first part of the talk, I made the case that Apache Hadoop has lots of head-room in terms of performance.  This translates into lots of opportunity both for open source developers to make Hadoop itself better, but also for companies to build products that derive from Hadoop but improve it in various ways.

The $S$ score

In honor of Steve Jobs whose highest praise is reputedly to say "that doesn't suck", I proposed an $S$ score whose highest score is zero, but for all real systems is always negative.  For a batch, data processing system like Hadoop, I proposed that a good definition of $S$ was the log base 10 of the ratio of the actual performance to the performance implied by hardware limits.

Not suprisingly, the overall score for Hadoop comes out to be somewhere between -5 to -2 depending on desired workload (i.e. Hadoop runs programs somewhere between 100 and 100,000 times slower than the hardware would allow).  For some aspects, Hadoop's $S$ score can be as good as $-0.5$ but generally there are multiple choke-points and some of these are additive.  This is hardly news and isn't even a mark of discredit to Hadoop since the developers of Hadoop have always prized getting things to work and to work at scale above getting things to work within an iota of the best the hardware can do at a particular scale.  Another factor that drives $S$ down for Hadoop is the fact that the hardware we use has changed dramatically over the 6-7 year life of Hadoop.

In defining the value of $S$ for current Hadoop versions, I don't mean to include algorithm changes.  Michael Stonebraker has become a bit famous for running down Hadoop for not doing database-like things with database-like algorithms, but I would like to stick to the question of how fast Hadoop could do what Hadoop is normally and currently used to do.

The key conclusion is that having such a low $S$ combined with high demand for Hadoop-like computation represents a lot of opportunity.  This opportunity involves opportunities for the open source community to make things better.  It also represents opportunities for commercial companies to make money.  The latter kind of opportunity is what is going to shake up the currently cozy Hadoop community the most.

Buzzwords Keynote ... blog edition

There has been a bit of demand for an expanded version of my Buzzwords keynote from a few weeks ago.  This demand has been increased by a particular unfortunate mis-quote in a tweet that suggested that I thought that there was a need for a new organization "to supersede Apache".  Of course, I suggested nothing of the sort so it is a good idea to walk through the ideas that I presented.  The buzz words site has the video of my talk and pdf of my slides in case you want to follow along.

The talk was divided into several sections.  The first one proposed the uncontroversial thesis that Hadoop performs at a level far below the potential offered by modern hardware.  A second section pointed out difficulties with the current social structure surrounding the development of Hadoop and related software.  I then examined what I see as possible futures while describing how I think we will be choosing between these alternative futures.  I will post each section in a separate blog entry.

As I spoke, I encouraged the audience to tweet using the hash-tag #bbuzz and to keep communal notes on a shared Google document.  The tweets are a bit hard to find as befits ephemeral media, but the shared notes are still accessible.

Sections:

The S Score
Possible Futures
Conclusion


Monday, June 20, 2011

Buzzwords Wrapup

Well, Buzzwords is over and my primary conclusion is that I wish I had come last year as well as this year. Isabel and Simon are really making Buzzwords into a major open source conference and with the demise of the European ApacheCon, Buzzwords is probably the the first or second most important open source conference in Europe.  If you only can choose one, I would strongly recommend geeking in Berlin.  It isn't just the conference; there are bunches of related events such as informal dinners, bar camps and hackathons.  Since Buzzwords makes such a strong effort to include North American participants you may even have a better chance of connecting globally by going to Europe than going to a conference in the US or Canada.

The conference itself consisted of two days of scheduled events anchored by keynotes each day.  Doug Cutting gave the first keynote and covered a lot of the history and current state of Hadoop.  As always, his talk was very well done and contained quite a bit of technical information which is refreshing in a keynote.  I gave the second keynote and talked a bit about the state and future of Hadoop, related Apache projects and the burgeoning commercial marketplace.  Some of what I said stirred up a bit of talk, which is good since my primary thesis that we aren't talking enough about how the world of Hadoop and related software is rapidly changing in ways that aren't well recognized.  Stay tuned here for a blog edition of my talk.

There were quite a few excellent technical talks as well.  Among the scheduled talks, Jonathan Gray gave a talk which his usual and customary dose of excellent technical information about how Facebook is using Hbase.  A notable moment came when he was asked about the state of Cassandra at Facebook.  Check out the upcoming video for details on his answer.

Dawid Weiss gave an excellent talk on finite state automata and the difference between deterministic and non-deterministic variants.  The only defect I could see in his presentation was that we couldn't see the eagles on the coins.  Based on the fact that the room was packed (I sat in the aisle on the floor) and the very eager audience questions, I would say that there is a surprisingly strong market place for information on foundational algorithms like finite state transformers.

The lightning talks at the end of day two also had some gems.  Thomas Hall's northern accent blended charmingly with the frank assessment of some of his experiences with certain technical approaches.  I can't possibly convey the tone and content so, yet again, you will need to refer to his slides and the video on the conference web-site.

Frank Scholten also had a lightning talk that contained a very nice walk-through of Mahout document clustering.  What he showed is a work in progress, but already what he has provides a highly requested set of recipes to illustrate a lot of the software in Mahout.

Outside of the conference there was an (excellent) barcamp run by Nick Burch.  I think I learned as much about how to run a barcamp by watching him as anybody did from any of the technical discussions and the technical discussions were pretty excellent.

I have to say that if you want to see me next year in early June, there is a high likelihood that you will have to be in Berlin to do it.

See http://berlinbuzzwords.de/wiki/linkstoslides to get slides from talks.

Wednesday, June 8, 2011

The Best Illustration of a probability Distribution

Describing a probability distribution in the abstract to a novice is often difficult.

Here it is in concrete form.  Note how some keys are more worn than others.  This is in Germany so that the ground floor is labeled "0".  Note the wear on the "door close" button!

Visit to DIMA at Technische Universitaet

I had a great visit today at the DIMA laboratory at TU in Berlin.  They are working on an interesting system called Stratosphere which provides an interesting generalization generalization of map-reduce.  Of particular interest is the run-time flexibility for adapting how the flow partitions or transfers data.

They accomplish this by having a lower level abstraction layer that supports a larger repertoire of basic options beyond just map and reduce.  These operations include match, cross product and co-group.  Having a wider range of operations and retaining some additional flow information at that level allows them to do on-the-fly selection of the detailed algorithm for different operations based on the statistics of the  data and the properties of the user-supplied functions.

Here's a pic of me answering questions about startups and log-likelihood ratio tests.