<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-4453805095942812863</id><updated>2011-11-13T15:18:55.510-08:00</updated><title type='text'>Surprise and Coincidence - musings from the long tail</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>47</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-7335920488725420253</id><published>2011-06-22T14:06:00.000-07:00</published><updated>2011-06-22T14:06:28.390-07:00</updated><title type='text'>Buzzwords Keynote - Conclusion</title><content type='html'>&lt;div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;In the end, it is up to us to make things better. &amp;nbsp;We need a way for non-Apache entities to interact with Apache productively. &amp;nbsp;If we can't do that, then it is quite possible that all of the momentum and excitement that Hadoop now has will be lost. &amp;nbsp;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Helvetica Neue', Arial, Helvetica, sans-serif; font-size: large;"&gt;Conclusion&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;The key is that we now have an eco-system, not just a community. &amp;nbsp;We can make it work. &amp;nbsp;Or we can elect to let it not work. &amp;nbsp;Not working is the default state. &amp;nbsp;We have to take positive action to avoid the default.&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;Apache can stay a strong voice for business friendly open source by remaining Apache. &amp;nbsp;Trying to make Apache broad enough to include all of the players in Hadoop and Hadoop derivatives will simply debase the voice of Apache into the average of opposing viewpoints, i.e. into nothing.&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;There&amp;nbsp;&lt;i&gt;are,&lt;/i&gt;&amp;nbsp;however, many other players who are not part of Apache and who should probably not be part of Apache. &amp;nbsp;There needs to be a way for these others to engage with the Apache viewpoint. &amp;nbsp;It can't just be on the level of individuals from Apache trying to informally spread the Apache way even though that is critical to have. &amp;nbsp;It is likely to require a venue in which corporate entities can deal with something comparable to themselves. &amp;nbsp;A good analogy is how Mozilla participation in W3C has made the web a better place.&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;But we can make our eco-system work. &amp;nbsp;&amp;nbsp;It isn’t what it was and&amp;nbsp;it never will be again.&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;But it can be astonishing&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;Let's make it so.&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-7335920488725420253?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/7335920488725420253/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=7335920488725420253' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/7335920488725420253'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/7335920488725420253'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2011/06/buzzwords-keynote-conclusion.html' title='Buzzwords Keynote - Conclusion'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-4722329546196284399</id><published>2011-06-22T14:04:00.000-07:00</published><updated>2011-06-22T14:04:46.587-07:00</updated><title type='text'>Buzzwords Keynote - Part 3</title><content type='html'>&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: Times, 'Times New Roman', serif;"&gt;In the third part of my talk, I talked a bit about where Hadoop has come from and where it is going. &amp;nbsp;Importantly, this involves a choice about where Hadoop and the related companies products and individuals might be able to take things.&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: Times, 'Times New Roman', serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: Times, 'Times New Roman', serif;"&gt;&lt;br class="Apple-interchange-newline" /&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Helvetica Neue', Arial, Helvetica, sans-serif; font-size: large;"&gt;Where we are and how we got here&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;My second section described the rough state of the Hadoop eco-system is a slightly provocative way. &amp;nbsp;In particular, I described a time when I was on a British train and in partial compensation for delays the operators announced that "free beer would be on sale in the galley car". &amp;nbsp;Free beer for sale is a wonderful analogy for the recent state of Hadoop and related software.&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;That said, there are serious problems brewing. &amp;nbsp;The current world of Hadoop is largely based on the assumption that the current community is all that there is. &amp;nbsp;This is a problem, however, because the current (Apache-based) community presumes interaction by individuals with a relatively common agenda. &amp;nbsp;More and more, however, the presence of a fundable business opportunity means that this happy world of individuals building software for the greater good has been invaded by non-human, non-individual corporations. &amp;nbsp;Corporations can't share the same agenda as the individuals involved in Apache and Apache is constitutively unable to allow corporate entities as members.&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;This means that the current community can no longer be the current world. &amp;nbsp;What we now have is not just a community with shared values but is now an eco-system with different kinds of entities, multiple agendas, direct competition and conflicting goals. &amp;nbsp;The Apache community is one piece of this eco-system.&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Helvetica Neue', Arial, Helvetica, sans-serif; font-size: large;"&gt;Our choice of roads&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;Much as Dante once described his own situation, Hadoop now finds itself in the middle of the road of its life in a dark wood. &amp;nbsp;The members of the Apache community have a large voice in the future of Hadoop and related software.&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;As a darker option, the community can pretend that the eco-system that now exists of human and corporate participants is really a community. &amp;nbsp;If so, it is likely that the recent problems in moving Hadoop forward will continue and even get worse. &amp;nbsp;Commit wars and factionalization are likely to increase as corporate entities, denied a direct voice in Apache affairs, will tend to gain influence indirectly. &amp;nbsp;Paralysis in development will stall forward progress of Hadoop itself leading to death by a thousand forks. &amp;nbsp;Such a dark world would let alternative frameworks such as Azure to gain footholds and possibly to dominate.&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;In this brighter alternative future, I think that there are ways to create a larger forum in which corporate voices can be heard in their true form rather than via conflicts of interest. &amp;nbsp;In this scenario, Apache would be stronger because it really can be a strong voice of the open source community. &amp;nbsp;Rather than being the average of conflicting views, Apache would be free to express the shared values of open source developers. &amp;nbsp;Corporations would be able to express their goals, some shared, some not in a more direct form and would not need so much to pull the strings of Apache committers. &amp;nbsp;Importantly, I would hope that Hadoop could become something analogous to a reference implementation and that commercial products derived from Hadoop would have a good way to honor their lineage without finding it difficult to differentiate themselves from the original. &amp;nbsp;Hopefully in this world innovation would be welcomed, but users would be able to get a more predictable experience because they would be able to pick products offering whatever innovation rate/stability trade-off that they desire. &amp;nbsp;Importantly, there would be many winners in such a world since different players would measure success in different terms.&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;We have a key task ahead of us to define just what kind of eco-system we want. &amp;nbsp;It can be mercenary and driven entirely be corporate goals. &amp;nbsp;This could easily happen if Apache doesn't somehow facilitate the creation of a forum for eco-system discussion. &amp;nbsp;In such an eco-system, it is to be expected that the companies that have shown a strong talent at dominating standards processes and competing in often unethical ways will dominate. &amp;nbsp;My first thought when I imagine such a company is Microsoft, but that is largely based on having been on the receiving end of their business practices. &amp;nbsp;I have no illusions that talent for that kind of work is exclusively found in Redmond.&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;In my talk, I proposed some colorful cosmological metaphors for possible worlds, but the key question is how we can build a way for different kinds of entities to talk. &amp;nbsp;It is important to recognize different values and viewpoints. &amp;nbsp;Apache members need to understand that not everything is based on individual action, nor do corporation hold the same values. &amp;nbsp;Companies need to take a strong stance to recognize the incredible debt owed to the Apache community for creating the opportunities we all see.&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;If we can do this, then Hadoop (and off-spring) really does have a potential to dominate business computing.&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-4722329546196284399?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/4722329546196284399/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=4722329546196284399' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/4722329546196284399'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/4722329546196284399'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2011/06/buzzwords-keynote-part-3.html' title='Buzzwords Keynote - Part 3'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-1097007648347287665</id><published>2011-06-22T14:02:00.000-07:00</published><updated>2011-06-22T14:02:31.078-07:00</updated><title type='text'>Buzzwords Keynote - Part 2</title><content type='html'>&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: Times, 'Times New Roman', serif;"&gt;In the first part of the talk, I made the case that Apache Hadoop has lots of head-room in terms of performance. &amp;nbsp;This translates into lots of opportunity both for open source developers to make Hadoop itself better, but also for companies to build products that derive from Hadoop but improve it in various ways.&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Helvetica Neue', Arial, Helvetica, sans-serif; font-size: large;"&gt;&lt;br class="Apple-interchange-newline" /&gt;The $S$ score&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;In honor of Steve Jobs whose highest praise is reputedly to say "that doesn't suck", I proposed an $S$ score whose highest score is zero, but for all real systems is always negative. &amp;nbsp;For a batch, data processing system like Hadoop, I proposed that a good definition of $S$ was the log base 10 of the ratio of the actual performance to the performance implied by hardware limits.&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;Not suprisingly, the overall score for Hadoop comes out to be somewhere between -5 to -2 depending on desired workload (i.e. Hadoop runs programs somewhere between 100 and 100,000 times slower than the hardware would allow). &amp;nbsp;For some aspects, Hadoop's $S$ score can be as good as $-0.5$ but generally there are multiple choke-points and some of these are additive. &amp;nbsp;This is hardly news and isn't even a mark of discredit to Hadoop since the developers of Hadoop have always prized getting things to work and to work at scale above getting things to work within an iota of the best the hardware can do at a particular scale. &amp;nbsp;Another factor that drives $S$ down for Hadoop is the fact that the hardware we use has changed dramatically over the 6-7 year life of Hadoop.&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;In defining the value of $S$ for current Hadoop versions, I don't mean to include algorithm changes. &amp;nbsp;Michael Stonebraker has become a bit famous for running down Hadoop for not doing database-like things with database-like algorithms, but I would like to stick to the question of how fast Hadoop could do what Hadoop is normally and currently used to do.&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;The key conclusion is that having such a low $S$ combined with high demand for Hadoop-like computation represents a lot of opportunity. &amp;nbsp;This opportunity involves opportunities for the open source community to make things better. &amp;nbsp;It also represents opportunities for commercial companies to make money. &amp;nbsp;The latter kind of opportunity is what is going to shake up the currently cozy Hadoop community the most.&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-1097007648347287665?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/1097007648347287665/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=1097007648347287665' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/1097007648347287665'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/1097007648347287665'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2011/06/buzzwords-keynote-part-2.html' title='Buzzwords Keynote - Part 2'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-1761542489815861873</id><published>2011-06-22T14:00:00.000-07:00</published><updated>2011-06-22T14:09:08.069-07:00</updated><title type='text'>Buzzwords Keynote ... blog edition</title><content type='html'>There has been a bit of demand for an expanded version of my Buzzwords keynote from a few weeks ago. &amp;nbsp;This demand has been increased by a particular unfortunate mis-quote in a tweet that suggested that I thought that there was a need for a new organization "to supersede Apache". &amp;nbsp;Of course, I suggested nothing of the sort so it is a good idea to walk through the ideas that I presented. &amp;nbsp;The buzz words site has the &lt;a href="http://berlinbuzzwords.de/content/keynotes-online"&gt;video of my talk&lt;/a&gt;&amp;nbsp;and &lt;a href="http://berlinbuzzwords.de/sites/berlinbuzzwords.de/files/buzz-words-ted-dunning.pdf"&gt;pdf of my slides&lt;/a&gt; in case you want to follow along.&lt;br /&gt;&lt;br /&gt;The talk was divided into several sections. &amp;nbsp;The first one proposed the uncontroversial thesis that Hadoop performs at a level far below the potential offered by modern hardware. &amp;nbsp;A second section pointed out difficulties with the current social structure surrounding the development of Hadoop and related software. &amp;nbsp;I then examined what I see as possible futures while describing how I think we will be choosing between these alternative futures. &amp;nbsp;I will post each section in a separate blog entry.&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;As I spoke, I encouraged the audience to tweet using the hash-tag #bbuzz and to keep communal notes on a &lt;a href="http://tinyurl.com/buzzwords-ted-dunning"&gt;shared Google document&lt;/a&gt;. &amp;nbsp;The tweets are a bit hard to find as befits ephemeral media, but the shared notes are still accessible.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;Sections:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://tdunning.blogspot.com/2011/06/buzzwords-keynote-part-2.html"&gt;The S Score&lt;/a&gt;&lt;br /&gt;&lt;a href="http://tdunning.blogspot.com/2011/06/buzzwords-keynote-part-3.html"&gt;Possible Futures&lt;/a&gt;&lt;br /&gt;&lt;a href="http://tdunning.blogspot.com/2011/06/buzzwords-keynote-conclusion.html"&gt;Conclusion&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div&gt;&lt;div style="direction: ltr; margin-bottom: 0pt; margin-left: .38in; margin-top: 7.68pt; mso-line-break-override: none; punctuation-wrap: hanging; text-align: left; text-indent: -.38in; unicode-bidi: embed; word-break: normal;"&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-1761542489815861873?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/1761542489815861873/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=1761542489815861873' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/1761542489815861873'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/1761542489815861873'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2011/06/buzzwords-keynote-blog-edition.html' title='Buzzwords Keynote ... blog edition'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-2825905620751037920</id><published>2011-06-20T21:06:00.000-07:00</published><updated>2011-06-20T21:06:13.913-07:00</updated><title type='text'>Buzzwords Wrapup</title><content type='html'>Well, Buzzwords is over and my primary conclusion is that I wish I had come last year as well as this year. Isabel and Simon are really making Buzzwords into a major open source conference and with the demise of the European ApacheCon, Buzzwords is probably the the first or second most important open source conference in Europe. &amp;nbsp;If you only can choose one, I would strongly recommend geeking in Berlin. &amp;nbsp;It isn't just the conference; there are bunches of related events such as informal dinners, bar camps and hackathons. &amp;nbsp;Since Buzzwords makes such a strong effort to include North American participants you may even have a better chance of connecting globally by going to Europe than going to a conference in the US or Canada.&lt;br /&gt;&lt;br /&gt;The conference itself consisted of two days of scheduled events anchored by keynotes each day. &amp;nbsp;Doug Cutting gave the first keynote and covered a lot of the history and current state of Hadoop. &amp;nbsp;As always, his talk was very well done and contained quite a bit of technical information which is refreshing in a keynote. &amp;nbsp;I gave the second keynote and talked a bit about the state and future of Hadoop, related Apache projects and the burgeoning commercial marketplace. &amp;nbsp;Some of what I said stirred up a bit of talk, which is good since my primary thesis that we aren't talking enough about how the world of Hadoop and related software is rapidly changing in ways that aren't well recognized. &amp;nbsp;Stay tuned here for a blog edition of my talk.&lt;br /&gt;&lt;br /&gt;There were quite a few excellent technical talks as well. &amp;nbsp;Among the scheduled talks, Jonathan Gray gave a talk which his usual and customary dose of excellent technical information about how Facebook is using Hbase. &amp;nbsp;A notable moment came when he was asked about the state of Cassandra at Facebook. &amp;nbsp;Check out the upcoming video for details on his answer.&lt;br /&gt;&lt;br /&gt;Dawid Weiss gave an excellent talk on finite state automata and the difference between deterministic and non-deterministic variants. &amp;nbsp;The only defect I could see in his presentation was that we couldn't see the eagles on the coins. &amp;nbsp;Based on the fact that the room was packed (I sat in the aisle on the floor) and the very eager audience questions, I would say that there is a surprisingly strong market place for information on foundational algorithms like finite state transformers. &lt;br /&gt;&lt;br /&gt;The lightning talks at the end of day two also had some gems. &amp;nbsp;Thomas Hall's northern accent blended charmingly with the frank assessment of some of his experiences with certain technical approaches. &amp;nbsp;I can't possibly convey the tone and content so, yet again, you will need to refer to his slides and the video on the conference web-site.&lt;br /&gt;&lt;br /&gt;Frank Scholten also had a lightning talk that contained a very nice walk-through of Mahout document clustering. &amp;nbsp;What he showed is a work in progress, but already what he has provides a highly requested set of recipes to illustrate a lot of the software in Mahout.&lt;br /&gt;&lt;br /&gt;Outside of the conference there was an (excellent) barcamp run by Nick Burch. &amp;nbsp;I think I learned as much about how to run a barcamp by watching him as anybody did from any of the technical discussions and the technical discussions were pretty excellent. &lt;br /&gt;&lt;br /&gt;I have to say that if you want to see me next year in early June, there is a high likelihood that you will have to be in Berlin to do it.&lt;br /&gt;&lt;br /&gt;See&amp;nbsp;http://berlinbuzzwords.de/wiki/linkstoslides to get slides from talks.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-2825905620751037920?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/2825905620751037920/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=2825905620751037920' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/2825905620751037920'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/2825905620751037920'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2011/06/buzzwords-wrapup.html' title='Buzzwords Wrapup'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-3180467391210502058</id><published>2011-06-08T16:04:00.000-07:00</published><updated>2011-06-08T16:04:25.799-07:00</updated><title type='text'>The Best Illustration of a probability Distribution</title><content type='html'>Describing a probability distribution in the abstract to a novice is often difficult.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/-wc8qEOoyL1s/Te__Z13wf6I/AAAAAAAAAHc/DT4JjT9IKsU/s1600/pdf-small.jpg" imageanchor="1" style="clear: right; display: inline !important; float: right; margin-bottom: 1em; margin-left: 1em;"&gt;&lt;img border="0" src="http://4.bp.blogspot.com/-wc8qEOoyL1s/Te__Z13wf6I/AAAAAAAAAHc/DT4JjT9IKsU/s1600/pdf-small.jpg" /&gt;&lt;/a&gt;Here it is in concrete form. &amp;nbsp;Note how some keys are more worn than others. &amp;nbsp;This is in Germany so that the ground floor is labeled "0". &amp;nbsp;Note the wear on the "door close" button!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-3180467391210502058?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/3180467391210502058/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=3180467391210502058' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/3180467391210502058'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/3180467391210502058'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2011/06/best-illustration-of-probability.html' title='The Best Illustration of a probability Distribution'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-wc8qEOoyL1s/Te__Z13wf6I/AAAAAAAAAHc/DT4JjT9IKsU/s72-c/pdf-small.jpg' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-3246347620055734268</id><published>2011-06-08T16:00:00.000-07:00</published><updated>2011-06-08T23:17:47.086-07:00</updated><title type='text'>Visit to DIMA at Technische Universitaet</title><content type='html'>I had a great visit today at the &lt;a href="http://www.stratosphere.eu/"&gt;DIMA laboratory at TU in Berlin&lt;/a&gt;. &amp;nbsp;They are working on an interesting system called Stratosphere which provides an interesting generalization generalization of map-reduce. &amp;nbsp;Of particular interest is the run-time flexibility for adapting how the flow partitions or transfers data.&lt;br /&gt;&lt;br /&gt;They accomplish this by having a lower level abstraction layer that supports a larger&amp;nbsp;repertoire of basic options beyond just map and reduce. &amp;nbsp;These operations include match, cross product and co-group. &amp;nbsp;Having a wider range of operations and retaining some additional flow information at that level allows them to do on-the-fly selection of the detailed algorithm for different operations based on the statistics of the &amp;nbsp;data and the properties of the user-supplied functions.&lt;br /&gt;&lt;br /&gt;Here's a pic of me answering questions about startups and log-likelihood ratio tests.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-xDvcKsIjVEE/Te_--oK2ndI/AAAAAAAAAHY/uNPMuLYOTLw/s1600/dima-visit-small.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="191" src="http://4.bp.blogspot.com/-xDvcKsIjVEE/Te_--oK2ndI/AAAAAAAAAHY/uNPMuLYOTLw/s320/dima-visit-small.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-3246347620055734268?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/3246347620055734268/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=3246347620055734268' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/3246347620055734268'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/3246347620055734268'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2011/06/visit-to-dima-at-technische.html' title='Visit to DIMA at Technische Universitaet'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-xDvcKsIjVEE/Te_--oK2ndI/AAAAAAAAAHY/uNPMuLYOTLw/s72-c/dima-visit-small.jpg' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-3619541449336287245</id><published>2011-06-01T18:02:00.000-07:00</published><updated>2011-06-01T22:09:55.635-07:00</updated><title type='text'>A Tour of the Multi-update For Zookeeper</title><content type='html'>I have recently been working on some new features for Apache Zookeeper that represent probably the largest change in the Zookeeper client API in several years. &amp;nbsp;Other recent changes such as observers and read-only clusters have changed the server-side semantics, but the client API is nearly unchanged since the original public release of Zookeeper several years ago. &amp;nbsp;The JIRA covering the overall issue is&amp;nbsp;&lt;a href="https://issues.apache.org/jira/browse/ZOOKEEPER-965"&gt;Zookeeper-965&lt;/a&gt;&amp;nbsp;and you can look at the code as it is being refined into committable state in my fork of Zookeeper that I keep&amp;nbsp;&lt;a href="https://github.com/tdunning/zookeeper"&gt;on github&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;(&lt;a href="http://mapr.com/?p=83&amp;amp;option=com_wordpress&amp;amp;Itemid=134"&gt;related announcement&lt;/a&gt;)&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Helvetica Neue', Arial, Helvetica, sans-serif; font-size: large;"&gt;The Problem&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;Almost from the beginning of the Zookeeper project, there have been repeated questions on the mailing about how to update several nodes at once. &amp;nbsp;The answer has always been to consolidate the structures that you need to update atomically into the contents of a single znode and then update those contents atomically using standard methods. &amp;nbsp;This update problem is often exacerbated by a desire to use ephemeral nodes so that state disappears correctly when a service crashes.&lt;br /&gt;&lt;br /&gt;This is a pretty reasonable answer and it handle many important cases fairly well. &amp;nbsp;For instance, in the common case of servers that are assigned certain roles and then in turn advertise which roles they have successfully taken on, this pattern can be used by giving each server a single file of assignments and a single file of advertised capabilities. &amp;nbsp;Each of these files can be ephemerally created by the server so they will both vanish when the server disappears. &amp;nbsp;Assignments can be added atomically and only the server ever modifies the advertised roles. &amp;nbsp;Zookeeper works quite well in these cases.&lt;br /&gt;&lt;br /&gt;Keeping track of a graph full of nodes with directional connections to other nodes is a good example of where Zookeeper's update model is not so nice. &amp;nbsp; In this case, nodes have a list of connections to other nodes and a list of connections from other nodes. &amp;nbsp;If any node has a list of outgoing connections that are not matched by the corresponding list of incoming connections on other nodes, then the graph is referentially inconsistent. &amp;nbsp;Keeping the graph consistent under changes is not possible with normal Zookeeper updates unless you store the entire graph in a single file.&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Helvetica Neue', Arial, Helvetica, sans-serif; font-size: large;"&gt;Lazy Operations&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The new &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;multi() &lt;/span&gt;method allows such a data structure to be updated with strong guarantees of consistency. &amp;nbsp;This is done by currying the normal database mutation operators and then passing all of the resulting closures to Zookeeper at once for execution. &amp;nbsp;Obviously this requires a bit of syntactic sugar in languages like Java and C which don't like to express closures directly.&lt;br /&gt;&lt;br /&gt;The way that this works is that there is a new class call &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Op&lt;/span&gt;. &amp;nbsp;There are sub-classes of &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Op&lt;/span&gt; called &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Op.Create&lt;/span&gt;, &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Op.SetData&lt;/span&gt;, &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Op.Check&lt;/span&gt; and &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Op.Delete&lt;/span&gt; that correspond to the operations &amp;nbsp;normally invoked by calling the similarly named methods on a ZooKeeper object. &amp;nbsp;In essence, these sub-classes of Op represent reified methods or closures that can be frozen at one point in time and then executed later. &amp;nbsp;These sub-classes are instantiated using factory methods on&amp;nbsp;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Op&amp;nbsp;&lt;/span&gt;class. &amp;nbsp;The names and signatures of these factory methods mimic the corresponding methods on ZooKeeper.&lt;br /&gt;&lt;br /&gt;Once you have all of the operations you would like to perform, you can execute them all in a single transaction using &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;ZooKeeper.multi().&lt;/span&gt; &amp;nbsp;This will either execute all of the &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Op&lt;/span&gt;'s or none of them.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Helvetica Neue', Arial, Helvetica, sans-serif; font-size: large;"&gt;An Example&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;a href="http://1.bp.blogspot.com/-Zfjultl0Hxs/TeRllzvRoxI/AAAAAAAAAHQ/3aYjZhX1P-8/s1600/hyper-cube.png" imageanchor="1" style="clear: right; display: inline !important; float: right; margin-bottom: 1em; margin-left: 1em;"&gt;&lt;img border="0" height="312" src="http://1.bp.blogspot.com/-Zfjultl0Hxs/TeRllzvRoxI/AAAAAAAAAHQ/3aYjZhX1P-8/s320/hyper-cube.png" width="320" /&gt;&lt;/a&gt;I have placed an example program for doing a &lt;a href="https://github.com/tdunning/graph-demo"&gt;graph-based computation&lt;/a&gt;&amp;nbsp;over on github. &amp;nbsp;This program builds a graph consisting of 32 nodes in the form of a 5-dimensional hyper-cube and then uses a numerical technique called over-relaxation to solve for the voltages that would result if all the links in the hyper-cube were replaced by unit resistors. &amp;nbsp;A picture of the graph is just to the right. In this graph, the voltage for node 0x1F is labeled as $V_5$ and the voltage for node 0x0 is labeled as $V_0$. &amp;nbsp;There are lots of connections and solving this problem analytically is difficult unless you twig to the trick of using symmetry. &amp;nbsp;Real-world networks generally don't exhibit such congenial properties and can be nearly impossible to solve analytically.&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;Normally, of course, we wouldn't actually use Zookeeper for this computation, but the point here isn't so much a realistic computation as a test-bed for the new multi() method and a demonstration of how the closure based style associated with multi-ops makes code easier to read and understand.&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Helvetica Neue', Arial, Helvetica, sans-serif; font-size: large;"&gt;The Code&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Helvetica Neue', Arial, Helvetica, sans-serif; font-size: large;"&gt;&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;/div&gt;&lt;div style="font-family: Times; margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: Times; font-size: small;"&gt;To read the code, start with the class&amp;nbsp;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;ZGraphTest&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: Times; font-size: small;"&gt;. &amp;nbsp;This is a JUnit test that drives all the rest of the code. &amp;nbsp;In this test, a graph is constructed (named &lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: small;"&gt;zg&lt;/span&gt; in the code), nodes are connected in the pattern of a hyper-cube which means that nodes are labeled with integers from 0 to 31 and a node $n_1$ is connected to another node $n_2$ if $n_1$ and $n_2$ differ in exactly one bit.&lt;/div&gt;&lt;div style="font-family: Times; margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: Times;"&gt;In each iteration of the code, the &lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;ZGraph.reweight()&lt;/span&gt; method of the graph is called on a randomly selected node. &amp;nbsp;This sets the value at that node to be the average of the values at the neighbors of that node. &amp;nbsp;This computation converges geometrically to the correct value with errors decreasing by a factor of 5-10 with every 1000 iterations. &amp;nbsp;As the actual computation proceeds the error versus the theoretically known correct values is shown every 1000 iterations and &amp;nbsp;you can see the errors go to zero with 6 significant figures at about 7000 iterations.&lt;/div&gt;&lt;div style="font-family: Times; margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: Times;"&gt;Internally, a &lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;ZGraph&lt;/span&gt; keeps a connection to Zookeeper and the name of the root directory for all of the nodes in the graph. &amp;nbsp;Methods like &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;connect()&lt;/span&gt;, &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;reweight()&lt;/span&gt; and &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;update()&lt;/span&gt; all work by defining lazy update or version check operations on &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Node&lt;/span&gt; objects and then executing all of the update or version checks at one time in a transaction. &amp;nbsp;For instance, in the &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;reweight()&lt;/span&gt; method, this code gets the weight of all of the neighbors of node g1:&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; double mean = 0;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Set&lt;integer&gt; neighbors = g1.getIn();&lt;/integer&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; List&lt;op&gt; ops = Lists.newArrayList();&lt;/op&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; for (Integer neighbor : neighbors) {&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Node n = getNode(neighbor);&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; ops.add(n.checkOp(root));&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; mean += n.getWeight();&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; }&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: Times;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: Times;"&gt;As a side effect, we also collect a list of version check operations into the list ops. &amp;nbsp;Then in this code we add one more operation to set the weight on g1 and add the update operation to the list ops:&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; // set up the update to the node itself&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; g1.setWeight(mean / neighbors.size());&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; ops.add(g1.updateOp(root));&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: Times;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: Times;"&gt;Finally, &lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;multi()&lt;/span&gt; is called to actually do all of the version checks and updates that we have collected,&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; zk.multi(ops);&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: Times;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: Times;"&gt;The essential point here is how the list of operations was collected a bit at a time and then executed all at once. &amp;nbsp;The versions of the nodes that were inputs into the operation were collected in the form of check operations. &amp;nbsp;When those check operations are executed along with the update of &lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;g1&lt;/span&gt;, we guarantee that &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;g1&lt;/span&gt;'s new weight will accurately reflect the average of its neighbors and if somebody changes the value of any of the neighbors in between the time that we read the value and the time we actually do the update, we will run the update again.&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Similarly, and closer to the idea of maintaining referential integrity, the &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;ZGraph.connect()&lt;/span&gt; collects update operations for the two nodes being connected and executes both updates together to avoid the possibility that anyone would see a situation where one node connects to a second but the second doesn't have the reverse connection. &amp;nbsp;This code does the job,&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Node g1 = Node.readNode(zk, root, id1);&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Node g2 = Node.readNode(zk, root, id2);&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; g1.connectTo(id2);&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; g2.connectFrom(id1);&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; zk.multi(Arrays.asList(g1.updateOp(root), g2.updateOp(root)), results);&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div style="font-family: Times; margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;Again, the closure passing style makes this operation very easy.&lt;/div&gt;&lt;div style="font-family: Times; margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: Times;"&gt;One additional point to be noted is that having each &lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Node&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: Times;"&gt; return closures for updates that can be executed in any context the caller would like has the effect of removing all direct Zookeeper operations from the &lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Node&lt;/span&gt; other than for reading or creating a node. &amp;nbsp;It also removes all references to serialization of a &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Node&lt;/span&gt; from any outside code. &amp;nbsp;This simplifies both the caller and the &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Node&lt;/span&gt; because the &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Node&lt;/span&gt; doesn't have to implement all plausible combinations of multiple updates and the caller doesn't have to worry about any aspect of serialization and can focus on the appropriate task of arranging the transactional semantics for updates.&lt;/div&gt;&lt;div style="font-family: Times; margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: Times; font-size: medium; margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;/div&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Helvetica Neue', Arial, Helvetica, sans-serif; font-size: large;"&gt;Credit Where Credit is Due&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: Times; font-size: medium; margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;Of course, this new Zookeeper capability wouldn't be possible if the Apache Zookeeper project team hadn't built Zookeeper in a very flexible way in the first place. &amp;nbsp;Kudos to Ben Reed and Mahadev and the other early committers on the project.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;Also, the actual &lt;a href="https://issues.apache.org/jira/browse/ZOOKEEPER-965"&gt;Zookeeper-965&lt;/a&gt; project would have stalled out if&amp;nbsp;Marshall McMullen and&amp;nbsp;Camille Fournier hadn't stepped in with a bit of extra momentum. &amp;nbsp;I had completed the wire protocols and all of the client side work, but Marshall did all of server side work and Camille provided really excellent advice about how the changes should be done.&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;The final credit goes to MapR Technologies for supporting open source by allowing capabilities like multi to be contributed back.&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-3619541449336287245?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/3619541449336287245/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=3619541449336287245' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/3619541449336287245'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/3619541449336287245'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2011/06/tour-of-multi-update-for-zookeeper.html' title='A Tour of the Multi-update For Zookeeper'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-Zfjultl0Hxs/TeRllzvRoxI/AAAAAAAAAHQ/3aYjZhX1P-8/s72-c/hyper-cube.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-5156214065309491100</id><published>2011-05-29T11:51:00.000-07:00</published><updated>2011-05-31T15:30:35.804-07:00</updated><title type='text'>Online algorithms for boxcar-ish moving averages</title><content type='html'>One problem with exponentially weighted moving averages is that the weight that older samples decays sharply even for very recent samples.  &lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-hL8j1kS5toU/TY0kbm7e8RI/AAAAAAAAAG4/R_8_VdbTaT8/s1600/graph1.png" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"&gt;&lt;img border="0" height="320" src="http://4.bp.blogspot.com/-hL8j1kS5toU/TY0kbm7e8RI/AAAAAAAAAG4/R_8_VdbTaT8/s320/graph1.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;The impulse response of such an average shows this clearly.  In the graph to the right, the red line is the impulse response of the exponential weighted average is shown by the red line.  The impulse response of a different kind of moving average derived by John Canny in the early 80's is shown by the black line.&lt;br /&gt;&lt;br /&gt;The Canny filter puts higher weight on the events in the recent past which makes it preferable when you don't want to forget things right away, but do want to forget them before long and also want an on-line algorithm.&lt;br /&gt;&lt;br /&gt;The cool thing about the Canny filter is that it is only twice as much work as a simple exponential moving average.  The idea is that if you take the difference between two exponentially weighted averages with different time constants, you can engineer things to give you an impulse response of the sort that you would like.  &lt;br /&gt;&lt;br /&gt;The formula for such a difference looks like this&lt;br /&gt;\begin{eqnarray*}&lt;br /&gt;w(x) &amp;amp;=&amp;amp; k_1 e^{-x/\alpha} - k_2 e^{-x/\beta}&lt;br /&gt;\end{eqnarray*}&lt;br /&gt;Here $k_1$ and $k_2$ scale the two component filters in magnitude and $\alpha$ and $\beta$ are the time constants for the filters.&lt;br /&gt;&lt;br /&gt;It is nice to set the magnitude of the filter at delay $0$ to be exactly 1.  We can use this to get a value for $k_2$ in terms of $k_1$&lt;br /&gt;\begin{eqnarray*}&lt;br /&gt;w(0) &amp;amp;=&amp;amp; k_1 - k_2 = 1 \\&lt;br /&gt;k_2 &amp;amp;=&amp;amp; k_1 -1 &lt;br /&gt;\end{eqnarray*}&lt;br /&gt;Similarly, we can constrain the slope of the impulse response to be $0$ at delay $0$.  This gives us $\beta$ in terms $\alpha$&lt;br /&gt;\begin{eqnarray*}&lt;br /&gt;w'(0) &amp;amp;=&amp;amp; {k_1 \over \alpha} - {k_2 \over \beta} = 0\\&lt;br /&gt;{k_1 \over \alpha} &amp;amp;=&amp;amp; {k_1-1 \over \beta} \\&lt;br /&gt;\beta &amp;amp;=&amp;amp; \alpha \frac{ k_1-1} { k_1}&lt;br /&gt;\end{eqnarray*}&lt;br /&gt;The final result is this impulse response&lt;br /&gt;\begin{eqnarray*}&lt;br /&gt;w(x) = k \exp \left(-{x \over \alpha}\right)-(k-1) \exp\left(-{k x\over \alpha (k-1)}\right)&lt;br /&gt;\end{eqnarray*}&lt;br /&gt;We can do a bit better if we note that as $k \to \infty$, the shape of the impulse quickly converges to a constant shape with mean of $w(x) \to \frac{3a}{2}$ and a total volume of $2a$.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-5156214065309491100?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/5156214065309491100/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=5156214065309491100' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/5156214065309491100'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/5156214065309491100'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2011/05/online-algorithms-for-boxcar-ish-moving.html' title='Online algorithms for boxcar-ish moving averages'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-hL8j1kS5toU/TY0kbm7e8RI/AAAAAAAAAG4/R_8_VdbTaT8/s72-c/graph1.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-7338300417369656049</id><published>2011-03-24T12:24:00.000-07:00</published><updated>2011-04-08T09:11:41.411-07:00</updated><title type='text'>Exponentially weighted averaging for rates</title><content type='html'>In a previous post, I talked about how to produce exponentially weighted averages of time-embedded values. &amp;nbsp;This is fine for cases where you really want averages, but it doesn't work for rates. &amp;nbsp;Happily, the technique I presented for simple averages can be easily extended to work for rates as well.&lt;br /&gt;&lt;br /&gt;To recap a bit, an on-line algorithm for computing a simple average works by accumulating the number of observations and the sum of the observations. There is no mention of time. &amp;nbsp;For each observation $(x, t)$ we update our state $(s, w)$ according to&lt;br /&gt;\begin{eqnarray*}&lt;br /&gt;s_{n} &amp;amp;=&amp;amp; s_{n-1} + x \\&lt;br /&gt;w_{n} &amp;amp;=&amp;amp; w_{n-1} + 1&lt;br /&gt;\end{eqnarray*}&lt;br /&gt;The exponentially weighted average approach that I mentioned before uses the time of each observation to discount the previous state based on the time of the previous observation. &amp;nbsp;Thus, the state $(s,w,t)$ is updated according to&lt;br /&gt;\begin{eqnarray*}&lt;br /&gt;\pi &amp;=&amp; e^{-(t-t_{n-1})/\alpha} \\&lt;br /&gt;s_{n} &amp;=&amp; \pi s_{n-1} + x \\&lt;br /&gt;w_{n} &amp;=&amp; \pi w_{n-1} + 1 \\&lt;br /&gt;t_{n} &amp;=&amp; t&lt;br /&gt;\end{eqnarray*}&lt;br /&gt;If we were to compute simple rates without discounting, then instead of interpreting $w$ as a count, we would interpret it as a time interval.  Thus we would update state $(s,w,t)$ according to with an observation $(x, t)$ according to &lt;br /&gt;\begin{eqnarray*}&lt;br /&gt;s_{n} &amp;=&amp; s_{n-1} + x \\&lt;br /&gt;w_{n} &amp;=&amp; w_{n-1} + (t - t_n) \\&lt;br /&gt;t_n &amp;=&amp; t&lt;br /&gt;\end{eqnarray*}&lt;br /&gt;By analogy with the discounted version of the simple average, we can produce an exponentially weighted rate average by updating the state according to this&lt;br /&gt;\begin{eqnarray*}&lt;br /&gt;\pi &amp;=&amp; e^{-(t-t_{n-1})/\alpha} \\&lt;br /&gt;s_{n} &amp;=&amp; \pi s_{n-1} + x \\&lt;br /&gt;w_{n} &amp;=&amp; \pi w_{n-1} + (t-t_n) \\&lt;br /&gt;t_{n} &amp;=&amp; t&lt;br /&gt;\end{eqnarray*}&lt;br /&gt;This approach has a defect in that each data pont should really be considered to be a report of events spread out in an uncertain way over the time since the last report.  As such, the interval and the events should probably both be discounted as if they had been reported uniformly over the period in question.  In practice, this correction does not matter much since the averaging time constant $\alpha$ is typically large compared to the average reporting interval.  Another consideration comes up when multiple sources are reporting concurrently.  If we want to do proper discounting of each observed number, then we have to keep track of the time since last report for each of the sources.  Since this probably won't matter and since it considerably complicates the implementation, I would rather not do it.&lt;br /&gt;&lt;br /&gt;See https://issues.apache.org/jira/browse/MAHOUT-634 for code.  This will be in Mahout before long.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-7338300417369656049?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/7338300417369656049/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=7338300417369656049' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/7338300417369656049'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/7338300417369656049'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2011/03/exponentially-weighted-averaging-for.html' title='Exponentially weighted averaging for rates'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-3622566315859707542</id><published>2011-03-24T10:25:00.000-07:00</published><updated>2011-03-24T10:25:19.507-07:00</updated><title type='text'>Update on exponential time-embedded averages</title><content type='html'>I will be adding this code to Mahout shortly.&lt;br /&gt;&lt;br /&gt;See&amp;nbsp;https://issues.apache.org/jira/browse/MAHOUT-634 for status.&lt;br /&gt;&lt;br /&gt;Also, if you are measuring rates, then it is common for rates to be reported from multiple sources independently. &amp;nbsp;Such an average can be computed pretty easily using this same framework if the sources report often relative to the averaging time constant. &amp;nbsp;This simple implementation just attributes each reported count as if they occurred in the interval since the most recent report from any reporter. &amp;nbsp;If the time constant is relatively long, this can work out reasonably well as long as we are careful.&lt;br /&gt;&lt;br /&gt;If reporting intervals are longer, then the averaging is a bit trickier because we really would like to attribute the reported counts over the entire interval from the last report from the same source. &amp;nbsp;This means that we have to discount some of the counts because they are effectively kind of old.&lt;br /&gt;&lt;br /&gt;More details shortly.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-3622566315859707542?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/3622566315859707542/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=3622566315859707542' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/3622566315859707542'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/3622566315859707542'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2011/03/update-on-exponential-time-embedded.html' title='Update on exponential time-embedded averages'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-8426891890885835065</id><published>2011-03-22T17:15:00.000-07:00</published><updated>2011-03-22T19:23:34.926-07:00</updated><title type='text'>Exponential weighted averages with irregular sampling</title><content type='html'>Ted Yu on the hbase list wants to compute an exponential weighted average for the number of queries per second or other statistics that characterize the performance or state of an hbase system.&lt;br /&gt;&lt;br /&gt;The trick is that the measurements are only available at irregular intervals.  If they were sampled regularly, then the standard mixing trick would work:&lt;br /&gt;\[&lt;br /&gt;m_{n+1} = \mu m_n + (1-\mu) x_{n+1}&lt;br /&gt;\]&lt;br /&gt;where $m$ is our current estimate of the mean, $x_n$ is the $n$-th sample and $\mu$ determines how much history to use.&lt;br /&gt;&lt;br /&gt;With unequal sample times, things become a bit more complicated.  If we get lots of measurements all at once, we want to give them nearly equal weight but if we have a long gap, we want to weight the very old samples much less.&lt;br /&gt;&lt;br /&gt;In fact, we want to weight old samples according to how old they are with exponentially decreasing weight.  If we sample values $\left \lbrace x_1 \ldots x_n \right \rbrace$ at times $t_1 \ldots t_n$ then we want the weighted mean defined by&lt;br /&gt;\[&lt;br /&gt;m_n = {\sum_{i=1}^n x_i e^{-(t_n - t_i)/\alpha} \over \sum_{i=1}^n e^{-(t_n - t_i)/\alpha} }&lt;br /&gt;\]&lt;br /&gt;Here $\alpha$ plays the same role as $\mu$ did before, but on a different scale.  If the evenly sampled data comes at time intervals $\Delta t$ then $\mu = e^{\Delta t / \alpha}$.&lt;br /&gt;&lt;br /&gt;Happily, there is a very simple recurrence relationship that allows us to keep only two intermediate values while computing the value of $m_1 \ldots m_n$ in an entirely on-line fashion as the $x_i$ values arrive.&lt;br /&gt;&lt;br /&gt;To see this, define&lt;br /&gt;&lt;br /&gt;\begin{eqnarray*}&lt;br /&gt;\pi_n &amp;amp;=&amp;amp; e^{(t_{n+1}-t_n)/\alpha} \\&lt;br /&gt;w_{n+1} &amp;amp;=&amp;amp;&lt;br /&gt;\sum_{i=1}^{n+1} e^{-(t_{n+1} - t_i)/\alpha} =&lt;br /&gt;1+e^{-(t_{n+1}-t_n)/\alpha} \sum_{i=1}^{n} e^{-(t_{n} - t_i)/\alpha} \\&lt;br /&gt;&amp;amp; =&amp;amp; 1 + \pi w_n\\&lt;br /&gt;s_{n+1} &amp;amp;=&amp;amp;&lt;br /&gt;\sum_{i=1}^{n+1} x_i e^{-(t_{n+1} - t_i)/\alpha} =&lt;br /&gt;x_{n+1}+e^{-(t_{n+1}-t_n)/\alpha} \sum_{i=1}^{n} x_i e^{-(t_{n} - t_i)/\alpha} \\&lt;br /&gt;&amp;amp;=&amp;amp; x_{n+1} + \pi_n s_n&lt;br /&gt;\end{eqnarray*}&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Then note that&lt;br /&gt;\[&lt;br /&gt;m_{n+1} = {s_{n+1} \over w_{n+1}}&lt;br /&gt;\]&lt;br /&gt;&lt;br /&gt;This leads naturally to a procedure that has state consisting of $t, w, m$ which are updated with using new values of $t_n, x_n$ according to&lt;br /&gt;\begin{eqnarray*}&lt;br /&gt;\pi &amp;amp;=&amp;amp; e^{t_{n}-t} \\&lt;br /&gt;w &amp;amp;=&amp;amp; 1 + \pi w \\&lt;br /&gt;s &amp;amp;=&amp;amp; x_n + \pi s \\&lt;br /&gt;m &amp;amp;=&amp;amp; {s \over w} \\&lt;br /&gt;t &amp;amp;=&amp;amp; t_{n}&lt;br /&gt;\end{eqnarray*}&lt;br /&gt;Isn't that a kick!&lt;br /&gt;&lt;br /&gt;To do this right, however, we need a test.  Here are some data vectors computed for $\alpha=5$:&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; t &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;x &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;pi &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;w &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;s &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;m&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;1 11.35718 &amp;nbsp;1.5992071 1.0000000 1.000000 &amp;nbsp;1.5992071 &amp;nbsp;1.5992071&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;2 21.54637 -1.3577032 0.1303100 1.130310 -1.1493105 -1.0168100&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;3 28.91061 -0.3405638 0.2292718 1.259148 -0.6040683 -0.4797436&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;4 33.03586 &amp;nbsp;0.7048632 0.4382129 1.551775 &amp;nbsp;0.4401527 &amp;nbsp;0.2836447&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;5 39.57767 &amp;nbsp;0.3020558 0.2702621 1.419386 &amp;nbsp;0.4210124 &amp;nbsp;0.2966159&lt;/span&gt;&lt;br /&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;Writing a proper unit test is an exercise left to the reader. (but I will be happy to help)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-8426891890885835065?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/8426891890885835065/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=8426891890885835065' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/8426891890885835065'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/8426891890885835065'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2011/03/exponential-weighted-averages-with.html' title='Exponential weighted averages with irregular sampling'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-3244692402363960102</id><published>2010-11-04T19:59:00.000-07:00</published><updated>2010-11-04T19:59:55.430-07:00</updated><title type='text'>Recent lecture on Mahout for SDForum</title><content type='html'>I gave a lecture last night on recent additions to Mahout.&lt;br /&gt;&lt;br /&gt;The slides are here:&amp;nbsp;&lt;a href="http://www.slideshare.net/tdunning/sdforum-11042010"&gt;http://www.slideshare.net/tdunning/sdforum-11042010&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I can amplify this post with answers to any questions that anybody puts in the comments.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-3244692402363960102?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/3244692402363960102/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=3244692402363960102' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/3244692402363960102'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/3244692402363960102'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2010/11/recent-lecture-on-mahout-for-sdforum.html' title='Recent lecture on Mahout for SDForum'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-4918256684464438324</id><published>2010-10-31T12:23:00.000-07:00</published><updated>2010-10-31T12:23:29.297-07:00</updated><title type='text'>Mahout 0.4 released!</title><content type='html'>Go to the &lt;a href="http://mahout.apache.org/"&gt;Apache Mahout&lt;/a&gt; site for more info. &amp;nbsp;Here is the official announcement:&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;We are pleased to announce release 0.4 of Mahout. Virtually every corner of&amp;nbsp;the project has changed, and significantly, since 0.3. Developers are&amp;nbsp;invited to use and depend on version 0.4 even as yet more change is to be&amp;nbsp;expected before the next release. Highlights include:&lt;/blockquote&gt;&lt;blockquote&gt;- Model refactoring and CLI changes to improve integration and&amp;nbsp;consistency&lt;/blockquote&gt;&lt;blockquote&gt;- New ClusterEvaluator and CDbwClusterEvaluator offer new ways to&amp;nbsp;evaluate clustering effectiveness&lt;/blockquote&gt;&lt;blockquote&gt;- New Spectral Clustering and MinHash Clustering (still experimental)&lt;/blockquote&gt;&lt;blockquote&gt;- New VectorModelClassifier allows any set of clusters to be used for&amp;nbsp;classification&lt;/blockquote&gt;&lt;blockquote&gt;- Map/Reduce job to compute the pairwise similarities of the rows of a&amp;nbsp;matrix using a customizable similarity measure&lt;/blockquote&gt;&lt;blockquote&gt;- Map/Reduce job to compute the item-item-similarities for item-based&amp;nbsp;collaborative filtering&lt;/blockquote&gt;&lt;blockquote&gt;- RecommenderJob has been evolved to a fully distributed item-based&amp;nbsp;recommender&lt;/blockquote&gt;&lt;blockquote&gt;- Distributed Lanczos SVD implementation&lt;/blockquote&gt;&lt;blockquote&gt;- More support for distributed operations on very large matrices&lt;/blockquote&gt;&lt;blockquote&gt;- Easier access to Mahout operations via the command line&lt;/blockquote&gt;&lt;blockquote&gt;- New HMM based sequence classification from GSoC (currently as&amp;nbsp;sequential version only and still experimental)&lt;/blockquote&gt;&lt;blockquote&gt;- Sequential logistic regression training framework&lt;/blockquote&gt;&lt;blockquote&gt;- New SGD classifier&lt;/blockquote&gt;&lt;blockquote&gt;- Experimental new type of NB classifier, and feature reduction options&amp;nbsp;for existing one&lt;/blockquote&gt;&lt;blockquote&gt;- New vector encoding framework for high speed vectorization without a&amp;nbsp;pre-built dictionary&lt;/blockquote&gt;&lt;blockquote&gt;- Additional elements of supervised model evaluation framework&lt;/blockquote&gt;&lt;blockquote&gt;- Promoted several pieces of old Colt framework to tested status (QR&amp;nbsp;decomposition, in particular)&lt;/blockquote&gt;&lt;blockquote&gt;- Can now save random forests and use it to classify new data&lt;/blockquote&gt;&lt;blockquote&gt;- Many, many small fixes, improvements, refactorings and cleanup&lt;/blockquote&gt;&lt;blockquote&gt;Details on what's included can be found in the release&amp;nbsp;&lt;a href="https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12310751&amp;amp;styleName=Html&amp;amp;version=12314281"&gt;notes&lt;/a&gt;.&lt;/blockquote&gt;&lt;blockquote&gt;Downloads are available from the &lt;a href="http://www.apache.org/dyn/closer.cgi/lucene/mahout/"&gt;Apache&amp;nbsp;Mirrors&lt;/a&gt;&lt;/blockquote&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-4918256684464438324?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/4918256684464438324/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=4918256684464438324' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/4918256684464438324'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/4918256684464438324'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2010/10/mahout-04-released.html' title='Mahout 0.4 released!'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-1426893485863638501</id><published>2010-10-25T11:53:00.000-07:00</published><updated>2010-10-25T11:53:31.477-07:00</updated><title type='text'>New Mahout release coming</title><content type='html'>The vote has started for the 0.4 Mahout release. &amp;nbsp;Lots of new stuff, but the part that I am excited about is a fairly comprehensive implementation of logistic regression suitable for large scale training and high speed classification, but there is a whole lot more.&lt;br /&gt;&lt;br /&gt;With the 0.4 release, Mahout is moving along strongly towards the fabled 1.0 release. &amp;nbsp;At that point, we will start paying lots of attention to backwards compatibility. &amp;nbsp;That will be good, but the current wild and wooly policy is pretty handy if you have something in mind that Mahout really, really needs because we can get new things in pretty readily right now.&lt;br /&gt;&lt;br /&gt;See&amp;nbsp;http://mahout.apache.org/ for the release when it arrives and watch my twitter feed for an announcement.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-1426893485863638501?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/1426893485863638501/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=1426893485863638501' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/1426893485863638501'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/1426893485863638501'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2010/10/new-mahout-release-coming.html' title='New Mahout release coming'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-3878608003744796631</id><published>2010-10-13T07:09:00.000-07:00</published><updated>2010-10-13T10:16:05.553-07:00</updated><title type='text'>Why is the sum of two uniform randoms not uniform?</title><content type='html'>Lance Norkog asks on the Mahout mailing list why adding two uniformly distributed random variables gives a pyramidal distributed value.  I would normally answer on the mailing list, but here I can use lovely math notation.  As I mentioned on-list, this is a very basic result that is related to the law of large numbers.&lt;br /&gt;&lt;br /&gt;If we were to draw a picture of the joint distribution of these variables \(x\) and \(y\), we would get something that is 1 inside the \([0,1] \times [0,1]\) square and 0 outside that region.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://1.bp.blogspot.com/_c4sz5uEKsbI/TLXaSGfWt6I/AAAAAAAAAFk/96oUNz0i1OI/s1600/sum-uniforms.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://1.bp.blogspot.com/_c4sz5uEKsbI/TLXaSGfWt6I/AAAAAAAAAFk/96oUNz0i1OI/s1600/sum-uniforms.png" /&gt;&lt;/a&gt;For a given value \(\alpha\) of the sum \(x + y\), there is a diagonal line segment where \(x+y=\alpha\) and \(x\) and \(y\) are in the square.  Where \(z \le 0\) or \(z \ge 2\) that intersection vanishes and for \(0 \lt z \lt 2\), that intersection varies in length.  The probability&amp;nbsp;of the sum having some particular value z is proportional to the length of that intersection.  As you can imagine, the intersection&amp;nbsp;varies in size linearly and it reaches a maximum where z = 1.&lt;br /&gt;&lt;br /&gt;For the sum of three random variables, the situation is more complex to reason about geometrically because we need to worry about the intersection of a plane and a cube. &amp;nbsp;For more variables, the geometry is not worth the trouble.&lt;br /&gt;&lt;br /&gt;If we tackle the problem a bit more rigorously, then the easiest way to approach the problem is to compute the cumulative distribution of various values of sums.  That leads to a convolution integral over the density functions involved.  Since the densities are all 1, the integration limits are the key to the value and those limits have to broken down into cases.  Actually doing these integrals is a pretty rare activity since the limit is approximated so well after just a few iterations.&lt;br /&gt;&lt;br /&gt;Just how quickly that convergence happens is can be seen by looking at the empirical distribution of the sum of three uniform deviates. &amp;nbsp;I used something very like this R code to produce the graph below:&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&lt;b&gt;&amp;nbsp; hist(runif(100000)+runif(100000)+runif(100000),&amp;nbsp;&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&lt;b&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;breaks=50, prob=T)&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&lt;b&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;lines(seq(-1,4,by=0.01), dnorm(seq(-1,4,by=0.01),&amp;nbsp;&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&lt;b&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;1.5, sqrt(1/4)))&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;In this graph, the red curve is the normal distribution with the same mean and standard deviation. &amp;nbsp;As you can see, the peak is a tiny bit too high and the tails are just a &lt;i&gt;skoshi&lt;/i&gt;&amp;nbsp;too long for the normal to be a good description of the samples of the sum. &amp;nbsp;&amp;nbsp;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/_c4sz5uEKsbI/TLXbGbqVHiI/AAAAAAAAAFo/h3Ono2hqh5Q/s1600/sum_3.png" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"&gt;&lt;img border="0" height="284" src="http://3.bp.blogspot.com/_c4sz5uEKsbI/TLXbGbqVHiI/AAAAAAAAAFo/h3Ono2hqh5Q/s320/sum_3.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div&gt;This is, however, just the sum of three random values. &amp;nbsp;If you sum more values, the convergence to the normal distribution is very strong and by the time you are adding six uniform random values together, the difference between the distributions is no longer visible in a graph like this and can only be detected numerically using lots of data and clever things like a Kolmogorov-Smirnov test.&lt;br /&gt;&lt;br /&gt;&lt;div&gt;The moral here is that there isn't much way to avoid this regression to the normal distribution and distorting the data to avoid it is probably pointless.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;But if you are like me, being just a little more normal always made it easier to get along.&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-3878608003744796631?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/3878608003744796631/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=3878608003744796631' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/3878608003744796631'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/3878608003744796631'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2010/10/why-is-sum-of-two-uniform-randoms-not.html' title='Why is the sum of two uniform randoms not uniform?'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_c4sz5uEKsbI/TLXaSGfWt6I/AAAAAAAAAFk/96oUNz0i1OI/s72-c/sum-uniforms.png' height='72' width='72'/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-6953183628250335073</id><published>2010-08-11T11:48:00.000-07:00</published><updated>2010-08-11T11:48:03.848-07:00</updated><title type='text'>Sadder things</title><content type='html'>I spent this last weekend holding a hand that grew cold in the early hours of Sunday morning. &lt;br /&gt;&lt;br /&gt;That hand helped me through much of my life. &amp;nbsp;No &lt;a href="http://www.legacy.com/obituaries/lvrj/obituary.aspx?n=hal-dunning&amp;amp;pid=144607613"&gt;longer&lt;/a&gt;. &amp;nbsp;At least not in the flesh.&lt;br /&gt;&lt;br /&gt;Nobody who reads this blog is likely to have known my father and given how little he talked about things he had done, few who knew him would know much of the many things he did. &amp;nbsp;He lived a long life and a full one. &amp;nbsp;Along the way he saw things few will ever see. &lt;br /&gt;&lt;br /&gt;In his prime, he was simply extraordinary. &amp;nbsp;He could see and he could hear better than anyone I have ever known. That could be torture, as it was the time when a cat walking in the next room woke him from a deep sleep but it was what let him fly the way he did. &amp;nbsp;And fly he did in planes large and small. &amp;nbsp;He checked out Gary Powers in the U-2, flew P-47's and P-38 in combat and flew with me in small aircraft. &amp;nbsp;We fished and walked and camped across the western US and we lived in many places. &lt;br /&gt;&lt;br /&gt;He didn't show off his mental abilities, but there too he could do things few others could match. &amp;nbsp;He passed a graduate reading exam in French without ever studying any romance language. &amp;nbsp;I saw him on several occasions understand spoken German also without having studied the language. &amp;nbsp;He spoke of the shape and rate of physical processes in ways that only somebody with innate ability in math could possibly do. &lt;br /&gt;&lt;br /&gt;These faculties declined in age as they must with all of us, but even thus dimmed his candle burned bright.&lt;br /&gt;&lt;br /&gt;But it did not last. &amp;nbsp;I saw it sputter and fade. &amp;nbsp;Then, between one instant and the next, it was gone.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-6953183628250335073?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/6953183628250335073/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=6953183628250335073' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/6953183628250335073'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/6953183628250335073'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2010/08/sadder-things.html' title='Sadder things'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-5164848947152987583</id><published>2010-08-04T11:29:00.000-07:00</published><updated>2010-08-04T11:29:11.181-07:00</updated><title type='text'>Word Count using Plume</title><content type='html'>Plume is working for toy programs! &lt;br /&gt;&lt;br /&gt;You can get the source code at&amp;nbsp;&lt;a href="http://github.com/tdunning/Plume"&gt;http://github.com/tdunning/Plume&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Here is a quick description of how to code up the perennial map-reduce demo program for counting words. &amp;nbsp;The idea is that we have lines of text that we have to tokenize and then count the words. &amp;nbsp;This example is to be found in the class WordCountTest in Plume.&lt;br /&gt;&lt;br /&gt;So we start with&amp;nbsp;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;PCollection&lt;string&gt; lines&lt;/string&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt; for input. &amp;nbsp;For each line, we split the line&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;into words and emit them as a separate record:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp;PCollection&lt;string&gt; words = lines&lt;/string&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;.map(new DoFn&lt;string, string=""&gt;() {&lt;/string,&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp;@Override&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp;public void process(String x, EmitFn&lt;string&gt; emitter) {&lt;/string&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;for (String word : onNonWordChar.split(x)) {&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;emitter.emit(word);&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;}&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp;}&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;}, collectionOf(strings()));&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;Then we emit each word as a key for a PTable with a count of 1. &amp;nbsp;This is just the same as most word-count implementations except that we have separated the tokenization from the emission of the original counts. &amp;nbsp;We could have put them together into a single map operation, but the optimizer will do that for us (when it exists) so keeping the functions modular is probably better.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp;PTable&lt;string, integer=""&gt; wc = words&lt;/string,&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;.map(new DoFn&lt;string, pair=""&gt;&lt;string, integer=""&gt;&amp;gt;() {&lt;/string,&gt;&lt;/string,&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp;@Override&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp;public void process(String x,&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; EmitFn&lt;pair&gt;&lt;string, integer=""&gt;&amp;gt; emitter) {&lt;/string,&gt;&lt;/pair&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; emitter.emit(Pair.create(x, 1));&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp;}&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;}, tableOf(strings(), integers()))&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;Then we group by word&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp;.groupByKey()&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;And do the counting. &amp;nbsp;Note how we don't have to worry about the details of using a combiner or a reducer.&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp;.combine(new CombinerFn&lt;integer&gt;() {&lt;/integer&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; @Override&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; public Integer combine(Iterable&lt;integer&gt; counts) {&lt;/integer&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; int sum = 0;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; for (Integer k : counts) {&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; sum += k;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; }&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; return sum;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; }&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp; });&lt;/span&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In all, it takes 27 lines to implement a slightly more general word-count than the one in the Hadoop tutorial. &amp;nbsp;If we were compare apples to apples, this could would probably be a few lines shorter. &amp;nbsp;The original word-count demo was 210 lines to do the same thing.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-5164848947152987583?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/5164848947152987583/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=5164848947152987583' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/5164848947152987583'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/5164848947152987583'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2010/08/word-count-using-plume.html' title='Word Count using Plume'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-7812847383456021674</id><published>2010-07-30T13:41:00.000-07:00</published><updated>2010-07-30T14:06:02.228-07:00</updated><title type='text'>The new grool</title><content type='html'>A few years ago, I built a prototype system I called grool in Groovy to simplify map-reduce programming. &amp;nbsp;My goal was to use Groovy to handle control flow and define operations but to execute large programs using Hadoop.&lt;br /&gt;&lt;br /&gt;Grool foundered on the difficulty in transporting closures over the network. &amp;nbsp;I used a clever trick to avoid the problem, but it depended on the ability to execute a script multiple times with different meanings each time. &amp;nbsp;That is a difficult concept to convey to users and the result was that grool really didn't work that well for ordinary folks to use.&lt;br /&gt;&lt;br /&gt;A bit later, the guys at Google developed FlumeJava which has many of the same goals and many of the same benefits as grool intended to provide. &amp;nbsp;In Java, however, transporting functional objects it paradoxically much simpler than in Groovy. &amp;nbsp;The difference is entirely because Java is statically compiled and thus state-less functional classes can be referred to by name in different JVM's with access to the same jar.&lt;br /&gt;&lt;br /&gt;Flume also provide an optimizer which is made possible because Flume uses lazy evaluation. &amp;nbsp; This makes FlumeJava programs nearly as efficient as well-optimized hand-written programs. &amp;nbsp;Other systems like Pig and Cascading are able to re-write their logical plan, but Pig especially has problems because it has no real access to a Turing complete language.&lt;br /&gt;&lt;br /&gt;In addition, Flume has some interesting choices in terms of API.&lt;br /&gt;&lt;br /&gt;All in all, Flume-like systems are definitely worth playing with. &amp;nbsp;In order to make that easier, I just implemented an eager, sequential version of an approximate clone of FlumeJava that I call Plume. &amp;nbsp;The name is a presumptuous one, anticipating that if all goes well, we would be able to build a community and bring Plume into Apache where it would be Apache Plume. &amp;nbsp;There it would provide some redress for the clear fauna bias in software names.&lt;br /&gt;&lt;br /&gt;Check it out at&amp;nbsp;&lt;a href="http://wiki.github.com/tdunning/Plume/"&gt;http://wiki.github.com/tdunning/Plume/&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-7812847383456021674?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/7812847383456021674/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=7812847383456021674' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/7812847383456021674'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/7812847383456021674'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2010/07/new-grool.html' title='The new grool'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-4281078256674860328</id><published>2010-04-22T14:47:00.000-07:00</published><updated>2010-04-22T14:47:11.507-07:00</updated><title type='text'>Hadoop user group AKA Mahout Users Anonymous</title><content type='html'>At the Hadoop User Group meeting last evening it was quite the Mahout love fest.&amp;nbsp; First the Yahoo team described their spam fighting efforts which apparently are partially anchored by frequent item set mining using Mahout (thanks to Robin Anil's hard work as a GSoC student).&amp;nbsp; Then later Ken Krugler demonstrated some of what you can do with web-mining and Mahout.&lt;br /&gt;&lt;br /&gt;These are just two of a growing number of real-world uses of Mahout which is really beginning to grow up.&amp;nbsp; Related to that growing up, the Apache board just decided to make Mahout a top level project.&amp;nbsp; That will take a little while to make real, but the first steps are already happening.&lt;br /&gt;&lt;br /&gt;You can find out more about how Mahout is being used by perusing the &lt;a href="https://cwiki.apache.org/MAHOUT/bookstutorialstalks.html"&gt;Mahout wiki&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.blogger.com/%20http://developer.yahoo.net/blogs/hadoop/2010/04/hundreds_of_hadoop_fans_at_the.html?utm_source=feedburner&amp;amp;utm_medium=feed&amp;amp;utm_campaign=Feed%3A+YDNHadoop+%28Hadoop+and+Distributed+Computing+at+Yahoo%21%29"&gt;Slide decks available here&lt;/a&gt; for the HUG presentations.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-4281078256674860328?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/4281078256674860328/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=4281078256674860328' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/4281078256674860328'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/4281078256674860328'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2010/04/hadoop-user-group-aka-mahout-users.html' title='Hadoop user group AKA Mahout Users Anonymous'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-9003963256014357563</id><published>2010-04-20T12:51:00.000-07:00</published><updated>2010-04-20T12:51:46.895-07:00</updated><title type='text'>Interview on Mahout available</title><content type='html'>I just had an interchange with&lt;span style="font-family: inherit; font-size: small;"&gt; &lt;strong&gt;&lt;/strong&gt;&lt;strong&gt;&lt;a class="editorlink" href="http://www.infoq.com/author/Gilad-Manor"&gt;Gilad Manor&lt;/a&gt;&lt;/strong&gt; of&lt;/span&gt; InfoQ (as did Grant).&amp;nbsp; The results are to be found in the &lt;a href="http://www.infoq.com/news/2010/04/mahout-03"&gt;article on Mahout status&lt;/a&gt; that resulted.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-9003963256014357563?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/9003963256014357563/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=9003963256014357563' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/9003963256014357563'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/9003963256014357563'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2010/04/interview-on-mahout-available.html' title='Interview on Mahout available'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-7886897289819178424</id><published>2010-04-10T11:35:00.000-07:00</published><updated>2011-02-23T11:41:07.449-08:00</updated><title type='text'>Sampling a Dirichlet Distribution (revised)</title><content type='html'>Last year, I casually wrote a post about how to &lt;a href="http://tdunning.blogspot.com/2009/04/sampling-dirichlet-distributions.html"&gt;sample from a prior for Dirichlet distributions&lt;/a&gt;.&amp;nbsp; There were recently a few comments on this blog and it now occurs to me that there is a massive simplification that is possible for the problem.&lt;br /&gt;&lt;br /&gt;What I suggested back then was to sample the parameters of a Dirichlet distribution by sampling &lt;i&gt;from&lt;/i&gt; a Dirichlet distribution and then multiplying that by a magnitude sampled from an exponential distribution.&amp;nbsp; As it turns out, this is a special case of a much nicer general method.&lt;br /&gt;&lt;br /&gt;The new method is to sample each parameter independently from a gamma distribution&lt;br /&gt;\[\gamma_i \sim \mathrm{Gamma}(\alpha_i, \beta_i) \]&lt;br /&gt;This can be related to my previous method where we had the parameters expressed as a magnitude \(\alpha\) multiplied by a vector \(\vec m\) whose components sum to 1.  Expressed in terms of the new method&lt;br /&gt;\[\alpha = \sum_i \gamma_i \]&lt;br /&gt;and&lt;br /&gt;\[ m_i = \gamma_i / \alpha \]&lt;br /&gt;Moreover, if all of the \(\beta_i\) are equal, then \(\alpha\) is gamma distributed with shape parameter \(\sum_i \alpha_i\). If this sum is 1, then we have an exponential distribution for \(\alpha\) as in the original method.&lt;br /&gt;&lt;br /&gt;I am pretty sure that this formulation also makes MCMC sampling from the posterior distribution easier as well because the products inside the expression for the joint probability will get glued together in a propitious fashion.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-7886897289819178424?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/7886897289819178424/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=7886897289819178424' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/7886897289819178424'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/7886897289819178424'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2010/04/sampling-dirichlet-distribution-revised.html' title='Sampling a Dirichlet Distribution (revised)'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-1102755452320938312</id><published>2010-04-06T16:28:00.000-07:00</published><updated>2010-04-06T16:33:30.771-07:00</updated><title type='text'>Sampling Dirichlet Distributions (chapter 2-b)</title><content type='html'>&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;br /&gt;In the previous post, I talked about how sampling in soft-max space can make Metropolis algorithms work better for Dirichlet distributions.  Here are some pictures that show why. These pictures are of three parameter Dirichlet distributions with the simplex projected into two dimensions.  Dirichlet distributions with different parameters can look like this:&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/_c4sz5uEKsbI/S7us8xNIbOI/AAAAAAAAADk/xYqE_KoQSMs/s1600/Picture+4.png" imageanchor="1" style="float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="200" src="http://1.bp.blogspot.com/_c4sz5uEKsbI/S7us8xNIbOI/AAAAAAAAADk/xYqE_KoQSMs/s200/Picture+4.png" width="188" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;or this&lt;br /&gt;&lt;div class="separator" style="text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/_c4sz5uEKsbI/S7utGOdQwfI/AAAAAAAAADs/0M-VmTWEFV8/s1600/Picture+5.png" imageanchor="1" style="float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="200" src="http://2.bp.blogspot.com/_c4sz5uEKsbI/S7utGOdQwfI/AAAAAAAAADs/0M-VmTWEFV8/s200/Picture+5.png" width="200" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;or this&lt;br /&gt;&lt;div class="separator" style="text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/_c4sz5uEKsbI/S7utUORP26I/AAAAAAAAAD0/own5TjnGp4o/s1600/Picture+6.png" imageanchor="1" style="float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="200" src="http://1.bp.blogspot.com/_c4sz5uEKsbI/S7utUORP26I/AAAAAAAAAD0/own5TjnGp4o/s200/Picture+6.png" width="188" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;These were all generated using the standard algorithm for sampling from a Dirichlet distribution.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;In R, that algorithm is this&lt;br /&gt;&lt;br /&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&lt;span style="font-size: x-small;"&gt;&lt;b&gt;rdirichlet = function(n, alpha) {&lt;br /&gt;&amp;nbsp; k = length(alpha)&lt;br /&gt;&amp;nbsp; r = matrix(0, nrow=n, ncol=k) &lt;br /&gt;&amp;nbsp; for (i in 1:k) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; r[,i] = rgamma(n, alpha[i], 1)&lt;br /&gt;&amp;nbsp; }&lt;br /&gt;&amp;nbsp; r = matrix(mapply(function(r, s) {return (r/s)}, r, rowSums(r)), ncol=k)&lt;br /&gt;&amp;nbsp; return (r)&lt;br /&gt;}&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;That is, first sample &lt;img src="http://latex.codecogs.com/gif.latex?%28y_1%20%5Cldots%20y_n%29" title="(y_1 \ldots y_n)" /&gt; using a gamma distribution&lt;br /&gt;&lt;br /&gt;&lt;img src="http://latex.codecogs.com/gif.latex?y_i%20%5Csim%20%5Cmathrm%7BGamma%7D%28%5Calpha_i,%201%29" title="y_i \sim \mathrm{Gamma}(\alpha_i, 1)" /&gt;&lt;br /&gt;&lt;br /&gt;Then normalize to get the desired sample&lt;br /&gt;&lt;br /&gt;&lt;img src="http://latex.codecogs.com/gif.latex?%5Cpi_i%20=%20%5Cfrac%20%7By_i%7D%20%7B%5Csum_j%20y_j%7D" title="\pi_i = \frac {y_i} {\sum_j y_j}" /&gt;&lt;br /&gt;&lt;br /&gt;In contrast, here are 30,000 samples computed using a simple Metropolis algorithm to draw samples from a Dirichlet distribution with &lt;img src="http://latex.codecogs.com/gif.latex?%5Calpha%20=%20%281,1,1%29" title="\alpha = (1,1,1)" /&gt;.&amp;nbsp; This should give a completely uniform impression but there is noticeable thinning in the corners where the acceptance rate of the sampling process drops.&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/_c4sz5uEKsbI/S7u-DKmJOvI/AAAAAAAAAD8/IF9d_yLf4Qs/s1600/Picture+7.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://3.bp.blogspot.com/_c4sz5uEKsbI/S7u-DKmJOvI/AAAAAAAAAD8/IF9d_yLf4Qs/s320/Picture+7.png" /&gt;&lt;/a&gt;&lt;/div&gt;The thinning is only apparent.&amp;nbsp; In fact, what happens is that the samples in the corners are repeated, but that doesn't show up in this visual presentation.&amp;nbsp; Thus the algorithm is correct, but it exhibits very different mixing behavior depending on where you look.&lt;br /&gt;&lt;br /&gt;A particularly unfortunate aspect of this problem is that decreasing the size of the step distribution in order to increase the acceptance rate in the corners makes the sampler take much longer to traverse the region of interest and thus only partially helps the problem.&amp;nbsp; Much of the problem is that when we have reached a point near the edge of the simplex or especially in a corner, many candidate steps will be off the simplex entirely and will thus be rejected.&amp;nbsp; When many parameters are small, this is particularly a problem because steps that are not rejected due to being off the simplex are likely to be rejected because the probability density drops dramatically as we leave the edge of the simplex.&amp;nbsp; Both of these problems become much, much worse in higher dimensions.&lt;br /&gt;&lt;br /&gt;A much better alternative is to generate candidate points in soft-max space using a Gaussian step distribution.&amp;nbsp; This gives us a symmetric candidate distribution that takes small steps near the edges and corners of the simplex and larger steps in the middle of the range.&amp;nbsp; The sensitivity to the average step size (in soft-max space) is noticeably decreased and the chance of stepping out of the simplex is eliminated entirely.&lt;br /&gt;&lt;br /&gt;This small cleverness leads to the solution of the question posed in a comment some time ago by Andrei about how to sample from the posterior distribution where we observe counts sampled from a multinomial&lt;br /&gt;&lt;br /&gt;&lt;img src="http://latex.codecogs.com/gif.latex?%5Cvec%20k%20%5Csim%20%5Cmathrm%7BMulti%7D%28%5Cvec%20x%29" title="\vec k \sim \mathrm{Multi}(\vec x)" /&gt;&lt;br /&gt;&lt;br /&gt;where that multinomial has a prior distribution defined by&lt;br /&gt;&lt;br /&gt;&lt;img src="http://latex.codecogs.com/gif.latex?%5Cvec%20x%20=%20%5Calpha%20%5Cvec%20m" title="\vec x = \alpha \vec m" /&gt;&lt;br /&gt;&lt;img src="http://latex.codecogs.com/gif.latex?%5Calpha%20%5Csim%20%5Cmathrm%7BExp%7D%28%5Cbeta_0%29" title="\alpha \sim \mathrm{Exp}(\beta_0)" /&gt;&lt;br /&gt;&lt;img src="http://latex.codecogs.com/gif.latex?%5Cvec%20m%20%5Csim%20%5Cmathrm%7BDir%7D%28%5Cbeta_1%20%5Cldots%20%5Cbeta_n%29" title="\vec m \sim \mathrm{Dir}(\beta_1 \ldots \beta_n)" /&gt;&lt;br /&gt;&lt;br /&gt;More about that in my next post.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-1102755452320938312?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/1102755452320938312/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=1102755452320938312' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/1102755452320938312'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/1102755452320938312'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2010/04/sampling-dirichlet-distributions_06.html' title='Sampling Dirichlet Distributions (chapter 2-b)'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_c4sz5uEKsbI/S7us8xNIbOI/AAAAAAAAADk/xYqE_KoQSMs/s72-c/Picture+4.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-6545841732715669218</id><published>2010-04-06T13:52:00.000-07:00</published><updated>2010-04-06T13:52:50.791-07:00</updated><title type='text'>Sampling Dirichlet Distributions (chapter 2-a)</title><content type='html'>In &lt;a href="http://tdunning.blogspot.com/2009/04/sampling-dirichlet-distributions.html"&gt;a previous post on sampling from Dirichlet distributions&lt;/a&gt; that related to a &lt;a href="http://www.stat.columbia.edu/%7Ecook/movabletype/archives/2009/04/conjugate_prior.html"&gt;comment by Andrew Gelman&lt;/a&gt; on the same topic I talked a bit about how to sample the parameters of a Dirichlet (as opposed to sampling from a Dirichlet distribution with known parameters).&lt;br /&gt;&lt;br /&gt;In responding to some questions and comments on my post, I went back and reread Andrew's thoughts and think that they should be amplified a bit.&lt;br /&gt;&lt;br /&gt;Basically, what Andrew suggested is that when sampling from a Dirichlet, it is much easier to not sample from Dirichlet at all, but rather to sample from some unbounded distribution and then reduce that sample back to the unit simplex.  The transformation that Andrew suggested is the same as the so-called soft-max basis that David Mackay also advocates for several purposes.  This method is not well know, however, so it deserves a few more pixels.&lt;br /&gt;&lt;br /&gt;The idea is that to sample &lt;img src="http://latex.codecogs.com/gif.latex?%28%5Cpi_1%20%5Cldots%20%5Cpi_n%29" title="(\pi_1 \ldots \pi_n)" /&gt; from some distribution you instead sample&lt;br /&gt;&lt;br /&gt;&lt;img src="http://latex.codecogs.com/gif.latex?%28x_1%20%5Cldots%20x_n%29%20%5Csim%20%5Cmathrm%7BDir%7D%20%28%5Cbeta_1%20%5Cldots%20%5Cbeta_n%29" title="(x_1 \ldots x_n) \sim \mathrm{Dir} (\beta_1 \ldots \beta_n)" /&gt;&lt;br /&gt;&lt;br /&gt;and then reduce to the sample you want &lt;br /&gt;&lt;br /&gt;&lt;img src="http://latex.codecogs.com/gif.latex?%5Cpi_i%20=%20%5Cfrac%20%7Be%5E%7Bx_i%7D%7D%20%7B%5Csum_j%20e%5E%7Bx_j%7D%7D" title="\pi_i = \frac {e^{x_i}} {\sum_j e^{x_j}}" /&gt;&lt;br /&gt;&lt;br /&gt;Clearly this gives you values on the unit simplex.  What is not clear is that this distribution is anything that you want.  In fact, if your original sample is from a normal distribution&lt;br /&gt;&lt;br /&gt;&lt;img src="http://latex.codecogs.com/gif.latex?%28x_1%20%5Cldots%20x_n%29%20%5Csim%20%5Cmathcal%20N%20%28%5Cvec%20%5Cmu%20,%5CSigma%29" title="(x_1 \ldots x_n) \sim \mathcal N (\vec \mu ,\Sigma)" /&gt;&lt;br /&gt;&lt;br /&gt;then you can come pretty close to whatever desired Dirichlet distribution you like.  More importantly in practice, this idea of normally sampling &lt;img src="http://latex.codecogs.com/gif.latex?%28x_1%20%5Cldots%20x_n%29" title="(x_1 \ldots x_n)" /&gt; and then transforming gives a very nice prior serves as well as a real Dirichlet distribution in many applications.  &lt;br /&gt;&lt;br /&gt;Where this comes in wonderfully handy is in MCMC sampling where you don't really want to sample from the prior, but instead want to sample from the posterior.  It isn't all that hard to use the Dirichlet distribution directly in these cases, but the results are likely to surprise you.  For instance, if you take the trivial case of picking a uniformly distributed point on the simplex and then jumping around using Metropolis updates based on a Dirichlet distribution, you will definitely have the right distribution for your samples after just a bit of sampling.  But if you plot those points, they won't look at all right if you are sampling from a distribution that has any significant density next to the boundaries of the simplex.  What is happening is that the high probability of points near the boundaries is accounted for in the Metropolis algorithm by having a high probability of rejecting any jump if you are near the boundary.  Thus, the samples near the boundary are actually repeated many times leading to the visual impression of low density where there should be high density.&lt;br /&gt;&lt;br /&gt;On the other hand, if you were to do the Metropolis jumps in the soft-max basis, the problem is entirely avoided because there are no boundaries in soft-max space.  You can even use the soft-max basis for the sole purpose of generating candidate points and do all probability calculations in terms on the simplex.  &lt;br /&gt;&lt;br /&gt;In my next post, I will provide a concrete example and even include some pictures.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-6545841732715669218?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/6545841732715669218/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=6545841732715669218' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/6545841732715669218'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/6545841732715669218'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2010/04/sampling-dirichlet-distributions.html' title='Sampling Dirichlet Distributions (chapter 2-a)'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-2738962549043258981</id><published>2010-04-04T13:53:00.000-07:00</published><updated>2010-04-04T18:36:24.597-07:00</updated><title type='text'>Big Data and Big Problems</title><content type='html'>In talks recently, I have mentioned an ACM paper that showed just how bad random access to various storage devices can be.&amp;nbsp; About half the time I mention this paper, somebody asks for the reference which I never know off the top of my head.&amp;nbsp; So here it is&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.deepdyve.com/lp/association-for-computing-machinery/the-pathologies-of-big-data-PkgZDJLn9w"&gt;The Pathologies of Big Data&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Of course, the big deal is not that disks are faster when read sequentially.&amp;nbsp; We all know that and most people I talk to know that the problem is worse than you would think.&amp;nbsp; The news is that random access to memory is much, much worse than most people would imagine.&amp;nbsp; In the slightly contrived example in this paper, for instance, sequential reads from a hard disk deliver data faster than randomized reads from RAM.&amp;nbsp; That is truly horrifying.&lt;br /&gt;&lt;br /&gt;Even though you can pick holes in the example, it shows by mere existence that the problem of random access is much worse than almost anybody has actually internalized.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-2738962549043258981?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/2738962549043258981/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=2738962549043258981' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/2738962549043258981'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/2738962549043258981'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2010/04/big-data-and-big-problems.html' title='Big Data and Big Problems'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-2236870032873393079</id><published>2010-02-07T09:36:00.000-08:00</published><updated>2010-02-07T09:36:45.507-08:00</updated><title type='text'>Nice package of software an tutorials for corpus analysis</title><content type='html'>I was about to try to provide some help in using the &lt;img src="http://www.codecogs.com/eq.latex?G^2" /&gt; implementations that I use and give away for a graduate student trying to do some corpus analysis and I found this lovely site describing how to do &lt;a href="http://www.cogsci.uni-osnabrueck.de/%7Esevert/SIGIL/sigil_R/"&gt;corpus analysis using R&lt;/a&gt;.&amp;nbsp;&amp;nbsp;  &lt;a class="external" href="http://www.form.unitn.it/%7Ebaroni/" target="_blank"&gt;&lt;strong&gt;Marco Baroni&lt;/strong&gt;&lt;/a&gt; and &lt;a class="external" href="http://purl.org/stefan.evert/" target="_blank"&gt;&lt;strong&gt;Stefan Evert&lt;/strong&gt;&lt;/a&gt; have done a really lovely job there of describing how to do a large number of simple tasks.&amp;nbsp; Kudos to them!&lt;br /&gt;&lt;br /&gt;For not necessarily the most noble reasons, I was happy to see from their slide show on &lt;a href="http://www.cogsci.uni-osnabrueck.de/%7Esevert/SIGIL/sigil_R/materials/collocations_part_2.4up.pdf"&gt;computing association measures&lt;/a&gt; that &lt;img src="http://www.codecogs.com/eq.latex?G^2" /&gt; is still essentially on top as a measure of association after all of these years.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-2236870032873393079?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/2236870032873393079/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=2236870032873393079' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/2236870032873393079'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/2236870032873393079'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2010/02/nice-package-of-software-tutorials-for.html' title='Nice package of software an tutorials for corpus analysis'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-243417651126623310</id><published>2009-12-08T18:11:00.000-08:00</published><updated>2009-12-08T18:11:20.957-08:00</updated><title type='text'>A call to action</title><content type='html'>The &lt;a href="http://www.guardian.co.uk/commentisfree/2009/dec/06/copenhagen-editorial"&gt;Guardian&lt;/a&gt; and numerous other newspapers around the world published a simultaneous editorial about climate change and warming.&lt;br /&gt;&lt;br /&gt;This call to action is urgent.  We should heed it.  This is both the greatest crisis and the greatest opportunity our species has ever faced.  Sadly, failure &lt;i&gt;is&lt;/i&gt; an option and many people are working to guarantee it by spreading rumors and lies.&lt;br /&gt;&lt;br /&gt;The rest of us have to succeed in spite of that.  For everyone's sake.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-243417651126623310?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/243417651126623310/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=243417651126623310' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/243417651126623310'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/243417651126623310'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2009/12/call-to-action.html' title='A call to action'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-4919804390061677460</id><published>2009-11-23T11:10:00.000-08:00</published><updated>2009-11-23T11:10:18.389-08:00</updated><title type='text'>How to smooth rate estimates</title><content type='html'>I mentioned in a previous posting that I recommend smoothing rate estimates using offsets on numerator (number of events) and denominator (time period).  While it is possible to build a principled prior distribution and do very detailed rate estimation based on this, I find that for "top-100" lists and for "hot-100" lists, it is more useful to adjust the prior with larger strategic goals in mind.&lt;br /&gt;&lt;br /&gt;In particular, what I like to do is set the ratio of the offsets to position an item with no data whatsoever (event count = 0, time = 0) so be somewhere in the top half of all items, but generally not in the top 10%.  This expresses a general optimism about new items that allows them to achieve high rankings with a very modest burst of enthusiasm from the audience and forces them to provide some proof that they are dogs. &lt;br /&gt;&lt;br /&gt;Once this ratio is set, it remains to set the actual magnitudes.  This is done by deciding how much many events that you want an item to have or how long you want it to languish before the data overcome the prior.  If you want the first 10 events to have equal weight to the prior, set the numerator offset to 10.  Alternately, if you want the first day of data to have equal weight to the prior, set the denominator offset to 1 day.&lt;br /&gt;&lt;br /&gt;Done.  Works.  Simple.&lt;br /&gt;&lt;br /&gt;I have done the much more complex effort of building detailed prior models and actually estimating rates in a completely principled fashion but I have found two things:&lt;br /&gt;&lt;br /&gt;a) business people have other agendas than pure estimation and they want a vote&lt;br /&gt;&lt;br /&gt;b) the early estimates are pretty unreliable until the data dominate the prior (duh!).  Thus, you may as well set up the prior to make (a) work.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-4919804390061677460?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/4919804390061677460/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=4919804390061677460' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/4919804390061677460'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/4919804390061677460'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2009/11/how-to-smooth-rate-estimates.html' title='How to smooth rate estimates'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-6233801937863239767</id><published>2009-11-22T20:03:00.000-08:00</published><updated>2009-11-22T20:03:20.832-08:00</updated><title type='text'>Twitter and small counts</title><content type='html'>I just got a very nice email from Slideshare that says&lt;br /&gt;&lt;blockquote&gt;"Transactional Data Mining" is being tweeted more than any other document on SlideShare right now. So we've put it on the homepage of SlideShare.net (in the "Hot on Twitter" section).&lt;/blockquote&gt;&lt;br /&gt;That sounds sooo exciting.  People magazine, here I come!&lt;br /&gt;&lt;br /&gt;Of course, when you look into the matter, I really don't think that I am going to have to worry about dodging paparazzi any time soon.  What they mean by "tweeted more" seems to be 3 times in a day.  Sounds like they should have read my paper about surprise and coincidence!&lt;br /&gt;&lt;br /&gt;In fact, for applications like popularity ranking it is often really important to have a robust idea about what is hot and what is not.  There are two very common problems to solve, one is to find out which items are rapidly rising and the other is to put a reasonable rank onto things.  &lt;br /&gt;&lt;br /&gt;The problem of what is rising is surprisingly well well solved by ranking items by the log of the ratio of the number of hits/views/tweets in a recent period of time to the number of hits in a somewhat older (and typically longer) period of time.  Well, not quite the ratio.  In practice, it is better to offset both numerator and denominator by a constants which are well chosen by looking at the results, but which are actually well-founded in terms of MAP estimators for rates.  The reason that this works so well is that popularities are typically distributed in a roughly Zipfian way so taking log of the rate is linearly related to the log of the rank.  &lt;br /&gt;&lt;br /&gt;A large change in the log-rank of an item is a great indicator of dramatically rising popularity, but rank itself is a pretty fuzzy measurement with small counts (the guys at SlideShare need to hear this).  Log of the rate, however, is linearly related to the log of rank, so change in log-rank is proportional to change in log-rate.  Moreover, the offset trick when computing the rate ratio largely deals with the small count problems that you have when something goes from no hits to one hit in a time period.&lt;br /&gt;&lt;br /&gt;The second problem is computing what the rank of an item really is given bursty nasty data like hit counts.  The problem is that you only have a nice tight estimate of the hit rate for the most popular items.  Since you want recent rankings, you want to use a shorter period of time for your estimates so the problem of small counts is even worse.  Conceptually, a very principled way to do deal with the problem is to embrace the uncertainty and sample the rates associated with each item given the data you have and rank the items.  Do this a bunch of times and you have the posterior distribution of the ranks for each item.  Moreover, you also have a lookup table that tells you what any particular observed rate means relative to what rank the item might have.&lt;br /&gt;&lt;br /&gt;This sampling can be done in a few seconds in R.  I think it is cooler to add dithering do the ranked list of items so that each time you look at the list, you get another sample from this distribution.  This means that items that might plausibly be in the top 100 or 1000 actually get to be there some of the time and items that are definitely in the top few places are always where they should be.  It also adds dynamism to your top-n list which is always good on the web.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-6233801937863239767?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/6233801937863239767/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=6233801937863239767' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/6233801937863239767'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/6233801937863239767'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2009/11/twitter-and-small-counts.html' title='Twitter and small counts'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-2127878813049144481</id><published>2009-11-01T22:59:00.000-08:00</published><updated>2009-11-05T09:59:17.481-08:00</updated><title type='text'>Video of talk on transactional data</title><content type='html'>If you want to see more about some the techniques I have used in projects take a look at this talk I gave to the Bay Area ACM chapter.&lt;br /&gt;&lt;br /&gt;See "&lt;a href="http://fora.tv/2009/10/14/ACM_Data_Mining_SIG_Ted_Dunning"&gt;Transactional Data Mining&lt;/a&gt;"&lt;br /&gt;&lt;br /&gt;Slides are available at &lt;a href="http://www.slideshare.net/tdunning/transactional-data-mining"&gt;here&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-2127878813049144481?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/2127878813049144481/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=2127878813049144481' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/2127878813049144481'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/2127878813049144481'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2009/11/video-of-talk-on-transactional-data.html' title='Video of talk on transactional data'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-8189704708619604694</id><published>2009-04-29T12:20:00.000-07:00</published><updated>2011-02-23T11:31:31.935-08:00</updated><title type='text'>Sampling Dirichlet Distributions</title><content type='html'>[I just realized that this post from last year was only half the story.&amp;nbsp; See this post about using the &lt;a href="http://tdunning.blogspot.com/2010/04/sampling-dirichlet-distribution-revised.html"&gt;gamma distribution directly&lt;/a&gt; to sample Dirchlet distributions]&lt;br /&gt;&lt;br /&gt;I just &lt;a href="http://www.stat.columbia.edu/~cook/movabletype/archives/2009/04/conjugate_prior.html"&gt;commented on a post by Andrew Gelman&lt;/a&gt; about methods for sampling Dirichlet distributions.  Those comments were pretty non specific and deserve a bit of amplification.&lt;br /&gt;&lt;br /&gt;First off, a Dirichlet distribution is a distribution of real-valued tuples, &lt;br /&gt;\[(x_1 \ldots x_n) \sim \mathrm&amp;nbsp;{Dir}(\pi_1 \ldots \pi_n) \]&lt;br /&gt;such that \(x_i \ge 1\) and \(\sum_i x_i = 1\)&lt;br /&gt;&lt;br /&gt;The parameters \(\pi_i\) are all non-negative.&lt;br /&gt;&lt;br /&gt;The original question had to do with sampling the Dirichlet parameters, especially from a conjugate distribution.  The one and true answer in mathematical terms is that there is, indeed, a continuous distribution which is the conjugate of a Dirichlet.  In practical terms, however, that isn't the answer that you really want.&lt;br /&gt;&lt;br /&gt;A much more practical answer is that the Dirichlet can be sampled from a prior that is characterized by \(n+1\) non-negative real parameters using the following procedure&lt;br /&gt;\[\begin{aligned}&lt;br /&gt;\left(m_1 \ldots m_n \right) &amp;amp; \sim \mathrm {Dir} (\beta_1 \ldots \beta_n ) \\&lt;br /&gt;\alpha &amp;amp; \sim \exp (\beta_0) \\&lt;br /&gt;\pi_i &amp;amp; \sim \alpha m_i&lt;br /&gt;\end{aligned}&lt;br /&gt;\]&lt;br /&gt;Alternative distributions for \(\alpha\) include the gamma distribution and exponential normal, such as&lt;br /&gt;\[ \log \alpha \sim \mathcal N (0, 2)\]&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-8189704708619604694?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/8189704708619604694/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=8189704708619604694' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/8189704708619604694'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/8189704708619604694'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2009/04/sampling-dirichlet-distributions.html' title='Sampling Dirichlet Distributions'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-6052878329757502091</id><published>2009-03-24T14:51:00.000-07:00</published><updated>2009-03-24T15:08:24.398-07:00</updated><title type='text'>What would really help map-reduce programs?</title><content type='html'>David Hall is touting a Scala package that he calls &lt;a href="http://scala-blogs.org/2008/09/scalable-language-and-scalable.html"&gt;SMR (Scala for Map Reduce)&lt;/a&gt;.  &lt;br /&gt;&lt;br /&gt;I built something like this quite some time ago using groovy.  I called the resulting Groovy map-reduce facade &lt;a href="http://tdunning.blogspot.com/2008/03/hello-world-for-map-reduce.html"&gt;grool&lt;/a&gt; because it was pretty thin.  The results were even more impressive for toy programs than what SMR achieved.  For instance, here is a fully hadoop compatible map-reduce version of word count:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;  wc = mapReduce(&lt;br /&gt;  {key, line, out, report -&gt; line.split().each {w -&gt; out.collect(w,1)}},&lt;br /&gt;  {w, counts, out, report -&gt; out.collect(w, counts.sum())}&lt;br /&gt; )&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;In this example, wc is a function that can then be applied to various inputs as in&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;  wc(["this is input text", "for the word-counter"])&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;or &lt;br /&gt;&lt;pre&gt;&lt;br /&gt;  wc(new File("/local/input/file")&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;or even &lt;br /&gt;&lt;pre&gt;&lt;br /&gt;  wc(functionUsingMapReduce(new HdfsFile("/existing/file")))&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;In spite of how pretty these tiny examples are, I eventually quit using it for day-to-day map-reduce coding.  &lt;br /&gt;&lt;br /&gt;The problem with things like grool that attempt to make writing map-reduce programs easier is that it doesn't really solve the problems that you face in writing map-reduce programs.  It does make word-count much shorter (5-7 lines of code versus &gt;200).  It does involve a very cool hack so that the closures passed as arguments are available as map and reduce objects on the right machines.  But I found in using grool for production code that it didn't solve the right problem.&lt;br /&gt;&lt;br /&gt;The big problems involved in map-reduce programming are&lt;br /&gt;&lt;br /&gt;&lt;li&gt; map-reduce is just a tad too low-level for most problems.  &lt;br /&gt;&lt;br /&gt;&lt;li&gt; debugging programs written using a facade like this is harder than for programs written using raw java + map-reduce.  &lt;br /&gt;&lt;br /&gt;&lt;li&gt; the boilerplate of map-reduce isn't proportional to problem size so large programs exhibit much less improvement in grool than small programs do.&lt;br /&gt;&lt;br /&gt;What I would much rather have is something that provides &lt;i&gt;much&lt;/i&gt; higher-level semantics such as those provided by Pig, but which provides for better integration into real programming languages.  The better integration that I need goes two directions.  First, I want to be able to, for instance, compute every date in a date range, use those dates to form input paths and then do a computation on those input paths.  Pig can't do that except via a very ugly argument substitution hack.&lt;br /&gt;&lt;br /&gt;A second kind of better integration is what Chris Wenzel has been pushing for some time with Cascading.  I would like it if I could write programs in what might be called PL/Pig (by analogy with PL/SQL) that would then be tightly integrated into a Cascading flow.  Ideally, the PL/Pig program would expose its execution plan to Cascading for global optimization and scheduling.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-6052878329757502091?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/6052878329757502091/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=6052878329757502091' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/6052878329757502091'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/6052878329757502091'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2009/03/what-would-really-help-map-reduce.html' title='What would really help map-reduce programs?'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-389642480903586916</id><published>2009-01-15T00:07:00.000-08:00</published><updated>2009-01-15T00:27:52.074-08:00</updated><title type='text'>Real-time decision making using map-reduce</title><content type='html'>Recently Tim Bass described his happiness that the Mahout project has taken on some important tasks that are often applied to &lt;a href="http://www.thecepblog.com/2009/01/14/apache-mahout-real-time-decisioning-in-the-mapreduce-framework/"&gt;real-time decision making&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;I think that Tim's joy is justified, although Mahout is still a very new project that is definitely not a finished work by any means.  There are a number of important algorithms that are being worked on that could provide some very important capabilities for real-time decision makers.&lt;br /&gt;&lt;br /&gt;The fact is, though, that map-reduce is no silver bullet.  It won't make problems go away.  It is an important technology for large-scale computing that lends it self to the batch training of real-time models if not quite to high availability real time decisioning.  For that, I tend to add systems like zookeeper and to build systems in the style of the Katta framework.&lt;br /&gt;&lt;br /&gt;In my mind the really important change that needs to happen is that designers of real-time decisioning systems need to embrace what I call scale-free computation.&lt;br /&gt;&lt;br /&gt;Scale free computation is what you get when you don't let the software engineers include the scale of the process as a design parameter.  Without that knowledge, they have to build systems that will not require changes when the scale changes.  Map-reduce is a great example of this because the map and reduce functions are pure functions.  The person who writes those functions has no idea how many machines will be used to compute them ... moreover, the framework is free to apply these functions more than once to data if it find a need.&lt;br /&gt;&lt;br /&gt;Katta does something similar by allowing the manager of a search system to specify what should happen without specifying quite how.   The master system uses Zookeeper to maintain state reliably and it examines the specification of what is desired and allocates tasks to search engines.  The search engines detect changes in their to-do list and download index shards as necessary.  Clients examine the state in Zookeeper and autonomously dispatch requests to search engines and combine results.&lt;br /&gt;&lt;br /&gt;The overall effect again is that the person implementing a new katta search system has very little idea how many machines will be used to run the search.  If the corpus is large, then many shards will be used.  If throughput is required, then shards will be replicated many times.  Neither requires change.&lt;br /&gt;&lt;br /&gt;Scale free design patterns like this are required for modern, highly reliable decision systems.  The change in mind-set required is non-trivial and dealing with legacy code will be prodigiously difficult.  As usual with major changes in technology, the best results will come after the old code is retired.  Hopefully that can happen before us old coders retire!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-389642480903586916?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/389642480903586916/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=389642480903586916' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/389642480903586916'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/389642480903586916'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2009/01/real-time-decision-making-using-map.html' title='Real-time decision making using map-reduce'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-2096321076908602501</id><published>2008-12-17T12:59:00.000-08:00</published><updated>2008-12-17T13:12:35.862-08:00</updated><title type='text'>Students learn what they need, not what is assigned</title><content type='html'>&lt;div class="gmail_quote"&gt;A friend recently wrote about her experience with the currently fashionable style of teaching where students are expected to reinvent mathematics.&lt;br /&gt;&lt;br /&gt;She wrote:&lt;br /&gt;&lt;blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"&gt;&lt;div id=":3b" class="ArwC7c ckChnd"&gt;So much attention is given as to why - and to having students figure out algorithms themselves. Some teachers feel this is a waste of time - time that could be used for the students to be practicing the algorithms and becoming more fluent.&lt;/div&gt;&lt;/blockquote&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;I think that this is largely a waste as well.  We cannot expect students of all abilities to recapitulate in 6 years the history of more than 2000 thousand years of thought by the best minds the earth has ever seen.  To imagine that they can is simply ludicrous.  To imagine that what they can invent in this short time and that they don't need the benefit of those great thoughts is similarly ludicrous.&lt;br /&gt;&lt;br /&gt;At the risk of sounding self-contradictory, I also think that when the student owns the task of learning, they will learn enormously better than when the task is imposed.  This does not imply that the student discover everything, however.  It merely implies that they need to discover the &lt;i&gt;need&lt;/i&gt; for the things that they learn.  Students who need to learn something can learn from almost any source, even from something as currently unfashionable as, say, sitting quietly thorugh traditional lecture.&lt;br /&gt;&lt;br /&gt;The last time I taught in the classroom was as a member of a two-person teaching team teaching a software engineering class on machine level programming.  In the past, this had been done by lecture and assignment and was truly a stunningly boring class.  On the first day, I turned the structure of the class upside-down and assigned the entire final exam.  This consisted of a single question in the form of a task (to build a robot that would drive around as fast as possible following a line on the floor).  I then passed out soldering irons, computer components and kits of lego parts and told them to get to work.  For the record, I had never tried to build such a 'bot myself.&lt;br /&gt;&lt;br /&gt;This tactic resulted, as you would expect, in panic.  The students complained that they didn't know how to solder, that they didn't know anything about the computer I had given them and that they didn't know how to build robots.  I told them that they would have to learn all this and much more and that I would try to help them find out all this information, but they would have to tell me what they wanted to learn.  Early on, the questions were about soldering.  Over time, the questions became more and more sophisticated.  At the beginning of the class, we had a list of lectures that we wanted to give, but we held them back until somebody asked a related question.  At that point we would have a vote among the class whether they would like to have a lecture on the subject or would rather continue work on the robots.&lt;br /&gt;&lt;br /&gt;By the end of the semester, I was getting complaints from the department because my students were (voluntarily) spending so much time on my class that they were neglecting their other classes.  Some were spending 40 hours or more in the computer lab and many had built remarkable contraptions little related to the impending exam.  This enthusiasm translated into perfect line-following performance on sample lines.&lt;br /&gt;&lt;br /&gt;But ...&lt;br /&gt;&lt;br /&gt;What I hadn't told the students was that the final would (for the first time) involve a line that crossed itself.  Essentially all of the robots would fail on this line because they hadn't been designed or programmed to deal with that case.  The &lt;i&gt;real&lt;/i&gt; exam was whether they could deal with the unexpected and redesign and reprogram their robots during the 3 hour exam period to succeed on the harder problem. &lt;br /&gt;&lt;br /&gt;All students succeeded.&lt;br /&gt;&lt;br /&gt;Moreover two thirds of the class came back a week after the official final to repeat it in front of a friend who flew down to see how the class had worked out.  I think, but do not have much data to support the assertion that it is rare for students to ask to be allowed to put off summer vacation so that they can repeat a final exam.&lt;br /&gt;&lt;br /&gt;The point of it all was that these students could learn vastly more than was expected of them if they just wanted to.  They could learn this material almost effortlessly in a traditional setting where somebody (me) wrote illegibly on the blackboard on difficult and abstract topics.  But previous classes with exactly the same lectures given by better teachers than me had failed abysmally.  I think the difference was that my students felt that they absolutely needed to learn what was in the lecture ... indeed, they had to ask for the lecture before I would give it.  They also felt very much the owners of their task.&lt;br /&gt;&lt;br /&gt;In the end, the students felt that they had invented almost everything they needed.  In fact, I had spoon-fed fed them almost all of the key material.  I told them how to make motors turn, how to make lights turn on, how to assemble and load software and how to design a software project.  That didn't matter because they didn't remember that.  What they remembered was that they initiated their learning of all of the material.  They owned the class.  I was their assistant, not their master.&lt;br /&gt;&lt;br /&gt;And they learned.  A lot.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-2096321076908602501?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/2096321076908602501/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=2096321076908602501' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/2096321076908602501'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/2096321076908602501'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2008/12/students-learn-what-they-need-not-what.html' title='Students learn what they need, not what is assigned'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-2275061558556604408</id><published>2008-12-16T11:02:00.000-08:00</published><updated>2008-12-16T11:16:04.431-08:00</updated><title type='text'></title><content type='html'>So what about the question of why two negatives multiplied give a positive?&lt;br /&gt;&lt;br /&gt;The short answer is that if this were not so, the world would not exist.&lt;br /&gt;&lt;br /&gt;That is, the fundamental reason that multiplication works that way is because otherwise the definition would be broken.  By broken, I mean that it would lead you to nonsensical conclusions if you followed it as far as you could.  Any definition multiplication in a system with negative numbers &lt;span style="font-style: italic;"&gt;has&lt;/span&gt; to work pretty much that way.&lt;br /&gt;&lt;br /&gt;&lt;div class="gmail_quote"&gt;On Tue, Dec 16, 2008 at 6:42 AM,  &lt;span dir="ltr"&gt;&lt;bonniestarrysky@gmail.com&gt;&lt;/span&gt; wrote:&lt;br /&gt;&lt;blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"&gt;&lt;div id=":4q" class="ArwC7c ckChnd"&gt;Why does a negative number times a negative number equal a positive number?&lt;br /&gt;How can you show that using a manipulative? &lt;/div&gt;&lt;/blockquote&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;Let's take the concrete (but abstract) example of arithmetic on a 7 hour clock.  Here is the addition table:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new,monospace;"&gt;+   &lt;b&gt;0 1 2 3 4 5 6&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new,monospace;"&gt;&lt;b&gt;0&lt;/b&gt;   0 1 2 3 4 5 6&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new,monospace;"&gt;&lt;b&gt;1&lt;/b&gt;   1 2 3 4 5 6 0&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new,monospace;"&gt;&lt;b&gt;2&lt;/b&gt;   2 3 4 5 6 0 1&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new,monospace;"&gt;&lt;b&gt;3&lt;/b&gt;   3 4 5 6 0 1 2&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new,monospace;"&gt;&lt;b&gt;4&lt;/b&gt;   4 5 6 0 1 2 3&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new,monospace;"&gt;&lt;b&gt;5&lt;/b&gt;   5 6 0 1 2 3 4&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new,monospace;"&gt;&lt;b&gt;6&lt;/b&gt;   6 0 1 2 3 4 5&lt;/span&gt;&lt;span style="font-family: courier new,monospace;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;OK.  Now we can build multiplication on top of that.  I will use * instead of x to indicate multiplication because I don't have really good fonts.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new,monospace;"&gt;*   &lt;b&gt;0 1 2 3 4 5 6&lt;/b&gt;&lt;/span&gt;&lt;br /&gt; &lt;span style="font-family: courier new,monospace;"&gt;&lt;b&gt;0&lt;/b&gt;   0 0 0 0 0 0 0&lt;/span&gt;&lt;br /&gt; &lt;span style="font-family: courier new,monospace;"&gt;&lt;b&gt;1&lt;/b&gt;   0 1 2 3 4 5 6&lt;/span&gt;&lt;br /&gt; &lt;span style="font-family: courier new,monospace;"&gt;&lt;b&gt;2&lt;/b&gt;   0 2 4 6 1 3 5&lt;/span&gt;&lt;br /&gt; &lt;span style="font-family: courier new,monospace;"&gt;&lt;b&gt;3&lt;/b&gt;   0 3 6 2 5 1 4&lt;/span&gt;&lt;br /&gt; &lt;span style="font-family: courier new,monospace;"&gt;&lt;b&gt;4&lt;/b&gt;   0 4 1 5 2 6 3&lt;/span&gt;&lt;br /&gt; &lt;span style="font-family: courier new,monospace;"&gt;&lt;b&gt;5&lt;/b&gt;   0 5 3 1 6 4 2&lt;/span&gt;&lt;br /&gt; &lt;span style="font-family: courier new,monospace;"&gt;&lt;b&gt;6&lt;/b&gt;   0 6 5 4 3 2 1&lt;/span&gt;&lt;span style="font-family: courier new,monospace;"&gt;&lt;/span&gt;&lt;br /&gt; &lt;br /&gt;OK.  Not much to see here except that in the multiplication every row has all of the integers.  Moreover, every row has these integers in a different order.  This is a consequence of the fact that 7 is a prime number.  It doesn't happen that way on the 12 hour clock.&lt;br /&gt;&lt;br /&gt;But let's look again at the addition table.  Note that 1 + 6 = 0 and that 2 + 5 = 0.  Notice also that 0 only appears once in each row.  That means that we can consider 6 to be a way of writing -1.  Or perhaps -1 is a way to write 6.  This lets us solve addition problems such as 3 + x = 5 by adding 4 (which is -3) to both sides.&lt;br /&gt;&lt;br /&gt;But what happens when we multiply 3 * (-2)?  Do we get -6?&lt;br /&gt;&lt;br /&gt;Well, -2 = 5 so this should be the same as 3 * 5 which is 1.  But 1 is also -6 !&lt;br /&gt;&lt;br /&gt;And all of our properties like the distributive law should still work. &lt;br /&gt;&lt;br /&gt;Thus 3 * (-2) + 3 * 4 = 1 + 5 = 6 should be 3 * (-2 + 4) which is 3 * 2 = 6.  So that works.&lt;br /&gt;&lt;br /&gt;What about (-3) * (-2) = 6?&lt;br /&gt;&lt;br /&gt;Well, this should be (and is) the same as 4 * 5 = 6. &lt;br /&gt;&lt;br /&gt;That's cool.&lt;br /&gt;&lt;br /&gt;In a very limited and geeky dull kind of way.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;What is more cools is that this is all just a very concrete example of how the laws or arithmetic imply a certain kind of order.  We could test this out for all the different kinds of arithmetic that have the properties that we like about integers.  I don't mind starting infinite tasks, but I do mind having to finish them.&lt;br /&gt;&lt;br /&gt;SO&lt;br /&gt;&lt;br /&gt;Let's do much better by reasoning from those properties directly.&lt;br /&gt;&lt;br /&gt;For instance, assume that we have the following:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;an additive unit, 0 such that &lt;span style="font-style: italic;"&gt;x&lt;/span&gt; + 0 = 0 + &lt;span style="font-style: italic;"&gt;x&lt;/span&gt; = &lt;span style="font-style: italic;"&gt;x&lt;/span&gt; and thus 0 * &lt;span style="font-style: italic;"&gt;x&lt;/span&gt; = 0&lt;/li&gt;&lt;li&gt;an additive inverse, -&lt;span style="font-style: italic;"&gt;x&lt;/span&gt; = 0 - &lt;span style="font-style: italic;"&gt;x&lt;/span&gt;&lt;/li&gt;&lt;li&gt;left distributive law, (&lt;span style="font-style: italic;"&gt;a&lt;/span&gt; + &lt;span style="font-style: italic;"&gt;b&lt;/span&gt;) * &lt;span style="font-style: italic;"&gt;x&lt;/span&gt; = &lt;span style="font-style: italic;"&gt;a&lt;/span&gt; * &lt;span style="font-style: italic;"&gt;x&lt;/span&gt; + &lt;span style="font-style: italic;"&gt;b&lt;/span&gt; * &lt;span style="font-style: italic;"&gt;x&lt;/span&gt;&lt;/li&gt;&lt;li&gt;right distributive law, &lt;span style="font-style: italic;"&gt;x&lt;/span&gt; * (&lt;span style="font-style: italic;"&gt;a&lt;/span&gt; + &lt;span style="font-style: italic;"&gt;b&lt;/span&gt;) = &lt;span style="font-style: italic;"&gt;x&lt;/span&gt; * &lt;span style="font-style: italic;"&gt;a&lt;/span&gt; + &lt;span style="font-style: italic;"&gt;x&lt;/span&gt; * &lt;span style="font-style: italic;"&gt;b&lt;/span&gt;&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;Now take &lt;span style="font-style: italic;"&gt;x&lt;/span&gt; * (0 - &lt;span style="font-style: italic;"&gt;y&lt;/span&gt;).  This has to be equal to (&lt;span style="font-style: italic;"&gt;x&lt;/span&gt; * 0) - (&lt;span style="font-style: italic;"&gt;x&lt;/span&gt; * &lt;span style="font-style: italic;"&gt;y&lt;/span&gt;) = 0 - (&lt;span style="font-style: italic;"&gt;x&lt;/span&gt; * &lt;span style="font-style: italic;"&gt;y&lt;/span&gt;) = - (&lt;span style="font-style: italic;"&gt;x&lt;/span&gt; * &lt;span style="font-style: italic;"&gt;y&lt;/span&gt;)&lt;br /&gt;&lt;br /&gt;Or (0 - &lt;span style="font-style: italic;"&gt;x&lt;/span&gt;) * (0 - &lt;span style="font-style: italic;"&gt;y&lt;/span&gt;) = 0 - &lt;span style="font-style: italic;"&gt;x&lt;/span&gt; * (0 - &lt;span style="font-style: italic;"&gt;y&lt;/span&gt;) = 0 - (0 - &lt;span style="font-style: italic;"&gt;x&lt;/span&gt; * &lt;span style="font-style: italic;"&gt;y&lt;/span&gt;) = &lt;span style="font-style: italic;"&gt;x&lt;/span&gt; * &lt;span style="font-style: italic;"&gt;y&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This means that ANY kind of arithmetic where multiplication and addition have an additive inverse and left and right distributive laws will follow the pattern that (-&lt;span style="font-style: italic;"&gt;x&lt;/span&gt;) * (-&lt;span style="font-style: italic;"&gt;y&lt;/span&gt;) = &lt;span style="font-style: italic;"&gt;x&lt;/span&gt; * &lt;span style="font-style: italic;"&gt;y&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;OR&lt;br /&gt;&lt;br /&gt;if that system doesn't have that, then it won't have one or all of the properties mentioned.&lt;br /&gt;&lt;br /&gt;This works for a spinning globe, for quantum mechanics, for clocks and rubik's cubes.&lt;br /&gt;&lt;br /&gt;And THAT is why abstraction is cool.  But not why it is useful.  That comes next.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-2275061558556604408?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/2275061558556604408/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=2275061558556604408' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/2275061558556604408'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/2275061558556604408'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2008/12/so-what-about-question-of-why-two.html' title=''/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-6166078413131827629</id><published>2008-12-16T10:31:00.000-08:00</published><updated>2008-12-16T10:35:10.076-08:00</updated><title type='text'>Mathematics as an article of faith</title><content type='html'>A friend asked me a wonderful question recently.&lt;br /&gt;&lt;br /&gt;She asked:&lt;br /&gt;&gt; I feel like math is so wonderful and useful to a certain point. (Like sixth grade.)&lt;br /&gt;&gt; Then it seems abstract... I can't see the purpose.&lt;br /&gt;&lt;br /&gt;She was very much right that mathematics becomes abstract at that point.&lt;br /&gt;&lt;br /&gt;And that is very much the point of it.  It allows us to think abstractly.  It allows us to find patterns in many different things that work the same way.  That allows us to think a problem through in mathematically form and then recognize that form again and again.&lt;br /&gt;&lt;br /&gt;One good example is modular arithmetic.  We could call it clock arithmetic and talk about integral hours that wrap around the 12 hour clock.  We could build an addition table that tells us everything there is to know about that kind of arithmetic.  We could extend that addition table to be multiplication by repeated multiplication.&lt;br /&gt;&lt;br /&gt;That is all well and good.  But it only tells us about clock arithmetic on a 12 hour clock.  It doesn't tell us as it stands about clock arithmetic on an 11 or 13 or 24 hour clock.  It doesn't tell us about arithmetic on a clock where the time isn't just an integer, but can be between the hours.  Maybe we don't care about those things because we don't have to tell time on a 17 hour clock very often.&lt;br /&gt;&lt;br /&gt;But what about the fact that the original arithmetic that we started with on our 12 hour clock also works just fine for .... music on a western scale.&lt;br /&gt;&lt;br /&gt;Should we spend the same amount of time reinventing all of what we learned about the 12 hour clock when it is &lt;span style="font-style: italic;"&gt;exactly&lt;/span&gt; the same as for musical scales?  Or should we re-use that knowledge?&lt;br /&gt;&lt;br /&gt;Well, to re-use that knowledge, we have to stop a moment and erase all the places where we originally said "clock" or "hour" or "one day later" and replace them with the integers from 0 to 11 inclusive and replace "move clockwise one hour" with "add one, reduce by removing 12's".  This abstracts the original system which can make it harder to learn, but it also makes it much more useful because we don't have to learn it again and again.&lt;br /&gt;&lt;br /&gt;But what if somebody asks about pentatonic scales?  Or about microtones?  Or the 35 note scales that were talked about in the sixties? &lt;br /&gt;&lt;br /&gt;Well, if we had started by abstracting away the number of hours on the clock and abstracting away whether the numbers were integers or real numbers, then we would be able to answer those questions instantly.&lt;br /&gt;&lt;br /&gt;And if we had abstracted the idea of rotation a bit more, we would see that rotation around a circle can be generalized into rotation around a sphere or even more complex things.  Now our clock arithmetic can solve navigational problems and get us home on a dark night.&lt;br /&gt;&lt;br /&gt;But with just a bit more extension, that same clock arithmetic (now with 4 hours and 9 dimensions) could solve a Rubik's cube with about 5 minutes of thought.&lt;br /&gt;&lt;br /&gt;None of the details here matter.  For instance, the fact that 12 is not prime plays a big role in the nature of the arithmetic that we get on clocks, but that isn't the point of all of this. &lt;br /&gt;&lt;br /&gt;Now, this abstraction is really frustrating because you can't always point to something and say that it is what you are talking about.  In fact, the &lt;span style="font-style: italic;"&gt;point&lt;/span&gt; of abstraction is that you &lt;span style="font-style: italic;"&gt;aren't&lt;/span&gt; talking about something specific.  This is exactly what makes it hard to teach abstraction using manipulatives.  Manipulatives are all about grounding intuition in naive physics.  Abstraction is all about UNgrounding your intuitions.&lt;br /&gt;&lt;br /&gt;The point is that mathematics is all about abstraction.  It is just a mechanism so that we can repeat what once appeared the work of genius without having ourselves to be genii. &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;So is it useful for most people to understand these wheels within wheels that turn behind the world we see?&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;To me, yes, it is useful and beautiful and wondrous.  But I can't speak for others. &lt;br /&gt;&lt;br /&gt;I do think that anybody with a spiritual tendency is turning away from the hand of god if they choose to not see all the kinds of order in the world.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-6166078413131827629?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/6166078413131827629/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=6166078413131827629' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/6166078413131827629'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/6166078413131827629'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2008/12/mathematics-as-article-of-faith.html' title='Mathematics as an article of faith'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-5378328851053224717</id><published>2008-11-04T22:29:00.000-08:00</published><updated>2008-11-04T22:29:29.285-08:00</updated><title type='text'>natural language processing blog: Using machine learning to answer scientific questions</title><content type='html'>Try this paper for some pretty interesting results on regression in a large variable space with interactions:&lt;br /&gt;&lt;br /&gt;   http://ideas.repec.org/p/wop/pennin/01-05.html&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-5378328851053224717?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='related' href='http://nlpers.blogspot.com/2008/11/using-machine-learning-to-answer.html' title='natural language processing blog: Using machine learning to answer scientific questions'/><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/5378328851053224717/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=5378328851053224717' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/5378328851053224717'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/5378328851053224717'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2008/11/natural-language-processing-blog-using.html' title='natural language processing blog: Using machine learning to answer scientific questions'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-5349662498690470207</id><published>2008-07-05T22:03:00.000-07:00</published><updated>2008-07-05T22:25:31.349-07:00</updated><title type='text'>Death penalty and data modeling</title><content type='html'>This &lt;a href="http://ijlit.oxfordjournals.org/cgi/content/full/16/1/1"&gt;article&lt;/a&gt; just appeared in the "International Journal of  Law and Information Technology".  It purports to show that the death penalty is arbitrary because whether or not somebody who is sentenced to death can be predicted without considering any substantive characteristic of the crime in question.  If true, the conclusion certainly would indicate that the death penalty really is applied based more on who was sentenced rather than what they did.&lt;br /&gt;&lt;br /&gt;Unfortunately, there are a number of defects in the analysis given in that paper that me somewhat leery of taking the results at face value.  Some of these problems represent insufficient detail in the article, others indicate a lack of rigor in the data analysis itself.&lt;br /&gt;&lt;br /&gt;First, the authors include without definition the following three variables:&lt;br /&gt;&lt;br /&gt;  7.    Third most serious capital offence&lt;br /&gt;  8.    Second most serious capital offence&lt;br /&gt;  9.    First most serious capital offence&lt;br /&gt;&lt;br /&gt;These variables sound related to the capital crime for which the prisoner was sentenced.  Without definition, we cannot judge.  Moreover, without knowing how these variables were encoded, we cannot tell whether the model was able to use this input or not.&lt;br /&gt;&lt;br /&gt;Secondly, the authors include the following four time based variables:&lt;br /&gt;&lt;br /&gt;  15.    Month of conviction for capital offense&lt;br /&gt;  16.    Year of conviction for capital offense&lt;br /&gt;  17.    Month of sentence for capital offense&lt;br /&gt;  18.    Year of sentence for capital offense&lt;br /&gt;&lt;br /&gt;I understand what these variables represent, but am curious how the authors encoded them.  Normally in a model like this, time would be encoded as a continuous linear variable from some relatively recent epoch.  Separating a time variable like this often leads to problems if you are looking for recency effects.  The other common use of separated month variables is to look for seasonality affects.  Typically, though, this would be done using 1 of (n-1) encoding which you clearly did not use given how few inputs you have.&lt;br /&gt;&lt;br /&gt;While the encoding of these variables doesn't really bear much on the validity of the results, it does make it unlikely that the model could have used these variables for the purpose intended.&lt;br /&gt;&lt;br /&gt;Thirdly, the issue of encoding also comes up the geographical variable:&lt;br /&gt;&lt;br /&gt; 2.    State&lt;br /&gt;&lt;br /&gt;Encoded as a single variable, this is almost certainly an integer code.  This is a very poor encoding of state (1 of 50 encoding would be much better, several variables with different resolution would be even better).&lt;br /&gt;&lt;br /&gt;Regarding the learning results themselves, the authors apparently did not assess which input variables were responsible for their results.  They should have used any of the standard techniques such as step-wise variable selection or input randomization.  They should have also measured how much value the non-linear classifier that they used mattered to the results.  Typically, this would be done by comparing the results available from a linear classifier such as a ridged logistic regression.  With the encodings in use by these authors, such a straight-forward comparison is probably not viable.&lt;br /&gt;&lt;br /&gt;The authors also do not appear to have done anything to determine whether there is a target leak in the data.  This can occur when there is some apparently independent input which is accidentally highly correlated with the desired output.  In this case, it is possible that one state is responsible for a large majority of the executions.  This might make the output more predictable without providing much ammunition for the original argument.&lt;br /&gt;&lt;br /&gt;Finally, the authors do not have any reference in their article about public availability of their data.  Without making their data easily available and given how incredibly easy it is for novices to screw up a modeling effort, these results should be considered no more than slightly provocative.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-5349662498690470207?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/5349662498690470207/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=5349662498690470207' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/5349662498690470207'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/5349662498690470207'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2008/07/death-penalty-and-data-modeling.html' title='Death penalty and data modeling'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-8680856399453352885</id><published>2008-07-03T13:26:00.000-07:00</published><updated>2008-07-03T14:01:28.065-07:00</updated><title type='text'>Why the long tail isn't as long as expected</title><content type='html'>Several bloggers and authors are finding out (belatedly relative to people involved in the industry) that the economic returns in systems that are supposedly governed by long-tail distributions are unexpectedly concentrated in the highly popular items.  See &lt;a href="http://conversationstarter.hbsp.com/2008/06/challenging_the_long_tail.html"&gt;here&lt;/a&gt; for a blogger's commentary on &lt;a href="http://harvardbusinessonline.hbsp.harvard.edu/hbsp/hbr/articles/article.jsp?ml_action=get-article&amp;amp;articleID=R0807H&amp;amp;ml_issueid=BR0807&amp;amp;ml_subscriber=true&amp;amp;pageNumber=1&amp;amp;_requestid=241246"&gt;this&lt;/a&gt; article.&lt;br /&gt;&lt;br /&gt;In practice, the long tail model does predict certain kinds of consumption very well.  I speak from experience analyzing view data at Veoh where, except for the very few top titles, a unit power law described number of views pretty well.  This means that the number of views from sparsely watched videos is surprisingly large, adding to the woes of anybody trying to review submissions.&lt;br /&gt;&lt;br /&gt;Economically speaking, though, you have to factor in a few kinds of friction.  These are reasonably modeled as per unit sale costs and per unit of stock costs.  The per unit sale costs don't affect the distribution of revenue and profit, but the per stock unit costs definitely do.  If you draw the classic Zipf curve and assume that revenue is proportional to consumption for all items, then a per stock unit cost offsets the curve vertically.  The resulting profit curve tells you pretty much immediately how things will fall out.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_c4sz5uEKsbI/SG04dDnqASI/AAAAAAAAAAM/XEw_a2SdeTI/s1600-h/Picture+1.png"&gt;&lt;img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer;" src="http://2.bp.blogspot.com/_c4sz5uEKsbI/SG04dDnqASI/AAAAAAAAAAM/XEw_a2SdeTI/s320/Picture+1.png" alt="" id="BLOGGER_PHOTO_ID_5218889615031271714" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;The graph to the right illustrates.  The solid line is the classic long-tail distribution and the dashed lines represent different per stock unit costs.  Wherever the solid line is above the dashed line, the model implies positive returns; where it is below, it implies loss.  Clearly, the lower the stock unit costs, the deeper into the distribution you can go and still make money.&lt;br /&gt;&lt;br /&gt;If we assume that you will only be stocking items that have non-negative return, then the percentage of profit that comes from a particular part of the distribution will vary with the threshold.  For high thresholds, almost all of the profit will be from mega-hits but for very low thresholds, a much larger percentage will come from low consumption items.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_c4sz5uEKsbI/SG09QrkOGsI/AAAAAAAAAAU/b1Tvv_Qc9RE/s1600-h/Picture+2.png"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer;" src="http://4.bp.blogspot.com/_c4sz5uEKsbI/SG09QrkOGsI/AAAAAAAAAAU/b1Tvv_Qc9RE/s320/Picture+2.png" alt="" id="BLOGGER_PHOTO_ID_5218894899974118082" border="0" /&gt;&lt;/a&gt;This sort of situation is shown in this second graph which shows the percentage of total profit achieved for different thresholds and ranks.  The concentration of profit in the low rank items for high thresholds is quite clear.&lt;br /&gt;&lt;br /&gt;A good example of a per stock unit cost is given in p2p download schemes.  The first few downloads of each item have to be paid for by the original source while  additional downloads make use of sunk network resources that apply minimal marginal cost back to the source.  For systems like bit-torrent, you need more than a hundred downloads before you get to high p2p efficiency which makes the system useful for mainstream content and less useful for body and tail content.  The Veoh p2p system has a much lower threshold and is thus useful far down into the tail.  Either system, though, leads to an economic return different from the theoretical zero-cost long-tail model.&lt;br /&gt;&lt;br /&gt;None of these observations is earth-shaking and, as far as I know these are all common wisdom among anybody trying to make money out of the long tail.  Why this is news to the Harvard Business Review is the real mystery to me.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-8680856399453352885?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/8680856399453352885/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=8680856399453352885' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/8680856399453352885'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/8680856399453352885'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2008/07/why-long-tail-isnt-as-long-as-expected.html' title='Why the long tail isn&apos;t as long as expected'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_c4sz5uEKsbI/SG04dDnqASI/AAAAAAAAAAM/XEw_a2SdeTI/s72-c/Picture+1.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-1403698364631918639</id><published>2008-04-16T15:37:00.000-07:00</published><updated>2008-04-16T15:40:34.208-07:00</updated><title type='text'>Reading email</title><content type='html'>And you thought that would be easy.&lt;br /&gt;&lt;br /&gt;If you have ever written a program to receive email (not just pick it up from a POP server), you know just how painful that can be.  If you wrote the program in Java, you know this even better.&lt;br /&gt;&lt;br /&gt;The world just got better:&lt;br /&gt;&lt;br /&gt;   http://subethasmtp.tigris.org/&lt;br /&gt;&lt;br /&gt;This package is just as simple as reading email should be.  Your program decides whether to receive the email and then it gets it.  The interface is tiny and it looks like they handle enough of the standards to really work.&lt;br /&gt;&lt;br /&gt;Whew.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-1403698364631918639?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/1403698364631918639/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=1403698364631918639' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/1403698364631918639'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/1403698364631918639'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2008/04/reading-email.html' title='Reading email'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-7767455609478388698</id><published>2008-04-15T11:35:00.001-07:00</published><updated>2008-04-15T11:58:35.569-07:00</updated><title type='text'>A random walk from eigenvectors to parallel page rank</title><content type='html'>Power law algorithms are ideal for computing things like page rank.  That isn't obvious, however.&lt;br /&gt;&lt;br /&gt;The basic outline idea is that hub and authority style algorithms are intimately related to eigenvector or singular value decompositions (depending on whether the links are symmetrical).  This also means that there is a close relationship to asymptotic beahavior of random walks on the graph.  That probably still isn't obvious, so let's dig in.&lt;br /&gt;&lt;br /&gt;If you represent the linkage in the web by a matrix that has columns representing source page and rows representing the target page and with a 1 where-ever the source page has a link pointing to the target page, then if you start with a vector with a single non-zero element equal to 1 as a representation of a single page, then multiplying by the linkage matrix will give you a vector with 1 in the positions corresponding to the pages the original page linked to.  If you multiply again, you get all the pages that you can get to in two steps from the original page.&lt;br /&gt;&lt;br /&gt;Mathematically, if we call the original vector x and the linkage matrix A, the pages that x links to are just Ax.   The pages that are two steps from x are A(Ax) = A&lt;sup&gt;2&lt;/sup&gt; x.&lt;br /&gt;&lt;br /&gt;The eigenvector decomposition of A is just a way of writing A as a product of three matrices:&lt;br /&gt;&lt;br /&gt;A = U S U'&lt;br /&gt;&lt;br /&gt;U' is the transpose of U, and U has the special property that U'U = I (it is called ortho-normal because of this).&lt;br /&gt;&lt;br /&gt;S is a diagonal matrix.&lt;br /&gt;&lt;br /&gt;There is lots of deep mathematical machinery and beautiful symmetry available here, but for now we can just take this as given.&lt;br /&gt;&lt;br /&gt;The set of pages &lt;span style="font-style: italic;"&gt;n&lt;/span&gt; steps from x are&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;x&lt;sub&gt;n&lt;/sub&gt;&lt;/span&gt; = A&lt;sup&gt;&lt;span style="font-style: italic;"&gt;n&lt;/span&gt;&lt;/sup&gt; &lt;span style="font-style: italic;"&gt;x&lt;/span&gt; = (U S U')&lt;sup&gt;&lt;span style="font-style: italic;"&gt;n&lt;/span&gt;&lt;/sup&gt; x = (U S U')&lt;sup&gt;&lt;span style="font-style: italic;"&gt;n&lt;/span&gt;-2&lt;/sup&gt; (U S U') (U S U') &lt;span style="font-style: italic;"&gt;x&lt;/span&gt;&lt;br /&gt;  = (U S U')&lt;sup&gt;n-2&lt;/sup&gt; (U S (U'U) S U') &lt;span style="font-style: italic;"&gt;x&lt;/span&gt; = (U S U')&lt;sup&gt;&lt;span style="font-style: italic;"&gt;n&lt;/span&gt;-2&lt;/sup&gt; (U S&lt;sup&gt;2&lt;/sup&gt; U') &lt;span style="font-style: italic;"&gt;x&lt;/span&gt;&lt;br /&gt;  = U S&lt;sup&gt;&lt;span style="font-style: italic;"&gt;n&lt;/span&gt;&lt;/sup&gt; U' &lt;span style="font-style: italic;"&gt;x&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This is really cool because S&lt;sup&gt;&lt;span style="font-style: italic;"&gt;n&lt;/span&gt;&lt;/sup&gt; can be computed by just taking each diagonal element and raising it to a power.&lt;br /&gt;&lt;br /&gt;Eigenvector decompositions have other, really deep connections.  For instance,  if you take the elements of S (call the i-th one &lt;span style="font-style: italic;"&gt;s&lt;/span&gt;&lt;sub style="font-style: italic;"&gt;i&lt;/sub&gt;) then&lt;br /&gt;&lt;br /&gt;&lt;span style=";font-family:Arial,Helvetica,sans-serif;font-size:100%;"  &gt; &lt;img src="http://alt1.artofproblemsolving.com/Forum/latexrender/pictures/7/8/2/782cb89fbe7443d4734af5f2bb9f8d15ac2ace56.gif" alt="\displaystyle\sum" align="absmiddle" /&gt;&lt;/span&gt;&lt;sub&gt;&lt;span style="font-style: italic;"&gt;i&lt;/span&gt;&lt;/sub&gt; s&lt;sub&gt;i&lt;/sub&gt;&lt;sup&gt;&lt;span style="font-style: italic;"&gt;n&lt;/span&gt;&lt;/sup&gt;&lt;br /&gt;&lt;br /&gt;is the number of paths that are &lt;span style="font-style: italic;"&gt;n&lt;/span&gt; steps long.&lt;br /&gt;&lt;br /&gt;Connected (or nearly connected) clusters of pages can also be derived from the eigenvector decomposition.  This is the basis of so-called spectral clustering.  For some very impressive examples of spectral clustering see this &lt;a href="http://citeseer.ist.psu.edu/ng01spectral.html"&gt;paper.&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;So eigenvectors are cool.  But how can we compute them?  And how can we compute them on BIG graphs in parallel?&lt;br /&gt;&lt;br /&gt;First, note that if A&lt;sup&gt;&lt;span style="font-style: italic;"&gt;n&lt;/span&gt;&lt;/sup&gt; = U S&lt;sup&gt;&lt;span style="font-style: italic;"&gt;n&lt;/span&gt;&lt;/sup&gt; U' and if some of the &lt;span style="font-style: italic;"&gt;s&lt;sub&gt;i&lt;/sub&gt; &lt;/span&gt;are bigger than others, the big ones will quickly dominate the others.  That is pretty quickly, A&lt;sup&gt;&lt;span style="font-style: italic;"&gt;n&lt;/span&gt;&lt;/sup&gt; &lt;span style=";font-family:Arial,Helvetica,sans-serif;font-size:100%;"  &gt;&lt;img src="http://alt2.artofproblemsolving.com/Forum/latexrender/pictures/b/a/f/baf1f8e194a46566b6abaecc626a5338eb89b6c4.gif" alt="\approx" align="absmiddle" /&gt;&lt;/span&gt; &lt;span style="font-style: italic;"&gt;u&lt;/span&gt;&lt;sub&gt;1&lt;/sub&gt; &lt;span style="font-style: italic;"&gt;s&lt;/span&gt;&lt;sub&gt;1&lt;/sub&gt;&lt;sup&gt;&lt;span style="font-style: italic;"&gt;n&lt;/span&gt;&lt;/sup&gt; &lt;span style="font-style: italic;"&gt;u&lt;/span&gt;&lt;sub&gt;1&lt;/sub&gt;'.  This means that we can compute an approximation of u&lt;sub&gt;1&lt;/sub&gt; by just doing A&lt;sup&gt;n&lt;/sup&gt; x where x is some random vector.  Moreover, we can compute u&lt;sub&gt;2&lt;/sub&gt; by starting with a different random vector and iterating the same way, but with an additional step where we forbid the result from going towards u&lt;sub&gt;1&lt;/sub&gt;.  With just a few additional wrinkles, this gives us what is called the Lanczos algorithm.  Golub and van Loan's excellent book Matrix Computations gives a lot of information on these algorithms.&lt;br /&gt;&lt;br /&gt;The cool thing here is that our random vector can represent a single page and we can approximate the final result by following links.  Following links is just a (human-readable) way of saying sparse matrix multiplication.  If we do this multiplication against lots of different random starting points, we can quickly build parallel algorithms to compute things like page rank.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-7767455609478388698?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/7767455609478388698/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=7767455609478388698' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/7767455609478388698'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/7767455609478388698'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2008/04/random-walk-from-eigenvectors-to.html' title='A random walk from eigenvectors to parallel page rank'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-2305786958804182019</id><published>2008-04-15T08:13:00.000-07:00</published><updated>2008-04-15T08:18:20.125-07:00</updated><title type='text'>Words at random, carefully chosen</title><content type='html'>On comp.ai, Dmitry Kazakov reiterated the lonely cry of a frequentist against statistical natural language.  This cry has been repeated many times over the years by many people who cannot abide the treatment of documents and language as if they were random.&lt;br /&gt;&lt;br /&gt;Let's examine the situation more carefully.&lt;br /&gt;&lt;br /&gt;On Apr 14, 5:28 am, "Dmitry A. Kazakov" &lt;mail...@dmitry-kazakov.de&gt; wrote:&lt;br /&gt;&gt; ... It cannot be probability because the document is obviously not random ...&lt;br /&gt;&lt;br /&gt;The statement "It cannot be probability ..." is essentially a tautology.  It should read, "We cannot use the word probability to describe our state of knowledge because we have implicitly accepted the assumption that probability cannot be used to describe our state of knowledge".&lt;br /&gt;&lt;br /&gt;The fact that an object has been constructed in its present state by non-random processes outside our ken is no different as far as we can tell than if the object were constructed at random (note that random does not equal uniform).  What if the document were, in fact, written using the I Ching (as Philip K Dick is reputed to have written "The Man in the High Castle")?   Is it reasonable to describe the text as having been randomly generated now that we know that?&lt;br /&gt;&lt;br /&gt;Take the canonical and over-worked example of the coin being flipped.  Before the coin is flipped a reasonable observer who knows the physics of the situation and who trusts the flipper would declare the probability of heads to be 100%.  After the coin is flipped, but before it is revealed, the situation is actually no different.  Yes, the coin now has a state whereas before the coin was only going to have a state, but, in fact, the only real difference is that the physics has become somewhat simpler, the most important factor in our answering the question of the probability has not changed.  We still do not know the outcome.&lt;br /&gt;&lt;br /&gt;Moreover, if the person flipping the coin looks at the coin, that does not and cannot change our answer.&lt;br /&gt;&lt;br /&gt;When WE look at the coin, however, we now suddenly, miraculously declare that the probability is now 100% that the coin has come up heads.  Nothing has changed physically, but our estimate has changed dramatically.&lt;br /&gt;&lt;br /&gt;Moreover, if we now examine the coin and find that it has two heads, our previous answer of 50% is still valid in the original context.  If we were to repeat the experiment, our correct interpretation is to give 100% as the probability before the flip.  The only difference is our state of knowledge.&lt;br /&gt;&lt;br /&gt;So philosophically speaking, probability is a statement of knowledge.&lt;br /&gt;&lt;br /&gt;Moreover, by de Finetti's famous theorem, even if this philosophical argument is bogus, the mathematics all works our AS IF there were an underlying distribution on the parameters of the system.  That means that we can profitably use this philosophical argument AS IF it were true.&lt;br /&gt;&lt;br /&gt;The upshot is that even if you are a frequentist in your heart of hearts, it will still pay to behave as if you were a Bayesian.  And I, as a Bayesian, will be able to behave as if you were rational because I will not know your secret.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-2305786958804182019?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/2305786958804182019/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=2305786958804182019' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/2305786958804182019'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/2305786958804182019'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2008/04/words-at-random-carefully-chosen.html' title='Words at random, carefully chosen'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-6291734095028932128</id><published>2008-03-28T10:04:00.000-07:00</published><updated>2008-04-15T08:13:14.683-07:00</updated><title type='text'>How Not to Recommend</title><content type='html'>A friend of mine pointed out &lt;a href="http://whimsley.typepad.com/whimsley/2007/07/the-limitations.html"&gt;&lt;span style="font-style: italic;"&gt;The Netflix Prize: 300 Days Later&lt;/span&gt;&lt;/a&gt; this morning.  The article points out how difficult it has been to make significant improvements on the Netflix rating prediction problem.&lt;br /&gt;&lt;br /&gt;This article reinforces a lot of my own feelings.  I continue to believe that recommendations based on ratings are inherently flawed.  Among other things&lt;br /&gt;&lt;ul&gt;&lt;li&gt;predicting ratings from ratings is hard&lt;/li&gt;&lt;li&gt;predicting real user preferences from ratings is even harder&lt;/li&gt;&lt;li&gt;most people don’t rate things very much, if at all (makes #1 even harder)&lt;/li&gt;&lt;/ul&gt;All of these factors are eased considerably if you just get rid of the idea that you have to predict ratings.  Observing real behavior and then predicting that same behavior is vastly easier.&lt;br /&gt;&lt;br /&gt;&lt;jkorman@veoh.com&gt;&lt;/jkorman@veoh.com&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-6291734095028932128?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/6291734095028932128/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=6291734095028932128' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/6291734095028932128'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/6291734095028932128'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2008/03/how-not-to-recommend.html' title='How Not to Recommend'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-3007903168405439520</id><published>2008-03-26T12:06:00.000-07:00</published><updated>2008-03-28T09:17:31.442-07:00</updated><title type='text'>The Power of Meta</title><content type='html'>Evolutionary algorithms are oohhh so trendy, but they often provide very poor convergence rates.  This occurs particularly when the population is in the neighborhood of a good solution, but the mutations being produced are far from that neighborhood because the mutation rate is too high.&lt;br /&gt;&lt;br /&gt;Decreasing the mutation rate often doesn't help because then it takes forever to find the right neighborhood.&lt;br /&gt;&lt;br /&gt;Changing the mutation rate externally can work, but you will usually get the annealing schedule wrong and have too much or too little mutation at some point.&lt;br /&gt;&lt;br /&gt;One of the most effective ways to fix these problems is the technique known as meta-mutation.  With meta-mutation, each member of the population has data about how dramatically it should be mutated.  Then, when you are far from a good solution, members that take big steps can be big winners and their descendants will also take big steps.  When you get close to a good solution, members with large mutation rates will step far from the best solution and will be mostly unable to improve the solution, but some descendants will have smaller mutation rates and these will able to find improved solutions.  These calmer descendants will quickly dominate the population because they will be able to produce significant improvements.&lt;br /&gt;&lt;br /&gt;Unfortunately, meta-mutation introduces the new problem of how you should mutate the mutation rate.  (danger ... recursion alert)&lt;br /&gt;&lt;br /&gt;I describe some methods that avoid all of these problems in my 1998 paper &lt;span style="font-style: italic;"&gt;Recorded Step Mutation for Faster Convergence.&lt;/span&gt;  The techniques are very simple and very effective.  The paper is available from &lt;a href="http://arxiv.org/abs/0803.3838"&gt;arxiv&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The real trick is that you can have the mutation rate encode the meta-mutation rate as well.  One approach using a fixed distribution that is scaled by the current mutation rate.  Another is to use the last step determine a directional component of the new mutation rate.  Together, these approaches can improve convergence rate and final accuracy by many orders of magnitude relative to other approaches.  Self-similar meta-mutation alone gives results that are nearly identical in symmetric Gaussian bowl problems to analytically determined optimal annealing schedules.  When combined with directional mutation, results are near optimal for Gaussian bowls with low rank cross correlations.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-3007903168405439520?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='enclosure' type='text/html' href='http://arxiv.org/abs/0803.3838' length='0'/><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/3007903168405439520/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=3007903168405439520' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/3007903168405439520'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/3007903168405439520'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2008/03/power-of-meta.html' title='The Power of Meta'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-8932254133045032592</id><published>2008-03-25T09:20:00.000-07:00</published><updated>2008-03-25T09:56:43.620-07:00</updated><title type='text'>Hello world for map-reduce</title><content type='html'>There has been a surge of interest lately in map-reduce based parallel from many quarters.  The public, non-google side of the that interest is largely focussed around an open source system of software known as Hadoop.&lt;br /&gt;&lt;br /&gt;Lots of people are doing some very cool things with Hadoop.  Yahoo is now running key parts of their web indexing infrastructure using Hadoop.  Academics have described very impressive results in large scale machine translation developments.  Facebook uses Hadoop for key parts of their metrics pipeline.  IBM has developed a functional programming system on Hadoop.  Imitators have been spawned.&lt;br /&gt;&lt;br /&gt;But one key ingredient of any new computing technology is still missing.&lt;br /&gt;&lt;br /&gt;What Hadoop has been missing to date is a good way to write a hello-world application in just a few lines of code.  It has often been said that the ability to do simple things simply is a critical capability for a new technology.  Hadoop, and map-reduce programming in general, does not lack a good hello-world example; the word-counting example provides that.  What is missing is the ability to write this program in just a few lines of code.  The word-count example that comes with the Hadoop distribution is over a hundred lines of code, most of which are impossible to motivate for a new user.&lt;br /&gt;&lt;br /&gt;The basic idea is simple, the map phase tokenizes input lines and emits words as keys with (initial) counts of 1.  The reduce phase takes all of the counts for a single word and sums them up.   As a refinement, we can use a combiner to pre-reduce counts within the output of a single mapper.&lt;br /&gt;&lt;br /&gt;So why is this more than 5 lines of code?  The answer is that running a real world map-reduce cluster has some real complexity if you want the best possible results.  Another reason is that in Java hiding complex implementations requires lots of noise tokens.  It isn't that Java is bad, it is just that it isn't good for hiding complexity from naive users.&lt;br /&gt;&lt;br /&gt;My answer to this problem is to move to Groovy, a language whose raison d'etre is specifically to hide the cruft in Java programs.  Groovy isn't Java or an extension of Java, but Groovy programs inter-call with Java programs completely transparently and compile to the same byte codes.&lt;br /&gt;&lt;br /&gt;So what should word-count look like?  I think that it should look like functional programming.  Our mapper should look like a function as should our reducer.  They should be combined to produce a map-reduce program which should itself be a function that takes inputs and produces  outputs.   Those inputs should be any kind of input that we might like to use including lists of strings (for testing), local files and files stored in a distributed data store.&lt;br /&gt;&lt;br /&gt;So our program really ought to look like this:&lt;br /&gt;&lt;pre&gt;  wc = mapReduce(&lt;br /&gt;  {key, line, out, report -&gt; line.split().each {w -&gt; out.collect(w,1)}},&lt;br /&gt;  {w, counts, out, report -&gt; out.collect(w, counts.sum())}&lt;br /&gt; )&lt;br /&gt;&lt;/pre&gt;and we should be able to use this program in a lot of different ways:&lt;br /&gt;&lt;pre&gt;  wc(["this is input text", "for the word-counter"])&lt;br /&gt; wc(new File("/local/input/file")&lt;br /&gt; wc(findInterestingTextUsingMapReduce(new HdfsFile("/existing/distributed/file")&lt;br /&gt;&lt;/pre&gt;and we should be able to access the output very simply:&lt;br /&gt;&lt;pre&gt;  wc(["this is input text"]).eachLine {&lt;br /&gt;     println it&lt;br /&gt; }&lt;br /&gt;&lt;/pre&gt;Subject to some limitations in Groovy 1.5.4, this is exactly what my new system does.  I call this new system &lt;span style="font-family: courier new;"&gt;grool&lt;/span&gt; because it is pretty thin porridge right now.&lt;br /&gt;&lt;br /&gt;Send me email for a copy.  I will get it onto a standard kind of open-source repository soon(ish), but you can try it sooner.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-8932254133045032592?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/8932254133045032592/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=8932254133045032592' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/8932254133045032592'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/8932254133045032592'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2008/03/hello-world-for-map-reduce.html' title='Hello world for map-reduce'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-6655800277795930385</id><published>2008-03-21T10:10:00.000-07:00</published><updated>2010-07-30T14:59:05.610-07:00</updated><title type='text'>Surprise and Coincidence</title><content type='html'>Some years ago, I wrote a simple paper, &lt;span class="m"&gt;&lt;span class="l"&gt;&lt;b&gt;&lt;a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.5962"&gt;Accurate Methods for the Statistics of Surprise and Coincidence&lt;/a&gt;&lt;span style="font-weight: bold;"&gt; &lt;/span&gt;&lt;/b&gt;that has since seen quite a history of citation and use by others.  The basic method was applied to get highly accurate &lt;a href="http://citeseer.ist.psu.edu/50120.html"&gt;language identification&lt;/a&gt;, species identification, author identification, document classification, &lt;a href="http://web.archive.org/web/20030423113844/http://www.musicmatch.com/"&gt;musical recommendations&lt;/a&gt;, insurance risk assessment and &lt;a href="http://www.veoh.com/"&gt;video recommendations&lt;/a&gt;.   It was eventually the core of my doctoral dissertation (ask me for a copy) and has been cited hundreds of times.&lt;br /&gt;&lt;br /&gt;What hasn't been well described in the past is just how simple the method really is.  In most of the applications above, it is important to find useful features in large amounts of data.   The method at the heart of all of these algorithms is to use a score to analyze counts of events, particularly counts of when events occur together.  The counts that you usually have in these situations are the number of times two events have occurred together, the number of times that they have occurred with or without each other and the number of times anything has occurred.&lt;br /&gt;&lt;br /&gt;To compute the score, it is handy to restate these counts slightly as the number of times the events occurred together (let's call this k_11), the number of times each has occurred  &lt;span style="font-style: italic;"&gt;without&lt;/span&gt; the other (let's call these k_12 and k_21) and the number of times something has been observed that was neither of these events (let's call this k_22).  In this notation, I am using row or column 1 to be the event of interest and row or column 2 to be everything else.&lt;br /&gt;&lt;br /&gt;Here is the table we need&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="m"&gt;&lt;span class="l"&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="m"&gt;&lt;span class="l"&gt;&lt;table border="1" cellpadding="5" cellspacing="0" style="height: 150px; width: 499px;"&gt;&lt;tbody&gt;&lt;tr&gt; &lt;td&gt;&lt;br /&gt;&lt;/td&gt; &lt;td&gt;Event A&lt;/td&gt; &lt;td&gt;Everything but A&lt;/td&gt; &lt;/tr&gt;&lt;tr&gt; &lt;td&gt;Event B&lt;/td&gt; &lt;td&gt;A and B together (k_11)&lt;/td&gt; &lt;td&gt;B, but not A (k_12)&lt;/td&gt; &lt;/tr&gt;&lt;tr&gt; &lt;td&gt;Everything but B&lt;/td&gt; &lt;td&gt;A without B (k_21)&lt;/td&gt; &lt;td&gt;Neither A nor B (k_22)&lt;/td&gt; &lt;/tr&gt;&lt;/tbody&gt; &lt;/table&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="m"&gt;&lt;span class="l"&gt;&lt;br /&gt;Once you have these counts, the hard part is over.  Computing the log-likelihood ratio score (also known as G&lt;sup&gt;2&lt;/sup&gt;) is very simple,&lt;br /&gt;&lt;br /&gt;LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k)))&lt;br /&gt;&lt;br /&gt;where H is Shannon's entropy, computed as the sum of (k_ij / sum(k)) log (k_ij / sum(k)).  In R, this function is defined as&lt;br /&gt;&lt;br /&gt;H = function(k) {N = sum(k) ; return (sum(k/N * log(k/N + (k==0)))}&lt;br /&gt;&lt;br /&gt;The trick with adding k==0 handles the case where some element of k is zero.  Since we are multiplying by that same zero, we can drop those terms but it helps not to try to compute the log(0).  The java or C versions of this (ask me for a copy) is almost as simple, but not quite as pretty.&lt;br /&gt;&lt;br /&gt;So there you have it.  Two lines of code and the world is your oyster.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-6655800277795930385?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/6655800277795930385/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=6655800277795930385' title='8 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/6655800277795930385'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/6655800277795930385'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html' title='Surprise and Coincidence'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>8</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4453805095942812863.post-5247928081199539248</id><published>2008-03-21T10:07:00.000-07:00</published><updated>2008-03-21T10:10:38.395-07:00</updated><title type='text'>Why the title?</title><content type='html'>What kind of title, you may ask, is "Parallel and Uncertain".&lt;br /&gt;&lt;br /&gt;If you were to ask, I would answer, "descriptive".&lt;br /&gt;&lt;br /&gt;My main interests lately have to do with parallel processing and (automated) reasoning in the fact of uncertainty.  Thus, these blog posts will be largely about these topics.&lt;br /&gt;&lt;br /&gt;What more can I say?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4453805095942812863-5247928081199539248?l=tdunning.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tdunning.blogspot.com/feeds/5247928081199539248/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4453805095942812863&amp;postID=5247928081199539248' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/5247928081199539248'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4453805095942812863/posts/default/5247928081199539248'/><link rel='alternate' type='text/html' href='http://tdunning.blogspot.com/2008/03/why-title.html' title='Why the title?'/><author><name>Ted Dunning ... apparently Bayesian</name><uri>http://www.blogger.com/profile/02498665124454933570</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry></feed>
