Saturday, July 5, 2008

Death penalty and data modeling

This article just appeared in the "International Journal of Law and Information Technology". It purports to show that the death penalty is arbitrary because whether or not somebody who is sentenced to death can be predicted without considering any substantive characteristic of the crime in question. If true, the conclusion certainly would indicate that the death penalty really is applied based more on who was sentenced rather than what they did.

Unfortunately, there are a number of defects in the analysis given in that paper that me somewhat leery of taking the results at face value. Some of these problems represent insufficient detail in the article, others indicate a lack of rigor in the data analysis itself.

First, the authors include without definition the following three variables:

7. Third most serious capital offence
8. Second most serious capital offence
9. First most serious capital offence

These variables sound related to the capital crime for which the prisoner was sentenced. Without definition, we cannot judge. Moreover, without knowing how these variables were encoded, we cannot tell whether the model was able to use this input or not.

Secondly, the authors include the following four time based variables:

15. Month of conviction for capital offense
16. Year of conviction for capital offense
17. Month of sentence for capital offense
18. Year of sentence for capital offense

I understand what these variables represent, but am curious how the authors encoded them. Normally in a model like this, time would be encoded as a continuous linear variable from some relatively recent epoch. Separating a time variable like this often leads to problems if you are looking for recency effects. The other common use of separated month variables is to look for seasonality affects. Typically, though, this would be done using 1 of (n-1) encoding which you clearly did not use given how few inputs you have.

While the encoding of these variables doesn't really bear much on the validity of the results, it does make it unlikely that the model could have used these variables for the purpose intended.

Thirdly, the issue of encoding also comes up the geographical variable:

2. State

Encoded as a single variable, this is almost certainly an integer code. This is a very poor encoding of state (1 of 50 encoding would be much better, several variables with different resolution would be even better).

Regarding the learning results themselves, the authors apparently did not assess which input variables were responsible for their results. They should have used any of the standard techniques such as step-wise variable selection or input randomization. They should have also measured how much value the non-linear classifier that they used mattered to the results. Typically, this would be done by comparing the results available from a linear classifier such as a ridged logistic regression. With the encodings in use by these authors, such a straight-forward comparison is probably not viable.

The authors also do not appear to have done anything to determine whether there is a target leak in the data. This can occur when there is some apparently independent input which is accidentally highly correlated with the desired output. In this case, it is possible that one state is responsible for a large majority of the executions. This might make the output more predictable without providing much ammunition for the original argument.

Finally, the authors do not have any reference in their article about public availability of their data. Without making their data easily available and given how incredibly easy it is for novices to screw up a modeling effort, these results should be considered no more than slightly provocative.

Thursday, July 3, 2008

Why the long tail isn't as long as expected

Several bloggers and authors are finding out (belatedly relative to people involved in the industry) that the economic returns in systems that are supposedly governed by long-tail distributions are unexpectedly concentrated in the highly popular items. See here for a blogger's commentary on this article.

In practice, the long tail model does predict certain kinds of consumption very well. I speak from experience analyzing view data at Veoh where, except for the very few top titles, a unit power law described number of views pretty well. This means that the number of views from sparsely watched videos is surprisingly large, adding to the woes of anybody trying to review submissions.

Economically speaking, though, you have to factor in a few kinds of friction. These are reasonably modeled as per unit sale costs and per unit of stock costs. The per unit sale costs don't affect the distribution of revenue and profit, but the per stock unit costs definitely do. If you draw the classic Zipf curve and assume that revenue is proportional to consumption for all items, then a per stock unit cost offsets the curve vertically. The resulting profit curve tells you pretty much immediately how things will fall out.

The graph to the right illustrates. The solid line is the classic long-tail distribution and the dashed lines represent different per stock unit costs. Wherever the solid line is above the dashed line, the model implies positive returns; where it is below, it implies loss. Clearly, the lower the stock unit costs, the deeper into the distribution you can go and still make money.

If we assume that you will only be stocking items that have non-negative return, then the percentage of profit that comes from a particular part of the distribution will vary with the threshold. For high thresholds, almost all of the profit will be from mega-hits but for very low thresholds, a much larger percentage will come from low consumption items.

This sort of situation is shown in this second graph which shows the percentage of total profit achieved for different thresholds and ranks. The concentration of profit in the low rank items for high thresholds is quite clear.

A good example of a per stock unit cost is given in p2p download schemes. The first few downloads of each item have to be paid for by the original source while additional downloads make use of sunk network resources that apply minimal marginal cost back to the source. For systems like bit-torrent, you need more than a hundred downloads before you get to high p2p efficiency which makes the system useful for mainstream content and less useful for body and tail content. The Veoh p2p system has a much lower threshold and is thus useful far down into the tail. Either system, though, leads to an economic return different from the theoretical zero-cost long-tail model.

None of these observations is earth-shaking and, as far as I know these are all common wisdom among anybody trying to make money out of the long tail. Why this is news to the Harvard Business Review is the real mystery to me.