Saturday, July 5, 2008

Death penalty and data modeling

This article just appeared in the "International Journal of Law and Information Technology". It purports to show that the death penalty is arbitrary because whether or not somebody who is sentenced to death can be predicted without considering any substantive characteristic of the crime in question. If true, the conclusion certainly would indicate that the death penalty really is applied based more on who was sentenced rather than what they did.

Unfortunately, there are a number of defects in the analysis given in that paper that me somewhat leery of taking the results at face value. Some of these problems represent insufficient detail in the article, others indicate a lack of rigor in the data analysis itself.

First, the authors include without definition the following three variables:

7. Third most serious capital offence
8. Second most serious capital offence
9. First most serious capital offence

These variables sound related to the capital crime for which the prisoner was sentenced. Without definition, we cannot judge. Moreover, without knowing how these variables were encoded, we cannot tell whether the model was able to use this input or not.

Secondly, the authors include the following four time based variables:

15. Month of conviction for capital offense
16. Year of conviction for capital offense
17. Month of sentence for capital offense
18. Year of sentence for capital offense

I understand what these variables represent, but am curious how the authors encoded them. Normally in a model like this, time would be encoded as a continuous linear variable from some relatively recent epoch. Separating a time variable like this often leads to problems if you are looking for recency effects. The other common use of separated month variables is to look for seasonality affects. Typically, though, this would be done using 1 of (n-1) encoding which you clearly did not use given how few inputs you have.

While the encoding of these variables doesn't really bear much on the validity of the results, it does make it unlikely that the model could have used these variables for the purpose intended.

Thirdly, the issue of encoding also comes up the geographical variable:

2. State

Encoded as a single variable, this is almost certainly an integer code. This is a very poor encoding of state (1 of 50 encoding would be much better, several variables with different resolution would be even better).

Regarding the learning results themselves, the authors apparently did not assess which input variables were responsible for their results. They should have used any of the standard techniques such as step-wise variable selection or input randomization. They should have also measured how much value the non-linear classifier that they used mattered to the results. Typically, this would be done by comparing the results available from a linear classifier such as a ridged logistic regression. With the encodings in use by these authors, such a straight-forward comparison is probably not viable.

The authors also do not appear to have done anything to determine whether there is a target leak in the data. This can occur when there is some apparently independent input which is accidentally highly correlated with the desired output. In this case, it is possible that one state is responsible for a large majority of the executions. This might make the output more predictable without providing much ammunition for the original argument.

Finally, the authors do not have any reference in their article about public availability of their data. Without making their data easily available and given how incredibly easy it is for novices to screw up a modeling effort, these results should be considered no more than slightly provocative.

3 comments:

Eduardo said...

Data is available from the ICPSR:

http://www.icpsr.umich.edu/cgi-bin/bob/newark?study=3667

madmetrics said...

I also find it somewhat odd that they did not include the results of at least one preliminary simple model (ie a logit, probit, etc.). Such results would go a long way towards boosting confidence in their results, especially with the use of such an opaque method for their primary model.

Ted Dunning ... apparently Bayesian said...

The good news is that, as Eduardo says, the data is available.