Just before Christmas, the Lord Chief Justice gave his senior brothers in the legal profession a bit of a Christmas present. In a speech to legal practitioners in his homeland of Wales, he warned them of their impending redundancy, saying:
It is probably correct to say that as soon as we have better statistical information, artificial intelligence using that statistical information will be better at predicting the outcome of cases than the most learned Queen’s Counsel.
Not just any old Queen’s Counsel, but the most learned. Even, shock horror, some who perhaps practice in London might be the implication. Even, EVEN, and here my friends in the De Keyser Massive may start humming Say it ain’t so, David Pannick QC.
So, as I happened to be teaching my Future of Legal Practice Students about Quantitative Legal Prediction (they start by reading this) now seemed like a good time to look quite hard at what the research actually says about quantitative legal prediction. This is a long post as a result. You may need beverages and biscuits. I should also say, I am not a machine learning expert. I am feeling my way, learning as I go, reading and listening to the likes of Jan Van Hoecke at RAVN and Noah Waisberg at Kira. I may well have got things wrong or misdirected the reader. I would, even more than usual, warmly welcome dialogue on these subjects from those who understand them more fully and from those who, like me, want to learn. So…
Dan Katz, a leading legal tech academic and founder of a predictions business, opines that QLP will “define much of the coming innovation in the legal services industry” and, with more circumspection, that, “a nontrivial subset of tasks undertaken by lawyers is subject to automation.” As we will see, I think, those last words provide an important coda to those drinking unreflectively from the innovation KoolAid. Yet the developments are exciting too: a recent study suggests text scraping and machine learning enabled impressive ‘predictions’ of ECtHR decisions (79% accuracy) and Katz and his colleagues have this week updated their paper claiming a machine learning approach can predict US Supreme Court Cases in a robust and reliable way (getting decisions right about 70% of the time) over decades and decades. And Ruger et al, as we will see, claim machine accuracy higher than experts (perhaps we can lobby them to call their algorithm GoveBot). So there may be something in the Lord Chief’s predictions. Let’s take a look at what that something may be.
We will start with Theodore Ruger, Andrew Martin, Kevin Quinn, and Pauline Kim’s study of the Supreme Court. This is a study which is fascinating because a computer model predicted the outcome of US Supreme Court cases for a year using only six variables with good accuracy. These variables are simple things like which Circuit the case came from, and a basic coding of the subject matter of the case. The variables are the kind of variables which one would not need to be a highly qualified lawyer to code. Feed those variables into an algorithm and the algorithm predicts whether the case is likely to be unanimously decided or not, and if it I not, how each of the Justices is likely to vote. The algorithm uses simple decision trees like the one below for each judge (each judge has his or her own algorithm – paying attention to different variables and in different ways). The decision tree provides a pretty good rule of thumb as to how the particular judge will decide a case. If you answer each of these questions about a case before it was decided, the likelihood is you could predict accurately the decision of that particular judge. Simples. And no need to worry about what the law said. Or detailed facts. Or what time the judge would have their lunch.
Ruger and his colleagues built these decision trees having analysed just over 600 cases between 1994 and 2002; a very modest data set that produces impressive results. Having built the decision trees, they then tested them. They used their algorithms to predict the outcome of each new case appearing before the Supreme Court (using them there six variables). And they got panel of experts (a mixture of top academics and appellate practitioners; some with experience of clerking for Supreme Court justices) to predict the outcome of cases in areas where they had expertise (they usually managed to get three experts to predict a decision). This real test enabled them to compare the judgment of the machine and the judgment of experts on 68 cases. Who would win? Machine or Expert?
In broad terms, the machine did. It predicted 75% of Supreme Court decisions accurately. The experts got it right only 59% of the time. Even where they had three experts predict the outcome of a particular case and could take the majority view of the three experts, accuracy climbed to the still not as good 66%. Only at the level of individual judicial votes did the experts do better than the machine (experts got votes right 68% of the time compared with the machines 67% – let’s call that a tie). The rub was that the machine was much better a predicting the moderate, swing voters; precisely the task you might think the experts would be better at.
Yet all is not quite what it seems. There is, at the very least, a glimmer of hope for our professional experts. If we look separately at the predictions of appellate attorneys in Ruger’s sample of experts, their accuracy of prediction was a mighty 90% plus. Because there were only 12 of such appellate in the panel used for the study, Ruger et al were not able to separately test (in a statistically robust way) whether the difference between appellate attorneys and other experts was significant. Nonetheless, we should be very wary of accepting the study as proving machines are better than experts at predicting the outcome of legal cases, or more particularly the kind of experts the Lord Chief has in mind, because the kind of experts we might expect to do very well at predictions really did seem to do very well, even if Ruger et al can’t prove so with high levels of confidence.
As Ruger et al also indicate there are other limitations on their, nonetheless very impressive, experiment. In particular, their algorithms are built to reflect a stable court: the same justices sat on the court through the entire period. They know that this is important because some of their Justices’ decision trees showed inter-judicial dependencies: Judge A would be influenced in their decision by the decision of Judge B. Take away Judge B, and the decision trees would not work so well. In the time period of their experiment Judge A, B, C and so on all sat so this was not a problem.
It is at this point that Dan Katz, Michael Bommarito and Josh Blackman pick up the story. Their model uses a far greater number of variables, across more than two centuries of data, and seeks to show that it is possible to predict judicial decision making using information on cases that could be known before the case is heard with high levels of reliability and consistency. In this way, they seek to go further than Ruger et al. They show that it may very well be possible to develop an approach which can adapt to changes in the personnel on courts, the kinds of cases before them, the evolving state of the law, and so on. In this way they hope to develop a model which is, “general – that is, a model that can learn online [as more decisions are taken]” that has a, “consistent performance across time, case issues, and Justices” and has “future applicability”, that is it could be used in the real world, that is, “it should consistently outperform a baseline comparison.”
Their approach is complicated to those, like me, not expert in data science and machine learning – reading it is a great way to start to peer beneath the machine learning hood. They rely on a pre-existing political science database of Supreme Court decisions which has coded the characteristics and outcomes of Supreme Court cases on dozens and dozens of variables across more than two centuries. To simplify a bit, from that database they set about, through machine learning, developing algorithms which can predict the outcomes of a ‘training set’ of cases in the early years of the Supreme Court; testing whether the algorithms can predict the outcome of cases held back from this training set; then – through machine learning again – adapting those algorithms to take account of a training set for the next year, testing them again and moving forward. Ingeniously, then, they can build models, test those models, and move forward through a very large slice of the US Supreme Court’s history. Because the approach does not rely on simple decision trees, but random forests (or multiple variations) of decision trees, which develop and change over the life of the court, then it is not possible for the authors to be generalise very clearly about how the machine takes its decisions: but can they get decisions right?
They test their predictions against the actual outcomes on the cases and find about 70% accuracy overall across decades. They build a way of looking at the cases, in (say) years 1-50, and then predicting cases in year 51, and then again Year 52 and so on. The computer programme does not have the data for Year 52 until after it has made the predictions. It is, thus far, an extraordinary achievement – a way of analysing cases, from quite basic (if detailed) data, which enables them to predict the outcome of cases at many, many points in the Supreme Court’s history, with high accuracy – even though the law, judges, and nature of the cases has – I would surmise- changed beyond all recognition over that period.
The next question for them, is what should they compare that accuracy level against. They note for instance that if one was to predict the outcome of a Supreme Court decision now – without any algorithm of legal insight – one would predict a reversal (because about 70% of cases are now reversals, but in 1990 a reversals would be the counter-intuitive prediction because only about 25% of cases were reversals). They need a null hypothesis against which to compare their 70% accuracy. If the computer can’t beat the null hypothesis it is not that smart. They opt for the best ‘rule of thumb’ prediction rule as being a rolling average of the reversal rate for the last ten years (if I have understood it right). This is actually quite a good prediction rule (it enables an accurate prediction of cases or votes in 66-68% or so of instances – one can be pretty smart at predicting, it turns out, with data from one variable – reversal rates over time).
On this basis the difference between the machine learning approach and the 10 year rule of thumb approach is 5% or thereabouts. The machine learning approach gets one more case right out of twenty than guessing on the basis of the 10 year reversal rate. This is still really interesting and suggestive of the potential for predictive analytics, but it seems to be some way off the sort of system that one would base case decisions on. Perhaps more worryingly for Katz et al, especially the need to be consistent over time and for different justices, is the model seems to work much better in the middle years of the data range, but for reasons which – as yet – remain a mystery the prediction of decisions under the Roberts Court (2005-) is poorer than the rule of thumb. You get a nice sense of this from Katz et al’s graphs in their article, especially this one which looks at the decisions of judges.
The clusters of pink indicate where the model is less successful at predicting than the ‘null model’ the rule of thumb. Perhaps understandably, the model struggles in the early years when it has learned less data. Also, as one of my students pointed out, Oliver Wendell Holmes was being Oliver Wendell Holmes. But the really interesting thing is that the model is not working so well in recent years either. One thought is that this is something to do with ‘newness’ of the Roberts Court. It is anyone’s guess here as to what is going on, but one thought I had is that the variables in the Supreme Court database might no longer be as fit for purpose as they once were. Remember the data is based on a political science dataset, where cases are coded retrospectively and where, I am assuming, the major intellectual legwork of thinking what kinds of data should be coded was probably generally done some time ago. The analytical framework it provides may be degrading in utility. The judges may be deciding new kinds of cases or in ways which somehow escape the variables used to date.
This leads me onto the third, and last, study I want to concentrate on. This study does not rely on the same kind of coding of cases. Instead of using someone else’s variables it seeks to predict cases based on the text associated with those cases. It is by Nikolaos Aletras, Dimitrios Tsarapatsanis, Daniel Preoţiuc-Pietro and Vasileios Lampos. This study sought to employ Natural Language Processing and Machine Learning to analyse decisions of the European Court of Human Rights and see if there were patterns of text in those cases which helped predict whether the Court would find, or not find, a violation of the Convention. What did they do? I am going to try and speed up a bit here for readers who have patiently made it this far, and hope I do not do the study a disservice in the meantime.
In broad terms, they took a sample of cases from the European Court of Human Rights dealing with Articles Cases related to Articles 3, 6, and 8 of the Convention. This is because they get most data on these cases dealing with potential violations in these areas. They only use cases in English and they randomly selected equal numbers of cases of violations and non-violation for each of the three articles. This creates a fairly clear null hypothesis within their dataset: there is a 50:50% chance of being right if you pick a case at random from the database.
After some cleaning up of the data, they use machine learning and natural language processing techniques to count the frequency of words, word sequences, and clusters of words/word sequences that appear to be semantically similar (the machine learning approach seems to work on the on the assumption that similar words appear in similar contexts, and defines semantically similar in this way). They do this analysis across different sections of the cases: Procedure, Circumstances, Facts, Relevant and calculate some amalgams of these categories (such as one which looks at the frequency of words and clusters across the full case, law and facts together). That analytical task on one level is as basic as identifying the 2,000 most common words, and short word sequences, but the clustering of topics is more complicated. Then – in the broadest, somewhat simplified terms – they look for relationships between the words, word sequences and, clusters of words that predict the outcome of a sample of the cases, a learning set, build a model of those cases, and test it on the remaining cases to give an indication of the predictive power of the model. It is very similar, in some ways I think, to the way spam filters work.
The predictions from this modelling are 79% accurate. A good deal better than Ruger’s ordinary experts, and not very far from his best experts. In numerical terms that is a little bit better than Katz’s model (but looking at more similar kinds of cases and over a narrower period of time).
There are some wrinkles in the study design. In particular, the textual analysis of ‘facts’ and ‘outcomes’ cannot be kept very separate. It might be unsurprising if the judges did not describe facts a certain way when they were going to find a violation. The algorithms might be picking up a shift in judicial tone, rather than being able to pick up whether certain fact patterns are likely to lead to certain outcomes. As the researchers make clear, it would be better to test the analysis of facts from documents other than the judgements themselves (such as the appeal pleadings and submissions – something which has been done on US Supreme Court Briefs but in a different context). Nonetheless it is a study which suggests the power of machine learning to glean information from legal documents, rather than read data from the kind of database Katz et al were using. Perhaps this suggests a way for Katz et al’s general model to develop? It would be interesting to know whther certain word clusters predict which cases Dan Katz’s model gets wrong, for instance. And if they can find a way of the machine learning new variables over time, from judgments say, well…
As things stand, then we seem some way off, but also quite close to artificial intelligence being be better at predicting the outcome of cases than the most learned Queen’s Counsel. No one gets close to Ruger et al’s 90%+ performance of the appellate attorneys. We don’t know if that is the fairest benchmark, but we certainly need to hold the possibility that it is in mind. The likelihood is, as Aletras et al note, that machine learning will be used in the lower reaches of the legal system. If I was a personal injury firm, or an insurer, for instance, I would be thinking about engaging with these kinds of projects in analysing my risk analysis by scapring data from files on which cases are accepted under no win no fee agreements.
And even then predictions contain only modest information. If you tried to look at Katz et al’s code (you can btw: it is on Github) you’d be hard pushed to find an explanation as to why the case was likely to win that was useful to you as a lawyer or a client (although this might be less so if you were looking at baskets of cases, like say an insurance company). Similarly, if you go and read Aletras et al’s paper, you will see clusters of words which seem to predict judicial outcomes, but they are not to my mind very informative of what is going on. The problem of opacity of algorithms is a big one, sometimes inhibiting their legitimacy, or raising regulatory issues. That said, utility will be what makes or breaks them.
And whilst we should worry about opacity, and power, and systematic malfunctions, and the like, we should perhaps not think they are new problems. After all, we think we know what influences judges (the law and facts), and we try and work out what persuasive arguments are, and we know what judges say influences them, but if we compare what we think with what Ruger et al found (six variables!), or Roger Hood’s work on sentencing (judges say they do not discriminate but…), things start to look simpler, or the explanations of judges look more unstable or tenuous. If our complex, fact and law-based thinking turns out to be poorer than forests of difficult to understand decision trees at predicting winners, is human or quantitative thinking more opaque in these circumstances? That is not all that matters, but it is very important.