Lawyers learning about prediction

Just before Christmas, the Lord Chief Justice gave his senior brothers in the legal profession a bit of a Christmas present. In a speech to legal practitioners in his homeland of Wales, he warned them of their impending redundancy, saying:

It is probably correct to say that as soon as we have better statistical information, artificial intelligence using that statistical information will be better at predicting the outcome of cases than the most learned Queen’s Counsel.

Not just any old Queen’s Counsel, but the most learned. Even, shock horror, some who perhaps practice in London might be the implication. Even, EVEN, and here my friends in the De Keyser Massive may start humming Say it ain’t so, David Pannick QC.

So, as I happened to be teaching my Future of Legal Practice Students about Quantitative Legal Prediction (they start by reading this) now seemed like a good time to look quite hard at what the research actually says about quantitative legal prediction. This is a long post as a result. You may need beverages and biscuits. I should also say, I am not a machine learning expert. I am feeling my way, learning as I go, reading and listening to the likes of Jan Van Hoecke at RAVN and Noah Waisberg at Kira. I may well have got things wrong or misdirected the reader. I would, even more than usual, warmly welcome dialogue on these subjects from those who understand them more fully and from those who, like me, want to learn. So…

Dan Katz, a leading legal tech academic and founder of a predictions business, opines that QLP will “define much of the coming innovation in the legal services industry” and, with more circumspection, that, “a nontrivial subset of tasks undertaken by lawyers is subject to automation.” As we will see, I think, those last words provide an important coda to those drinking unreflectively from the innovation KoolAid. Yet the developments are exciting too: a recent study suggests text scraping and machine learning enabled impressive ‘predictions’ of ECtHR decisions (79% accuracy) and Katz and his colleagues have this week updated their paper claiming a machine learning approach can predict US Supreme Court Cases in a robust and reliable way (getting decisions right about 70% of the time) over decades and decades. And Ruger et al, as we will see, claim machine accuracy higher than experts (perhaps we can lobby them to call their algorithm GoveBot). So there may be something in the Lord Chief’s predictions. Let’s take a look at what that something may be.

We will start with Theodore Ruger, Andrew Martin, Kevin Quinn, and Pauline Kim’s study of the Supreme Court. This is a study which is fascinating because a computer model predicted the outcome of US Supreme Court cases for a year using only six variables with good accuracy. These variables are simple things like which Circuit the case came from, and a basic coding of the subject matter of the case. The variables are the kind of variables which one would not need to be a highly qualified lawyer to code. Feed those variables into an algorithm and the algorithm predicts whether the case is likely to be unanimously decided or not, and if it I not, how each of the Justices is likely to vote. The algorithm uses simple decision trees like the one below for each judge (each judge has his or her own algorithm – paying attention to different variables and in different ways). The decision tree provides a pretty good rule of thumb as to how the particular judge will decide a case. If you answer each of these questions about a case before it was decided, the likelihood is you could predict accurately the decision of that particular judge. Simples. And no need to worry about what the law said. Or detailed facts. Or what time the judge would have their lunch.

Ruger and his colleagues built these decision trees having analysed just over 600 cases between 1994 and 2002; a very modest data set that produces impressive results. Having built the decision trees, they then tested them. They used their algorithms to predict the outcome of each new case appearing before the Supreme Court (using them there six variables). And they got panel of experts (a mixture of top academics and appellate practitioners; some with experience of clerking for Supreme Court justices) to predict the outcome of cases in areas where they had expertise (they usually managed to get three experts to predict a decision). This real test enabled them to compare the judgment of the machine and the judgment of experts on 68 cases. Who would win? Machine or Expert?

In broad terms, the machine did. It predicted 75% of Supreme Court decisions accurately. The experts got it right only 59% of the time. Even where they had three experts predict the outcome of a particular case and could take the majority view of the three experts, accuracy climbed to the still not as good 66%. Only at the level of individual judicial votes did the experts do better than the machine (experts got votes right 68% of the time compared with the machines 67% – let’s call that a tie). The rub was that the machine was much better a predicting the moderate, swing voters; precisely the task you might think the experts would be better at.

Yet all is not quite what it seems. There is, at the very least, a glimmer of hope for our professional experts. If we look separately at the predictions of appellate attorneys in Ruger’s sample of experts, their accuracy of prediction was a mighty 90% plus. Because there were only 12 of such appellate in the panel used for the study, Ruger et al were not able to separately test (in a statistically robust way) whether the difference between appellate attorneys and other experts was significant. Nonetheless, we should be very wary of accepting the study as proving machines are better than experts at predicting the outcome of legal cases, or more particularly the kind of experts the Lord Chief has in mind, because the kind of experts we might expect to do very well at predictions really did seem to do very well, even if Ruger et al can’t prove so with high levels of confidence.

As Ruger et al also indicate there are other limitations on their, nonetheless very impressive, experiment. In particular, their algorithms are built to reflect a stable court: the same justices sat on the court through the entire period. They know that this is important because some of their Justices’ decision trees showed inter-judicial dependencies: Judge A would be influenced in their decision by the decision of Judge B. Take away Judge B, and the decision trees would not work so well. In the time period of their experiment Judge A, B, C and so on all sat so this was not a problem.

It is at this point that Dan Katz, Michael Bommarito and Josh Blackman pick up the story. Their model uses a far greater number of variables, across more than two centuries of data, and seeks to show that it is possible to predict judicial decision making using information on cases that could be known before the case is heard with high levels of reliability and consistency. In this way, they seek to go further than Ruger et al. They show that it may very well be possible to develop an approach which can adapt to changes in the personnel on courts, the kinds of cases before them, the evolving state of the law, and so on. In this way they hope to develop a model which is, “general – that is, a model that can learn online [as more decisions are taken]” that has a, “consistent performance across time, case issues, and Justices” and has “future applicability”, that is it could be used in the real world, that is, “it should consistently outperform a baseline comparison.”

Their approach is complicated to those, like me, not expert in data science and machine learning – reading it is a great way to start to peer beneath the machine learning hood. They rely on a pre-existing political science database of Supreme Court decisions which has coded the characteristics and outcomes of Supreme Court cases on dozens and dozens of variables across more than two centuries. To simplify a bit, from that database they set about, through machine learning, developing algorithms which can predict the outcomes of a ‘training set’ of cases in the early years of the Supreme Court; testing whether the algorithms can predict the outcome of cases held back from this training set; then – through machine learning again – adapting those algorithms to take account of a training set for the next year, testing them again and moving forward. Ingeniously, then, they can build models, test those models, and move forward through a very large slice of the US Supreme Court’s history. Because the approach does not rely on simple decision trees, but random forests (or multiple variations) of decision trees, which develop and change over the life of the court, then it is not possible for the authors to be generalise very clearly about how the machine takes its decisions: but can they get decisions right?

They test their predictions against the actual outcomes on the cases and find about 70% accuracy overall across decades. They build a way of looking at the cases, in (say) years 1-50, and then predicting cases in year 51, and then again Year 52 and so on. The computer programme does not have the data for Year 52 until after it has made the predictions. It is, thus far, an extraordinary achievement – a way of analysing cases, from quite basic (if detailed) data, which enables them to predict the outcome of cases at many, many points in the Supreme Court’s history, with high accuracy – even though the law, judges, and nature of the cases has – I would surmise- changed beyond all recognition over that period.

The next question for them, is what should they compare that accuracy level against. They note for instance that if one was to predict the outcome of a Supreme Court decision now – without any algorithm of legal insight – one would predict a reversal (because about 70% of cases are now reversals, but in 1990 a reversals would be the counter-intuitive prediction because only about 25% of cases were reversals). They need a null hypothesis against which to compare their 70% accuracy. If the computer can’t beat the null hypothesis it is not that smart. They opt for the best ‘rule of thumb’ prediction rule as being a rolling average of the reversal rate for the last ten years (if I have understood it right). This is actually quite a good prediction rule (it enables an accurate prediction of cases or votes in 66-68% or so of instances – one can be pretty smart at predicting, it turns out, with data from one variable – reversal rates over time).

On this basis the difference between the machine learning approach and the 10 year rule of thumb approach is 5% or thereabouts. The machine learning approach gets one more case right out of twenty than guessing on the basis of the 10 year reversal rate. This is still really interesting and suggestive of the potential for predictive analytics, but it seems to be some way off the sort of system that one would base case decisions on. Perhaps more worryingly for Katz et al, especially the need to be consistent over time and for different justices, is the model seems to work much better in the middle years of the data range, but for reasons which – as yet – remain a mystery the prediction of decisions under the Roberts Court (2005-) is poorer than the rule of thumb. You get a nice sense of this from Katz et al’s graphs in their article, especially this one which looks at the decisions of judges.

The clusters of pink indicate where the model is less successful at predicting than the ‘null model’ the rule of thumb. Perhaps understandably, the model struggles in the early years when it has learned less data. Also, as one of my students pointed out, Oliver Wendell Holmes was being Oliver Wendell Holmes. But the really interesting thing is that the model is not working so well in recent years either. One thought is that this is something to do with ‘newness’ of the Roberts Court. It is anyone’s guess here as to what is going on, but one thought I had is that the variables in the Supreme Court database might no longer be as fit for purpose as they once were. Remember the data is based on a political science dataset, where cases are coded retrospectively and where, I am assuming, the major intellectual legwork of thinking what kinds of data should be coded was probably generally done some time ago. The analytical framework it provides may be degrading in utility. The judges may be deciding new kinds of cases or in ways which somehow escape the variables used to date.

This leads me onto the third, and last, study I want to concentrate on. This study does not rely on the same kind of coding of cases. Instead of using someone else’s variables it seeks to predict cases based on the text associated with those cases. It is by Nikolaos Aletras, Dimitrios Tsarapatsanis, Daniel Preoţiuc-Pietro and Vasileios Lampos. This study sought to employ Natural Language Processing and Machine Learning to analyse decisions of the European Court of Human Rights and see if there were patterns of text in those cases which helped predict whether the Court would find, or not find, a violation of the Convention. What did they do? I am going to try and speed up a bit here for readers who have patiently made it this far, and hope I do not do the study a disservice in the meantime.

In broad terms, they took a sample of cases from the European Court of Human Rights dealing with Articles Cases related to Articles 3, 6, and 8 of the Convention. This is because they get most data on these cases dealing with potential violations in these areas. They only use cases in English and they randomly selected equal numbers of cases of violations and non-violation for each of the three articles. This creates a fairly clear null hypothesis within their dataset: there is a 50:50% chance of being right if you pick a case at random from the database.

After some cleaning up of the data, they use machine learning and natural language processing techniques to count the frequency of words, word sequences, and clusters of words/word sequences that appear to be semantically similar (the machine learning approach seems to work on the on the assumption that similar words appear in similar contexts, and defines semantically similar in this way). They do this analysis across different sections of the cases: Procedure, Circumstances, Facts, Relevant and calculate some amalgams of these categories (such as one which looks at the frequency of words and clusters across the full case, law and facts together). That analytical task on one level is as basic as identifying the 2,000 most common words, and short word sequences, but the clustering of topics is more complicated. Then – in the broadest, somewhat simplified terms – they look for relationships between the words, word sequences and, clusters of words that predict the outcome of a sample of the cases, a learning set, build a model of those cases, and test it on the remaining cases to give an indication of the predictive power of the model. It is very similar, in some ways I think, to the way spam filters work.

The predictions from this modelling are 79% accurate. A good deal better than Ruger’s ordinary experts, and not very far from his best experts. In numerical terms that is a little bit better than Katz’s model (but looking at more similar kinds of cases and over a narrower period of time).

There are some wrinkles in the study design. In particular, the textual analysis of ‘facts’ and ‘outcomes’ cannot be kept very separate. It might be unsurprising if the judges did not describe facts a certain way when they were going to find a violation. The algorithms might be picking up a shift in judicial tone, rather than being able to pick up whether certain fact patterns are likely to lead to certain outcomes. As the researchers make clear, it would be better to test the analysis of facts from documents other than the judgements themselves (such as the appeal pleadings and submissions – something which has been done on US Supreme Court Briefs but in a different context). Nonetheless it is a study which suggests the power of machine learning to glean information from legal documents, rather than read data from the kind of database Katz et al were using. Perhaps this suggests a way for Katz et al’s general model to develop? It would be interesting to know whther certain word clusters predict which cases Dan Katz’s model gets wrong, for instance. And if they can find a way of the machine learning new variables over time, from judgments say, well…

As things stand, then we seem some way off, but also quite close to artificial intelligence being be better at predicting the outcome of cases than the most learned Queen’s Counsel. No one gets close to Ruger et al’s 90%+ performance of the appellate attorneys. We don’t know if that is the fairest benchmark, but we certainly need to hold the possibility that it is in mind. The likelihood is, as Aletras et al note, that machine learning will be used in the lower reaches of the legal system. If I was a personal injury firm, or an insurer, for instance, I would be thinking about engaging with these kinds of projects in analysing my risk analysis by scapring data from files on which cases are accepted under no win no fee agreements.

And even then predictions contain only modest information. If you tried to look at Katz et al’s code (you can btw: it is on Github) you’d be hard pushed to find an explanation as to why the case was likely to win that was useful to you as a lawyer or a client (although this might be less so if you were looking at baskets of cases, like say an insurance company). Similarly, if you go and read Aletras et al’s paper, you will see clusters of words which seem to predict judicial outcomes, but they are not to my mind very informative of what is going on. The problem of opacity of algorithms is a big one, sometimes inhibiting their legitimacy, or raising regulatory issues. That said, utility will be what makes or breaks them.

And whilst we should worry about opacity, and power, and systematic malfunctions, and the like, we should perhaps not think they are new problems. After all, we think we know what influences judges (the law and facts), and we try and work out what persuasive arguments are, and we know what judges say influences them, but if we compare what we think with what Ruger et al found (six variables!), or Roger Hood’s work on sentencing (judges say they do not discriminate but…), things start to look simpler, or the explanations of judges look more unstable or tenuous. If our complex, fact and law-based thinking turns out to be poorer than forests of difficult to understand decision trees at predicting winners, is human or quantitative thinking more opaque in these circumstances? That is not all that matters, but it is very important.

Posted in Uncategorized | Leave a comment

Women QCs: a quick look at the data

The MoJ’s press release on recent QC appointments says this, “More female and black and minority ethnic candidates have been appointed Queen’s Counsel than ever before.” And the QC appointments panel data says this, “We are pleased that the number of women applying and being successful continues to rise, and that the proportion of women amongst those appointed is at its highest level ever.” (see the press release on its site).

So it is worth pointing out the following.

The year the most women were appointed as QCs in absolute terms was in 2006 (there were 68 compared with this year’s 56). You can see the graph of the data here.

qc1

And in terms of the proportion of women applicants,  66% of women applicants succeeded in 2011/12 whereas this year it was 55%. Another picture…

qc2

And if we turn to the data that the press departments want us to focus on, we do indeed see that this year there was a higher proportion of women appointed.

qc4

That’s an increase from 23.4% last year to 27.4% this, but it was 26.9% the year before that. So this year is 0.5% higher than the previous best for that statistic.*

Of course, the most important thing about those two lines is how far apart they are; and the most important lessons from the data are probably learnt from the number of applicants and their success rates. Oh, and the length of time this is taking.

*n.b. a previous draft of this post used the wrong data. The graph and the data have been corrected here.

Posted in Uncategorized | 2 Comments

Law: It’s all a game

Happy New year! I am tempted out of my accidental blogging purdah by a genuinely fascinating story in Legal Week on Taylor Wessing’s use of Cosmic Cadet. Cosmic Cadet is not a replacement term for trainee solicitors indicating the uber-global commercial awareness of the modern day law student. No. It is a possibly cringe-worthy test designed to measure (per the maker‘s website):

  • Cognition. How an individual processes and uses information to perform mental operations.
  • Thinking Style. How an individual tends to approach and appraise problems and make decisions.
  • Interpersonal Style. An individual’s preferred approach to interacting with other people.
  • Delivering ResultsAn individual’s drive to cope with challenges and finish a task through to completion.
With such an awful title, there must be something in it, no?  Arctic Shores claim strong levels of scientific support for their approach, including that all of their ‘research’ (not all of their testing or application or interpretation, I note in passing [NB, I am reassured since posting this that, “our testing, interpretation and general validity has been independently reviewed” – see below in comments section] is reviewed (with what results we know not) by “independent subject matter experts“.
If I am sounding sceptical, in fact, I am more interested than sceptical. The attributes that Legal Week highlighted as measured by the test are particularly worthy of scrutiny:

Thinking style
Risk appetite
Managing uncertainty
Potential to innovate
Learning agility

Interpersonal style
Social confidence
Affiliativeness

Aptitudes 
Processing capacity
Executive function
Processing speed
Attention control

Delivering results
Persistence
Resilience
Performance under pressure

No mention of ethics was my first reaction – and remains my strongest one. Risk appetite is likely to be related to ethical inclination and some of the other measures may be too. It would be especially interesting to know what kinds of risk appetite users of the test want. The rather weakly evidenced assumption in the industry is that lawyers are risk averse, in the same way as lawyers are seen as both show-offs and introverts. An interesting part of the test will be the capacity of Arctic and the firms to learn more about the truth of such claims.

Fascinating too would be an explanation of how would-be trainees are supposed to manage uncertainty. There is an uncomfortable impression given by this list of the trainee as a machine, a resilient robot, a chip in the supercomputer that is big law. That’s an unfair impression, I am sure, but it is one which I hope the firms who are thinking along these lines think carefully about. Taylor Wessing, to be clear, seem to be thinking carefully about how the tests integrate with their wider processes of assessment.

Resilient, high performing people are one thing; systems that break them or lead them astray are another. I would not say law firms are broken, but there is plenty of evidence that they can and do lead some people astray. And it is absolutely vital that if firms are thinking along these lines they pay more than lip service to the moral capacities of their candidates and the ethical resilience of their systems and culture. I don’t see that in these tests. Perhaps it is to be found elsewhere.

……………..

Postscript: there’s another excellent story on this here http://www.legaltechnology.com/latest-news/gamification-taylor-wessing-using-video-game-to-assess-trainee-aptitude/

Posted in Uncategorized | 3 Comments

Keyser, So…

Yeah, alright, enough with the De Keyser gags.

I am very pleased to announce that David Pannick QC has agreed to adorn the latest Billable Hour tee shirts celebrating the end of Miller hearings, Pannick’s advocacy (you know how prone I am to sycophancy, so don’t make me say more), and Sean Jones QC’s awesomeness in setting up the Billable Hour appeal.

You can buy them here. All profits got to Billable Hour

Anyone know Lord Sumption’s size?

 

Posted in Uncategorized | Leave a comment

Wrath of Khan: the Search for…

It has come to my attention that proceedings related to the ones discussed in my recent Wrath of Khan post are subject to reporting restrictions. I am very grateful to the person who alerted me to this. In the circumstances, I am taking down the blog for now.

Posted in Uncategorized | Leave a comment

Quality and cost post SQE

The SRA continues to proselytise about its SQE proposals. I confess I have still not had a chance to fully digest the detail but I get a little bit more anxious with each bit of detail that surfaces. One point struck me whilst reading this rather good story on Legal Cheek. The SRA education director (Julie Brannan) says it would be “hard to devise an exam that could possibly cost as much as £15,000, even including training”. Tempting as it is to deconstruct the sentence with more vigour, or to simply chortle, Mwahahaha, I will simply say this: if the SRA is right – as it claims – that the exam will significantly drives up standards then there is at least a plausible case that both the exam and the training necessary to deliver those standards will be more expensive than currently. There are other possibilities, perhaps some of the training can be done away from classrooms, on the job, without the students/trainees being charged for it, and perhaps some of the training will be rolled up into LLBs and that will reduce cost – but I am not sure how much I would bet on it unless we suddenly magic up a whole lot of price competition where there has been little to date.

An interesting further point is made about price and quality. Julie notes in the same story that purchasers often treat price as a proxy for quality. This, she thinks, is one of the reasons behind the driving up of LPC fees. I do not know if this is true or not, but it is a plausible problem. Relatedly, The SRA are putting quite a lot of eggs in a basket marked publication of SQE results. This, they seem to be hoping, will help contribute to a better market for SQE related training. It’s not at all clear why, where prices have raced to the top in the past, they will now race to the bottom. But anyway, they want to, it seems, publish individual institution’s SQE averages for their students. This they will do, perhaps, whether or not the institutions conduct SQE training, and in situations where the SQE training may be very extensive or very lightly geared towards the SQE assessment. There are various problems with this, but a big one is that the link between the intervention (the training) and the outcome (the SQE result) maybe really rather tenuous. Imagine that Oxford changes its LLB not one bit and ignores the SQE; that Keele changes its degree programme to make their students part-SQE ready; and Northumbria preps the students for all the assessments that it possibly can. And then imagine comparing their SQE pass rates. What will they mean, and who’s behaviour will they influence?

But even putting this to one side, I found myself thinking back to when I chose to do the Law Society Finals. Then, the Law Society had a central assessment and league tables were published of success rates for each LSF provider. I remember because I chose my institution, Birmingham Poly as it then was, because it had a high success rate. This seemed the obvious, rational thing to do. I did, however, have to swim against a certain “you should choose the College of Law” tide because “law firms prefer the College of Law” even though the outcome is that the College of Law is achieving were poorer then. That tide was dominant even though there was a plausible case for saying that the College of Law was a rather poor institution then- indeed it was about to be given a good shakeup by the erstwhile head of Nottingham Law School, Nigel Savage.

Readers of the better memory may be to remind me whether there were price differentials between the College of Law and the institutions, and whether that might have influenced decisions or a suggested a healthy market in quality and cost. My suspicion is there was not, but that may have been because prices regulated? Anyway, my basic point is that in spite of a very clear link between outcomes (the exam results) and the interventions (the training provided that those institutions), a link that – for all that it was flawed – is significantly clearer than the SRA’s current proposals, we nevertheless had a reputational market which still (I think) favoured the College of Law. Why, if that were true then, would we get a more responsive market for quality and cost now?

Posted in Uncategorized | 2 Comments

The Political Ideology of Lawyers

An interesting paper has been published (open access version here) on the Political Ideologies of US lawyers. The research has linked, “the largest database of political ideology with the largest database of lawyers’ identities to complete the most extensive analysis of the political ideology of American lawyers ever conducted.” Data on ideological leanings is derived from a database of federal campaign contributions made by individuals and that is linked, using an algorithm which matched contributors who identified as lawyers to apprioprate entries in Martindale Hubbell. Ingenious, even if not perfect. A person’s ideological commitment is calculated based on the nature and size of such contributions. Various testing was done to try and be sure this was a reasonably robust measure. It wasn’t immediately clear to me how it would deal with lawyers in the centre who did not tend to contribute. And interestingly, in their sample – which was very large – over 40% of lawyers had contributed.

They find that:

American lawyers lean to the left, [and] there is a (slight) bimodality to the distribution. Although there is certainly a peak of observations located around the center-left, there is also a second, smaller peak in the center-right. In other words, the ideology of American lawyers peaks around Bill Clinton on the left and around Mitt Romeny on the right.

Indeed, lawyers fell in the middle of seven professions: journalists and academics to the left, accountants; bankers and financial workers; and medical doctors to the right. It got me speculating, with prejudice and no significant knowledge, about what situation would be in the UK, especialy that bit about doctors. They also found that women are more liberal than men; government lawyers are more liberal than non-government lawyers; and, “law professors are more liberal than the attorney population. [Although, t]his effect is slightly smaller in magnitude than gender or government service.” Now, if you’re a practitioner in the UK, I’m betting you speculating now alongside me.

In many ways, the findings are interesting if unremarkable, much of the impact on attorney ideology relates back to where they are from, their age, and so on. Turns out lawyers from Texas are, well… you work it out. Elite firms, interestingly, and elite law schools, tend to lean slightly more towards the left relative to others. Firms that are less inclined in this direction are most often in firmly Republican States. Enron’s old lawyers are one of the few firms identified as having conservative partners and associates. Couldn’t resist putting that little factoid in.

Generally, then lawyers seem to be (in the US, on this data) somewhat less extreme versions of their local fellow electorates. They are also interestingly able to compare, depending on how accurate the specialisations in the Martindale Hubbell directory are, the relative political ideologies of different kinds of lawyers. The results are not very counterintuitive. The graph clipped below reports regression results. As such the results are not saying that all oil and gas lawyers are very right wing, just that they are significantly more right wing than other lawyers when other factors that are important are controlled for (such as age and gender).

One thing might be worth noting in the context of today’s big announcement on personal injury reform. In the United States at least, those involved in person injury defence appeared likely to be more right wing than lawyers on average, whereas claimant lawyers appear to be more left-wing than lawyers on average. That is unsurprising, but assuming it is replicated in the UK, it is a timely reminder of the likely personal political preferences of those involved in the compensation culture debate.

Posted in Uncategorized | Leave a comment