The SRA has put great store in the reliability of the assessment mechanisms for the SQE. So it was with interest, that I read the two reports on their Stage 1 pilot.
Relative to many academic colleagues and practitioners, I am not particularly sceptical of MCTs, willing to be persuaded they can be more sophisticated than is generally thought. As part of a diet of assessment, they have significant merit. Whether they are sufficiently sophisticated to deal with all the knowledge based skills a day one lawyer needs, I have reservations about, but as a start, and with supplements, they have value.
Putting aside the validity of the tests as predictive of actual competence (a large and difficult topic), in assessing reliability I expected a technical but reasonably robust and open analysis of the reliability level that the test was set at. I’d have hoped for, too, some examples of questions (some that candidates found easy, some that they found hard) and a detailed, open, if rather narrow, analysis of a statistical notion of reliability, more persuasive on the ability of two markers to consistently arrive at the same judgment, than whether that judgment was highly predictive of day one competence. The example questions with pass fail rates would also be enormously helpful in generating understanding and perhaps acceptance of the tests.
The predictive question is the most important one but it is hard to deliver on. Delivering on an analysis of the reliability question is relatively straightforward. But, my modest expectations were not met. It is worth highlighting a few points of concern:
1. We don’t know anything about the representativeness of those who sat the test. We are told that those selected to sit the test are broadly representative but this is cold comfort. Firstly, the levels of attrition between those selected and those actually sitting is MP large. Those who asked to sit the test may have been broadly representative but who dropped out? The reports rely on telling us about representativeness rather than showing us the data to help us evaluate that claim. It’s a frankly bizarre way to go about dealing with data.
2. The same problem manifests with their testing of reliability. Here the data is absolutely central to the claim of the whole report. Here’s the key quote on that. “Analyses, including generalisability analysis, show scores of three exams of 120 questions were slightly less reliable and accurate than the level commonly regarded as the “gold standard” in national licensing. They do not state what test they used, or what the data from that test found. I’d expect a test, a score and a confidence interval. I’d have wanted, also, an explanation in lay terms of how often this would mean the test would be reliable, i.e. how many candidates would it wrongly indicate were passes and fails. That would give statistically literate readers and lay readers a chance to assess the strength of the claim that it is “slightly less” reliable than the accepted standard.
Note in any event, it did not meet the accepted standard mentioned. Or put another way, if it ain’t gold, what colour is it? It’s in this context the CEO of the SRA trumpeting a step towards a world class test is worth noting. To which the response might be, a) how far away are you and b) why haven’t you published the most basic of information about the quality of your sample, the tests you used, and the actual results from the tests. Publishing such data is entry level not world class. You have to show us not tell us. Interestingly, also, the report writers elide reliability and generalisability tests. They are quite different and separate. What they say here smacks of a lack of understanding or obfuscation.
3. The same problems are manifest in the analysis of fairness. Tantalisingly we are told BAME sitters did worse, but whether those candidates did the GDL or went to a Russell Group Uni were more significant sources of performance variance. The size and robustness of effects, and their interrelationships is crucial information here. It is missing. The directions of these effects, to put it mildly, is interesting! Note also the apparent, potential discriminatory effect within the OSCE tests is a reason for their abandonment (the fact this will make the test cheaper will have helped the SRA make up its mind here). The apparent, potential discriminatory effects in the MCTs is glossed over. Tricky issue. Transparency and openness here demands clarity about what the data says.
4. We can see reasons to doubt the representativeness of the test sitters. Reasons to doubt the tests actually passed a reliability test, or at least a willingness to be open about that. Reasons to wonder at its fairness. But there is one other problem. As a test of competence, we learn that this bunch of candidates failed more often than normal and, if I understand correctly, more often than not. This does not strike me as a strong pilot of how the test will actually work. Either sitters were not performing at the level one would expect or the test was set at the wrong level. I understand the reasons for this problem, and correcting for it is difficult, but it adds a further uncertainty to an already cloudy picture.
So, in sum, this pilot obscures as much as it reveals, does not meet the basic reporting standards of the kind of testing it engages in, and suggests potentially significant issues about validity, fairness, and reliability. It is world class in the way that British volleyball is world class. If the SRA really took a decision based on these reports, I’d be genuinely shocked. There must be a more open, honest, technical report and they should publish it.