“I don’t feel like I’m swimming against the tide in the way that I did thirty years ago.”

In this episode, we talk to Professor Rob Coe. Rob was a maths teacher, then, for many years, Director of the Centre for Evaluation and Monitoring. He is now both Director of Research and Development at Evidence Based Education (EBE) and Senior Associate at the Education Endowment Foundation (EEF). Rob has been doing thoughtful, critical, uncompromising educational research for longer than I’ve been working in schools. When new thinking in English schools gained momentum, he was well placed to influence that thinking.

His work – notably Improving Education: A triumph of hope over experience, the EEF’s Teaching and Learning Toolkit, and EBE’s Great Teaching Toolkit – has managed that rare balance of doing a wealth of hard research, particularly in the area of assessment and evaluation, while conveying it in clear and accessible ways that make sense to busy teachers who don’t have graduate-level training in statistics. I wanted to hear from Rob about what he thought had improved – or at least changed – and why.

We discussed:

  • The reasons for grade inflation in the 1990s and 2000s, and the limits to educational improvement
  • Why Assessment for Learning made little difference in English schools
  • What we can and can’t learn from international tests
  • Why health so much better, and education hasn’t
  • How we can scale effective teacher development
  • The role of the Education Endowment Foundation and the successes it has had

Rob’s answer were characteristically thoughtful, original, and thought-provoking.

You can listen to the episode on Spotify, or Apple Podcasts, or read the full transcript below, lightly edited for clarity.

You were a secondary maths teacher before getting into educational research, and were trained by both Dylan Wiliam and Guy Claxton. Tell us a bit about what life was like as a teacher.

I love teaching, and I’ve always had this story in my head that I would or could go back. I suppose I still think that – I’m in my sixties now, so I guess it’s getting a bit less likely, but in principle I could go back. It was just getting into research and loving that so much that pulled me away. So it was all pull and not push.

I was a math teacher. I taught in two sixth form colleges, and an 11-to-16 school. I’m not sure I was a brilliant teacher. When I look back at what I now know – about learning, memory, and even managing classrooms – I wish I’d known some of that stuff 30 years ago. But I was good in some other ways, perhaps. I loved doing it and it was a strange convert to it in a way, because when I left school I thought, “There’s no way I’m going to be a teacher. That’s the one thing I’ll never be. Teachers are bastards.” I really couldn’t see myself doing that.

I started doing a bit of teaching. I was coaching resits and things like that for kids in Oxford, and quite enjoyed doing it, and thought, “Maybe I should learn how to do this properly.” So I applied, and I ended up going to Chelsea College, as it was then, it had just merged with Kings. Dylan Wiliam was there. He’d only just started on the staff there. He’d been a head of department in London. Guy Claxton was also there. I don’t think I appreciated it. I just thought, “Well, every PGCE must be like that.” My PGCE was fabulous. I thought it was great. They were so provocative. They wanted to engage with big ideas and make you think hard about stuff. Since then, I’ve spoken to people, “How’s your PGCE?” “It was a waste of time before you got into schools and started doing stuff.” That wasn’t my feeling about it. It was a really good experience. Did it prepare me for the chaos that is the classroom of a newly-qualified teacher? No, of course not. That was a steep learning curve, but did it give me lots to think about? Yes.

Nice to hear, I’m sure Dylan will be pleased to hear that you look back on it, fondly. So what then was it that drew you out of the classroom and into educational research?

I’d done a master’s while I was teaching, in maths education, and quite enjoyed the writing and thinking side of it. So I suppose I thought, “I’d like to do a bit more of this.” The obvious way forward seemed to be to do a PhD. It was an indulgence to be honest. I was married at that point, didn’t have any children, and my wife was working – she was a teacher – so she could support me and I could sit in my living room with a computer and tap away all day. That’s more or less what I did for three years.

It was a culture shock to go from one of the most sociable jobs in the world, where you’re surrounded by kids, you’re the centre of attention – there’s no downtime when you’re in school from the minute you walk in the building, pretty much everything’s at high speed – you could spend a week and literally speak to no one apart from your family or whatever. I found that hard in some ways, but I also quite enjoyed it.

I did this PhD and I thought, “I’ll go back into teaching because I love teaching. Why wouldn’t you? I won’t tell anyone I’ve got a PhD because they’ll all think, ‘Who does he think he is?’ No one will call me Dr. Coe or anything silly. It’ll just be – carry on. But I’ll have had three years of fun.” So I did that. Then, towards the end, a job came up. This was at Durham. I applied for the job, got it, and stayed on. I loved doing that as well. I loved doing research. So I’ve been very lucky that I’ve loved doing all the things I’ve done in my life. I meet lots of people and it’s very rare for anybody to say that. So I know how lucky I am.

Then that just grew into doing lots of evaluation, lots of work in assessment. But my background as a classroom teacher was always there with me. I was always thinking about, “well, “How would this be relevant? How can this connect with people working in schools?” Hopefully I haven’t lost that perspective too much, or at least the ability to understand what it’s like and what kinds of things help people. That’s basically where I’ve gone.

We did the EEF Toolkit you mentioned. That was a piece of research that came originally from The Sutton Trust – it was Lee Elliot Major, who was then the director of research there. We’d already done bits of work for him and Steve Higgins, my colleague in Durham. We were chatting one day about the difference between what headteachers were saying they would spend money on – there was talk then about the pupil premium being a thing. This would be extra money: schools could choose how they spent it to close the disadvantage gap and what would heads spend that money on and what would the evidence say they should spend it on.

Sutton had done a survey at that point, and it was way out of line. They said, “We’d employ more staff and have smaller classes. We’d employ more teaching assistants.” Those were the two top things they wanted to spend the money on. Neither of those are bad things, but neither of them pay back in terms of benefits for kids learning given how expensive they are. We knew that – that research wasn’t new. It had been around for many years. So we thought, “Maybe we should try and summarise it in a way that school leaders can understand, will have access to, first of all because they didn’t know about this, but also that it’ll connect and make sense for them.”

So the expertise of Steve Higgins, who’d been doing this work for many years – secretly collecting all these meta-analyses about different interventions – and Lee Elliott Major, whose background was in journalism, but had also done a physics PhD, a weird combination, turned it into something that did resonate. People looked at it and thought, “Something interesting here.” That got me interested in pedagogy – thinking about, “Why doesn’t class size make much difference to how much kids learn? Because teachers are sure that it does. The research says it’s not a massive effect. It’s positive, but it is relatively small.”

That was the initial reaction to what became the EEF toolkit: “It’s wrong.” That was great, because you could then have a debate with people about, “Tell me something that you could do with a class of 15 that you can’t do with a class of 30.” It’s quite hard to think of anything. People would say, “You can give more individual attention, certainly spend less time marking.” Then you say, “What’s the evidence about the impact of that on learning? It’s small at best, probably. How much more individual attention can you give? Is that your dominant pedagogy for how you’re going to spend your lesson – each child one by one?” Because that’s not a very efficient way to go about it. You are only increasing the amount of time that each child gets by – if they did have four minutes each in a 15-person class for an hour, that would only go down to two minutes each. You’re adding two minutes. Is that enough? What about the other 56 minutes?

So those conversations made me think hard about learning, how it happens, and how teachers manage it. I guess that got us into, the inaugural lecture, and then the What makes great teaching? report and then I was on a journey to thinking about pedagogy much more.

Let’s talk about Improving education, a triumph of hope over experience. It’s one of my favourite ever papers, and a paper that I keep quoting. It was published in 2013. Lots of things since then we wouldn’t necessarily go back to, but there are so many points that were so well and so memorably made around things like, “What should we be looking for in a lesson? How should we evaluate whether things are getting better or not?

I’ve got a few questions about this paper. At the start, you argue, “Standards have not risen. Teaching has not improved. Research that has tried to support improvement has generally not succeeded. Even identifying which schools and teachers are good is more difficult than we thought.” Again, this is 2013, so some things may have changed. Why did you think that point needed to be stressed?

Maybe it’s something in my nature that likes to puncture balloons, or puncture optimism. Perhaps I’m a cruel person. But there was a smugness, I think. We’d seen in England, grades at A-level and GCSE rising massively, and we’d seen ministers would claim credit for that. Every year, the results would get a bit better and they would say, “This is evidence of our success.” Partly because I had access to data that we were using in Durham – the monitoring systems at CEM showed things didn’t look as though they were improving on an objective measure that was constant over that time period.

So it was partly the availability of data that no one else had. But it was also thinking, “It’s important that we face this, and we realise that the things we’ve done so far haven’t actually had benefit.” That isn’t the perception you have. When you’re working in a school, it does feel like things are getting better, or may do. Probably not in every school, but there’s a sense of, for a whole series of reasons, perhaps it may feel like things are improving. Of course in many schools they are improving. It’s just that for every school where they’re improving, there’s another school where they’re going in the opposite direction. So it’s hard to get the overview.

I suppose I thought, “If we are ever going to do anything at a system, or even at a school level, that genuinely does improve things, we need to start by facing the fact that the things we’ve done so far, or at least most of them, haven’t really worked. Otherwise we carry on. We’ll think, “Things are getting better, it’s great. Or they’re fine, and so let’s carry on doing what we’re doing.” I thought, “That’s not the way to go. We do need to think quite differently about this.”

You’re very fair about it, saying, “It’s not to say that teachers and leaders and everyone isn’t working hard. The question is, on net are we managing to make sure that students are learning more than they were last year?Again, 2013, you look at this and say that GCSE results have improved enormously. You show that a student who’d have got a D in 1996 was now by 2012, likely to get a C at GCSEs. Scores have become much higher and at the same time, nothing’s really changing in international tests. What is that process of grade inflation? What was going on?

It was quite easy to explain that in terms of the way the grading was being done. The typical thing the exam boards were doing in that time was they would set a paper. Very difficult to judge how hard a paper’s going to be before people have done it. There’s lots of research on that. Even experts can’t say how hard it’s going to be with any level of precision. So then you identify what you think are likely to be the grade boundaries – so that’s a lot of examiner judgment.

Tere were statistics being used as well. They would look at things like Key Stage 2, if we’re talking about GCSE. They would look at the overall norms across the cohort. Then broadly speaking, that statistical evidence would give them a window – within which, they’d say, “The threshold for a grade C should be between 48 and 52, or something like that.” Then the examiners would look at it and they would always go for the lower end. It was an asymmetry. They weren’t choosing the middle of the statistical range. So more people get the grade C. It was within what was reasonable each year, but there was no long-term anchoring. So each year it would creep up slightly. It seems obvious.

It’s in everyone’s interest for grades to go up. Students like it, teachers like it, parents like it, government like it, the exam boards like it – there’s nobody who doesn’t like it apart from me. For many years I was the only person saying this. The Daily Mail were about the only paper who would print the story every year. So I became quite friendly with journalists there for a few years. That was the case for ages.

Suddenly, in 2010 when the government changed, Michael Gove became interested. We’d had some conversations with him before the election. He’d read some of the things that we’d done on it. He accepted the argument, said, “There’s a problem here. We need to sort it out.” Ofqual already existed at that point, but not quite in the format that it does now. So he strengthened that up as an organisation and appointed people to lead it, who were going to sort this out and very quickly they had put in place a more robust mechanism.

We had comparable outcomes, if you remember that. Then not too long after that, the National Reference Test, although that took a while to come in. All of that was a response to this issue about grade inflation. I would say since about 2011, in terms of GCSE and A-level, that has no longer been the case. It’s stabilized. Obviously COVID – there was a blip there, but other than that, it’s been pretty stable.

I want to ask one more thing about the about this paper. You were critical of our assumptions about implementation. Again, a memorable quotation for me, “It’s now a rare thing in my experience to meet any teacher in any school in England who would not claim to be doing assessment for learning. And yet the evidence presented above suggests that during the 15 years of this intensive intervention to promote AfL, despite its near-universal adoption, and strong research evidence of substantial impact on attainment, there’s been no, or at best limited effect on learning outcomes nationally.” What then, or what now, are your guesses to explain that paradox?

You’re right, it’s a paradox. It’s an interesting contradiction. In 1998, we had Inside the black box – it became a big national policy. Lots of training and funding. If you were teaching in the early 2000s, you would’ve taken part in that training about Assessment for Learning (AfL). If you speak to any teacher, they would say, “We’re doing AfL.” When you observe classrooms, you might have a slightly different perspective because – one of the things about AfL, formative assessment, is that it’s not that well defined. I mean, you wrote a book, Responsive Teaching, which is basically trying to summarise what it is – it’s not just one thing, it’s a whole lot of different things. But teachers were being trained and they were all good things. The evidence supported all of those things.

So how is it that we hadn’t seen any change in overall standards? Which we didn’t seem to have done. The evidence about this is always a bit incomplete, but as far as you could tell, there wasn’t any shift. I think the answer is because doing these things in classrooms is hard. If you just go on a day’s training about formative assessment or Assessment for Learning, you see some of these techniques explained. You perhaps see them demonstrated. You understand the theory behind them. Let’s say best case – because if you just spend a day on a piece of training, what we now know, we knew then, but we certainly more people know now about how learning happens – is that there’s very little you can learn in one day and never revisit. You have a day on it, however good it is, chances are it then goes out of your memory, your knowledge, your certainly your practice – you revert back to things you were doing.

So we should know that it’s unlikely that that’s going to make a big change to people’s practice. Sure enough, it seemed like it didn’t. In fairness to Dylan, and other people who promoting Assessment for Learning, I don’t think he ever expected it would. In fact, he spent probably the best part of the 20 years after writing Inside the black box, trying to learn how to get teachers to do any of this stuff. At the end of 20 years, he had something that worked a bit, which is more than anyone else had.

I’ve dwelled quite on up to 2013 to that point. To that point, lots of effort, things haven’t got much better. Since 2013, you’ve written two toolkits, I’ve written some books, we’ve thought harder about pedagogy, GCSEs have been reformed, lots of things have changed. The million-dollar question has, has it done any good? Are students learning more now than they were in 2013 or 2005 or whenever you want to start?

I think it’s quite hard to say. It was really hard to say in 2013, because we just didn’t have good evidence. The evidence that I cited was a mixture of things. There are a few different assessments that have been used. The CEM ones from Durham, but various others that have been done on a reasonably representative sample of people. Never perfect, but there’s those, and then the international surveys. In particular, PISA, which by then had been going, in strength at least, since 2003. Before that a bit, plus TIMSS and PIRLS. So we’ve got all these different international surveys.

International surveys are designed primarily to compare different countries. I don’t know if that’s what they would say – to give countries insights, but to learn about the comparative performance of different countries in relation to different policies. They’re not designed to track progress over time. They have all sorts of problems to do with sampling, for example – there’s no country that gets adequate or the perfect sample. England had a few problems with that, particularly in the early years. (We discussed these challenges with John Jerrim in a previous episode.)

You could say, “We won’t look at some of those earlier PISA surveys because they didn’t meet the sampling requirements.” Or you could say, “We’ll look at them, but we need to be a bit cautious.” But most years, if they meet it, it’s only just, and there are issues, including in the latest round. So that’s always a problem. Even the way the sample frames are designed can change a bit because, for example, kids who are not in school – they’re not treated as missing from the sample. They’re not even thought about. So that’s an issue, particularly post-COVID, where there are more kids who are not in school, and typically they’re the lower performing kids.

We’ve also seen, in that time, PISA move from being a paper assessment to being digital. There is no way that you can have perfect continuity between those two different modes. They’re different assessments. You can do all the clever equating you like – some people will do better on one and some will do better on the other. All you’re trying to do is balance those out overall. So it’s quite hard to know whether you’ve got constancy there.

So for all these reasons, it’s problematic. But we do have PISA data. I would say, since about 2013, it’s more or less flat. It goes up and down a bit each year. We’ve also got, now, the National Reference Test. That was partly what came in response. It’s a good thing we’ve got that. I know it’s quite an expensive thing to do – but knowing how well you’re doing is a high priority. That’s the evidence-based way to improve what you’re doing – you have to start with knowledge. So we’ve had that a few years now and to begin with, there were some hints that it might be going up. Certainly the maths looked as though it improved for a few years, and everyone was getting quite excited about that. Then we had COVID, and that makes a big difference. I would say if you look at the pattern of both maths and English over the nine or ten years we’ve had it, it’s pretty flat.

Having said that, and going back to PISA, where we have improved is in the rank order – we’re higher. In the absolute score, we’re flat, given fluctuation. I know, when they report results like PISA, they give you a confidence interval. But I would argue that the true margin of uncertainty is bigger than the margin that is estimated by statisticians purely based on random sampling. We have statisticians play this trick where they say, “We can estimate confidence intervals” – which you can, if you make some quite strong assumptions about a well-defined population and a random sample being drawn from it. The reality is that you don’t ever get a fully random sample. But also that there are lots of other variables in any research study that affect the results, not just sampling. So we try and quantify the sampling error and we ignore all the other sources of error because they’re a bit too hard to quantify.

There is some research on that. My sense would be it’s probably at least as big again, maybe more so, so we should double those confidence intervals to get an idea of how much we know. So within that variation, and given all the caveats about those studies, the story is we’ve flatlined, but most other countries have got worse since COVID.

In a counterfactual hypothetical situation, you can say, “If we hadn’t had COVID, we might have been going up.” It’s quite plausible – but the reality is I don’t think we have gone up yet. It is also interesting if you look at the other surveys: TIMSS or PIRLS. PIRLS we’ve always been one of the top countries, and we’ve continued to be. That means ten-year-olds’ reading is pretty good. In fact, everything for primary age looks pretty good for England and it always mostly has been. Some of them have risen, but the maths and science in TIMSS again, at primary age, are really strong. We’re one of the top countries for that. Something then seems to go wrong in secondary and we fall behind. Assuming that that’s the right interpretation – so quite what that is, I don’t know why is that worse in England than other places?

England seems to have pulled a bit ahead of the other nations of the UK [I have examined the international evidence and this phenomenon]. So that’s an interesting thing. Again, it’s more because they’ve fallen back than England’s has raced ahead, but that’s probably accountable by COVID, at least in part.

So it’s quite hard to know still. Perhaps we know a bit more than we did, but I still think there’s quite a lot of uncertainty around that. There certainly hasn’t been a dramatic change. There may have been a very small change.

Tim Oates wrote a blog post making this argument, particularly with reading, that everywhere was getting worse and had been for quite a long time. England had achieved something by managing to stem decline. Is that something that you would agree with?

I mean, that’s right. It’s a bizarre situation where everybody’s getting worse at something. There’s a big contrast with other areas of life. If you look at health, for example; if you look at crime – people’s perception is often not that – but if you look at some data, certainly almost every health outcome you could point to improves quite dramatically over even quite short timescale sometimes. Other aspects of social life, broadly speaking – technology, let’s name another one. Changes there are dramatic, fast, and huge.

Yet education seems to not be any better than it was, twenty, fifty – even a hundred years ago. I don’t think there’s evidence that things have improved, particularly in terms of the ultimate productivity of the process, which is getting large numbers of kids to learn stuff. You could add to that closing the gap. I don’t think, on either measure, we’ve seen progress. We’ve held, and maybe that’s a good thing. You can create a narrative about that. But it does look odd to me, when you look at other aspects of life, and you see huge improvements. Then you see in education – well, not any improvement at all.

Have you got a theory for why that is? Health, there have been technological changes. But in terms of people’s fundamental behaviour – do they remember to take their medicine? Do they behave in healthy ways? We’ve seen falls in things like smoking, but people are still rash and forgetful. Why has education proved harder to improve than other public services?

Why has it? It’s a good question and any answer is a bit speculative. I don’t think we can have a clear attribution for it. One story would be the rise of evidence-based medicine, although the change doesn’t wait for that. That was in the ’90s – medicine was already improving before that. But that probably is a factor. Technology obviously is a big factor in health. Even things like rates of smoking reducing quite dramatically over the last thirty years.

In education, why haven’t we improved? Part of the problem is that we haven’t applied the knowledge that we get from research to trying to think about how we improve practice. How do we think about how we improve education? What are the strategies? If you look at government, they have a range of levers they can pull – mostly quite limited. What have we seen? We’ve seen accountability systems change. I think the evidence about their ability to drive big change is pretty limited. I think we should have accountability – it’s a positive force. But I don’t think it’s transformational. Once you’ve got it, there aren’t further gains to be had from that – not large ones anyway. Schools are still pretty much organised the way they always have been.

For me, teachers’ professional learning is where we should put our hope. That was hinted at in that 2013 paper. My view on that has hardened since then. There are some schools and trusts who do fabulous things around professional development – but mostly they don’t. The gap between what’s mostly done and what would be best practice according to what we currently know is huge – there’s no point of contact. It would be unrecognisable to most schools. I think what we currently do in terms of professional development in many schools, most schools perhaps – it’s not surprising it doesn’t have any benefit. You look at it and you say, “We can see why that wouldn’t work.”

So that would give you flat-line, wouldn’t it? That would say, “The whole system hasn’t improved, but also individuals don’t improve.” Again, that’s an extraordinary thing when you look at other areas of expertise. Mostly, people, over a career, become dramatically better. They’re more expert at the thing they’re doing. Not so in teaching. The evidence is of a small increase. After the first three to five years, hardly any improvement. Maybe some a little bit, but not really. Overall, pretty unconvincing. That’s extraordinary. It’s not because after five years you’ve learned everything you could – you’re as good a teacher as you could ever be. That’s certainly not the case. It’s that we don’t have in place good systems for developing expertise. We haven’t thought about the expertise problem as being part of how we think about system improvement.

That’s, in some ways, a pessimistic view that says, “Everything’s flat and we haven’t improved.” But it’s also quite an optimistic view, because we can see some things we could do that would be much better than where we are. That’s the hurdle you’ve got to get up over. You haven’t got to do something perfect, you’ve got to do something that’s better than what you were doing before. Then gradually, if you keep on doing it better than you did before, those gains accumulate and you end up with something that is totally better.

When I interviewed John Jerrim, he said – unprompted – “If you put a gun to my head, I would say that maths learning has improved in England over the last fifteen years.If we put this metaphorical gun to your head, and we get to our first set of proper post-COVID results – which you could argue is going to be a very long way away – but whenever we think the norm kicks in, do we see things get better?

What is the gun doing? It’s making me not say, “I don’t know”?

Yes. It’s forcing you to shrink your confidence intervals, let’s say.

The answer is we don’t know. I’d be surprised if it’s changed a lot. It may have changed a bit. When you look at the TIMSS primary scores rising – there’s a massive rise in the early 2000s. One story I heard about that was that the National Numeracy Project, which is mostly in primary schools – so applies to TIMSS in Year 5 – was an exercise in realigning the curriculum in England with the curriculum in the TIMSS survey. I mean, it wasn’t explicitly about that, but that’s what it amounted to.

We had done maths before, and it became numeracy. It was much more focused on number, and some parts of what had been in the curriculum went out. All of that coincides with the things that were tested in TIMSS. I’m not saying that that was the design behind it, but it just so happened. By changing our curriculum to be better aligned with the test, you might hope that our performance on that test would improve. Could you interpret that as more learning? Not really, because you are testing some things, you’re not testing others, and that’s always a problem. But in a sense that was a big success. I think in all of these measures that you’re trying to use, you’ve got those sorts of problems as well. The curriculum changes. It’s better or worse aligned to a particular assessment, and ultimately you have to try and avoid that.

The model for this is the National Assessment of Education Progress in the US, where they’ve been doing that since the 1970s. Of course the curriculum’s changed and it operates across all 50 states. So again, the curriculum can be quite different. They have to have a model that is relatively flexible to those kinds of changes. So it’s still meaningful to talk about a change over 50 years, even though lots of other things about the curriculum have changed in that time. I think that’s quite a good model.

That’s not the same as the one we’ve got in the National Reference Test. Ours is much more fragile, because it’s the same specification for GCSE maths and English that we’d have today. So as soon as the next reform to qualifications comes along, and we have a different maths and English GCSE spec, the National Reference Test will no longer tell us anything about that. I suppose it’s done well that it’s lasted as long as it has. There doesn’t seem to be an appetite for radical change around GCSE – but it will change at some point. When it does, we would like to know whether, in terms of fundamental skills, or things that are more universal – have those changed? We won’t know about that, because it’s a copy of the current GCSE. For me that’s a missed opportunity. But that’s what we’ve got.

I avoided your question. I’ve pushed the gun away.

You have.

I don’t think we know. But if it has changed, it’s quite a small change. At the EEF, we talk about the closing the gap between disadvantaged and not. Broadly speaking, the gap between children on Free School Meals and the others is something like a 0.8 or 0.7 effect size. It’s most of a standard deviation. The change that we’ve seen in maths might be a 0.1 or a 0.05, at best – it’s of that order. It’s a tenth of the size of the impact we’d need to have to close that disadvantage gap. It’s a start. It’s good if it’s happened, but I wouldn’t describe it as transformational.

As you say, it gives us lots of scope for further improvement. I want to ask about the understanding of effective teaching. So you’ve alluded to doing your PGCE and having a very interesting, thoughtful time, but not necessarily learning everything that we now know about about teaching and learning.

How has both your own understanding changed, and have you seen that understanding change over the last 15-20 years? I dug out my notes from my training a while back. A lot of the things that we now talk about quite a lot weren’t mentioned at all, might as well not have existed, and didn’t exist in my mind until I later discovered them. What’s what’s changed there?

Quite a lot has changed. If you look at the Early Career Framework and the Initial Teacher Training framework, now combined, there’s a lot of stuff in there that wasn’t previously being mainstreamed – certainly wasn’t part of my PGCE, or yours, or many other people’s. Now that’s part of everybody’s training. So I think in terms of alignment with evidence about pedagogy, we’re at least saying the right things. We’ve still got work to do, in terms of moving from talking about these things, and a theoretical understanding of them, to doing them routinely, habitually, and with skill.

We’re not quite there yet, but at least we are talking about that. So those discussions are very different now from how they were – in England. But you don’t have to go very far to have quite different conversations with people that are completely unaware of the science of learning, or – let’s take something like Rosenshine’s Principles. The funny thing about Rosenshine – he published that review in 2010, but all of that was work that he and others had done in the ’60s, ’70s, or into the ’80s perhaps. None of it was new. We knew about this stuff for a long time. It had just gone a bit out of fashion. In England at least, it’s come back into fashion. But if you go to the US, or the Continent, and you have that conversation with teachers, they look at you blankly: “What is all this stuff?” They’ve literally never heard of any of it.

Many teachers in England as well – we can live in a bubble where we go to ResearchED and we read books, or in your case, write books. Lots of teachers are engaged with that. There are roles now in trusts and other organisations that blend that teaching and research angle, which I think is great. But those are not typical of all schools in the country. There are many schools where people have not heard of Rosenshine – they wouldn’t be able to tell you what it’s about, or the science of learning, or any of those things.

This is the thing almost that first got me interested in having the conversations like the one we’re having now – this experience of being in diverse international groups, and realising the couple of Brits in the group would be saying, “What about this?” As you say, it’s generally not… There are places – Australia I think is picking up – so it’s not saying ideas don’t travel at all. How would you explain this quite unusual path that the the people involved in this thinking have gone on over the last few years?

It is hard to explain, and it’s certainly not what I would’ve expected to happen. There was a time where I and others were campaigning to get evidence taken more seriously and thinking about education policy and practice. Literally nobody was interested. No ministers, no other teachers, no one in universities. It was a tiny handful of people arguing for this.

When are we talking about?

The late ’90s and early 2000s is when I first started. I finished my PhD in 1998. That’s when I became a researcher and started campaigning, I suppose, for evidence-based practice and policy. The academic community was quite hostile to that whole idea. People in schools were a bit more interested, but it was slightly bemused interest. It wasn’t mainstream. That’s changed completely now. So what are the reasons for that?

The Education Endowment Foundation is part of that. Partly the messages that they’ve been able to put out since 2011. Partly the fact they’ve had money to spend on research: that drives researchers’ practice in interesting ways. People who were arguing, “You shouldn’t ever do randomised control trials: they’re immoral, unethical, pointless,” were then queuing up to take the money to do them, because there wasn’t money to do anything else.

ResearchED has also had a place, and various other organisations like that – that have brought people together with a focus on school leadership and teaching, but also looking at research evidence. Arguably Ofsted has also played a role in that. Probably lots of other organisations too.

So it’s quite hard to put your finger on, “Why did this thing happen in England?” As you say, Australia looks as though it’s following a quite a similar path, maybe a little bit behind, but probably catching up because you’ve got that advantage. So but mostly around the world, it’s not. Having said that, people will look at things like PISA and say, “England seems to be doing better than everyone else. We better go over there and see what they’re doing.”

Policy – I didn’t mention, the Gove-Gibb reforms. Not saying that they were a perfectly constructed set of things that I completely agreed with, at all. But I think some of the underlying principles and the components of it probably did contribute to that as well. You’ve had Nick Gibb on this podcast, haven’t you? He does have some responsibility for that. For sure.

I remember not not being wild about many aspects of the reforms, but there were elements of it – I had a child walk into my room one day and say, “I’ve just had my fifth go at doing my science module. I know I’ll keep failing it, but they keep making me.” So he was missing my lessons doing these things every three months – they kept throwing him against the wall. Likewise coursework – I think we all knew things that were going on there (I wrote about cheating in tests at the time) – that were rightly got rid of.

So we’ve got these better ideas, and we’ve talked a little bit about the difficulty then of helping teachers in schools to change. You’ve spent time thinking about this, particularly at the EEF. What are we doing right in terms of making sure that more people know and can use these effective pedagogical tools, and what more do we need to do?

I think we’re beginning to learn quite more about this. The turn towards coaching – although I have mixed feelings about it – at least makes us think about professional practice, a set of skills and behaviours. It is not just about theoretical knowledge and understanding – although that also matters – but it is about a practical craft: that you learn by doing. In order to learn by doing, you need to have feedback about what you’re doing, as that’s how we learn pretty much anything.

So how do we get that feedback? Coaching looks like an obvious way to do that. I think where coaching is done well – the evidence is supportive. It can have pretty positive effects. My concern is about the scalability of that as a national strategy, because one of the things that that evidence shows is that the expertise of the coach is the main determinant of how helpful it is. There’s always going to be short supply of good coaches.

Maybe AI can help with that. Maybe we can find ways to scale it. Maybe I’m overestimating the difficulties – but I do have some concerns. If one of the things that happens because of an appetite for coaching is that all of our best teachers get taken out of classrooms to be coaches, then that could have quite a big downside in terms of learning. One of the things we know is that the variation across teachers is quite large in terms of the impact they have on children’s learning. Our best teachers are a lot better than our worst teachers. So it matters to have a policy that respects that.

What we’ve tried to do at Evidence Based Education is more to say, “How can we scale this thing? How can we make it something that doesn’t depend on having too much of that expertise in your school?” Or if you do have that expertise, which pretty much every school does have some, “How can you find ways to use it that are not so time-consuming?” Coaching is really intensive – you sit down with a person for an hour, you talk about stuff, and then you observe again. What’s been evaluated isn’t a one session every half-term [model]. It’s much more intensive than that. So is it feasible? Maybe. But for many schools, they’ll look at that and say, “We couldn’t do that.” Or they’ll think they could do it and then not succeed.

So how can we scale that more? How can we use that expertise? Feedback is a key part of that. We’ve tried to create feedback loops for people. We’ve got instruments they can use – feedback tools if you like. Things like student surveys – to ask the students their perceptions of teacher behaviours that are supported in the evidence as being aligned with effective practice. Is that their perception of what they’re seeing? Some people have qualms about student surveys, but it’s just one source of information. It’s coming in, it’s giving you another snapshot and a bit more insight.

We’re keen on using video as well. You can see yourself and other people can see you. You can have a conversation with a group of peers about what you’re seeing in a video and how that could be improved. Then systematically working at changing practice. There’s a lot here about habits and how expertise becomes routinised. It plateaus, once you move out of a learning phase – which you’re at at the beginning, because it’s overwhelmingly complex – but then you move into more of a routine phase where you’re not having to learn fast because you’re coping and it’s fine. Then things can become a bit ossified – you settle for that, because it’s working. Part of the issue is that you’re not then getting feedback that says, “Yes, it’s working, but this bit isn’t working,” or, “This bit could be better.”

You have to frame it in a way that – it’s not a deficit. It’s not saying you need to improve because you’re not doing well. You need to improve because it’s a moral imperative: you work with children. You need to be as good as you can be. Even if you’re fabulous, you still need to be better. That’s a mindset thing, which some teachers and school leaders have – to improve, people need to believe that they can be, and need to be, better.

Then it’s a dual piece about the skills and understanding. Some approaches to professional development focus much more on understanding, thinking, being a reflective practitioner and doing research and inquiry. Some focus much more on the practice side. Coaching obviously ticks that box. But my view is that you need to have both sides of that: to build that understanding as well as the skills. They’re quite separate things. You need to explicitly design systems that that do both, and integrate them as well.

Then the final piece is about the habits – you have to make that that new practice something that is then routine and habitual. The “what you do when you’re not thinking about what you do.” Again, explicitly building in that habit-building process into the professional learning. So it’s not an accident – “I can do this now.” That’s not enough. You have to get to the point where you can’t not do it, or where you are default is to do it. Where if you’re not thinking about it, that’s what you’re going to do. That’s not immediate. It takes quite a bit of time.

To link that back to the the thing that you prompted yourself with – the theory is all of those are scalable tools inasmuch as there’s no limit in the number of people who can video themselves, do student surveys, or change their habits?

Yes, exactly. So the effects might be smaller – if I’m a teacher and I could have you come in and give me some coaching, that’s going to have a bigger effect on me than using any of these tools, I’m absolutely sure. But most people can’t. So what can we do that is completely scalable? So yes, student surveys. 150,000 students have completed these surveys; 5,000 teachers have had the feedback from them. I’m yet to hear anyone who says that it wasn’t useful – it tells them something they didn’t know.

Same with video. I’ve never seen anybody watch a video of themselves teaching and not learn something from it. And also not be surprised – you’re in the classroom every day, you think you know what’s happening. Then somebody shows you a video of yourself and you think, “Oh my goodness, is that really what I do?” Sometimes it’s trivial things. “That’s not how I thought I sounded.” But more often, it’s something that is important for teaching and learning.

So that perspective is valuable. It gives you some insight. It makes you think slightly differently. It opens up a possibility of a thing that you could improve. If you’re being a teacher in a room with kids, it’s easy to go through many years not consciously thinking, “That’s something I could improve.” Because you don’t get that feedback. Some teachers seek it out – but mostly people don’t, because it’s more work. So why would you? Whereas if you use feedback tools like these, you can’t not think, “I can improve that thing.

That’s not enough – you have to then be supported to do it. Research on feedback is a little bit disappointing. Feedback can be powerful, but it quite often isn’t, and there are lots of reasons why. One is that most people need quite help to implement feedback, to understand the changes they need to make, and then to do it – and for it to become a normal routine part of their practice.

I do agree, the power of that initial insight – I analogise it to threshold concepts and this idea that it’s irreversible and transformational. You see yourself in a new light and you can never unsee that, even if it takes you a long time to to change what you do.

You have been heavily involved with the EEF. Tell us a little bit about the role you think the EEF has played in raising the quality of research, and the level of knowledge among school leaders and and teachers. Within that, I’d be curious to know how you see that changing over time. When I read some of the early evaluation reports, I look at both some of the programs that are being evaluated, and some of the ways that it’s done, I think “Really?” Whereas lots of things there have been have changed quite a lot.

The advent of EEF was a very exciting time. It coincided with a huge chunk of funding for research activity, going to be managed by EEF, but also lots of other funding being taken away – the Department for Education used to fund research, and they’ve more or less stopped doing that. In general, the climate for research funding tightened up through austerity. Those things combined to drive researchers’ behaviour quite alot.

What EEF set out to do was to fund large-scale randomised control trials of interventions in schools. There would be these programs that were out there – more or less manualised versions of things that were supported in research. The EEF could evaluate them, we’d find out which ones worked, then we’d just say to people, “These things work, do these things.”

I can remember one of the first, maybe it was even the very first, chatting with people at EEF – I wasn’t formally part of the organisation then – about, “What kinds of things should we start with?” We thought the absolute bankable certainty was that peer tutoring would would have a positive impact. So that was one of the first trials that was led by a team in Durham and evaluated independently. A couple of years down the road, we’ve done this trial. Lots of previous research shows peer tutoring is effective – quite big effect sizes and the learning is both for the tutors and the tutor – it seems a pretty robust finding.

Evaluation comes in: no impact. Everybody’s scratching their head thinking, “What happened here?” In a sense, that set the scene for then the next ten or so years – most of these trials have had quite small impact. Some of them have worked a bit. Quite often, where they have worked, they then have this pipeline which says, “We’ll do a bigger trial, and a more scalable version of the intervention, to see if we can get it to work at scale.” Then the second time it doesn’t work – so you get a positive result, you do a replication, and then the result isn’t replicated. To begin with, EEF would badge them as “promising programmes.” Then they backed off from that a bit, because it was a bit embarrassing to have to say, “It did look promising, but now we’ve changed our mind.” I don’t think that should be embarrassing – that’s how science works.

But the thing I hadn’t predicted was that the size of those effects was going to be smaller than what was in the original literature. Peer tutoring comes out with big effects 0.8, 0.7, that sort of size. When you do a good trial of it, that’s reduced by a factor of 10. Plus, even at the scale they’re doing those trials, the confidence intervals around them are pretty big. I’ve already said that that’s just the sampling error. That’s not the whole error. Let’s think of it as bigger than that even. So most of them are not big enough, given the uncertainty around those estimates, for us to be very confident that that the effect isn’t zero, let’s say, or that it’s positive. That’s a bit frustrating maybe. It certainly was surprising for me, because I had thought we would replicate what was in the literature. That hasn’t happened. So I think we’ve learned quite a lot about that.

It’s the same issue in a way. It’s about scaling, implementation, and getting people to do the thing faithfully that you want them to do. In small-scale trials, it’s much easier to do that. You can have a more intensive intervention – a more faithful rendition of it. Therefore you do see bigger effects. There are other reasons as well, but I think that’s a big part of it. So that’s quite tricky.

But meanwhile, the appetite for doing randomised controlled trials has grown. Lots of researchers have had to learn how to do them in order to create the the volume of work. So the debate around it has reached a higher level. It’s no longer, “You you shouldn’t do this thing, it’s unethical.” It’s much more about, “What about this kind of design? Can we overcome that threat by doing it this way.”

The trials that EEF has done, and continues to do, have become better in quality. They’re really good now. So in things like what they call the the implementation and process evaluation, that used to be quite thin, and it’s now quite sophisticated. We learn much more about the why, where, and how – not just an overall effect size at the end.

There’s much more we can do still there. I think one of the big missing pieces is about heterogeneity. These trials are set up, broadly speaking, to estimate an average effect. Sometimes they will estimate an effect for Pupil Premium as well. What they won’t tell you is, “Did it work in some schools and not others?” Because the school is the unit of randomisation. A school is either in or out – the design is not set up to be able to estimate effects for individual schools.

Yet there’s a fair amount of evidence that that’s where the main variation is – things work in one place, not in another. That might be at classroom or school level. We need to understand that much better, because when the study says, “Overall it’s small,” it’s one month [of additional learning], in EEF speak, which is a 0.06-0.07 effect size – that’s a small effect, but hidden inside that average will be some schools where it made a big positive difference. But also some where it probably had quite a negative impact. If you are a school leader trying to make a decision about whether to do this thing, you might say, “On average it’s a positive effect, so I should do it.” But it’s much more useful to know, “In schools that are like this, it has a big effect. If you’re not like that, then it probably has – even a harmful effect.” That would be much more valuable knowledge. Yet we don’t have any of that because partly because of the way the trials are designed.

The implementation reviews were always the the thing that struck me. Then when I did the systematic review with Sam Sims of teacher professional development, and started looking at more of the American and other foreign studies, I realised nobody else was doing this. Most foreign studies, you get to, “What actually happened? You say you did all these things and did anyone turn up?” You never know.

Sam once described you to me, in reverent terms, as “a scientist.” He said, you’re happy to follow the evidence and update wherever it went. I think if you look back at your 2013 paper, you see that preference for – when you look at a classroom, or a programme – “Let’s get to the truth,” rather than let’s come up with a comforting narrative. How easy is it to be a scientist in education?

I think it’s easier now than it used to be. Back in the late ’90s and early 2000s, I would say that what I was advocating then was a scientific approach to improvement as well, but also to doing research. I think there’s a lot of research done in education, in education journals, that isn’t remotely scientific in that way. I think there’s a place for some of that, but it’s a bit out of balance. We’ve probably got too much that doesn’t use those norms about what constitutes good research. I’m all for debate and diversity, but I think you you have to get the balance right.

But I think that the mood has changed. Randomised controlled trials have become almost a default in many cases. It’s hard to imagine, or to remember, quite a short time ago, they absolutely weren’t – literally no one was doing them. There were groups of people who were advocating. I can remember, we we counted up the number of randomised controlled trials that had been done in England that weren’t health related. You could count them on the fingers of one hand – ever – in the late ’90s. That’s definitely not the case now. We’ve done hundreds – big ones, good ones, more and more being done, not just by EEF either. So I think that more scientific approach to research is much more embedded – a scientific approach to improvement is a bit less common, but there are definitely pockets of that in the system. Lots of organisations are trying to do that thing. So it it’s definitely not a marginal voice anymore. Whether it’s mainstream enough, I don’t know. But I don’t feel like I’m swimming against the tide in the way that I did, say, thirty years ago.

It’s reassuring that at least some things have got better.

I think that’s a massive improvement, and it’s also not something I would’ve predicted. If you’d asked me back then, “Will we ever win this battle?” I would’ve said, “I don’t think so. I’m going to die fighting.”

Is it just the money to EEF then that allowedyou said money comes in and disappears from elsewhere, but why wasn’t the EEF guided by people who are hostile to randomised control trials given they were the the vast majority?

It was pure chance. I don’t know, by some fluke maybe. Or maybe by clever design with the people who set it up – it got the right people in, who listened to the right people, and had the right views. I think Kevan Collins is a big part of that story. I didn’t know him before he was at EEF, but I was always very impressed by what he did there. He had a really clear line – “It’s going to be randomised controlled trials. It’s going to be high-stakes outcomes, reading and and maths.” EEF now, and other organisations, are maybe a bit more flexible about that. EEF do quite a lot of things that aren’t randomised controlled trials now. I think that’s the right thing to do. But to begin with, it was a pretty hard line. I think that was necessary. It did change the culture and it changed the world.

Having the money does help, because if you’re a research funder, you can control what research gets done. So that’s a way to influence researchers and that’s definitely been the case. There’s still a number of people who think we do too many randomised control trials. Some people think they’re not good value for money in terms of what they tell us. I think those are good opinions that are worth debating. But that’s a very different world from the much more existential – “You shouldn’t be doing this thing at all.” Or, “You can’t do it because human beings are not determined.” I think we’ve got beyond that.

Thank you very much.

That’s been a pleasure. I feel like we could talk all day.

When I interviewed Daisy as well – it’s fun to reminisce and think back, because it’s now quite a long time. You’ve been doing this for for much longer than I have, but – a lot of water under the bridge and lots of policy frameworks and dozens of ministers.

I think all those questions are good questions, but they’re also ones that we are just speculating about – why did that change happen? Who knows?