Chalk & Change 1: John Jerrim – what do international tests tell us?

“Are children really improving their skills over time, are we just inflating the grades?”

In this episode, we speak to John Jerrim. John is a Professor of Education and Social Statistics, and the Director of the Quantitative Social Science Research Centre at UCL Institute of Education. He won the inaugural ESRC early Career Outstanding Impact Award. He’s a prolific researcher: at the time of recording, his Google Scholar page listed 247 published papers on. It’s now up to 258. His work has covered a wide range of topics, but I particularly wanted to learn from him about England’s performance in international tests – one of his specialist subjects.

We discussed:

How John came to write so many papers – and how knows what makes a paper a good idea
The world of international tests: who runs them, who takes part in them and what’s in it for countries, schools, and students
The reliability of these tests: in the early 2010s, John argued that England’s scores on international tests had fallen because of changes in testing methods. England’s test results have tended to rise since then – could we rely on those results?
Why the next round of PISA results in England won’t be comparable with the last few rounds
John’s take on why test scores have gone up in maths – and what hasn’t gone so well
Other elements of John’s research, including a recent paper on diminishing student engagement, and a paper on the performance of students of East Asian backgrounds in Australia

John balanced well the caution of a professional researcher with a willingness to give clear answers – “if you put a gun to my head” – he said twice, before giving his belief about what has improved and why.

You can listen to the episode on Spotify and Apple Podcasts, or read the transcript, below.

Transcript

John, thank you for joining us. Tell us about yourself. Who are you, what do you do? What should listeners know about you?

I’m John Jerrim. I’m Professor of Education and Social Statistics at UCL and I also work part-time now at ImpactED. I’ve done a lot of research around the international assessments — PISA (Programme for International Student Assessment), PIRLS (Progress in International Reading Literacy Study), TIMSS (Trends in International Mathematics and Science Study) — but also more generally lots of randomised-control trials and impact evaluations in education, inequalities in education – anything quantitative methods, I’m interested in.

Despite your youthful appearance, Google Scholar lists 247 papers that you’ve co-authored. Which to date are your favourites or which should go on your headstone?

When I got that sent through – it was 240 – I spat up my coffee a little bit. I hadn’t realised it quite got to that slightly excessive number. But the ones that I look back on with the fondest are around the collaborations I had around the papers to some extent. I did one around international assessments where we got into a lot of the gory detail that I did with a former PhD student of mine, Laura Ziegler, who did a whole three-year PhD on some of the technicalities around PISA. It was getting into some of the gory detail around it, which someone, a geek like me, particularly enjoys. Other ones: I think the stuff I did with Christian Bokhove and Sam Sims around Ofsted, because there’s nothing out there, absolutely nothing out there on consistency and reliability. We moved the dial slightly, not hugely, but in a land of nothingness, I think we look back on that project and think that was a good one as well.

Tell us about the John Jerrim production function. How does one write 247 papers? When you sit in the office, you magic things up. You’re on the lookout for ideas. Once you’ve got an idea, how do you get from there to writing a completed paper?

Some of them build on each other. You write one paper, halfway through, you think of another paper to write and you spin it off the back of it. But I’ve got to the point where, and I’ve noticed a few academics are like this, you have two or three ideas a week for a new paper, and you can’t possibly do them all. What you tend to do is: think of these ideas and then sit on them and sit with them for four months, 6 months, 8 months, and if they still seem a good idea 6 months down the track, that’s probably where you’ve got a winner. I’ve got to the stage where it’s rather than an idea, let’s rush out and try and do it. It’s often, let’s sit back and if I still think it’s a good idea in X amount of time, it probably is one.

Let’s get onto international tests. What are they? What are they for? Who’s running them?

The big one’s PISA: the one that everyone’s heard of, international study run by the OECD (Organisation for Economic Co-operation and Development) across 80 countries, been running from 2000 and it’s been conducted every three years. The other main ones which get less attention, and I’ve got a paper about this, about how much less attention they get, are PIRLS and TIMSS. They’re run by the IEA (International Association for the Evaluation of Educational Achievement) out of Germany: assessments of maths, science and reading skills amongst 10 and 14-year-olds. There are other international assessments as well: there’s PIAAC (Programme for the International Assessment of Adult Competencies), the adult assessment as well, and there’s other international studies ALLI and some others in other parts of the world that are very much regionals, ones conducted in sub-Saharan Africa and whatever. But they’re the main ones that I’ve worked with.

We’ve got GCSEs, we’ve got A-levels, countries pay to participate. Why do we pay? Why do we care? What do we get from them that we can’t get locally?

One of the arguments for many countries taking part in them and to some extent, they’re an independent benchmark. If we go back to the 2000s, what were we seeing? Well, we were seeing big grade inflation, so the argument could be made as well, “Are our children improving their skills over time or are we inflating the grades?” That’s one motivation for countries taking part in them. You could argue now we’ve got the national reference test. Maybe that’s less important now at GCSE level, but I still think it’s the motivation I see. The other big benefits from the UK taking part is, it’s one of the few data sources that we’ve got that we can compare educational outcomes across England, Wales, Northern Ireland, and Scotland, which is interesting in and of itself. You’ve got the benchmarking against other countries. I think when these things are first run, that’s a big new interesting thing. Perhaps less now because less new exciting things are coming up. To be honest, why do we still participate in them? All OECD countries do. All the OECD countries do in PISA. You pull out, that’s a big political thing. Even when Scotland pulled out of doing TIMSS, it was a big political thing. Other countries can poke at it and that’s one of the reasons why we will always do PISA because you don’t want to avoid the political embarrassment of dropping out.

Every time we go through a PISA cycle or a TIMSS cycle, those of us who follow these things excitedly get these headlines. “Finland’s gone up, Singapore’s gone down,” whatever it is. If I open up the newspaper and it says, “England’s now in the top 10”, or “is out of the top 10,” how much credence should I give to those headline statements?

I always think of this in two ways, about whether I start to get interested in some change over time in these studies. The first is: has it swung 10 points either way? Within 10 points, it pretty much could well be sampling error or non-sampling error or whatever. 10 points on these things is a big swing. The second thing I always look for is: are we seeing this as a sustained thing over time? Are we seeing a genuine trend? It is not uncommon in these international studies, not only for the UK, but mainly for other countries. You see some weird spikes going on. There was one occasion in Ireland where they had a big spike in their reading test scores one particular year, and it went the next cycle or whatever. Those things happen. I think quite often when there’s a big exciting result, it often isn’t real, to be honest, and there’s something else strange going on. I do feel they to some extent get overplayed.

If I see a headline saying, “X is good,” what you’re saying is “wait three years, four years, however long the cycle is, and we’ll see if we can believe in it.”

Yes. There are some countries that always appear at the top of these rankings. The East Asian countries will always appear at the top of maths rankings, always have done, probably always will do. I believe that. I believe they probably are the best countries in the world or have the kids to the best at maths in the world. That’s fine. Take into account test effort, which could come into it as well, but I’ll ignore that for the moment. It’s the changes over time that are particularly thorny in a lot of these things where small changes get over-interpreted quite a lot of the time.

Thinking specifically about the quality of data we’ve got about England, can you tell us a bit about some of the reasons we might be a bit wary of English results? What are the issues there?

We’ve had issues with the data for England for a long time. Ever since PISA started the early cycles in 2000 and 2003, we ended up having very low response rates. We got kicked out of the 2003 results by the OECD. Not many countries have had that honour. I’m not sure honour is the right word to be kicked out by the OECD, but it reached a level where they didn’t have confidence in our results. The response rate improved to some extent over time, but there’s still definitely some issues in the data in terms of lower-achieving and more disadvantaged students are less likely to participate in the study. That’s still the case, probably quite stable over time, but it’d be better if we had some stronger evidence around it.

Every time PISA happens, there seems to be something happens that could potentially screw up the results a little bit. A good example is 2015, when PISA moved from paper-based to computer-based testing. They’ll swear on their lives that everything’s comparable and things haven’t changed or whatever. That’s not true. There should be a break to in the series – my strong belief is they didn’t manage that transition in terms of measuring trends as well as. Perhaps TIMSS has done and perhaps they should have done. Obviously between 2018 and 2022, you’ve had the pandemic, which throws these things all over the place. It also had some bearing on the timing of the testing across the different countries as well. It makes these things about pairing data and performance over the 22 years PISA was going more challenging than first meets the eye.

With non-response, the issue is, PISA meant to sample the whole nation of 15-year-olds – Year 11s, in our case. Some people don’t end up in the sample. That’s because schools drop out and students aren’t there on the day. Does that seem to be fairly consistent across years? You’ve suggested it was up to 40% of students who should be included aren’t. Is it 40% every time? Can we say it washes out or does it vary and make things even more complicated for us?

It’s an area where we need to know more. The place that wouldn’t be able to tell you more, should be able to tell you more, but can’t, is the Department for Education and their international statistics team. The non-response rate has probably been pretty ish over time, particularly since the 2006 cycle onwards. We know less about the selectivity of the response across those cycles. If someone was to put a gun to my head and say, “What do you think?” My guess would be that the bias in the trend over time for England, at least from 2006 onwards, is probably relatively small. The absolute position relative to other countries were probably too high. But in terms of trends over time, I would say it probably comes out in the wash, but I’m to some extent using an educated guess on that.

I guess that’s what you get to do as a professor of a social statistics. We think probably from 2006 onwards we’re happy ish with the trend. You said we were higher up the table than we should be. Why do you think that is?

Because on response rates, it tends to be lower than in a lot of other countries. In some respects, it’s difficult because we’ve got great administrative data that we can look at, which gives us enough rope to hang ourselves with. Other countries also have some problems with non-response, but they have done less and are less well-equipped to look in and think about the bias around it. Canada’s a good example of that, where I’ve written papers about this. They’ve also got actually got quite high levels of exclusions and non-response, but they don’t have such good published administrative data to be able to probe it quite as much as you want. Having said that, I still feel, at least relative to some other countries, we’re probably a bit too high.

In 2013, you published a paper which criticised those, particularly politicians, who were claiming scores have fallen, everything’s going wrong in English schools. You said that that fall had been substantially overstated. Why was that?

A few reasons. One was this non-response issue that got completely ignored, in terms of we don’t know exactly what the bias was in 2000 and 2003, it’s likely to be upward bias. The OECD bothered to boot us out, which they don’t do lightly. They obviously had some concerns around the data. But there was also quite a lot of other changes made to the PISA assessment as well. In those early cycles, we tested a mix of Year 10 and Year 11 students, at different points in the academic year. Since 2006 onwards, we’ve only tested Year 11 students towards Christmas time.

Interestingly, in the next cycle, we’re going back to testing the near GCSE time in March, and it’s going to be, I believe, again a mix of year 10 and year 11 students. It looks like we’re going to go back to the future. It is not a good thing in terms of PISA for England because what Year 11 wants to be doing a PISA test in March, April before their GCSEs, mad. The other final thing I would say on the question you asked is, we do have evidence from other data as well, in terms of the TIMSS data set. We were an unusual country where we were going down rapidly in PISA apparently, but looking to be going up in TIMSS. We were a big outlier in terms of two different studies telling us two quite drastically different stories. The politicians picked the narrative that best suited what they wanted.

Well, they are politicians, aren’t they? It’s not that surprising. Since 2013 or 2010 or wherever you want to draw the line, it looks like England has done pretty well on these tests. You’ve been sceptical about the decline earlier. To what extent would you agree that we can say, “Academic results have got better for students in England”?

Again, if someone was to hold a, a gun to my head and say, “What do you think?” I would say, “Probably the strongest evidence would be in maths, where I think there has been a degree of improvement over time. You were seeing it in PISA before the pandemic. You were seeing some reasonable evidence from the national reference test. The latest TIMSS results, which I’m still not sure I completely trust, seem to be painting quite a positive picture around maths since the pandemic. No negative effect, possibly some positive effects. If I’m convinced anywhere, it’s probably maths. I do probably think we have seen some improvement over time, genuinely, in our performance.

Literacy, I’m less convinced about. I’ve always had the view that these, the literacy scores and these international studies assessments are generally pretty stable and they don’t change that much over time. In most countries, most of it tends to be random ish. Science, even less so, in fact, I would be surprised if anything other than stable in science off the top of my head. Maths is where I’m convinced the most or the strongest gain can be made.

What do you think about PIAAC? This is a survey of adult skills also run by the OECD, and there seems to be this jump among 16 to 24-year-olds between the 2011 and the 2021 wave in both literacy and numeracy. And it’s a higher jump than any other country managers. How do you rate PIAAC? Because obviously it’s a very different method. They’re literally knocking on people’s doors trying to get them to do skill surveys.

I was going to try and start to do some stuff with, the latest PIAAC round where the data only came out six months ago. Then I look at, looked at the response rate and I was, I can’t be bothered because it’s gone down drastically over time. I think from remembering last time it was maybe around 60-65%, which is workable. It’s down to around 30% this time – only the Labour Force Survey is looking worse than that these days. The data quality for changes over time there looks particularly tricky. I have some particular issues with adult assessments as well, particularly amongst 16 to 24-year-olds. You have very little incentive to try your best. I’ve always got this image of hungover students doing these tests. “Give 10, I get. “Just give me my tenner, I get to complete this, and go away.” I believe, in school, most kids probably put in a reasonable degree of effort. It’s done in normal assessment conditions. The adult assessments, I think you probably need to take with a little bit more of a pinch of salt.

I was struck as well by the study that offered Chinese and American students money for getting test questions right. They found that Chinese students didn’t work any harder on the maths questions and the American students did. I feel like British students probably closer to American students than Chinese students culturally. That makes me wonder whether that reduces our school scores.

I agree, and I feel this is probably the biggest challenge with the international assessments. Having said that, I believe that most kids probably try to put in a reasonable degree of effort. I don’t think most, or at least a significant minority, probably don’t put in maximum effort because they’ve got nothing riding on the results. They don’t even get told the results. The school don’t get told their results back usually. Sometimes they have done, but usually they don’t. There’s very little incentive to be trying your best on them. Test effort could well be playing into a big part of the results.

The school don’t get anything. They do this for the glory of the nation, and they’re doing research as a favour. They don’t get any compensation for the time and effort or anything?

I’m trying to think back. I think they might get paid. They might get some money to in lieu of teacher time. Whether that’s enough or not, I’m not sure. In 2015, we did give them back a nice report around their results. They might still do something. But they don’t get a huge amount. Interestingly, I think it was maybe the 2006 cycle before austerity hit when they were trying to up the response rate after getting kicked out in 2003. The old Labour government was awash of money compared to today, and I think every school got to send at one of their teachers to Paris for the, launch of the results. That’s an incentive there.

You don’t get junkets like that anymore. We need to bring back prosperity in some way. It is interesting to see how England seems to have separated itself from the other home nations, and from your calculations, it looked like the overestimate was pretty similar between them. Do you think that English students are doing substantially better than Welsh and Scottish students are, or were?

Certainly Wales has had quite significant problems in the international assessments for a little while, certainly there, but also to some extent compared to the other countries as well. I would probably say that there’s a decent enough claim that we’re above them, at least in maths.

So we’ve seen some improvement, particularly in maths. What would you attribute that to?

That’s the million-dollar question. That’s what everyone always wants to tease out of this data in terms of, “This country’s doing well at this. Why is that?” And it’s always difficult to say exactly why. If I was to say about anything on this front, I would say there has been a lot of attention to maths, key skills, and the knowledge-based side of things, where it is, “You will know this and you’re focusing on particularly on the mastery, the bottom tail, helping to lift them up.” I do think that probably had something to do with it, that there has been, a fairly heavy maths focus on knowledge-based curriculum that is probably doing something, maybe. The former government probably love that.

That’s one of the interesting things about trying to dig into this at this point. You’re saying curricular changes have changed what students are learning and mean that they’re picking up the basics as we want them to. Is that right?

This is all a little bit hand-wavy. It’s difficult to put your hands exactly on why, but I think – to be fair to the previous government – they did have a very clear focus in many ways on, “This is what we’re going to try and do.” I think part of that was by doing that, they knew it would probably help target or help improve things on the international assessment-side. There was very much a key-skills focus with maths being one of them.

We’ve mentioned lots of positives, or at least some positives. What do you think has been working less well that maybe isn’t captured by these tests over the last 10, 15, 20 years?

That’s a good question. The teacher labour markets. Obviously interesting one in terms of lots of teachers leaving the profession. Interesting. We did do TALIS (Teaching and Learning International Survey), which is the international study of teachers. We did 2013, 2018. They’ve got a new cycle, 2023 or 2024. The former government pulled out of doing that, probably because it was showing things weren’t looking great on that side of things. That’s probably an area where we’re not doing so well, which would then feed in at some point, you would think, into pupils’ learning. In terms of other areas not doing so well, I don’t know. Have to think a little bit more. Persistent socio-economic gaps not reducing.

You’ve got a paper recently about student engagement falling over time.

Student engagement has definitely become an issue that was very much focused around the pandemic. The pandemic to some extent, you see as almost a little bit of a free hit in terms of – this was a thing that lots of places had to manage. But it does seem we’ve got an increasing problem in England in terms of measures of school engagement and school belonging, particularly the decline that happens between Year 5 and Year 9. That fall seems to be sharper in England than in other countries. With the work that we do with ImpactED, with a report that was out yesterday, we’ve narrowed that down quite specifically, that timing, and there seems to be a lot happening between the autumn and spring term of Year 7. After kids have made the transition, as they’re getting used to secondary school, that seems to be a challenging time where kids start disliking school.

Do we have any diagnosis and what do we think schools should do differently?

You’d be better to ask the other guys at ImpactED what their recommendations would be, because they’re doing now lots of follow-up case studies around that very issue. I think there’s going to be more coming in due course on getting into the nuance about what can be done to resolve it.

Not directly related to this, but I think relevant and one of your many interesting papers is the work on East Asian students’ achievement in Australia. What did you look into and what did you find?

This paper’s now old, but I think the message still applies to this day. It was written at a time when we were looking very much to East Asia. The Conservative government, Michael Gove, were pointing to Shanghai and Singapore going, “They’re doing brilliantly. Their education system must be brilliant.” One of the arguments that’s always made is, “Well, to what extent is it curriculum and teaching methods versus other things culture and drive and determination to succeed in education?” What I did in that particular paper was look at Asian children, but who were living and had been brought up in Australia. They’ve been brought up in a Western education system following very much Western curricula, Western teaching methods, whatever – but obviously have the East Asian heritage, potentially tiger parents, determination, value to be placed on education. What we ended up seeing was those East Asian pupils that were Australian tended to do just as well on tests PISA as their peers that lived in Singapore, Hong Kong, Shanghai, wherever. The conclusion behind that was – a lot of this may well be more things about cultural background, value of education rather than the education system and the particular nuances of teaching and curriculum methods per se.

They’re experiencing the Australian curriculum everyone else, but something’s carried over from their background, working hard or home tutoring or whatever that allows them to get more out of the curriculum than other Australian kids.

Exactly that.

What are you working on at the moment? What are you excited about at the moment?

I’m working on several projects at the moment. We’ve already mentioned the school engagement stuff, um, doing a lot more work in that area as well as student engagement, looking at teacher engagement as well, and linking that to retention. That’s a big area of interesting research. I’ve got a project with Sam Sims, National Institute of Teaching, Becky Allen, Rob Coe, where we’re trying to estimate in England teacher value-added for the first time for purely research purposes, I should say. The idea of can you identify at least in data, more effective versus less effective teachers in terms of boosting children’s test scores? That’s an interesting one. Very difficult. Very tricky. Not for the faint-hearted, but that’s interesting. I’ve also got an educational assessment piece where we’re trying to work with the Key Stage 2 test score data and work out if we can report back to schools sub-domain scores in order to help inform their teaching and curriculum. Can we tell them how well they’re doing in terms of their performance on number versus geometry versus other areas? Lots.

All useful and immediate things that schools can benefit from and put to work. Where can readers find out more about your work?

They could either go to my UCL profile or I’ve got my own website, johnjerrim.com. One of the useful things about having an unusual surname is I’m incredibly easy to Google. As long as you can spell my surname correctly, I pop up almost instantly.

Improving Teaching

Chalk & Change 1: John Jerrim – what do international tests tell us?

Transcript

John, thank you for joining us. Tell us about yourself. Who are you, what do you do? What should listeners know about you?

Despite your youthful appearance, Google Scholar lists 247 papers that you’ve co-authored. Which to date are your favourites or which should go on your headstone?

Tell us about the John Jerrim production function. How does one write 247 papers? When you sit in the office, you magic things up. You’re on the lookout for ideas. Once you’ve got an idea, how do you get from there to writing a completed paper?

Let’s get onto international tests. What are they? What are they for? Who’s running them?

We’ve got GCSEs, we’ve got A-levels, countries pay to participate. Why do we pay? Why do we care? What do we get from them that we can’t get locally?

If I see a headline saying, “X is good,” what you’re saying is “wait three years, four years, however long the cycle is, and we’ll see if we can believe in it.”

Thinking specifically about the quality of data we’ve got about England, can you tell us a bit about some of the reasons we might be a bit wary of English results? What are the issues there?

I guess that’s what you get to do as a professor of a social statistics. We think probably from 2006 onwards we’re happy ish with the trend. You said we were higher up the table than we should be. Why do you think that is?

In 2013, you published a paper which criticised those, particularly politicians, who were claiming scores have fallen, everything’s going wrong in English schools. You said that that fall had been substantially overstated. Why was that?

The school don’t get anything. They do this for the glory of the nation, and they’re doing research as a favour. They don’t get any compensation for the time and effort or anything?

So we’ve seen some improvement, particularly in maths. What would you attribute that to?

That’s one of the interesting things about trying to dig into this at this point. You’re saying curricular changes have changed what students are learning and mean that they’re picking up the basics as we want them to. Is that right?

We’ve mentioned lots of positives, or at least some positives. What do you think has been working less well that maybe isn’t captured by these tests over the last 10, 15, 20 years?

You’ve got a paper recently about student engagement falling over time.

Do we have any diagnosis and what do we think schools should do differently?

Not directly related to this, but I think relevant and one of your many interesting papers is the work on East Asian students’ achievement in Australia. What did you look into and what did you find?

They’re experiencing the Australian curriculum everyone else, but something’s carried over from their background, working hard or home tutoring or whatever that allows them to get more out of the curriculum than other Australian kids.

What are you working on at the moment? What are you excited about at the moment?

All useful and immediate things that schools can benefit from and put to work. Where can readers find out more about your work?

Further reading

Like this:

Related

Leave a Reply Cancel reply

Transcript

John, thank you for joining us. Tell us about yourself. Who are you, what do you do? What should listeners know about you?

Despite your youthful appearance, Google Scholar lists 247 papers that you’ve co-authored. Which to date are your favourites or which should go on your headstone?

Tell us about the John Jerrim production function. How does one write 247 papers? When you sit in the office, you magic things up. You’re on the lookout for ideas. Once you’ve got an idea, how do you get from there to writing a completed paper?

Let’s get onto international tests. What are they? What are they for? Who’s running them?

We’ve got GCSEs, we’ve got A-levels, countries pay to participate. Why do we pay? Why do we care? What do we get from them that we can’t get locally?

If I see a headline saying, “X is good,” what you’re saying is “wait three years, four years, however long the cycle is, and we’ll see if we can believe in it.”

Thinking specifically about the quality of data we’ve got about England, can you tell us a bit about some of the reasons we might be a bit wary of English results? What are the issues there?

I guess that’s what you get to do as a professor of a social statistics. We think probably from 2006 onwards we’re happy ish with the trend. You said we were higher up the table than we should be. Why do you think that is?

In 2013, you published a paper which criticised those, particularly politicians, who were claiming scores have fallen, everything’s going wrong in English schools. You said that that fall had been substantially overstated. Why was that?

The school don’t get anything. They do this for the glory of the nation, and they’re doing research as a favour. They don’t get any compensation for the time and effort or anything?

So we’ve seen some improvement, particularly in maths. What would you attribute that to?

That’s one of the interesting things about trying to dig into this at this point. You’re saying curricular changes have changed what students are learning and mean that they’re picking up the basics as we want them to. Is that right?

We’ve mentioned lots of positives, or at least some positives. What do you think has been working less well that maybe isn’t captured by these tests over the last 10, 15, 20 years?

You’ve got a paper recently about student engagement falling over time.

Do we have any diagnosis and what do we think schools should do differently?

Not directly related to this, but I think relevant and one of your many interesting papers is the work on East Asian students’ achievement in Australia. What did you look into and what did you find?

They’re experiencing the Australian curriculum everyone else, but something’s carried over from their background, working hard or home tutoring or whatever that allows them to get more out of the curriculum than other Australian kids.

What are you working on at the moment? What are you excited about at the moment?

All useful and immediate things that schools can benefit from and put to work. Where can readers find out more about your work?

Further reading

Share this:

Like this:

Related

Leave a Reply Cancel reply