Quite often I hear people claiming that “VO2max is not correlated with running performance.” Is that true?

What about similar correlation-based claims about metrics like body composition, running economy, mileage, and maximum heart rate?

Usually, when people make these kinds of claims, they link to a study showing that, among a certain group of runners, the metric in question (e.g. VO2max) does not accurately distinguish between the fastest and slowest runners in the group.

Another version of these claims comes in the form of statements like “among runners with a similar VO2max, those with better (running economy / lactate threshold / some other performance metric) have faster race times.”

So, what should we make of these kinds of claims? My goal in this article is to show why **correlations will always become weaker when you restrict your analysis to a small subset of a population, like elite athletes**.

Let's dig in and see why.

I'm not aiming to directly litigate any specific claims regarding VO2max or body composition or anything else; rather, I want to make a more general point about the issue with with drawing conclusions from correlations in small groups with selective inclusion criteria.

This is *not *a “correlation is not causation” article—that’s a topic (indeed, a surprisingly complex and subtle one) for another day.

I’ll be using the recurring example of VO2max and its correlation with performance among elite runners, but only as an illustrative case–the same arguments here apply generally.

We’ll close with an examination of better ways to think about the relationship between a metric like VO2max and what levels of performance you can achieve.

This article is a bit more technical than usual, so you’ll have to excuse the statistical deep-dive. It'll be worth it in the end!

## What does it mean for VO2max to be ‘correlated with performance’?

Correlation is a classic statistical measure of how one variable relates to another. Usually it’s denoted as R. It’s actually easier to understand the intuitive definition of the square of correlation, i.e. R^{2}.

That value is equal to the variation in our outcome (say, race performance) that’s explained by the predictor variable, divided by the *total *amount of variation in the outcome. Correlation is just the square root of this value.

Or, mathematically:

Take note of that denominator, **the total variation in the outcome**. If there is *any *source of variation in our outcome that isn’t fully explained by our metric of interest, R^{2} will necessarily be less than 1.0.

It’s easier to build intuitions with a few plots. Here are a few examples of different values for R:

The corresponding R^{2} values are 0.09, 0.36, and 0.81 (which shows one unfortunate property of R^{2}: it has awkward, nonlinear scaling).

## What is VO2max in running, and how is it measured?

If this article has piqued your interest, you probably know what VO2max is and how it’s measured, but in case you don’t, here’s a refresher.

VO2max is an estimate of the maximum amount of oxygen that your body can extract from the air per minute. It’s a standard measurement in an exercise physiology lab, and it involves fitting the athlete with an airtight mask with a sensor that measures the oxygen concentration in the air the athlete breathes in, and the oxygen concentration in the air the athlete breathes out.

The difference between these two (multiplied by the volume of the breath) is the volume (V) of oxygen (O2) consumed for that breath: **VO2**.

If you put this athlete on a treadmill at a slow speed, and ramp up the speed every one to two minutes, the athlete’s oxygen consumption will steadily climb as the speed increases, until one of two things happens: either (a) VO2 will reach a clear plateau, showing no further increase even as the speed keeps increasing, or (b) the athlete has to stop because of fatigue.

Situation (b) happens more often than you’d think, by the way; something like half of elite athletes do not show the classic “VO2 plateau” that you see in traditional exercise science textbooks [1]. This finding calls into question whether VO2max is *really *a maximum, hence my careful wording above that “VO2max is an *estimate* of maximum oxygen uptake.”

In any case, the maximum value of VO2 reached during the test—either the plateau value, or the highest value reached before the athlete stops—is taken as the maximum volume of oxygen that an athlete can extract, and is called VO2max.

VO2max is traditionally normalized to body mass, so the units of VO2max are milliliters of oxygen consumed, per kilogram of body mass, per minute: mL/kg/min. The average healthy young adult might have a VO2max around 30; fit runners tend to score in the 50s, 60s, and 70s, and elite runners tend to be north of 70.

## Why we might expect VO2max to be correlated with performance

Intuitively, it makes sense why VO2max should be correlated with performance (meaning *race times)*: higher oxygen consumption indicates a greater ability to produce energy aerobically, which means you are better able to “mop up” the excess metabolic byproducts generated when churning through lots of ATP to produce muscular force.

Indeed, it certainly *seems like *VO2max should be correlated with faster PRs–elite athletes have a VO2max that’s often double or triple that of a sedentary person, even when matched on body weight.

📚 If you want to learn more about VO2max, ATP, and running performance, check out my book,** Modern Training and Physiology**!

## Why we might not expect VO2max to be correlated with performance

Skeptics of the importance of VO2max correctly point out that many *other *factors beyond oxygen consumption are associated with performance too: how efficiently are you using the energy burned with that oxygen? How mentally tough are you? What fraction of your maximum aerobic power output can you sustain for a given duration? What’s your lactate threshold?

These are all valid points. Even in the context of lab-based testing, you could imagine an alternative universe where physiologists focused obsessively on running economy—usually quantified as oxygen consumption at a set speed—instead of VO2max.

In that universe, we’d be talking about who the most economical runners were, and what training you could do to increase your running economy. And then maybe some renegade would come along and point out that among runners with similar running economy, those with a higher VO2max run faster!

Again, I don’t intend to fully litigate the pro- vs anti-VO2max argument here. This is just providing some context for a relatively uncontroversial point, which is that other factors in addition to VO2max definitely affect performance.

## VO2max and performance both include errors and inaccuracies

In-lab physiology testing has an air of precision to it, but the truth is that even the fanciest commercially available equipment to measure oxygen consumption has around a 2% variation in repeated measurements on identical samples of gas [2]. Even in research-grade equipment, the accuracy of VO2max measurements can be off by as much as 15%!

The same is true for performance: while the timing system itself is super precise, any given athlete’s times will almost certainly have 1-2% variation depending on how the race unfolds, tactical decisions, and day-to-day variability how you feel, even if your actual fitness is constant.

Remember, both of these sources of error will contribute to the correlation that we measure when we take a VO2max measurement and a race performance (personal record or otherwise) as representative of the “true values” of that athlete. They’ll contribute to the total amount of variation in our data, which remember is in the denominator when we calculate correlation.

## Your sample: A crucial factor in determining the total variation in your outcome

In addition to the random errors in the measurement of VO2max and race performance, there is one very big factor that’s going to affect the calculation of the correlation between these two variables: **who we include in our sample**.

What’s our sample? It’s just the group of runners we’re looking at in our study. *But this is a key factor in our correlation calculation*!

Why? Because if we restrict our sample size to only runners of a certain level—say, international-caliber distance runners—*we will severely limit the total variation in our outcome. *And moreso, we'll do this in a systematic way, by including people who overperform, and excluding people who underperform.

Let's look at some real data to see this effect in action.

## An example with real data: VO2max and its correlation with 3k performance

There’s a very nice dataset from a paper by Anders Aandstad at the Norwegian Defense University College that we can use to explore correlations between VO2max and race performance, as well as understand the more general phenomenon of sample restriction and correlation.

The paper, published in 2021, studied the relationship between lab-measured VO2max and 3000m run performance in 259 military cadets in Norway.

This is a nice sample population since we’d expect a wide range of variation in both VO2max and performance level. Indeed, when we look at the distribution of both, we see a reasonably strong correlation between VO2max and 3k race speed:

The blue line shows a linear regression model fit to the data. The linear fit represents the average* *3k speed among people with a given VO2max performance.^{[1]}

The fact that the real data do not lie perfectly on that line is a result of both measurement errors and the additional factors that affect 3k performance beyond VO2max, like running economy and lactate threshold. This error is called the *residual error*, which makes sense in that it’s the error that’s “left over” after fitting our linear model.

The correlation seen in the plot above, 0.81, is reasonably strong by human performance research standards. The R^{2} value, 0.66, indicates that 66% of the variation in 3k time can be explained by lab-measured VO2max—*for this population*. That last qualifier is very important, as we’re about to see.

## Restricting your sample decreases the strength of a correlation

Now, using the same data, let’s replicate our analysis, but restricting our sample only to those with a 3k time faster than 11:00 (maybe this is the cut-off time for entry into a competitive race, and we’re doing a study using only subjects who qualified for this race). Let’s arbitrarily call these athletes with 3k times below 11:00 “elite athletes.”

This example a hard cut-off, but similar effects at at play when there are soft probabilities involved—for example, say you study NCAA DI male distance runners at Power ~~5~~4 schools; 9:30 high school two-milers are not going to be totally absent in your sample, but will be less likely to be included than 9:10 high school two-milers.

Now what happens to the correlation? It gets smaller. *Much *smaller! R is now 0.48 and R^{2} is now 0.23. Why? *Nothing *has changed about the underlying data, or presumably the biological process that generated the data.

What’s happened is that our denominator in the R^{2} calculation has gotten much smaller—we’ve artificially decreased the variation in 3k speed by including only the faster runners.

Our elite cut-off excluded runners with VO2max values in the 55-65 mL/kg/min range who had slower 3k performances, so among our restricted sample, it looks like VO2max is less predictive of performance than it really is.

Another way of thinking about this problem is that **runners who are good at the 3k for reasons other than their VO2max are more likely to be selected into the sample**, since they are the ones who tend to overperform among runners with the same VO2max.

As a consequence of this selection effect, VO2max is seemingly “uncorrelated” (or at least less correlated) with performance when you look at high-level runners.

The selection effect is more clear when we plot the elite data and its correlation alongside the entire dataset:

## A better approach: What range of performances is possible for a given VO2max value?

There is a better way to approach correlation-type questions, which is to look at the *range of performances *that are possible for a given level of some metric. In this case, that means asking what range of 3k times is possible for a given VO2max. Happily, the same linear regression model from above can provide this answer.

For the statistically-minded, what I’m talking about here is a **prediction interval**. Note that this is not the more common confidence interval about the mean. A prediction interval gives you a range of values that are compatible with a given level of performance.

In our case, we can ask “with 90% confidence, what range of 3k performances are possible with a given VO2max?”

Another way of thinking about a prediction interval is as a statement—“Among runners with a given VO2max, 90% of them will be able to run within this range of times for the 3k.”^{[2]}

90% is of course completely arbitrary; any level of confidence can be used as desired. We can visualize prediction intervals for the data above as follows:

This plot allows an easier visualization for the range of plausible 3k times we’d expect to see for a given VO2max value. For example, among a large group of runners with a VO2max of 60 mL/kg/min, we’d expect 90% of them to be able to run between 10:22 and 12:37 in the 3k.^{[3]}

Now, prediction intervals *still *won’t save you if your sample size is restricted based on the outcome. The reason is the same: in our example, by selecting based on performance level, you’re still systematically excluding a large group of people who under-perform relative to their VO2max.

Here’s a 90% prediction interval from the elite-only group, overlaid with the 90% prediction intervals for the entire dataset

Even prediction intervals suffer from artificially excluding people with a high VO2max but slow race time. Again, when "overperformers" are more likely to be included in your sample, your correlations (and prediction intervals) will not be accurate anymore.

## Summary

When evaluating how well a given metric correlates with performance, it’s critical to consider the sample that’s being used to make such a claim.

In a highly restricted sample (like a group of elite runners), correlations will *always *appear smaller, because you’re artificially reducing the range of total variation in your outcome (in this case, race performance).

A better to frame the question in terms of prediction intervals—the range of performances that are compatible with a given value of whatever metric we care about. Here’s what I mean:

**Less helpful:** “VO2max is not well-correlated with performance (R^{2} = 0.23) among runners with a 3k time below 11:00)”

**More helpful:** “Among runners with a VO2max of 60 mL/kg/min, 90% will be able to run between 10:22 and 12:37 in the 3k.”

The second and more helpful framing helps us understand a few things about VO2max and performance all at once. In our example from above, three things are clear:

- A higher VO2max enables better performance
- Even for a given VO2max value, a relatively wide range of performances are possible
- If your VO2max is very low, there are performance levels that are completely out of reach

Where your race performance end up within the range of plausible values for any given VO2max is going to depend on all the “other stuff” that we care about from a performance perspective: running economy, lactate threshold as a percentage of VO2max, anaerobic capacity, and so on.

But saying “VO2max isn’t correlated with performance” is clearly not true–**the apparent lack of correlation is merely an artifact of how we chose our sample**.

The same will be true for *whatever *metric you choose if you restrict your sample based on something that’s correlated with performance.

## Footnotes

^{[1]} It’s simpler to use speed as opposed to pace or race time, since the VO2-to-speed relationship is close to linear. If you plot 3k pace as a function of VO2max, you end up having to use a non-linear curve to adequately fit the data. A formal statistical test for linearity finds no reason to prefer a non-linear curve for 3k speed as a function of VO2max. The header image of this article shows the VO2-to-speed relationship transformed back to race times.

^{[2]} It’s relatively straightforward to use prediction intervals even with models that are more sophisticated. For example, if we had a large dataset of physiology and performance data from college athletes, we could ask “what range of 10k performances are compatible with a female sophomore who runs 60 miles per week and has a VO2max of 55 mL/kg/min?”

^{[3]} There’s a more subtle selection effect at play here: these are not runners who are training for the 3000m! They are Norwegian military recruits, so their 3k performance is “bad” for their VO2max, compared to a trained runner who is presumably training all of the other determinants of 3k performance, like running economy, lactate threshold, and anaerobic power output. That’s why these 3k times seem relatively pedestrian even for an impressive VO2max like 60 mL/kg/min. And for the same reason, it’s not appropriate to predict a *trained runner’s *VO2max from their 3k time using this dataset.