Abusing Statistics

I recently learned an interesting new way to abuse statistics using regressions. (I’ll describe it first in a way that requires no math background, and give some math details at the end.) It can be difficult to tell if those who abuse statistics are dangerous and well-intentioned or dangerous and know fully what they’re doing. Either way, they’re dangerous.

Suppose we conducted a study of retirees in their 60s to find out what percentage of their portfolios they spend each year. Even though this percentage varies across retirees, we want to get an overall sense of whether they’re spending too little or too much.

For the raw data of the study, I’m going to choose unrealistically simple numbers to make the calculations easier. The purpose here is to illustrate abuse of statistics. Here’s the raw data:

1000 retirees have $100,000 saved and spend $6000/year.
100 retirees have $1 million saved and spend $40,000/year.
10 retirees have $10 million saved and spend $200,000/year.

There are several ways we could go about calculating a collective spending percentage that represents all people in the study. They each have some merit, depending on what you’re trying to achieve, as long as you understand what each method gives you.

Median

Out of 1110 retirees, the median is between the 555th and 556th retiree. This is one of the retirees with $100,000 saved who spends $6000/year. So the median annual spending is 6%.

If you want to give advice that is relevant to as many retirees as possible, the median is a useful figure. Based on the median, we might write articles discussing how 6% is a high withdrawal rate for a retiree in his or her 60s.

Average retiree

We could calculate an average of the percentages, weighted by the number of retirees. In this case, we have 1000 at 6%, 100 at 4%, and 10 at 2%. The average is then 6420%/1110 = 5.8%.

An average like this could be skewed if there were some retirees who spent very high percentages in the year of the study. In our case, this didn’t happen, and the result isn’t much different from the median. Based on this average, we’ll end up warning retirees not to spend too much each year.

Dollar-weighted average

The total savings in each of the 3 groups of retirees is $100 million, for a total of $300 million. The total spending in each of the 3 groups is $6 million, $4 million, and $2 million, for a grand total of $12 million. This amount of spending is 4% of the total savings.

This dollar-weighted average of 4% is skewed towards wealthier retirees. If we didn’t know the distribution of retiree figures and only saw this average figure, we might conclude that retirees are spending reasonably. Of course, the truth is that 90% of the study’s retirees’ spending is high, and nearly 1% could be spending more. Only about 9% of retirees in the study are spending at a level represented by this dollar-weighted average.

Regression

Even if you haven’t heard of a regression and don’t know what it means, you’ve probably seen one. If you’ve ever seen a chart with a cloud of dots and a line going through the cloud intended to show the trend of the data, that was a type of regression.

If we perform a simple type of regression on the data in this study, the resulting line indicates that retirees spend 2.2% of their portfolios each year. This seems like a ridiculous answer, but it’s what a naive regression gives.

It’s easy to see in this case that this result is hopelessly skewed toward the wealthiest retirees, but it would be less clear with data from a real study where retiree portfolios and spending amounts are all over the map. A non-expert user of statistical software might run what looks like a standard regression, get some number like 2.2%, and believe it. This answer would still be hopelessly skewed toward the spending percentages of the wealthiest retirees, but this fact wouldn’t be obvious.

If you only care about the wealthiest retirees, then this result is fine, but this seems unlikely. It’s more likely that the user of the statistical software either doesn’t understand this heavy bias toward the wealthy, or wants to exploit the fact that others don’t understand this heavy bias. Either way, we end up with the incorrect conclusion that retirees don’t spend enough from their portfolios.

Conclusion

In general, avoiding this type of mistake when analyzing data is tricky and requires some deep understanding of probability and statistics. It’s easy for non-experts to get tripped up. Perhaps less commonly, it’s also easy for experts to abuse statistics to mislead others.

----------------------------------------

Mathematical details

In this case, a linear regression amounts to fitting the best straight line we can through our 1110 data points using the ordinary least squares (OLS) method. The following chart shows the points. The large blue point near the origin is 1000 points on top of each other representing the least wealthy retirees. The medium size green point is 100 overlapping points for the millionaire retirees. The smallest red point is 10 overlapping points for the decamillionaires.

The line we wish to fit to these points must go through the origin; it doesn’t make sense for someone to spend money from a portfolio of zero dollars. The following chart shows the ideal lines for each type of retiree. The slope of the line through the blue point is 6%, through the green point is 4%, and through the red point is 2%.

The first thing to notice is that all 3 lines look good enough for the blue point representing the 1000 least affluent retirees. Their wealth is so small compared to the wealthy retirees that even a 2% slope seems good enough. But that would mean these retirees spend $2000 per year and not $6000. So, even a relative error of a factor of 3 represents a small amount on this chart. But it isn’t small from the point of view of these retirees. It matters to them whether they have $2000 or $6000 available to spend each year.

But the math doesn’t care what matters to these retirees. The lines with slope 4% and 6% give enormous errors for the decamillionaire retirees. When we square those errors when using the OLS method, the errors for the decamillionaires dominate the errors at the other points even though there are far fewer decamillionaires.

Let’s go through the math for this study to find the OLS-optimized line. Suppose that the slope is m. Then the errors for the retirees of each wealth level are

(100,000)m - 6000 for 1000 retirees
(1 million)m - 40,000 for 100 retirees
(10 million)m - 200,000 for 10 retirees

We need to square these, add them up, and find the value of m that minimizes the total squared errors. The total squared error is

1000*((100,000)m - 6000)^2 + 100*((1 million)m - 40,000)^2 + 10*((10 million)m - 200,000)^2.

To make the numbers more manageable, let’s divide through by a billion to get

(100m-6)^2 + 10*(100m-4)^2 + 100*(100m - 2)^2.

We can already see that the contribution of the least wealthy retirees, and even the contribution of the millionaire retirees is swamped by the decamillionaire contribution. The simplified total is

1,110,000m^2 - 49,200m + 596.

The first derivative is 2,220,000m - 49,200. This is zero when

m = 49,200 / 2,220,000 = 0.022, or 2.2%.

The following chart shows the line of best fit. Visually, it seems fine, but it almost completely ignores 90% of the retirees. Even the millionaire retirees are mostly ignored. Only the wealthiest retirees matter very much.

It’s important to keep in mind that I’ve described all this in a way that makes the problem apparent. It is very easy to run a statistics package on some real data from a study and not realize that only the wealthiest participants matter much in the results. Any generic advice on retirement spending based on the 2.2% regression figure would be terrible for the least wealthy retirees, incorrect for millionaire retirees, and relevant to the wealthiest.

Search This Blog

Get new posts by email:

Abusing Statistics

Comments

Post a Comment

Popular posts from this blog

Financial Lessons from Poker

Are Financial Advisors the Solution or the Problem for Older Investors?

My Investment Return for 2024

Archive