The Bias-Variance Tradeoff and Why People Are So Bad at Predicting the Future and When Do Heuristics Work? Or: Why Bias Can Be Good

[This is post 4 in the "Structure and Cognition" series; links to all the posts can be found here]

If there were a contest for the scientific findings most embarrassing for human intelligence, the research comparing expert prediction to simple linear equations would be a strong contender.

When this research program began in the 1940s, clinical psychologists were curious to see just how much better their clinical judgment would be than statistical models. Dawes and Corrigan (1974) note that “the statistical analysis was thought to provide the floor to which the judgment of the experienced clinician could be compared. The floor turned out to be a ceiling.”

In summarizing this large body of literature, which repeatedly demonstrates the inferiority of human judgment, Nisbett and Ross (1980) write:

"Human judges make less accurate predictions than formulas do, whether they have more information than is fed into the formula or precisely the same amount of information. Human predictions are worse than actuarial ones even if the judges are informed of the weights used by the formula and are worse even if the judges are informed of the specific predictions generated by the formula in the cases in question.

Human judges are not merely worse than optimal regression equations; they are worse than almost any regression equation. Even if the weights in the equation are arbitrary, as long as they are nonzero, positive, and linear, the equation generally will outperform human judges. Human judges do not merely apply invalid weights; they apply their invalid weights unreliably: A computational model of an individual judge, one which calculates the weights applied by the judge to the first N cases, will outperform the judge of the next batch of cases because of improved reliability alone." (p. 141)

Some of this should sound similar to the discussion in my previous post about simple heuristics outperforming more complicated algorithms despite having not just less information but no information at all.

The reason this is possible lies in the move from explaining past data to using that data to make new predictions. The past is a (somewhat) unreliable guide to the future – we can’t just predict that exactly the same events will recur. If it rained last Wednesday, that doesn’t mean it will rain this Wednesday. But if we aren’t careful, we could easily construct a model that perfectly explains last week’s weather by predicting rain every 7 days. If the model is based on only a week’s worth of data (or 3 weeks where it coincidentally rained every 7 days, etc.), it would achieve perfect explanatory accuracy in accounting for last week. But obviously, if we try to use this model to make new predictions, we’re not going to do very well.

The general problem is that some information is highly predictive of future states, but other information is not, and we often don’t have any way of determining which is which.

Say we’re trying to predict a student’s performance on an exam, and we have their GPA, a conscientiousness score from a personality tests, the number of hours they slept the night before, and whether they ate breakfast. We might be inclined to lean on the GPA and conscientiousness more than their breakfast consumption, but how much more? It’s not clear how much information about test score (if any) is imparted by knowing they ate in the morning.

What we want is a formula that maps our knowledge to the outcome. Something like: score = 65% GPA + 20% conscientiousness + 5% sleep + 10% random other factors we don’t understand + 0% breakfast. We need to get estimates to assign these percentages, so we collect some data on these variables in a sample of students and correlate these data with outcomes.

There are two important goals in this process. First, we want to make sure that we pick up all the relationships that exist. Second, we want to avoid accidentally picking up “relationships” that aren’t real, like rain correlating with Wednesdays.

If 10% of the relationship is actually noise, but in our sample, half of that noise is coincidentally correlated with breakfast, our model will mistakenly believe that breakfast contributes 5% of the predictive information. Because this is wrong, it’s going to introduce error when we use this model to predict our student’s score.

The solution to the first goal is to make sure our estimation is sensitive to trends is the data, to ensure that we notice patterns. The solution to the second goal is to make sure our estimation is not sensitive to trends in the data, to ensure we only pick up patterns that are really there. These two goals are in direct competition with each other. This is the "bias-variance tradeoff:" if we reduce the sensitivity, we get a biased model that misses patterns in the data; if we increase the sensitivity, we mistakenly pick up noise and treat it as signal. This won’t be noticeable when we fit our model to the data, but when we try to predict, the problem will cause mistakes.

The technical term for too flexibly fitting past data is overfitting – the model is fit too closely to the original dataset at the cost of generalizing to new data. Effectively, it’s become extra good at predicting every last detail of our dataset, including those we would prefer it ignored because they aren’t going to show up in future data.

We used a 17-minute sample of your sleeping position to estimate the best-fit mattress for your preferred mattress shape.

To avoid overfitting, in machine learning, models will be trained on only part of the dataset, leaving out some subset for testing the model on data it hasn’t seen before to ensure it doesn’t just memorize the training data and achieve “perfect” performance that doesn’t generalize. (Models are penalized for complexity, but there's still a tradeoff between accuracy and complexity that can lead to overfitting (or underfitting)).

It’s useful to have a graph here to illustrate this (Incidentally, I hope to explain why illustration helps in a few posts).

Imagine we’re trying to predict the temperature over the course of a year. We have some samples from last year (the green dots). The predictor variable is the date and the output variable is temperature. The graph on the left has high bias – it’s not sensitive enough to capture the curve and predicts a linear increase in temperature even as September rolls around. On the other hand, the graph on the right is overfit. It treats each data point as an important factor when most of that is daily weather fluctuations. If we took the dotted lines from the left and right graphs and used them to predict the next year’s weather, both would do a poor job. (This despite the overfit model’s 100% accurate prediction of last year’s data.) The graph in the middle captures what we intuitively know to be the true pattern here and should make the best predictions.*

Bias isn’t inherently better than variance. Though simple biased models can avoid errors due to overfitting, the bias-variance tradeoff is just that – increasing your bias will reduce your error from variance, but at the cost of increased error from bias, or underfitting. In some situations, biased models will do better, while in others the variance is meaningful, and a complicated model will be best. In general, we can’t tell which is which when fitting a model, but that doesn’t mean we can’t characterize a particular situation whose parameters we know well and figure out which side of the tradeoff is best for that scenario.

Under conditions where learning from variance is likely to lead you astray, you want a biased model. With this in mind, we can understand why simple rules outperformed the complex strategies with more information. Some of that information, whether it was past stock performance or customer activity, was noise. The strategies might have been better than 1/N and the hiatus model, but they were weakened enough by overfitting their extra “knowledge” that they lost despite this superiority.

In our weather example, once the gross pattern has been captured by the model, using extra “information” is as likely to hurt as it is to help. It’s plausible that there are some additional details that are actually informative, like extra variability in June or whatever, but if we make the model sensitive enough to capture that information, it’s as likely to pick up enough noise to reduce performance.

Human prediction tends to suffer from a similar flaw. We often assume we can discern deep patterns in job applicants and disease symptoms, but we overfit on what is actually noise. We ignore more reliable cues that predict the overall pattern and focus on specific personality traits or anecdotes.** This is akin to getting really excited about the temperature fluctuations in June and ignoring the difference between July and December.

Tetlock (2017) describes a

"demonstration… that pitted the predictive abilities of a classroom of Yale undergraduates against those of a single Norwegian rat. The task was predicting on which side of a T-maze food would appear, with appearances determined—unbeknownst to both the humans and the rat—by a random binomial process (60 percent left and 40 percent right). The demonstration replicated the classic studies by Edwards and by Estes: the rat went for the more frequently rewarded side (getting it right roughly 60 percent of the time), whereas the humans looked hard for patterns and wound up choosing the left or the right side in roughly the proportion they were rewarded (getting it right roughly 52 percent of the time). Human performance suffers because we are, deep down, deterministic thinkers with an aversion to probabilistic strategies that accept the inevitability of error. We insist on looking for order in random sequences. Confronted by the T-maze, we look for subtle patterns like “food appears in alternating two left/one right sequences, except after the third cycle when food pops up on the right." (p. 40)***

Under conditions where complex models are likely to overfit the data, increasing bias is worth the cost of a reduced ability to account for variance. Heuristics can outperform more complex models by avoiding being actively hurt by taking in additional information. When this isn’t a concern, more data should generally be helpful, though it’s still costly to gather and integrate it. (Less knowledge can also be better when there are different costs for different kinds of mistakes. The classic example of running from a rustling bush with a 1/1000 chance that there's a predator doing the rustling illustrates that less information may be useful even if it reduces accuracy because of the very large discrepancies in the costs of making different types of errors).

II.

The recognition heuristic is probably the paradigmatic example of less-is-more heuristics. This heuristic specifically exploits a lack of knowledge – it can only be used by people who are ignorant. The reason it works is because ignorance isn’t random. You’re less likely to know obscure information, therefore, if you don’t know something, odds are better than average that it’s obscure.

In the classic demonstration, participants are presented with a pair of cities, like Hamburg and Dortmund, and are asked which has a larger population. Counterintuitively, Germans do better than Americans on American cities and vice versa. For each pair of cities, natives are likely to have heard of both, and need to rely on more explicit cues to judge the largest. These cues are noisy and unreliable, so relying on them increases error. On the other hand, non-natives are more likely to have only heard of one of the cities, and again, because this knowledge isn’t random, they tend to be ignorant of the smaller city.

However, heuristics can be successful even without more complex models actively failing. Under certain environmental conditions, using very few cues can be advantageous not because other cues are noisy, but because information is unequally distributed among cues.

In the test example, above, there were 4 cues: past GPA, conscientiousness, hours of sleep, and breakfast consumption. If each of these was equally predictive, we would need to inspect multiple cues to be sure of our prediction. On the other hand, if past GPA is 95% predictive of test scores, we can do quite well just from checking one cue (and if we tried using the other cues, we might end up overfitting, trying to correctly apportion the tiny amount of remaining predictive information contained in some combination of the other cues).

When a single cue outweighs all others (combined), the environment is referred to as noncompensatory. Optimizing models are compensatory – that is, some attributes can compensate for others. Maybe you know the student’s GPA is a 4.0, but if they don’t sleep or eat breakfast, those negatives will outweigh a lot of the GPA effect. Under compensatory conditions, you have to account for all of the cues. On the other hand, when the environment is noncompensatory, a noncompensatory (heuristic) strategy that focuses on the most useful cue(s) will perform about as well as one that accounts for all the cues. This is less likely with a compensatory environment, which may require much more complex processing.

Compensatory decision-making fits in nicely with the economic model of “rational” optimizing. When making a decision, say which ice cream brand to buy, translate everything into a common currency of utility or money you’d be willing to pay for x, and then maximize. So, price, taste, class(?), nutrition, distance from your current location in the grocery aisle, height on the shelf, etc. all get factored into your decision.

Alternatively, maybe you value taste more than anything else – there’s no price (in a normal supermarket, at least) or utility difference that you would accept to make do with low-tier ice cream. In that case, your preferences are noncompensatory and you can choose based on one cue: taste. Likewise, if you’re trying to reduce costs and you’d be fine with any ice cream but prefer to save the maximum amount, you can choose solely based on price.

Realistically, there may be a few cues that are important to you and you may compensate between them and ignore the rest. The point is that this is possible (and seems obvious) because the cues aren’t all compensatory, so we can get away with using roughly noncompensatory strategies when buying ice cream. In fact, when it comes to large purchases, like a house or a car, where we want to compensate like crazy, it’s often really hard to, e.g., trade off the new hardwood floors against the location of the property. This is especially unfortunate because these big choices are often the rarest decisions we make, which means they’re the ones we have the least experience with.

Even when cues are theoretically compensatory, the structure of the environment can determine whether heuristics are successful or render optimizing strategies next to useless.

Let’s say a student wants to choose a course that was enjoyed by most people who took it in the past. Assume we can divide those who took the course into “liked the course” and “didn’t like the course.”

Here are some possible strategies:

1. Flip a coin and choose at random

2. Ask a randomly sampled other student for their opinion

3. Ask 2 randomly sampled students; take the course if both liked it, don’t take it if they both didn’t like it, and ask a third in case of a tie

4. Sample the entire population of students

(Example from Nisbett & Ross, 1980)

We might naively expect strategy 4 to be dominant, given that it is the most comprehensive and has much more information than the other strategies.

In fact, though, depending on the environment, the simpler strategies may be surprisingly successful. (Strategy 1, of course, doesn’t depend on any external factors, so it will always yield a 50% chance that our student takes a good course.) At the extremes, if 100% of the student population either enjoyed or didn’t enjoy a course, strategies 2, 3, and 4 actually produce identical results! Here, complexity of strategy is completely irrelevant. The environment ensures that all strategies (except 1) will be 100% accurate.

If only 80% of students enjoy the course, sampling 1 student (strategy 2) raises the probability of making the right decision from 50% (strategy 1) to 80% and strategy 3 raises it almost to 90%. Sampling every other student can at best yield a 10% improvement past this point. Incidentally, this strategy seems pretty common – asking just one or two friends, rather than conducting a survey, though trust and knowledge of your friends’ preferences should improve this strategy beyond a random sample.

To further drive home the point about the environment determining results, consider that if the student body is split 50/50, sampling every student gets you the exact same result as flipping a coin.

We tend to assume that strategy choice and information gathering determine the results of our problem solving, but often the environment can play just as powerful a role in deciding the outcome. When heuristics can take advantage of structure in the environment, they can achieve impressive looking outcomes (like 90% accuracy with just 2 students surveyed). When environmental structure is random or poorly organized, heuristics may fail, but a systematic approach may do no better – you may as well flip a coin.

* There's a really interesting phenomenon called "deep double descent" where very complex machine learning models can apparently dodge the bias-variance tradeoff. An intuitive explanation of how this works can be found here: https://twitter.com/daniela_witten/status/1292293102103748609

** The overfitting story for expert prediction is likely not the whole picture - part of the reason that all forms of linear models outperform experts, even when the weights are random, is because it turns out that under certain conditions all linear models do about as well as as each other (which is to say, quite well). This is known as the "principle of the flat maximum," which may be particularly likely to occur when cues are correlated with one another. See (Dawes & Corrigan, 1974; Lovie & Lovie, 1986).

*** Choosing options in proportion to their probability of paying out is called "probability matching" and there is a lot of debate over whether it's rational or not. Gigerenzer (2002, chapter 9) argues that in social environments, this strategy might be evolutionarily stable for a group, as a mixed game-theoretic strategy. As is generally the case with these arguments, my inclination is that even if this is true, the correct strategy is to stop using the adaptive rule when it stops being useful.

References:

Dawes, R. M., & Corrigan, B. (1974). Linear models in decision making. Psychological bulletin, 81(2), 95.

Gigerenzer, G. (2002). Adaptive Thinking: Rationality in the real world. Oxford University Press on Demand.

Lovie, A. D., & Lovie, P. (1986). The flat maximum effect and linear scoring models for prediction. Journal of Forecasting, 5(3), 159-168.

Nisbett, R. E., & Ross, L. (1980). Human Inference: Strategies and shortcomings of social judgment.

Tetlock, P. E. (2017). Expert Political Judgment. Princeton University Press.

Bayes and Bounds

Search This Blog

The Bias-Variance Tradeoff and Why People Are So Bad at Predicting the Future and When Do Heuristics Work? Or: Why Bias Can Be Good

Comments

Post a Comment

Popular posts from this blog

Religion as Belief, A Realist Theory

Structure in the Mind

Bayes and Bounds