The Bias-Variance Tradeoff and Why People Are So Bad at Predicting the Future and When Do Heuristics Work? Or: Why Bias Can Be Good
[This is post 4 in the "Structure and Cognition" series; links to all the posts can be found here]
I.
If there were a contest for the scientific findings most
embarrassing for human intelligence, the research comparing expert prediction to simple linear equations would be a strong contender.
When this research program began in the 1940s, clinical
psychologists were curious to see just how much better their clinical judgment would be than statistical models. Dawes and Corrigan (1974) note that “the
statistical analysis was thought to provide the floor to which the judgment of
the experienced clinician could be compared. The floor turned out to be a
ceiling.”
In summarizing this large body of literature, which repeatedly demonstrates the inferiority of human judgment, Nisbett and Ross (1980) write:
"Human judges make less accurate predictions than formulas do, whether they have more information than is fed into the formula or precisely the same amount of information. Human predictions are worse than actuarial ones even if the judges are informed of the weights used by the formula and are worse even if the judges are informed of the specific predictions generated by the formula in the cases in question.
Human judges are not merely worse than optimal regression equations; they are worse than almost any regression equation. Even if the weights in the equation are arbitrary, as long as they are nonzero, positive, and linear, the equation generally will outperform human judges. Human judges do not merely apply invalid weights; they apply their invalid weights unreliably: A computational model of an individual judge, one which calculates the weights applied by the judge to the first N cases, will outperform the judge of the next batch of cases because of improved reliability alone." (p. 141)
Some of this should sound similar to the discussion in my
previous post about simple heuristics outperforming more complicated algorithms
despite having not just less information but no information at all.
The reason this is possible lies in the move from explaining past
data to using that data to make new predictions. The past is a (somewhat) unreliable guide to the future – we can’t just predict that exactly the same events will recur. If it rained last Wednesday, that doesn’t
mean it will rain this Wednesday. But if we aren’t careful, we could easily construct a model that perfectly explains last week’s weather by predicting
rain every 7 days. If the model is based on only a week’s worth of data (or 3
weeks where it coincidentally rained every 7 days, etc.), it would achieve
perfect explanatory accuracy in accounting for last week. But obviously, if we
try to use this model to make new predictions, we’re not going to do very well.
The general problem is that some information is highly predictive of future states, but other information is not, and we often don’t have any way of determining which is which.
Say we’re trying to predict a student’s performance on an exam, and we have their GPA, a conscientiousness score from a personality tests, the number of hours they slept the night before, and whether they ate breakfast. We might be inclined to lean on the GPA and conscientiousness more than their breakfast consumption, but how much more? It’s not clear how much information about test score (if any) is imparted by knowing they ate in the morning.
What we want is a formula that maps our knowledge to the
outcome. Something like: score = 65% GPA + 20% conscientiousness + 5% sleep +
10% random other factors we don’t understand + 0% breakfast. We need to get estimates
to assign these percentages, so we collect some data on these variables in a sample
of students and correlate these data with outcomes.
There are two important goals in this process. First, we want
to make sure that we pick up all the relationships that exist. Second, we want
to avoid accidentally picking up “relationships” that aren’t real, like rain
correlating with Wednesdays.
If 10% of the relationship is actually noise, but in our
sample, half of that noise is coincidentally correlated with breakfast, our model will mistakenly believe that breakfast contributes 5% of the predictive
information. Because this is wrong, it’s going to introduce error when we use
this model to predict our student’s score.
The solution to the first goal is to make sure our
estimation is sensitive to trends is the data, to ensure that we notice
patterns. The solution to the second goal is to make sure our estimation is not
sensitive to trends in the data, to ensure we only pick up patterns that are really
there. These two goals are in direct competition with each other. This is the "bias-variance tradeoff:" if we reduce the sensitivity, we get a biased model
that misses patterns in the data; if we increase the sensitivity, we mistakenly
pick up noise and treat it as signal. This won’t be noticeable when we fit our
model to the data, but when we try to predict, the problem will cause mistakes.
The technical term for too flexibly fitting past data is overfitting – the model is fit too closely to the original dataset at the cost of generalizing to new data. Effectively, it’s become extra good at predicting every last detail of our dataset, including those we would prefer it ignored because they aren’t going to show up in future data.
We used a 17-minute sample of your sleeping position to
estimate the best-fit mattress for your preferred mattress shape.
To avoid overfitting, in machine learning, models will be
trained on only part of the dataset, leaving out some subset for testing the
model on data it hasn’t seen before to ensure it doesn’t just memorize the
training data and achieve “perfect” performance that doesn’t generalize. (Models are penalized for complexity, but there's still a tradeoff between accuracy and complexity that can lead to overfitting (or underfitting)).
It’s useful to have a graph here to illustrate this (Incidentally,
I hope to explain why illustration helps in a few posts).
Bias isn’t inherently better than variance. Though simple biased models can avoid errors due to overfitting, the bias-variance tradeoff
is just that – increasing your bias will reduce your error from variance, but at
the cost of increased error from bias, or underfitting. In some situations, biased
models will do better, while in others the variance is meaningful, and a
complicated model will be best. In general, we can’t tell which is which when fitting
a model, but that doesn’t mean we can’t characterize a particular situation whose
parameters we know well and figure out which side of the tradeoff is best for that scenario.
Under conditions where learning from variance is
likely to lead you astray, you want a biased model. With this in mind, we can
understand why simple rules outperformed the complex strategies with more
information. Some of that information, whether it was past stock performance or
customer activity, was noise. The strategies might have been better than 1/N
and the hiatus model, but they were weakened enough by overfitting their extra “knowledge”
that they lost despite this superiority.
In our weather example, once the gross pattern has been
captured by the model, using extra “information” is as likely to hurt as it is
to help. It’s plausible that there are some additional details that are actually
informative, like extra variability in June or whatever, but if we make the model
sensitive enough to capture that information, it’s as likely to pick up enough
noise to reduce performance.
Human prediction tends to suffer from a similar flaw. We often
assume we can discern deep patterns in job applicants and disease symptoms, but
we overfit on what is actually noise. We ignore more reliable cues that predict
the overall pattern and focus on specific personality traits or anecdotes.** This
is akin to getting really excited about the temperature fluctuations in June
and ignoring the difference between July and December.
Tetlock (2017) describes a
"demonstration… that pitted the predictive abilities of a classroom of Yale undergraduates against those of a single Norwegian rat. The task was predicting on which side of a T-maze food would appear, with appearances determined—unbeknownst to both the humans and the rat—by a random binomial process (60 percent left and 40 percent right). The demonstration replicated the classic studies by Edwards and by Estes: the rat went for the more frequently rewarded side (getting it right roughly 60 percent of the time), whereas the humans looked hard for patterns and wound up choosing the left or the right side in roughly the proportion they were rewarded (getting it right roughly 52 percent of the time). Human performance suffers because we are, deep down, deterministic thinkers with an aversion to probabilistic strategies that accept the inevitability of error. We insist on looking for order in random sequences. Confronted by the T-maze, we look for subtle patterns like “food appears in alternating two left/one right sequences, except after the third cycle when food pops up on the right." (p. 40)***
Under conditions where complex models are likely to overfit
the data, increasing bias is worth the cost of a reduced ability to account for
variance. Heuristics can outperform more complex models by avoiding being
actively hurt by taking in additional information. When this isn’t a concern, more
data should generally be helpful, though it’s still costly to gather and
integrate it. (Less knowledge can also be better when there are different costs
for different kinds of mistakes. The classic example of running from a rustling
bush with a 1/1000 chance that there's a predator doing the rustling illustrates
that less information may be useful even if it reduces accuracy because of the
very large discrepancies in the costs of making different types of
errors).
II.
The recognition heuristic is probably the paradigmatic
example of less-is-more heuristics. This heuristic specifically exploits a lack
of knowledge – it can only be used by people who are ignorant. The reason it
works is because ignorance isn’t random. You’re less likely to know obscure
information, therefore, if you don’t know something, odds are better than
average that it’s obscure.
In the classic demonstration, participants are presented
with a pair of cities, like Hamburg and Dortmund, and are asked which has a
larger population. Counterintuitively, Germans do better than Americans on
American cities and vice versa. For each pair of cities, natives are likely to
have heard of both, and need to rely on more explicit cues to judge the
largest. These cues are noisy and unreliable, so relying on them increases
error. On the other hand, non-natives are more likely to have only heard of one
of the cities, and again, because this knowledge isn’t random, they tend to be
ignorant of the smaller city.
However, heuristics
can be successful even without more complex models actively failing. Under
certain environmental conditions, using very few cues can be advantageous not
because other cues are noisy, but because information is unequally distributed
among cues.
In the test
example, above, there were 4 cues: past GPA, conscientiousness, hours of sleep,
and breakfast consumption. If each of these was equally predictive, we would
need to inspect multiple cues to be sure of our prediction. On the other hand,
if past GPA is 95% predictive of test scores, we can do quite well just from
checking one cue (and if we tried using the other cues, we might end up
overfitting, trying to correctly apportion the tiny amount of remaining predictive
information contained in some combination of the other cues).
When a single
cue outweighs all others (combined), the environment is referred to as noncompensatory.
Optimizing models are compensatory – that is, some attributes can compensate
for others. Maybe you know the student’s GPA is a 4.0, but if they don’t sleep
or eat breakfast, those negatives will outweigh a lot of the GPA effect. Under
compensatory conditions, you have to account for all of the cues. On the other
hand, when the environment is noncompensatory, a noncompensatory (heuristic)
strategy that focuses on the most useful cue(s) will perform about as well as
one that accounts for all the cues. This is less likely with a compensatory
environment, which may require much more complex processing.
Compensatory
decision-making fits in nicely with the economic model of “rational” optimizing.
When making a decision, say which ice cream brand to buy, translate everything
into a common currency of utility or money you’d be willing to pay for x, and
then maximize. So, price, taste, class(?), nutrition, distance from your
current location in the grocery aisle, height on the shelf, etc. all get
factored into your decision.
Alternatively, maybe
you value taste more than anything else – there’s no price (in a normal
supermarket, at least) or utility difference that you would accept to make do
with low-tier ice cream. In that case, your preferences are noncompensatory and
you can choose based on one cue: taste. Likewise, if you’re trying to reduce
costs and you’d be fine with any ice cream but prefer to save the maximum
amount, you can choose solely based on price.
Realistically,
there may be a few cues that are important to you and you may compensate
between them and ignore the rest. The point is that this is possible (and seems
obvious) because the cues aren’t all compensatory, so we can get away with
using roughly noncompensatory strategies when buying ice cream. In fact, when
it comes to large purchases, like a house or a car, where we want to compensate
like crazy, it’s often really hard to, e.g., trade off the new hardwood floors
against the location of the property. This is especially unfortunate because
these big choices are often the rarest decisions we make, which means they’re
the ones we have the least experience with.
Even when cues are theoretically compensatory, the structure
of the environment can determine whether heuristics are successful or render
optimizing strategies next to useless.
Let’s say a student wants to choose a course that was
enjoyed by most people who took it in the past. Assume we can divide those who
took the course into “liked the course” and “didn’t like the course.”
Here are some possible strategies:
1. Flip a coin and choose at random
2. Ask a randomly sampled other student for their
opinion
3. Ask 2 randomly sampled students; take the course
if both liked it, don’t take it if they both didn’t like it, and ask a third in
case of a tie
4. Sample the entire population of students
(Example
from Nisbett & Ross, 1980)
We might naively expect strategy 4 to be dominant, given
that it is the most comprehensive and has much more information than the other
strategies.
In fact, though, depending on the environment, the simpler
strategies may be surprisingly successful. (Strategy 1, of course, doesn’t
depend on any external factors, so it will always yield a 50% chance that our
student takes a good course.) At the extremes, if 100% of the student
population either enjoyed or didn’t enjoy a course, strategies 2, 3, and 4
actually produce identical results! Here, complexity of strategy is completely
irrelevant. The environment ensures that all strategies (except 1) will be 100%
accurate.
If only 80% of students enjoy the course, sampling 1 student
(strategy 2) raises the probability of making the right decision from 50%
(strategy 1) to 80% and strategy 3 raises it almost to 90%. Sampling every
other student can at best yield a 10% improvement past this point.
Incidentally, this strategy seems pretty common – asking just one or two
friends, rather than conducting a survey, though trust and knowledge of your
friends’ preferences should improve this strategy beyond a random sample.
To further drive home the point about the environment determining
results, consider that if the student body is split 50/50, sampling every
student gets you the exact same result as flipping a coin.
We tend to assume that strategy choice and information
gathering determine the results of our problem solving, but often the
environment can play just as powerful a role in deciding the outcome. When
heuristics can take advantage of structure in the environment, they can achieve
impressive looking outcomes (like 90% accuracy with just 2 students surveyed).
When environmental structure is random or poorly organized, heuristics may
fail, but a systematic approach may do no better – you may as well flip a coin.
** The overfitting story for expert prediction is likely not the whole picture - part of the reason that all forms of linear models outperform experts, even when the weights are random, is because it turns out that under certain conditions all linear models do about as well as as each other (which is to say, quite well). This is known as the "principle of the flat maximum," which may be particularly likely to occur when cues are correlated with one another. See (Dawes & Corrigan, 1974; Lovie & Lovie, 1986).
*** Choosing options in proportion to their probability of paying out is called "probability matching" and there is a lot of debate over whether it's rational or not. Gigerenzer (2002, chapter 9) argues that in social environments, this strategy might be evolutionarily stable for a group, as a mixed game-theoretic strategy. As is generally the case with these arguments, my inclination is that even if this is true, the correct strategy is to stop using the adaptive rule when it stops being useful.
References:
Dawes, R. M., & Corrigan, B. (1974). Linear models in decision making. Psychological bulletin, 81(2), 95.
Nisbett, R. E., & Ross, L. (1980). Human Inference: Strategies and shortcomings of social judgment.
Tetlock, P. E. (2017). Expert Political Judgment. Princeton University Press.
Why casinos are rigged - Hertzaman - The Herald
ReplyDeleteIn the 1xbet korean UK, jancasino.com casino games are rigged and there 바카라 사이트 is evidence of fraud, crime ventureberg.com/ or disorder or an individual's involvement. There are also many