Mathematics is to science what seasoning is to cooking. It
is impossible to imagine modern cooking without spices. However, and this is an
extremely important qualification, not all spices lead to good cooking. Some combination
of spices may ruin an otherwise good dish, and other combination of spices may
even cover up the taste of a foul, putrefying, or poisonous dish.
The application of mathematics to science follows the same
rule. While modern science would not exist without mathematics, some
applications of mathematics either ruin what otherwise would be good science or
cover up trivial claims or outright quackery. The IQ testing and much of the neoclassical
economics are examples of the latter. Unfortunately, fraudulent use of
mathematics has spread to other social sciences as well. This evidenced, among other, by the fact that
most scientists cannot replicate results reported by their peers .
There are many reasons behind this trend, but I would like
to focus on two of them: (i) the sudden
availability of exotic spices in the hands of chefs, and (ii) the symbiotic
relationship between chefs and restaurateurs who employ them.
A. The availability of exotic spices.
As we all know, the quest for spices was the main reason
behind early European explorations. Spices had to be imported to the medieval
Europe from the far-away lands through trade routes fraught with many
dangers. This made them rare and very
expensive. As a result, they were available only to those select few who could
afford them. However, the improvements in
transportation and warfare technologies made these exotic spices more and more
available, which gave birth to fine European cuisine. The next push came with globalization that
spread formerly “exotic” cuisine – Chinese, Indian, Middle Eastern, etc. all
over the world, creating the new type of cooking known as "fast food". Every shopping mall in the US and Europe has
“ethnic” (Chinese, Japanese, Vietnamese, Mediterranean, etc.) eateries offering
hastily prepared simple and rather crappy dishes whose taste is virtually
indistinguishable from one another.
The same process occurred in science. The medieval science – limited for the most
part to theology – used logic and sophistry as its main methods. However, the development and application of mathematics
gave birth to the modern physics, astronomy, and other empirical sciences that dethroned
theology. The next push came from the
development of computers that made mathematical calculations easy and
effortless. Today the availability of
cheap computer numerical data manipulation programs made these tools available
to anyone involved in any kind of scientific inquiry, including those called “idiographic sciences” in the continental European tradition and concerned
with various manifestations of human behavior (history, anthropology,
psychology, sociology, economics etc.). The
distinctive characteristic of the idiographic sciences, which distinguished
them form nomothetic sciences, was their focus on understanding of unique human
phenomena rather than general rules of nature, which in turn called for
qualitative observations rather than quantitative methods.
The availability of cheap computers and computer programs fundamentally
altered not only the direction of research in the idiographic sciences, but
also what kinds of data are being collected. Since qualitative data are more difficult
to process by computer software, their collection often takes the back seat in
favor of quantitative – or rather pseudo-quantitative - data collected by
opinion surveys. They are
pseudo-quantitative, because they use numerical scales representing intensity
(e.g. strongly agree, somewhat agree, neither agree not disagree, etc.), but
they cannot be processed as “real” numbers.
For “real” numbers, such as 1,2, 3, 4 etc. we can say that
the difference between 1 and 2 is the same as that between 3 and 4, and that 2
is twice as big as 1 just as 4 is twice as big as 2. However, when those numbers are being used as
mere symbols representing multiple choices in an opinion survey, they cease to
be “real” numbers. They can be replaced
with letters a,b,c,d, etc. or even pictograms representing different choices
cooked up by survey designers. The reason why they are not “real” numbers but
pictograms is that we cannot say that a distance between choice A and choice B
(e.g. strongly agree and moderately agree) is the same as between b and c
(moderately agree and neither agree nor disagree).
Research shows that subjective perceptions of quantities
themselves differ from their numerical properties. For example, a 5 percent change in
probability is perceived differently depending on the overall probability of an
outcome (i.e. whether it is 10%, 50% or 90%).
When it comes to opinions and perceptions, that level of subjectivity is
even higher. For example, if I only
“moderately agree” with an opinion on, say, capital punishment, it may not take
much to persuade me to be an agnostic (neither agree nor disagree). However, if I have a strong feeling (strongly
agree or strongly disagree), it typically takes much more to move me into the
“moderate agreement/disagreement” direction.
There are other cognitive biases distorting these
measurements as well. If I ask
respondents to measure, say, the length of sticks by using a measuring tape
calibrated in centimeters or inches, most of the reported results will be reasonably
accurate, even if some respondents may not be familiar with a particular scale
of measurement. It is so, because the
measurement requires simple application of a well calibrated tool that does not change
when applied to the object being measured.
That is, measuring tapes are typically made out of steel and do not
expand or shrink to “fit” the particular object being measured. If the measurement tape were made out of
rubber, however, the measurements it produces would be rubbish, because the
length of the tape would change each time it is applied to the object.
Using subjective scales of the type strongly/moderately
agree or disagree is like using a rubber measuring tape. The scale itself changes each time the
measurement is applied. If I am
currently in pain, my perception of it is affected by my current experience, so
I will report my pain as severe on a rating scale. However, if I am asked to evaluate my past
pain experience on the same scale, I would report it as less severe, because
humans have the tendency to forget and minimize past negative experiences. Likewise, if I am asked to record my emotional
state or my views on a policy proposal, my rating would be affected by two
things – how this question is worded, and what I am told or asked to do
before being asked that question. If
previous line of inquiry evoked negative experiences my answers would differ
from those if the previous line of inquiry evoked positive experiences (the so
called “anchoring bias”).
It is therefore clear that answers solicited using such
measurement scales are nothing more than subjective opinions that will likely
change if the circumstances under which the measurement is takes changes. Assigning numbers to these measurements creates
a false illusion that they represent numerical quantities, similar to the
measurements of physical objects taken with a measuring tape made of steel. In reality, it is assigning numbers to the
shapes of clouds in the sky.
Transforming these shapes into numbers does not changer the idiographic
nature of these observations into nomothetic science leading to the general
laws, because such laws simply do not exist in this particular case. While each cloud was certainly caused by
objective and measurable atmospheric conditions, its particular shape is
idiosyncratic and impossible to predict from these measurable atmospheric
conditions.
More conscientious researchers may refrain from treating such
subjective responses like “real” numbers and limit their analysis to reporting
frequency counts of broad categories of responses (e.g. how many people
agreed and how many disagreed), but the availability of cheap data processing
software make such analysis look “pedestrian” and pressure is applied to use
more “advanced” techniques. I am
speaking from experience here. Some time
ago, an anonymous peer reviewer of my paper using frequency-based contingency
tables showing distributions of opinions collected in a survey called this
technique “pedestrian” and suggested one based on regression. In other words, let’s treat them as “real”
numbers. This advice reminds me of the old economist joke – he could not find a
can opener on an uninhabited island, so he assumed he had one.
The problem is not limited to the assumptions about quantitative
properties of the data, but the kind of research that gains dominance in social
sciences with the advent of cheap computational tools. This new research paradigm favors questions
that can be answered by numerical or quasi-numerical data, because such data
are easy to collect and process. Hence
the proliferation of various opinion surveys.
The idiocy of this approach lies not only in the misinterpretation of
numerical data, but more importantly, in intellectual laziness is fosters. Researchers abandon the difficult
intellectual task of trying to understand how people think and under what
conditions in favor of giving them simplistic multiple choice tests involving
prefabricated opinion statements, because such simplistic multiple choice tests
are easy to score and process by computers.
If this is not the proverbial drunkard’s search, I do not know what is.
B. The symbiotic relationship between chefs and
restaurateurs
Imagine, if you will, a renowned chef coming to the
restaurateur who employs her and saying “I tested several recipes, but they do
not meet the high standards of this fine dining establishment.” “So what are we going to serve our guests
tonight?” asks the restaurateur. “I do
not have anything ready yet” answers the chef, but with more testing I should
be able to come up with a good recipe in a week or two.” It is likely that this chef would be
looking for another job after this conversation.
The same applies to scientific knowledge. As Bruno Latour (“Science in Action”) aptly
observed, the production of science differs from the presentation of scientific
facts after the have been produced, that is, accepted as facts by the
scientific community. Whereas the
presentation of agreed upon facts is subjected to only one requirement – truth (i.e.
concordance with the reality) – the production of science is a much messier process. For one thing, it involves a search in the
dark, before “the truth” has been discovered.
All that is known during this process involves various, often
conflicting claims posed by people representing different schools of thought to
what they believe is factually true.
Testing these claims requires enormous resources, teams of researchers
and support personnel, laboratories, expensive instruments, and channels of
communication with the scientific community.
The procurement of these enormous resources requires the involvement of
a much larger number of people than the scientists who do the actual
research. It requires institutions –
universities and research laboratories – run by an army of administrative staff.
It involves designers and manufacturers of specialized
equipment without which research would not be possible. It also requires funding, which in turn is procured
by linking the prospective results of scientific inquiries to the interest of
people who control financial resources, such as government officials, corporate
bosses, or investors.
All those people – the administrators, the government
officials, the corporate bosses, and the investors – want to see the results of
their efforts and investments. In that
sense, they act like the restaurateur in our story – they expect their chefs to
produce meal of certain quality, but they will not settle for a chef who tells
them that she tested several recipes and found none of them satisfactory. Yet, when we look at science as a set of
agreed upon facts, all that messy production process disappears from the eyes,
and we are served with an equivalent of a meal served to us on a sliver plate
that need to pass only our taste test.
This is why the actual production of science, like cooking,
is very different form tasting the already prepared product. In the idealized world, chefs and scientists
look for recipes high and low, test them and select only those that pass the
rigorous test – the ultimate consistency between the object (a dish or natural phenomenon) and human perception of it (excellent flavor or truth). This is how
scientific research appears to Karl Popper – as a long series of attempts to
falsify different scientific hypotheses to see which one will withstand this
test. This may be a logical way to
proceed if one’s goal is to find THE ultimate truth in some unspecified but
likely long period of time, but it is a waste of effort if one’s goal is to
find a truth that is good enough to pass some rudimentary plausibility test and
satisfy the stakeholders in the scientific production process – the administrators,
the government bureaucrats, the funders, the investors and, last but not least,
the researchers themselves whose academic careers and salaries depend on producing
tangible results. Falsifying hypotheses may
produce results in a long run, but the stakeholders, like our restaurateur,
need to put the food on the table now, not in some distant future. They will not settle for a response “we have
eliminated some hypotheses, but we have nothing positive yet.” People providing such responses would be soon
looking for other jobs, whether they are chefs or scientists.
Here is where the ubiquity of computers and software
utilizing statistical analysis comes handy – they can produce minimally
satisfactory results in a rather short time period and with relatively little
effort. They can do it for two
reasons. First, their use requires
numbers, which in turn, leads to the substitution of qualitative phenomena with
something that looks like numbers, but it really is not. I covered this process in section A of this
essay. Second, it substitutes causal
models with statistical correlations and inductive methods of reasoning with statistical significance tests.
We all learn in Research Methods 101 that correlation is not
causation and there is no need to further repeat this rather hackneyed truth. Suffice it to say that correlation may – but does
not have to – imply a causal connection, so finding a correlation is a useful
first step into an inquiry into what causes the phenomenon we want
explain. Unfortunately, for many research
projects, especially in social and behavioral sciences, this is the only
step. These researchers often shout
Eureka as soon as they discover a meaningful correlation. And how do they know if the correlation is
meaningful? By looking at the results of
the statistical significance test or the p-statistic, which by convention has
to be lower than 0.05 (or 5%) to be called “significant.” However, yelling Eureka after finding a “significant”
statistical correlation is like coming to a five star restaurant and being
served macaroni and cheese out of a can.
It is barely edible, all right, but hardly worth the five star restaurant
price. Here is why.
The p-statistic simply means that if we were to repeatedly draw a representative
sample of the same size as one at hand from the same population, the value of the
correlation we are looking at would be zero (meaning no correlation) in 5 (or
fewer, depending on the actual value of the p-statistic) in 100 such trials. In other words, there is only 5% (or less)
chance that the correlation that our analysis uncovered in the data is due to
random chance. This is all that there is
to it. Does this mean that there is a
cause-effect relation behind this correlation?
The answer is that we really have no way of knowing with
only this information at hand. Imagine
the following situation. I want to know
if it rained at night when I was asleep, so I look at the grass in my front
yard. If it is dry, I can safely conclude
that it did not rain. But what if it is
wet? Can I conclude that it rained? I can, but I could be dead wrong because rain
is not the only thing that makes grass wet.
It could have been my automatic sprinkler, or it could have been mist on
a chilly night. The same reasoning
applies to statistical significance testing.
What we are really testing here is whether the observed correlation is
an artifact of the statistical procedure – random selection of samples whose
composition slightly varies due to chance.
Our p-statistic tells us the probability of that, or in the
methodological parlance, the probability that the null-hypothesis (i.e. the
hypothesis of no correlation) is true.
If that probability is less than 5%, we by convention conclude that the
null hypothesis is false. The grass is
dry, therefore it did not rain that night.
But the falsification of the null hypothesis is not a sufficient reason to
claim that the variables in question are actually correlated, just as there is
insufficient reason to conclude that it rained when the grass is wet.
What is more, statistical correlations change depending on
the presence of other variables in our model.
Suppose for example that when I look at prices of cars in different
countries I find that in the US cars cost on average $25k or more, while in the
Global South countries cars cost in the vicinity of $15k. I also discover that there are far more cars
in the US than in the Global South. If I
entered these findings into a computer, I would learn that there is a positive
correlation between the price of cars and the number of cars sold, which would
make most economists scratch their heads as this contradicts everything they
know about prices and the size of the demand.
These economists would tell me that I would need to consider other
factors that affect prices and demand for cars, such as earnings, cost of other
goods, availability of roads etc. So if following
the economists’ advice, I enter them into my model, the value of the initially
observed correlation will almost certainly change, and it may even become
negative as the economists expected. However,
if I believe that my initial findings were correct and the correlation between
prices and quantity of cars is actually positive due to monopolization of
transportation by car companies, I will measure the availability of alternative
means of transportation, enter it into my model in the hope that the relationship
between price and the number of cars sold moves back into the positive
territory.
If this looks like a wild goose chase, there is certainly
some truth in it. Most statistical
models are highly dependent on analytic procedures used to construct them. These include the nature and the quality of the
data, the data cleaning and preparation procedures, the type and number of variables
in the model, and in more complex models, such as factor analysis,
researcher-supplied assumptions about the nature of the expected outcome that
are necessary for the computerized analysis to run. As Paul Ormerod (“Death of Economics”)
argued, complex econometric models can produce a virtually infinite number of solutions,
depending on more or less arbitrary assumptions that analysts build into
them. A similar conclusion was reached
by Stephen Jay Gould (“The Mismeasure of Man”) in regard to complex psychometric
models.
So if these complex statistical tools indeed put the
researchers on a wild goose chase, why is there such a strong pressure in the
research community to use these tools? The
answer is that, unlike in the wild nature where the wild goose typically gets
away after leading her pursuer sufficiently far astray, in the academic research
world the pursuers almost always get something from such a chase. It may not be the goose they were after, but
a few feathers, an insect or two, or a mouse if they are lucky. In other words, they will find statistically significant
correlations that computers will invariably produce if run long enough, which
then they can claim as their big catch in their research papers and use as a
bait to attract further research grants and support. There is a tangible product of the chase
after all, which makes everyone happy, the chef and the restaurateur and his
guest, the researches, the university administrators, the government
bureaucrats, and the funders and investors.
None of it would be possible without computer tools capable
of cranking out correlations in the matter of seconds from the garbage that
passes for the quantitative data. This has
transformed the science as Popper claimed it to be - falsification of scientific
hypotheses. Indeed there is little
economic value in such a pursuit, just as there is little value in a chef
deconstructing the food served by other restaurants. What pays out is spending most of the energy on
cranking out food for thought that the administrators and paying clients would
buy. There is little wonder that so few
of these results are replicable, since their construction involves a fair
amount of chance and data preparation.
We have achieved a lofty goal.
As the proliferation of computers and quantitative analytic tools reaches
unprecedented proportions, the social science research resembles, more and more
closely, an elaborate chase after a wild goose in the thicket of numbers that
pass for the data. On some great and
glorious day social scientists will reach the object of their physics envy and
turn their discipline into what theology was in the Middle Ages, an impressive
logically coherent intellectual edifice whose empirical relevance and
predictive power is on a par with that of a chimp randomly pushing computer
buttons.
No comments:
Post a Comment