Train diaries: Cooking science with mathematics

Mathematics is to science what seasoning is to cooking. It is impossible to imagine modern cooking without spices. However, and this is an extremely important qualification, not all spices lead to good cooking. Some combination of spices may ruin an otherwise good dish, and other combination of spices may even cover up the taste of a foul, putrefying, or poisonous dish.

The application of mathematics to science follows the same rule. While modern science would not exist without mathematics, some applications of mathematics either ruin what otherwise would be good science or cover up trivial claims or outright quackery. The IQ testing and much of the neoclassical economics are examples of the latter. Unfortunately, fraudulent use of mathematics has spread to other social sciences as well. This evidenced, among other, by the fact that most scientists cannot replicate results reported by their peers .

There are many reasons behind this trend, but I would like to focus on two of them: (i) the sudden availability of exotic spices in the hands of chefs, and (ii) the symbiotic relationship between chefs and restaurateurs who employ them.

A. The availability of exotic spices.

As we all know, the quest for spices was the main reason behind early European explorations. Spices had to be imported to the medieval Europe from the far-away lands through trade routes fraught with many dangers. This made them rare and very expensive. As a result, they were available only to those select few who could afford them. However, the improvements in transportation and warfare technologies made these exotic spices more and more available, which gave birth to fine European cuisine. The next push came with globalization that spread formerly “exotic” cuisine – Chinese, Indian, Middle Eastern, etc. all over the world, creating the new type of cooking known as "fast food". Every shopping mall in the US and Europe has “ethnic” (Chinese, Japanese, Vietnamese, Mediterranean, etc.) eateries offering hastily prepared simple and rather crappy dishes whose taste is virtually indistinguishable from one another.

The same process occurred in science. The medieval science – limited for the most part to theology – used logic and sophistry as its main methods. However, the development and application of mathematics gave birth to the modern physics, astronomy, and other empirical sciences that dethroned theology. The next push came from the development of computers that made mathematical calculations easy and effortless. Today the availability of cheap computer numerical data manipulation programs made these tools available to anyone involved in any kind of scientific inquiry, including those called “idiographic sciences” in the continental European tradition and concerned with various manifestations of human behavior (history, anthropology, psychology, sociology, economics etc.). The distinctive characteristic of the idiographic sciences, which distinguished them form nomothetic sciences, was their focus on understanding of unique human phenomena rather than general rules of nature, which in turn called for qualitative observations rather than quantitative methods.

The availability of cheap computers and computer programs fundamentally altered not only the direction of research in the idiographic sciences, but also what kinds of data are being collected. Since qualitative data are more difficult to process by computer software, their collection often takes the back seat in favor of quantitative – or rather pseudo-quantitative - data collected by opinion surveys. They are pseudo-quantitative, because they use numerical scales representing intensity (e.g. strongly agree, somewhat agree, neither agree not disagree, etc.), but they cannot be processed as “real” numbers.

For “real” numbers, such as 1,2, 3, 4 etc. we can say that the difference between 1 and 2 is the same as that between 3 and 4, and that 2 is twice as big as 1 just as 4 is twice as big as 2. However, when those numbers are being used as mere symbols representing multiple choices in an opinion survey, they cease to be “real” numbers. They can be replaced with letters a,b,c,d, etc. or even pictograms representing different choices cooked up by survey designers. The reason why they are not “real” numbers but pictograms is that we cannot say that a distance between choice A and choice B (e.g. strongly agree and moderately agree) is the same as between b and c (moderately agree and neither agree nor disagree).

Research shows that subjective perceptions of quantities themselves differ from their numerical properties. For example, a 5 percent change in probability is perceived differently depending on the overall probability of an outcome (i.e. whether it is 10%, 50% or 90%). When it comes to opinions and perceptions, that level of subjectivity is even higher. For example, if I only “moderately agree” with an opinion on, say, capital punishment, it may not take much to persuade me to be an agnostic (neither agree nor disagree). However, if I have a strong feeling (strongly agree or strongly disagree), it typically takes much more to move me into the “moderate agreement/disagreement” direction.

There are other cognitive biases distorting these measurements as well. If I ask respondents to measure, say, the length of sticks by using a measuring tape calibrated in centimeters or inches, most of the reported results will be reasonably accurate, even if some respondents may not be familiar with a particular scale of measurement. It is so, because the measurement requires simple application of a well calibrated tool that does not change when applied to the object being measured. That is, measuring tapes are typically made out of steel and do not expand or shrink to “fit” the particular object being measured. If the measurement tape were made out of rubber, however, the measurements it produces would be rubbish, because the length of the tape would change each time it is applied to the object.

Using subjective scales of the type strongly/moderately agree or disagree is like using a rubber measuring tape. The scale itself changes each time the measurement is applied. If I am currently in pain, my perception of it is affected by my current experience, so I will report my pain as severe on a rating scale. However, if I am asked to evaluate my past pain experience on the same scale, I would report it as less severe, because humans have the tendency to forget and minimize past negative experiences. Likewise, if I am asked to record my emotional state or my views on a policy proposal, my rating would be affected by two things – how this question is worded, and what I am told or asked to do before being asked that question. If previous line of inquiry evoked negative experiences my answers would differ from those if the previous line of inquiry evoked positive experiences (the so called “anchoring bias”).

It is therefore clear that answers solicited using such measurement scales are nothing more than subjective opinions that will likely change if the circumstances under which the measurement is takes changes. Assigning numbers to these measurements creates a false illusion that they represent numerical quantities, similar to the measurements of physical objects taken with a measuring tape made of steel. In reality, it is assigning numbers to the shapes of clouds in the sky. Transforming these shapes into numbers does not changer the idiographic nature of these observations into nomothetic science leading to the general laws, because such laws simply do not exist in this particular case. While each cloud was certainly caused by objective and measurable atmospheric conditions, its particular shape is idiosyncratic and impossible to predict from these measurable atmospheric conditions.

More conscientious researchers may refrain from treating such subjective responses like “real” numbers and limit their analysis to reporting frequency counts of broad categories of responses (e.g. how many people agreed and how many disagreed), but the availability of cheap data processing software make such analysis look “pedestrian” and pressure is applied to use more “advanced” techniques. I am speaking from experience here. Some time ago, an anonymous peer reviewer of my paper using frequency-based contingency tables showing distributions of opinions collected in a survey called this technique “pedestrian” and suggested one based on regression. In other words, let’s treat them as “real” numbers. This advice reminds me of the old economist joke – he could not find a can opener on an uninhabited island, so he assumed he had one.

The problem is not limited to the assumptions about quantitative properties of the data, but the kind of research that gains dominance in social sciences with the advent of cheap computational tools. This new research paradigm favors questions that can be answered by numerical or quasi-numerical data, because such data are easy to collect and process. Hence the proliferation of various opinion surveys. The idiocy of this approach lies not only in the misinterpretation of numerical data, but more importantly, in intellectual laziness is fosters. Researchers abandon the difficult intellectual task of trying to understand how people think and under what conditions in favor of giving them simplistic multiple choice tests involving prefabricated opinion statements, because such simplistic multiple choice tests are easy to score and process by computers. If this is not the proverbial drunkard’s search, I do not know what is.

B. The symbiotic relationship between chefs and restaurateurs

Imagine, if you will, a renowned chef coming to the restaurateur who employs her and saying “I tested several recipes, but they do not meet the high standards of this fine dining establishment.” “So what are we going to serve our guests tonight?” asks the restaurateur. “I do not have anything ready yet” answers the chef, but with more testing I should be able to come up with a good recipe in a week or two.” It is likely that this chef would be looking for another job after this conversation.

The same applies to scientific knowledge. As Bruno Latour (“Science in Action”) aptly observed, the production of science differs from the presentation of scientific facts after the have been produced, that is, accepted as facts by the scientific community. Whereas the presentation of agreed upon facts is subjected to only one requirement – truth (i.e. concordance with the reality) – the production of science is a much messier process. For one thing, it involves a search in the dark, before “the truth” has been discovered. All that is known during this process involves various, often conflicting claims posed by people representing different schools of thought to what they believe is factually true. Testing these claims requires enormous resources, teams of researchers and support personnel, laboratories, expensive instruments, and channels of communication with the scientific community. The procurement of these enormous resources requires the involvement of a much larger number of people than the scientists who do the actual research. It requires institutions – universities and research laboratories – run by an army of administrative staff. It involves designers and manufacturers of specialized equipment without which research would not be possible. It also requires funding, which in turn is procured by linking the prospective results of scientific inquiries to the interest of people who control financial resources, such as government officials, corporate bosses, or investors.

All those people – the administrators, the government officials, the corporate bosses, and the investors – want to see the results of their efforts and investments. In that sense, they act like the restaurateur in our story – they expect their chefs to produce meal of certain quality, but they will not settle for a chef who tells them that she tested several recipes and found none of them satisfactory. Yet, when we look at science as a set of agreed upon facts, all that messy production process disappears from the eyes, and we are served with an equivalent of a meal served to us on a sliver plate that need to pass only our taste test.

This is why the actual production of science, like cooking, is very different form tasting the already prepared product. In the idealized world, chefs and scientists look for recipes high and low, test them and select only those that pass the rigorous test – the ultimate consistency between the object (a dish or natural phenomenon) and human perception of it (excellent flavor or truth). This is how scientific research appears to Karl Popper – as a long series of attempts to falsify different scientific hypotheses to see which one will withstand this test. This may be a logical way to proceed if one’s goal is to find THE ultimate truth in some unspecified but likely long period of time, but it is a waste of effort if one’s goal is to find a truth that is good enough to pass some rudimentary plausibility test and satisfy the stakeholders in the scientific production process – the administrators, the government bureaucrats, the funders, the investors and, last but not least, the researchers themselves whose academic careers and salaries depend on producing tangible results. Falsifying hypotheses may produce results in a long run, but the stakeholders, like our restaurateur, need to put the food on the table now, not in some distant future. They will not settle for a response “we have eliminated some hypotheses, but we have nothing positive yet.” People providing such responses would be soon looking for other jobs, whether they are chefs or scientists.

Here is where the ubiquity of computers and software utilizing statistical analysis comes handy – they can produce minimally satisfactory results in a rather short time period and with relatively little effort. They can do it for two reasons. First, their use requires numbers, which in turn, leads to the substitution of qualitative phenomena with something that looks like numbers, but it really is not. I covered this process in section A of this essay. Second, it substitutes causal models with statistical correlations and inductive methods of reasoning with statistical significance tests.

We all learn in Research Methods 101 that correlation is not causation and there is no need to further repeat this rather hackneyed truth. Suffice it to say that correlation may – but does not have to – imply a causal connection, so finding a correlation is a useful first step into an inquiry into what causes the phenomenon we want explain. Unfortunately, for many research projects, especially in social and behavioral sciences, this is the only step. These researchers often shout Eureka as soon as they discover a meaningful correlation. And how do they know if the correlation is meaningful? By looking at the results of the statistical significance test or the p-statistic, which by convention has to be lower than 0.05 (or 5%) to be called “significant.” However, yelling Eureka after finding a “significant” statistical correlation is like coming to a five star restaurant and being served macaroni and cheese out of a can. It is barely edible, all right, but hardly worth the five star restaurant price. Here is why.

The p-statistic simply means that if we were to repeatedly draw a representative sample of the same size as one at hand from the same population, the value of the correlation we are looking at would be zero (meaning no correlation) in 5 (or fewer, depending on the actual value of the p-statistic) in 100 such trials. In other words, there is only 5% (or less) chance that the correlation that our analysis uncovered in the data is due to random chance. This is all that there is to it. Does this mean that there is a cause-effect relation behind this correlation?

The answer is that we really have no way of knowing with only this information at hand. Imagine the following situation. I want to know if it rained at night when I was asleep, so I look at the grass in my front yard. If it is dry, I can safely conclude that it did not rain. But what if it is wet? Can I conclude that it rained? I can, but I could be dead wrong because rain is not the only thing that makes grass wet. It could have been my automatic sprinkler, or it could have been mist on a chilly night. The same reasoning applies to statistical significance testing. What we are really testing here is whether the observed correlation is an artifact of the statistical procedure – random selection of samples whose composition slightly varies due to chance. Our p-statistic tells us the probability of that, or in the methodological parlance, the probability that the null-hypothesis (i.e. the hypothesis of no correlation) is true. If that probability is less than 5%, we by convention conclude that the null hypothesis is false. The grass is dry, therefore it did not rain that night. But the falsification of the null hypothesis is not a sufficient reason to claim that the variables in question are actually correlated, just as there is insufficient reason to conclude that it rained when the grass is wet.

What is more, statistical correlations change depending on the presence of other variables in our model. Suppose for example that when I look at prices of cars in different countries I find that in the US cars cost on average $25k or more, while in the Global South countries cars cost in the vicinity of $15k. I also discover that there are far more cars in the US than in the Global South. If I entered these findings into a computer, I would learn that there is a positive correlation between the price of cars and the number of cars sold, which would make most economists scratch their heads as this contradicts everything they know about prices and the size of the demand. These economists would tell me that I would need to consider other factors that affect prices and demand for cars, such as earnings, cost of other goods, availability of roads etc. So if following the economists’ advice, I enter them into my model, the value of the initially observed correlation will almost certainly change, and it may even become negative as the economists expected. However, if I believe that my initial findings were correct and the correlation between prices and quantity of cars is actually positive due to monopolization of transportation by car companies, I will measure the availability of alternative means of transportation, enter it into my model in the hope that the relationship between price and the number of cars sold moves back into the positive territory.

If this looks like a wild goose chase, there is certainly some truth in it. Most statistical models are highly dependent on analytic procedures used to construct them. These include the nature and the quality of the data, the data cleaning and preparation procedures, the type and number of variables in the model, and in more complex models, such as factor analysis, researcher-supplied assumptions about the nature of the expected outcome that are necessary for the computerized analysis to run. As Paul Ormerod (“Death of Economics”) argued, complex econometric models can produce a virtually infinite number of solutions, depending on more or less arbitrary assumptions that analysts build into them. A similar conclusion was reached by Stephen Jay Gould (“The Mismeasure of Man”) in regard to complex psychometric models.

So if these complex statistical tools indeed put the researchers on a wild goose chase, why is there such a strong pressure in the research community to use these tools? The answer is that, unlike in the wild nature where the wild goose typically gets away after leading her pursuer sufficiently far astray, in the academic research world the pursuers almost always get something from such a chase. It may not be the goose they were after, but a few feathers, an insect or two, or a mouse if they are lucky. In other words, they will find statistically significant correlations that computers will invariably produce if run long enough, which then they can claim as their big catch in their research papers and use as a bait to attract further research grants and support. There is a tangible product of the chase after all, which makes everyone happy, the chef and the restaurateur and his guest, the researches, the university administrators, the government bureaucrats, and the funders and investors.

None of it would be possible without computer tools capable of cranking out correlations in the matter of seconds from the garbage that passes for the quantitative data. This has transformed the science as Popper claimed it to be - falsification of scientific hypotheses. Indeed there is little economic value in such a pursuit, just as there is little value in a chef deconstructing the food served by other restaurants. What pays out is spending most of the energy on cranking out food for thought that the administrators and paying clients would buy. There is little wonder that so few of these results are replicable, since their construction involves a fair amount of chance and data preparation.

We have achieved a lofty goal. As the proliferation of computers and quantitative analytic tools reaches unprecedented proportions, the social science research resembles, more and more closely, an elaborate chase after a wild goose in the thicket of numbers that pass for the data. On some great and glorious day social scientists will reach the object of their physics envy and turn their discipline into what theology was in the Middle Ages, an impressive logically coherent intellectual edifice whose empirical relevance and predictive power is on a par with that of a chimp randomly pushing computer buttons.

Train diaries

Thursday, March 9, 2017

Cooking science with mathematics

No comments:

Post a Comment

Contributors

Blog Archive