Chance News 86
Quotations
"Asymptotically we are all dead."
Submitted by Paul Alper
"To err is human, to forgive divine but to include errors in your design is statistical."
Submitted by Bill Peterson
"The only statistical test one ever needs is the IOTT or 'interocular trauma test.' The result just hits one between the eyes. If one needs any more statistical analysis, one should be working harder to control sources of error, or perhaps studying something else entirely."
Krantz is describing how some psychologists view statistical testing. On the same page he describes another viewpoint:
- "Nothing is due to chance. This is the Freudian stance...but fortunately, it has little support among researchers."
Submitted by Paul Alper
Forsooth
Gaydar
The science of ‘gaydar’
by Joshua A. Tabak and Vivian Zayas, New York Times, 3 June 2012
The definition of GAYDAR is the "Ability to sense a homosexual" according to internetslang.com.
In their NYT article, Tabak and Zayas write
Should you trust your gaydar in everyday life? Probably not. In our experiments, average gaydar judgment accuracy was only in the 60 percent range. This demonstrates gaydar ability — which is far from judgment proficiency. But is gaydar real? Absolutely.
At PLoS One is their complete research paper where subjects viewed facial photographs of homosexuals and straights; the subjects then had a short time to decide the sexual orientation. In their first experiment, the faces were shown upright only:
Twenty-four University of Washington students (19 women; age range = 18–22 years) participated in exchange for extra course credit. Data from seven additional participants were excluded from analyses due to failure to follow instructions (n = 4) or computer malfunction (n = 3).
In the second experiment, faces were shown upright and upside-down and the subjects had a short time to decide the sexual orientation:
One hundred twenty-nine University of Washington students (92 women; age range = 18–25 years) participated in exchange for extra course credit. Data from 16 additional participants were excluded from analyses due to failure to follow instructions (n = 12) or average reaction times more than 3 SD above the mean (n = 4).
According to the authors, “there are two components of “accuracy”: the hit rate which is “the proportion of gay faces correctly perceived as gay, and the false alarm rate” which is “the proportion of straight faces incorrectly perceived as gay.” The figure reproduced below (full version here) indicates that the accuracy was better when the target gender was women as opposed to men and the accuracy was better when the faces were upright as opposed to upside down. Presumably, random guessing would produce an “accuracy” of about .5.
- Figure 3. Accuracy of detecting sexual orientation from upright and upside-down faces (Experiment 2).
- Mean accuracy (A′) in judging sexual orientation from faces presented for 50 milliseconds as a function of the target’s gender and spatial orientation (upright or upside-down; Experiment 2). Judgments of upright faces are based on both configural and featural processing, whereas judgments of upside-down faces are based only on featural face processing. Error bars represent ±1 SEM.
Reproduced below are the detailed results for the second experiment:
- Table 1. Hit and False Alarm Rates for Snap Judgments of Sexual Orientation in Experiment 2.
The author’s conclude with the following statement:
The present research is the first to demonstrate (a) that configural face processing significantly contributes to perception of sexual orientation, and (b) that sexual orientation is inferred more easily from women’s vs. men’s faces. In light of these findings, it is interesting to note the popular desire to learn to read faces like books. Considering how challenging it is to read a book upside-down, it seems that we read faces better than we read books.
Discussion
1. Gaydar "accuracy" seems to be defined in the paper as hit rate / (hit rate + false alarm rate) or, to use terms common in medical tests, positive predictive value = true positive / (true positive + false positive). The paper makes no mention of negative predictive value = true negative / (true negative + false negative). As is illustrated in the Wikipedia article, legitimate medical tests will tolerate a low positive predictive value because a more expensive test exists in the rare case that the disease is actually present; negative predictive values must be high to avoid potentially deadly false optimism. The situation here is somewhat different because the subjects were exposed to an approximately equal number of gays and straights whereas in medical tests, most people in the population do not have the “disease.”
2. Perhaps the analogy with medical testing is inappropriate. That is, an error is an error and no distinction should be made between the two types of errors. Consider the above table for the case of women and upright spatial orientation. The hit rate is .36 and the false alarm rate is .22. If we assume that the 67 subjects viewed 100 gay faces and 100 straight faces, then we obtain the following table for average values:
Orientation | + | - | Total |
---|---|---|---|
Gay | 36 | 64 | 100 |
Straight | 22 | 78 | 100 |
Total | 58 | 142 | 200 |
This leads to Prob (success) = (36 + 78)/200 = .57; Prob (error) = (64 +22)/200 = .43 In effect, the model could be hidden tosses of a coin and the subjects, in an ESP fashion, guess heads or tails before the toss. Of course, a Bayesian would then assume a prior distribution and combine that with the results of the study to obtain a posterior probability and avoid any mention of p-value based on .5 as the null.
3. Why might the following be a useful way of assessing gaydar via a modification of the procedure used in the article? Repeat the second experiment with no gays.
4. Why might the following be a useful way of assessing gaydar via a modification of the procedure used in the article? Repeat the second experiment with no straights.
5. In case the reader feels that Discussion #3 and #4 are deceptive, see this Psychwiki which looks at the history of deception in social psychology:
deception can often be seen in the “cover story” for the study, which provides the participant with a justification for the procedures and measures used. The ultimate goal of using deception in research is to ensure that the behaviors or reactions observed in a controlled laboratory setting are as close to those behaviors and reactions that occur outside of the laboratory setting.
Submitted by Paul Alper
Coin experiments
“The Pleasures of Suffering for Science”, The Wall Street Journal, June 8, 2012
This story, about scientists experimenting on themselves, included a reference to a coin-tossing experiment:
Even mathematics offers an example of physical self-sacrifice, through repetitive stress injury. University of Georgia professor Pope R. Hill flipped a coin 100,000 times to prove that heads and tails would come up an approximately equal number of times. The experiment lasted a year. He fell sick but completed the count, though he had to enlist the aid of an assistant near the end.
A Google search for Prof. Hill turned up the following story at the “Weird Science” website:
If you repeatedly flip a coin, the law of probability states that approximately half the time you should get heads and half the time tails. But does this law hold true in practice?
Pope R. Hill, a professor at the University of Georgia during the 1930s, wanted to find out. But he thought coin-flipping was too imprecise a measurement, since any one coin might be imbalanced, causing it to favor heads or tails.
Instead, he filled a can with 200 pennies. Half were dated 1919, half dated 1920. He shook up the can, withdrew a coin, and recorded its date. Then he returned the coin to the can. He repeated this procedure 100,000 times!
Of the 100,000 draws, 50,145 came out 1920. 49,855 came out 1919. Hill concluded that the law of half and half does work out in practice.
Discussion
1. Do you think that drawing a single coin from among 1919 and 1920 coins - even in a perfectly shaken can - would solve the problem of potential imbalance between heads and tails on any single coin toss? Can you think of any possible imbalance in the former case?
2. In the second story, there is a remarkable relationship between Hill’s final counts. What questions, if any, might it raise in your mind about the experiment?
3. Which is the more accurate expectation from a coin-tossing experiment: (a) “heads and tails would come up an approximately equal number of times” (first story) or (b) “approximately half the time you should get heads and half the time tails” (second story)?
Submitted by Margaret Cibes
New presidential poll may be outlier
“Bloomberg Poll Shows Big But Questionable Obama Lead”
Huffington Post, June 20, 2012
A Bloomberg News national poll shows Obama leading his Republican challenger by a “surprisingly large margin of 53 to 40 percent,” instead of the (at most) single-digit margin shown in other recent polls.
While a Bloomberg representative expressed the same surprise as others, she stated that this result is based on a sample with the same demographics as its previous polls and on its usual methodology.
The article’s author states:
The most likely possibility is that this poll simply represents a statistical outlier. Yes, with a 3 percent margin of error, its Obama advantage of 53 to 40 percent is significantly different than the low single-digit lead suggested by the polling averages. However, that margin of error assumes a 95 percent level of confidence, which in simpler language means that one poll estimate in 20 will fall outside the margin of error by chance alone.
See Bloomberg’s report about the poll here.
Submitted by Margaret Cibes
Further discussion from FiveThirtyEight
Outlier polls are no substitute for news
by Nate Silver, FiveThirtyEight blog, New York Times, 20 June 2012
Silver identifies two options for dealing with such a poll, which a number of news sources have describe as an "outlier." One could simply choose to disregard it, or else "include it in some sort of average and then get on with your life." He with the following excellent advice:
My general view...is that you should not throw out data without a good reason. If cherry-picking the two or three data points that you like the most is a sin of the first order, disregarding the two or three data points that you like the least will lead to many of the same problems.
In the case of the Bloomberg poll, because the organization has a good record on accuracy, he has chosen to include it the overall average of poll results that he uses for FiveThirtyEight forecasts.
Description of one of the further adjustments that Silver makes in his model can be found in his later post Calculating ‘house effects’ of polling firms (22 June 2012). Silver explains that what often is interpreted as movement in public opinion as measured in two different polls can instead be a reflection of systematic tendencies of polling organizations to favor either Democratic or Republican candidates. Reproduced below is a chart from the post that shows the size and direction of this house effect for some major organizations:
As described there, "The house effect adjustment is calculated by applying a regression analysis that compares the results of different polling firms’ surveys in the same states...The regression analysis makes these comparisons across all combinations of polling firms and states, and comes up with an overall estimate of the house effect as a result." Looking at the table, it is interesting to note that these effects are comparable to the stated margin of sampling error for typical national polls.
Submitted by Bill Peterson
Rock-paper-scissors in Texas elections
“Elections are a Crap Shoot in Texas, Where a Roll of the Dice Can Win”
by Nathan Koppel, The Wall Street Journal, June 19, 2012
The state of Texas permits tied candidates to agree to “settle the matter by a game of chance.” The article describes instances of candidates using a die or a coin to decide an election.
In one case, “leaving nothing to chance, the city attorney drafted a three-page agreement ahead of time detailing how the flip would be conducted.”
However, not any game is permitted:
Tonya Roberts, city secretary for Rice … consulted the Texas secretary of state's office after a city-council race ended last month in a 25-25 tie. She asked whether the race could be settled with a game of "rock, paper, scissors" but was told no. "I guess some people do consider that a game of skill," she said.
For some suggested strategies for winning this game, see “How to Win at Rock, Paper, Scissors” in wikiHow, and/or “To win at rock-paper-scissors, put on a blindfold”, in Discover Magazine.
Discussion
Assume that the use of the rock-paper-scissors game had not been suggested by one of the Rice candidates, who might have been an experienced player. Do you think that a one-time play of this game, between random strangers, could have been considered a game of chance? Why or why not?
Submitted by Margaret Cibes
NSF may stop funding a “soft” science
“Political Scientists Are Lousy Forecasters”
by Jacqueline Stevens, The New York Times, June 23, 2012
A Northwestern University political science professor has written an op-ed piece responding to a House-passed amendment that would eliminate NSF grants to political scientists. To date the Senate has not voted on the bill.
She provides several anecdotes about political scientists having made incorrect predictions and states that she is “sympathetic with the [group] behind this amendment.” She feels that:
[T]he government — disproportionately — supports research that is amenable to statistical analyses and models even though everyone knows the clean equations mask messy realities that contrived data sets and assumptions don’t, and can’t, capture. …. It’s an open secret in my discipline: in terms of accurate political predictions …, my colleagues have failed spectacularly and wasted colossal amounts of time and money. …. Many of today’s peer-reviewed studies offer trivial confirmations of the obvious and policy documents filled with egregious, dangerous errors. ….I look forward to seeing what happens to my discipline and politics more generally once we stop mistaking probability studies and statistical significance for knowledge.
Discussion
1. The author makes a number of categorical statements based on anecdotal evidence. Could her conclusions about political science research be an example of the “availability heuristic/fallacy”?
2. Do you think that the problems the author identifies are limited to, or at least more common in, the area of political science than in the other "soft," or even any "hard," sciences? What information would you need in order to confirm/reject your opinion?
(Disclosure: The submitter's spouse is a political scientist, whose Ph.D. program, including stats, was entirely funded by a government act (National Defense Education Act), but who is also skeptical about some social science research.)
Submitted by Margaret Cibes at the suggestion of James Greenwood