Chance News 84

Quotations

"There's a saying in the San Francisco Bay Area: There are lies, damned lies, and next bus."

reported in Get onboard: It's time to stop hating the bus, Talk of the Nation, NPR, 29 March 2012

Submitted by Jerry Grossman

"Skepticism enjoins scientists — in fact all of us — to suspend belief until strong evidence is forthcoming, but this tentativeness is no match for the certainty of ideologues and seems to suggest to many the absurd idea that all opinions are equally valid. ...What else explains the seemingly equal weight accorded to the statements of entertainers and biological researchers on childhood vaccines?..."

--John Allen Paulos, in Why don’t Americans elect scientists?, New York Times, 13 February 2012

Submitted by Bill Peterson

From Significance magazine, February 2012:

“Diaconis and Mosteller … introduced an adage that they called the law of truly large numbers: with a large enough sample, almost anything outrageous will happen.”

“What a coincidence? It’s not as unlikely as you think”

“[After a drug trial] no one in the world wants to know what the chance is that, for this experimental group, A was better than B …. We know exactly what the chance is, because the event has already happened. …. The probability that A was better, given this evidence, is either 1 or 0. …. What the drug company actually wants to know is, given the evidence from the trial, and possibly given other evidence about the two drugs, what are the chances that a greater proportion of future patients will get better if they take A instead of B? …. [After a hypothesis test] the civilian assumes that his original question has been answered and that the probability that A is better is high. But he should not believe this, because of course it might not be true. The p-value can be small, as small as you like, and the probability that A is better could still be low.”

“Why do statisticians answer silly questions that no one ever asks”

Submitted by Margaret Cibes

Forsooth

“Let's start with T. Rowe Price's U.S. stock fund lineup. I have plugged in 15 of its largest actively managed U.S. equity funds. Let's start at the top with T. Rowe Price Blue Chip Growth. [Note] that T. Rowe Price Growth Stock is a tight fit with a 1.00 correlation--the highest it can get. So, we know that owning those two large-growth funds is rather redundant. [Note] that Small-Cap Value has the lowest correlation at 0.91--thus, it's a good choice for diversification purposes.”

“Fun With Mutual Correlation Matrices”, Morningstar, April 2, 2012

Submitted by Margaret Cibes

“This past week, Obama sports a 53-44 fav[orable] / unfav[orable] rating among women, but 43-54 among men. That's a whopping 20-point gender gap. .... In 2008, President Barack Obama won women 56-43, while narrowly edging out John McCain among men 49-48. That 12-point gender gap appeared massive at the time, but it appears that we're headed toward an even bigger margin in 2012.”

“Republicans face massive gender deficit”, Daily KOS, March 23, 2012

Submitted by Margaret Cibes

From Significance magazine, February 2012:

“It is ridiculous to believe that the legal profession has to get involved with the complex parts of statistics. The statistics of gambling uses the frequency approach to statistics, and that is straightforward. The reasoning involved with Bayes is more complex and we cannot expect juries to accept it.”

Letter to the editor

“Strangely, DNA experts are required to give probabilities for their evidence of matching; fingerprint expert[s] are forbidden to.”

“Fingerprints at the crime-scene"

“We cannot increase the probability of winning a [lottery] prize as this is fixed. However, we can increase the amount we can expect to win if we do strike [it] lucky …. [C]hoosing popular combinations of numbers decreases the expected value of your ticket as you have a higher probability of having to share your prize if you win. …. [W]hen you select your numbers, you may as well try to choose a less popular combination that might increase your expected value. Unfortunately, it is not possible to determine what exactly the unpopular combinations are: there is not enough data.”

“Playing the lottery with a little bit of stats know-how…”

Submitted by Margaret Cibes

Chocolate hype

“Association Between More Frequent Chocolate Consumption and Lower Body Mass Index”
Archives of Internal Medicine, March 26, 2012

This report (full text not yet available online) was published as a “research letter.” It involved about 1000 Southern Californians, who were surveyed about their eating and drinking habits and whose BMIs were computed. Funded by the National Institutes of Health, the study found that, among this group, people who ate chocolate more frequently had lower BMIs than less frequent chocolate eaters. The authors, in the “research letter” at least, apparently did not claim any cause-and-effect result and indicated that a controlled study would be needed before jumping to any such conclusion.

The Knight Science Journalism Tracker (subtitled “Peer review within science journalism”) presented a nice critique of press accounts of the “research letter” in “Eat Chocolate. Lose Weight. Yeah, right.”.

Before I became aware of the Knight project, I had tracked down a number of press accounts of the study and was encouraged to find it appropriately described in the articles, if not in the headlines, as non-definitive. (This is not to say that all reports were accurate or complete with respect to other aspects of the study.) Comparing media reports to original study reports might be a good exercise for a stats class, when a similarly enticing topic presents itself in the news, and when the original study report, or even an abstract, is available for comparison.

(a) The New York Times[1]: “Dietary studies can be unreliable, ... and it is difficult to pinpoint cause and effect.”
(b) BBC News[2]: “But the findings only suggest a link - not proof that one factor causes the other.”
(c) TODAY [3]: “The researchers only found an association, not a direct cause-effect link.”
(d) The Times of India[4]: “[T]he reasons behind this link between chocolate consumption and weight loss remain unclear.”
(e) Reuters[5]: “Researchers said the findings … don't prove that adding a candy bar to your daily diet will help you shed pounds.”
(f) Poughkeepsie Journal[6]: “The study is limited. It was observational, ... rather than a controlled trial ....”
(g) The Wall Street Journal[7]: “[B]efore people hoping to lose weight indulge in an extra scoop of chocolate fudge swirl, the researchers caution that the study doesn't prove a link between frequent chocolate munching and weight loss…..”

Discussion

1. According to Knight, the original purpose of the project was to study “non-cardiac effects on statin drugs,” not chocolate consumption. Should this information have been reported in the articles? Why or why not?
2. Do you think that the size of the chocolate-consuming group was the same as the size of the entire group of respondents? Why would you want to know the “sample” size behind the chocolate results?
3. How would you monitor a controlled study involving a group of people who were instructed to eat chocolate occasionally and another group who were instructed to eat it more frequently?

Submitted by Margaret Cibes

"Hangman" and conditional probability

A better strategy for hangman
by Nick Berry, lifehacker blog, 5 April 2012

If we order the 26 letters by their frequency of occurrence in English, we get:

ETAOIN SHRDLU CMFWYP VBGKQJ XZ

But does it follow that this is the right order for guessing your first letter in a game of Hangman? A better strategy, Berry argues, is to condition on the length of the word, information that we have at the start of the game. He develops the following table:

Number of letters	Optimal calling order
1	A I
2	A O E I U M B H
3	A E O I U Y H B C K
4	A E O I U Y S B F
5	S E A O I U Y H
6	E A I O U S Y
7	E I A O U S
8	E I A O U
9	E I A O U
10	E I A O U
11	E I A O D
12	E I A O F
13	I E O A
14	I E O
15	I E A
16	I E H
17	I E R
18	I E A
19	I E A
20	I E

He has some interesting conclusions based on the above conditional probabilities:

The most challenging (least deterministically obvious) words to guess are three letter words. It can take up to ten guesses before getting a letter on the board!
With fewer than three letters, it gets easier (there are fewer possible words), and with more than three letters it becomes less likely there will be any words that you cannot find a letter for quickly.
For five letter words, the best first guess is the letter S. This is the only time a consonant is the most likely first guess letter.
For four letter words, the first non-vowel guess is an S, followed by B and then F (remember, these are only called if all preceding letters have failed to hit).
No row contains more than ten guesses, and since a Hangman game takes eleven fails to lose, it is impossible to come up with any English letter word that will fail at Hangman without a single letter appearing on the board (assuming the optimal search strategy above is followed).
A should only be your first guess if the word length is four or fewer letters. If five letters, go for S first. Between six and twelve letters try E and above that you should call I.

How will this work in practice? Berry notes that "Battle plans are excellent up until the first shot is fired!" Indeed, as Andrew Gelman point out on his blog post, Hangman tips (4 April 2012), if if you knew your opponent was guessing according to these probabilities, you could adjust your word selection.

Discussion

1. The sequence

ETAOIN SHRDLU CMFWYP VBGKQJ XZ

comes from conceptually thinking of a very large bucket which contains all the words in the English language used in a very large book so that common words such as "and," '"or," "but," "down," "I," "me," "birth" and so on appear very frequently; words such as 'hyperventilate" or "colonoscopy" appear less frequently. However, if one thinks of a large bucket which contains all the words in the English language but each word appears only once, then the frequency of occurrence according to Berry is

ESIARN TOLCDU PMGHBY FVKWZX QJ

Now, if you are told the specific length of a word, we can ignore the words of different length in the very large bucket and this leads to the above table.

2. The sequence

ETAOIN SHRDLU

is beloved by the cognicenti because of its connection [en.wikipedia.org/wiki/ETAOIN_SHRDLU to the printing trade.]

Submitted by Paul Alper

Chance News 84

Contents

Quotations

Forsooth

Chocolate hype

Discussion

"Hangman" and conditional probability

Discussion

Navigation menu

Chance News 84

Quotations

Forsooth

Chocolate hype

Discussion

"Hangman" and conditional probability

Discussion

Navigation menu

Search