Chance News 22
Quotation
I think you're begging the question, said Haydock, and I can see looming ahead one of those terrible exercises in probability where six men have white hats and six men have black hats and you have to work it out by mathematics how likely it is that the hats will get mixed up and in what proportion. If you start thinking about things like that, you would go round the bend. Let me assure you of that!
Agatha Christie
From the Probability Web Quotations
The Mirror Crack's
Forsooth
The following Forsooths are from the November 2006 RRS NEWS.
At St John's Wood station alone, the number of CCTV cameras has jumped from 20 to 57, an increase of 300 per cent.
Metro
3 May 2006
Now 78% of female veterinary medicine students are women, almost a complete turn-around from the previous situation.
The Herald (Glasgow)
4 May 2006
Drought to ravage half the world within 100 years
Half the world's surface will be gripped by drought by the end of the century, the Met Office said yesterday.
Times online
6 October 2006
Estimating the diversity of dinosaurs
Proceedings of the National Academy of Sciences
Published online before print September 5, 2006
Steve C. Wang, and Peter Dodson
Fossil hunters told: Dig deeper
Philadelphia Inquirer, September 5, 2006
Tom Avril
Steve Wang is a statistician at Swarthmore College and Peter Dodson is a paleontologist at the University of Pennsylvania. Their study was widely reported in the media. You can find references to the media coverage and comments by Steve here.
In their paper the authors provided the following description of their results. Here are a few definitions that might be helpful: genera: a collective term used to incorporate like-species into one group, nonavian: not derived from birds, fossiliferous: containing a fossil, rock outcrop: the part of a rock formation that appears above the surface of the surrounding land
Despite current interest in estimating the diversity of fossil and extant groups, little effort has been devoted to estimating the diversity of dinosaurs. Here we estimate the diversity of nonavian dinosaurs at 1,850 genera, including those that remain to be discovered. With 527 genera currently described, at least 71% of dinosaur genera thus remain unknown. Although known diversity declined in the last stage of the Cretaceous, estimated diversity was steady, suggesting that dinosaurs as a whole were not in decline in the 10 million years before their ultimate extinction. We also show that known diversity is biased by the availability of fossiliferous rock outcrop. Finally, by using a logistic model, we predict that 75% of discoverable genera will be known within 60-100 years and 90% within 100-140 years. Because of nonrandom factors affecting the process of fossil discovery (which preclude the possibility of computing realistic confidence bounds), our estimate of diversity is likely to be a lower bound.
In this problem we have a sample of dinosaurs that lived on the earth. These dinosaurs are classified into groups called genera. We can count the number of each generus in our sample. From this we want to estimate the total number of dinosaurs that have roamed the earth. Many different methods for doing this have been developed and the authors of this study use one of the newer methods. We have discussed in prevent Chance News other examples of this problem and it might help to discuss these briefly.
One of the first statistical studies of species was carried out by R.A. Fisher and illustrated in terms of determining the number of species of Malayan butterflies. His method is described in the paper 'The Relation Between the Number of Species and the Number of Individuals in a Random Sample of an Animal Population', R.A. Fisher; A.Steven Corbet; C.B. Williams, The Journal of Animal Ecology, Vol. 12. No. 1, pp.442-58. (Available from Jstor).
Corbet provided the following data from his sampling of the Malyan butterflies:
n |
observed |
expected number |
1 |
118 |
156.44 |
2 |
74 |
74.52 |
3 |
44 |
47.33 |
4 |
24 |
33.82 |
5 |
29 |
25.77 |
6 |
22 |
20.46 |
7 |
20 |
16.71 |
8 |
19 |
13.93 |
9 |
20 |
11.79 |
10 |
15 |
10.11 |
11 |
12 |
8.76 |
12 |
14 |
7.65 |
13 |
6 |
6.73 |
14 |
12 |
5.95 |
15 |
6 |
5.29 |
16 |
9 |
4.73 |
17 |
9 |
4.24 |
18 |
6 |
3.81 |
19 |
10 |
3.44 |
20 |
10 |
3.11 |
21 |
11 |
2.83 |
22 |
5 |
2.57 |
23 |
3 |
2.34 |
24 |
3 |
2.14 |
In this table n is the number of times a species occurs in the sample. The second column gives the number of species that occur n times in the sample. So we see that 118 species occurred once in the sample, 74 twice and 44 three times. The their column gives the expected number that occur n times suing Fisher's model which we will explain next. Thus the expected number for n = 1,2,3 are 156.44, 74.52 and 47.33.
Fisher model assumes that the number of times a species occurs in a sample has a poisson distribution:
For a given species m is the expected number of this species that will occur in a sample. Since this will be expected to vary among the species Fisher treats this as a random variable. He chooses a distribution for m that leads him to estimate the expected number of species which appear n times in a random sample is given by
Here <math>\alpha</math> and x are parameters. If S is the number of species observed and N the sample size \alpha and x can be determined as the values that satisfied the following two equations:
From our data we find that S = 501 and N = 3306. Using these values we find that x = .95268 and <math>\alpha = 164.21.</math> These do not agree with the values obtained by the authors but we believe them to be correct.
Fisher was interested in finding a distribution that could approximate the distribution of the number of number of times a species in a sample occurred and the distribution that he proposed has been widely used in species studies. Another interesting question would be: can you estimate the total number of Malayan butterflies from a sample. This is what Wang and Dodson did in their study. One of the first to tackle this problem were I:.J. Good and G. H. Tollmin in their paper "The number of New Species, and the Increase in Population Coverage, when a Sample is Increased", Biometrika, Vol. 43, (June, 1956), pp. 45-63.
To be continued
I wasn't making up data, I was imputing!
An Unwelcome Discovery, by Jeneen Interlandi, The New York Times, October 22, 2006.
The New York Times has an informative summary of a recent scandal involving a prominent researcher at the University of Vermont, Eric Poehlman. The Poehlman scandal represents perhaps the biggest cases of research fraud in recent history.
He presented fraudulent data in lectures and in published papers, and he used this data to obtain millions of dollars in federal grants from the National Institutes of Health — a crime subject to as many as five years in federal prison.
The first person to speak up about the possibility of fraud in Poehlman's work was one of his research assistants, Walter DeNino.
The fall that DeNino returned to the lab, Poehlman was looking into how fat levels in the blood change with age. DeNino’s task was to compare the levels of lipids, or fats, in two sets of blood samples taken several years apart from a large group of patients. As the patients aged, Poehlman expected, the data would show an increase in low-density lipoprotein (LDL), which deposits cholesterol in arteries, and a decrease in high-density lipoprotein (HDL), which carries it to the liver, where it can be broken down. Poehlman’s hypothesis was not controversial; the idea that lipid levels worsen with age was supported by decades of circumstantial evidence. Poehlman expected to contribute to this body of work by demonstrating the change unequivocally in a clinical study of actual patients over time. But when DeNino ran his first analysis, the data did not support the premise.
When Poehlman saw the unexpected results, he took the electronic file home with him. The following week, Poehlman returned the database to DeNino, explained that he had corrected some mistaken entries and asked DeNino to re-run the statistical analysis. Now the trend was clear: HDL appeared to decrease markedly over time, while LDL increased, exactly as they had hypothesized.
Although DeNino trusted his boss implicitly, the change was too great to be explained by a handful of improperly entered numbers, which was all Poehlman claimed to have fixed. DeNino pulled up the original figures and compared them with the ones Poehlman had just given him. In the initial spreadsheet, many patients showed an increase in HDL from the first visit to the second. In the revised sheet, all patients showed a decrease. Astonished, DeNino read through the data again. Sure enough, the only numbers that hadn’t been changed were the ones that supported his hypothesis.
Poehlman brushed DeNino's concerns aside, so DeNino started asking around and other graduate students and postdocs had similar concerns. He got some cautionary advice from a former postdoctoral fellow
Being associated with either falsified data or a frivolous allegation against a scientist as prominent as Poehlman could end DeNino’s career before it even began.
and a faculty member who shared lab space with Poehlman who advised
If you’re going to do something, make sure you really have the evidence.
So DeNino started looking for the evidence.
DeNino spent the next several evenings combing through hundreds of patients’ records in the lab and university hospital, trying to verify the data contained in Poehlman’s spreadsheets. Each night was worse than the one before. He discovered not only reversed data points, but also figures for measurements that had never been taken and even patients who appeared not to exist at all.
DeNino presented his evidence to the university counsel and the response of Poehlman (to his department chair, Burton Sobel) was rather startling.
The accused scientist gave him the impression that nothing was wrong and seemed mostly annoyed by all the fuss. In his written response to the allegations, Poehlman suggested that the data had gotten out of hand, accumulating numerous errors because of handling by multiple technicians and postdocs over the years. “I found that noncredible, really, for an investigator of Eric’s experience,” Sobel later told the investigative panel. “There had to be a backup copy that was pure,” Sobel reasoned before the panel. “You would not have postdocs and lab techs in charge of discrepant data sets.” But Poehlman told Sobel that there was no master copy.
At the formal hearing, Poehlman had a different defense.
First, he attributed his mistakes to his own self-proclaimed ineptitude with Excel files. Then, when pressed on how fictitious numbers found their way into the spreadsheet he’d given DeNino, Poehlman laid out his most elaborate explanation yet. He had imputed data — that is, he had derived predicted values for measurements using a complicated statistical model. His intention, he said, was to look at hypothetical outcomes that he would later compare to the actual results. He insisted that he never meant for DeNino to analyze the imputed values and had given him the spreadsheet by mistake.
The New York Times article points out how pathetic this attempted explanation was.
Although data can be imputed legitimately in some disciplines, it is generally frowned upon in clinical research, and this explanation came across as hollow and suspicious, especially since Poehlman appeared to have no idea how imputation was done.
A large portion of the article examines how research fraud can occur in a system that is supposed to be self-correcting.
First, the people who are mostly likely to notice fraud are junior investigators who are subordinate to their research mentor. It's psychologically and emotionally difficult to confront someone who has devoted time to your professional development. Even when an investigator is emotionally willing to confront their mentor, they have their career concerns to worry about.
The principal investigator in a lab has the power to jump-start careers. By writing papers with graduate students and postdocs and using connections to help obtain fellowships and appointments, senior scientists can help their lab workers secure coveted tenure-track jobs. They can also do damage by withholding this support.
Every university will have a system in place to investigate claims of fraud. But there are problems here as well.
All universities that receive public money to conduct research are required to have an integrity officer who ensures compliance with federal guidelines. But policing its scientists can be a heavy burden for a university. “It’s your own faculty, and there’s this idea of supporting and nurturing them,” says Ellen Hyman-Browne, a research-compliance officer at the Children’s Hospital of Philadelphia, a teaching hospital. Moreover, investigations cost time and money, and no institution wants to discover something that could cast a shadow on its reputation.
“There are conflicting influences on a university where they are the co-grantor and responsible to other investigators,” says Stephen Kelly, the Justice Department attorney who prosecuted Poehlman. “For the system to work, the university has to be very ethical.”
Poehlman himself was careful and chose areas where fraud would be especially difficult to detect. He specialized in presenting longitudinal data, data that is very expensive to replaicate. He also presented research results that confirmed what most researchers had suspected, rather than results that would undermine existing theories of nutrition.
At his sentencing, Poehlman was sentenced to one year and one day in federal prison, making him the first researcher to serve time in jail for research fraud.
“When scientists use their skill and their intelligence and their sophistication and their position of trust to do something which puts people at risk, that is extraordinarily serious,” the judge said. “In one way, this is a final lesson that you are offering.”
Questions
1. Do you have experience with a researcher changing the data values after seeing the initial analysis results? What would make you suspicious of fraud?
2. Is the peer-review system of research self-correcting? What changes could be made to this system?
3. When is imputation legitimate and when is it fraudulent?
Submitted by Steve Simon