Chance News 20
Quotations
Like dreams, statistics are a form of wish fulfillment. - Jean Baudrillard
According to an article in the WSJ by Dr. Jerome Groopman of the Harvard Medical School criticizing alternative medicine: on the wall of the office of Dr. Stephen Straus who directs NCCAM, (formerly the Office of Alternative Medicine which is within the National Institutes of Health) there exists the following framed quotation, "The plural of anecdote is not evidence." This useful and insightful aphorism appears in various versions as can be seen by this website here.
Forsooth
"People who live longer have a greater chance of developing cancer in old age." Heard on the "Today" news programme on BBC Radio 4 and reported to the MEDSTATS discussion group by Ted Harding.
A clumsy attempt at anonymization
A Face is Exposed for AOL Searcher No. 4417749. Michael Barbaro and Tom Zeller, Jr. The New York Times (August 9, 2006).
Statisticians frequently deal with confidentiality issues when deciding what type of data and what amount of detail should be withheld to protect sensitive information about individual patients or institutions. It's not an easy task and there are some subtle traps. And sometimes there are not so subtle traps.
At the request of some researchers, America Online (AOL) released data on 20 million web searches performed 650 thousand AOL users over a three month span. They released the data, not just to those researchers, but to the general public. AOL quickly realized that this was a bad idea and removed the database, but it had already been copied to many locations. It is unlikely that they will ever be able to persuade the web owners at all the other locations to take the files offline.
The data was anonymized by replacing the user name with a random number. This is important, because some of the search terms are for rather sensitive items. Examples of things that people searched on are
- "can you adopt after a suicide attempt" or
- "how to tell your family you're a victim of incest."
But replacing a name by a number did not come even close to anonymizing all of the records. The problem is that people will do web searches about things that reveal hints about themselves. Actual searches listed in the data base included things like geographic locations:
- "gynecology oncologists in new york city,"
- "orange county california jails inmate information,"
- "employment needed- louisville ky," or
- "salem probate court decisions,"
or places where the searchers shopped or banked or got health care,
- "gerards restaurant in dc,"
- "st. margaret's hospital washington d.c.,"
- "l&n federal credit union," or
- "mustang sally gentlemans club,"
or products that the searchers owned,
- "cheap rims for a ford focus," or
- "how to change brake pads on scion xb,"
or their hobbies,
- "knitting stitches," or
- "texas hold'em poker on line seminars."
It gets even more revealing when people do web searches on their relatives or even themselves.
These individual searches are, according to one report, like individual pieces in a mosaic. Put enough of them together and you can get a really clear picture of who the searcher is. Can you actually identify people from their web searches? The answer is yes.
Accrdoing to the New York Times report, one user, with the id number 4417749 searched for
- "landscapers in Lilburn, Ga," and
- "homes sold in shadow lake subdivision gwinnett county georgia,"
as well as the names of several people, all of whose last names were Arnold. It didn't take long for the New York Times to track down a 62 year old widow named Thelma Arnold.
Ms. Arnold, who agreed to discuss her searches with a reporter, said she was shocked to hear that AOL had saved and published three months’ worth of them. “My goodness, it’s my whole personal life,” she said. "I had no idea somebody was looking over my shoulder."
This is an important lesson that statisticians have been aware of for some time. An individual piece of information by itself may not compromise someone's privacy, but will do so when it is combined with other pieces of information. Knowing that someone lives in a small town still preserves anonymity, but when that small town name appears in a database of all pediatric heart transplant cases, you have a problem.
Questions
1. List some of the other things that people might search on that would potentially reveal their identities.
2. Could this data set be cleaned up to the point where it could be truly thought to be anonymized?
3. Why would a researcher be interested in what people search for on the Internet? What sort of information would be useful for someone in Marketing?
Submitted by Steve Simon
Mean vs. Median
Who's Counting: It's Mean to Ignore the Median
ABCNews.com, 6 August 2006
John Allen Paulos
This latest installment of "Who's Counting" focuses on the distinction between the mean and median. Paulos begins with the familiar example of housing prices, and goes on to discuss the implications for interpreting newly released data on the performance of the US economy for 2004. Republicans point out that the economy grew at a rate of 4.2%, and complain that they are not getting enough credit for the good news. Democrats counter that real median income is falling and poverty is rising. How can both be true? Just as a few expensive houses in a neighborhood can pull the mean substantially above the median, gains by a wealthy few at the top of the income ladder can pull up the mean, even if most people are not benefiting.
To show that this is happening, Paulos cites work on income distribution by economists Thomas Picketty and Emmanuel Satz. According to their calculations, the the richest one percent, whose incomes exceed $315,000, gained on average nearly 17% over the year in question. However, the good news did not extend very far down the income distribution. Looking at the top five percent of all incomes, the average gain is described as "minimal." This means that the gains were concentrated near the very top. In fact, even among the top one percent, Picketty and Satz found that half of income gains went to the top tenth of the group.
Paulos points out that the pattern of the income distribution can be described mathematically in terms of so-called "power laws," which apply to a variety of observed phenomenon, including Internet surfing and investing. A general description of power laws from Wikipedia can be found here.
Submitted by Bill Peterson