Sandbox: Difference between revisions

From ChanceWiki
Jump to navigation Jump to search
Line 24: Line 24:
</center>
</center>
DISCUSSION QUESTION<br>
DISCUSSION QUESTION<br>
Is the correlation coefficient ''r'' an adequate summary for this plot?  What does this say about the appropriateness of ''r'' as a measure of accuracy for the predictions?
Is the correlation coefficient ''r'' an adequate summary for this plot?  What does this say about the appropriateness of ''r'' as a measure of accuracy for medal predictions?


Submitted by Bill Peterson
Submitted by Bill Peterson

Revision as of 15:39, 4 March 2010

Predicting medal counts

Canada vs. the United States: Who wins?
by Daniel Gross, Slate, 12 February 2010

The article is subtitled "An economist predicts the medal counts for the Vancouver Olympics." The economist here is Daniel Johnson of Colorado College. He has developed a model which predicts medal counts for countries from their population, per capita income, climate, political structure and host-nation status. You can read his latest predictions in a press release from the College, which is subtitled "Economics professor’s formula ignores athletes’ skills, yet proves remarkably accurate." Indeed, Johnson's web site here notes that

Beijing 2008 Summer Games had a 0.93 correlation with our predicted results
The Torino 2006 Winter Games had a 0.93 correlation with our predicted results

His record at the 2008 Beijing games was reported in a Wall Street Journal article that summer, entitled Want to Predict Olympic Champs? Look at GDP.

Johnson's bottom line for Vancouver was that Canada would win the most medals, with 27 total, just ahead of the US and Norway with 26. So far, however, the Canada has not been as successful as many anticipated--including the Canadians themselves, who had adopted the motto "Own the Podium." Nate Silver has been regularly updating medal projections at FiveThirtyEight.com. His 22 February post, which comes the day after the US hockey team's upset victory over Canada, is entitled Canada Not Owning the Podium. The New York Times has a data map of the current medal count. There is also data from past Olympics, and the historical progression can be viewed in a Gapminder-style animation.

DISCUSSION QUESTION
The Slate article concludes by saying, "In a couple of weeks, we'll check back with professor Johnson and see how his model performed this year." If Canada continues to lag, what explanations do you think he might offer?

Update

With the Games now concluded, we can report the top medal winners in Vancouver were US (37), Germany (30), Canada (26), Norway (23). Shown below is a scatterplot of 2010 total medals vs. 2006 total medals, for all countries that won at least one medal in either year. The correlation coefficient is 0.916. By this measure, we would have done well simply predicting that the Vancouver totals would match the Torino totals.

Medals.jpg

DISCUSSION QUESTION
Is the correlation coefficient r an adequate summary for this plot? What does this say about the appropriateness of r as a measure of accuracy for medal predictions?

Submitted by Bill Peterson

Census errors

Can you trust Census data?
by Justin Wolfers, New York Times, Freakonomics blog, 2 February 2010

Census Bureau obscured personal data—Too well, some say
by Carl Bialik, Numbers Guy column, Wall Street Journal, 6 February 2010


These stories describe problems with the Census Bureau' IPUMS (Integrated Public Use Mircodata Series) data, which provides subsamples of Census data to outside researchers. In order to protect the privacy of citizens, the records are altered slightly. For example, incomes may be rounded and ages may be tweaked by a small amount. Ideally this would make it impossible to identify any particular individual, while at the same time not introducing any important distortion into the overall demographic profile.

Unfortunately, it appears that serious distortions have resulted. A recent NBER working paper details the problems, which seem to be especially pronounced in data for ages 65 above. The Freakonomics post reproduces the following graph from the paper

http://graphics8.nytimes.com/images/2010/02/02/opinion/Census-Chart/blogSpan.jpg

showing how total population estimates based on the microdata diverge from the actual Census counts for older Americans. Breakdowns within particular age groups are also distorted. For example, The Wall Street Journal article has an interactive graphic, revealing how data released in 2006 showed inexplicable fluctuations from one age year to the next in the percentage of women who were married (those errors were corrected in 2007).

As Bialik notes, "The anomalies highlight how vulnerable research is to potential problems with underlying numbers supplied by other sources, even when the source is the government. And they illustrate how tricky it can be to balance privacy with accuracy."

Submitted by Bill Peterson