Statistics becomes interesting to non-methodologists only when taught in a research context that is relevant to them. Real data sets supplemented by sufficient background information provide just such a context. Despite this, many textbook authors and instructors of applied statistics rely on artificial data sets to illustrate statistical techniques. In this paper, we argue that artificial data sets should be eliminated from the curriculum and that they should be replaced with real data sets. Towards this end, we describe the rationale for using real data sets and describe the characteristics that we have found make data sets particularly good for instructional use. Having learned that real data sets can present problems for instructors, we discuss the difficulties that we have encountered when using real data and some of our strategies for compensating for these drawbacks. We conclude by presenting two authentic data sets and an annotated bibliography of dozens of primary and secondary data sources.
The CAUSE Research Group is supported in part by a member initiative grant from the American Statistical Association’s Section on Statistics and Data Science Education