Mary
Richardson
Department of Statistics
Grand Valley State
University
1 Campus Drive
Allendale, MI 49401-9403
Statistics Teaching
and Resource Library, March 17, 2003
© 2003 by Mary Richardson, all rights reserved. This text
may be freely shared among individuals, but it may not be
republished in any medium without express written consent from the
authors and advance notification of the editor.
Dawson (1995) presented a data set
giving a population at risk and fatalities for an “unusual
episode” (the sinking of the ocean liner Titanic) and discussed
the use of the data set in a first statistics course as an
elementary exercise in statistical thinking, the goal being to
deduce the origin of the data. Simonoff (1997) discussed the use
of this data set in a second statistics course to illustrate
logistic regression. Moore (2000) used an abbreviated form of the
data set in a chapter exercise on the chi-square test. This
article describes an activity that illustrates contingency table
(two-way table) analysis. Students use contingency tables to
analyze the “unusual episode” data (from Dawson 1995) and attempt
to use their analysis to deduce the origin of the data. The
activity is appropriate for use in an introductory college
statistics course or in a high school AP statistics
course.
Key
words: contingency table (two-way table), conditional distribution
Objectives
After completing the activity,
students will understand:
|
How
to construct and interpret a contingency table.
|
|
How
to construct and interpret conditional
distributions. |
|
The
usefulness of contingency
tables. |
Materials and
equipment
The activity can be completed
interactively or as homework. If the activity is to be completed
interactively students work in groups of three to five. Each
student needs a copy of the activity.
Time involved
The estimated interactive
completion time is a one-hour class period.
Activity
description
Prior to completing the activity,
students should be familiar with the basics of setting up
contingency tables.
To begin the interactive activity, the
background of the data is discussed. The sinking of the ocean
liner Titanic after colliding with an iceberg on April 15th, 1912
is referred to as an “unusual episode.” The initial data tables
(given in the Student’s Version of the activity) give counts for
the population at risk and the deaths for the passengers on the
Titanic. The 2201 people at risk are categorized by economic
status (I, II, III, or Other), age (child or adult), gender
(female or male), and survival status (survived or did not
survive). Economic status is determined based on the class in
which the passengers traveled: first-class (I), second-class (II),
third-class (III), or crew member (Other). The goal of completing
the activity is to determine the historical mortality episode that
produced the data.
Through using two-way tables to analyze
the data, students discover interesting characteristics of the
data that should help them to determine the nature of the “unusual
episode.” To complete the activity, students are asked to answer a
series of questions based on the data. Each question is intended
to highlight a different characteristic of the data.
Teacher
notes
While working on the activity,
students are allowed to ask the instructor questions about the
origin of the data. One question that is commonly asked is “What
is the Other group?” The instructor might answer this question by
pointing out that there are no children in the Other group, only 3
females, and this group does not completely fit into an economic
status characterization. Another question that is commonly asked
is “When did this 'unusual episode' occur?” The instructor might
choose to answer this question by giving the year of the sinking
of the Titanic (which will more than likely give away the answer)
or the instructor might simply say that the “unusual episode” is
not a recent event. Another point that the instructor might want
to make is that the “unusual episode” was an isolated incident and
that there were only 2201 people at risk (which might eliminate
erroneous guesses for which many thousands or even millions of
people were at risk).
Some interesting characteristics of
the data are:
- 68% of the people at risk died
- 92% of the people who died were
male
- The death rate was higher for the lower
economic status groups (especially among females)
- There were no children in the Other economic
status group and only 3 females (out of 673)
- The only deaths of children were in the
third-class.
In a typical class with several
groups, at least one of the groups will usually correctly guess
the origin of the data. Here are some example group responses to
the question: “What 'unusual episode' in history do you think this
data set describes?”
“This is probably the death stats
for the sinking of the Titanic since rich were put on the
lifeboats first and women and children took precedence over men.
The 'other' could/would be crew explaining why there would be no
children in that category.”
“We think this unusual
episode is the sinking of the Titanic. We believe this because
the ship did consist of men/women and children. The reason for
the women’s death being so low is due to the fact that they were
the first to be shipped on the safety of the other boats. We
also believe that economic status I, II, III and other is the
wealth distribution throughout the boat, I consisted of the
wealthy, II consisted of the middle class, III consisted of the
lower deck which had a hard time escaping because they were so
close to the bottom of the ship, and we believe that the other
class represents the workers on the ship. (They were the closest
to the bottom of the ship as well and last to get off the ship
as well.) This 'unusual episode' is the sinking of the Titanic,
and that is our educated guess.”
“The data set could be
explaining WWI. The rich could buy their way out of the war so
they wouldn’t have as many people at risk. Women would be found
in hospitals and other non-battle areas so they would be less at
risk. And, children would not be present for the most part of
the war.”
“We think the set describes the Civil War. Our
reasoning is because men fought in the wars and the Civil War is
when women started to be nurses for the Army. They were exposed
to the battlefield. The children that died could have been at
risk due to their age. If the child was near 18 years old they
would have gone to fight. If they were not 18 years old they
would be considered children still.”
“This unusual
episode data could be explaining heart failure. Look at the data
it shows that men at a lower economic status die of it. This
holds true for heart failure. More men die of heart failure than
women and children. Also the lower the economic status you are
the less treatment you are able to receive.”
“We
initially figured this data was describing the Black Plague,
which would describe the differences in deaths in the different
social classes. But this wouldn’t support the differences in
gender and age. Our best guess is that this data describes the
Nazi persecution of the Jews in the 30’s and early 40’s. Higher
educated men and women were likely considered either useful or
desirable and lower income children very undesirable or useful.
The gender differences are probably explained by men being
subjected to more harsh conditions because of physical work
ability.”
“We believe the unusual episode that is being
described is the sinking of the Titanic. First of all we see
that a high number of male adults perish, and a formidably
smaller amount of adult women and children perished. This would
support the 'women and children first' ideal of 1912. Based on
economic status we can see that a larger number of high-class
citizens (male and female alike) managed to survive. While the
highest numerical amount of deaths occurred in the lower two
classes. In fact, the only children that perished were lower
class ones. We also see by sheer number, there were more men,
more lower class citizens, and few children. All of these
factors would have been common place in travel (due to society,
immigration and other factors) during the era of the tragedy. In
general the total number of occupants seems similar to those
that would have been aboard, plus the high mortality rate (68%)
is common knowledge of the event.”
Through completing the activity,
students see an illustration of the usefulness of two-way tables
for summarizing two categorical variables. In addition,
constructing appropriate conditional distributions illustrates how
to informally use two-way tables to determine if two categorical
variables may be associated.
After completion of the
activity, the instructor might have a summary discussion. One
possible point for discussion is the fact that, overall, the data
set is hard to interpret. There are many classifications, and
counts cannot be compared due to unequal subgroup sizes. However,
by breaking down the data, focusing on two-way tables, and
calculating conditional percentages, more useful information can
be obtained. We can see that women had a much lower likelihood of
death than men, and the rich had a lower likelihood of death than
the poor (especially for women). At this point, students quite
often comment on the fact that the motion picture Titanic
(released in the 1990’s) portrays the third-class passengers
(whose cabins were in the lower level of the ship) as being
prevented from moving to the top level of the ship after the
collision with the iceberg (although this fact has not been
confirmed historically).
A point of caution here is that
the activity involves a very informal analysis. In general,
collapsing an initial contingency table over variables without
examining associations between all of the variables at once leaves
open the possibility of Simpson’s paradox occurring. The
instructor should preface completion of the activity by telling
students that a less informal analysis of contingency table data
can be completed with more sophisticated statistical tools.
Assessment
Students should understand how to
construct and interpret a contingency table. In addition, students
should understand how to construct and interpret conditional
distributions.
The following test question can be used to
assess student understanding.
An insurance company has examined
a large number of claims resulting from low speed collisions of
vehicles and has classified the claims according to type of
vehicle and to whether the claim was for more than $10,000. The
data are shown below.
|
|
Type of
Vehicle |
|
|
Car |
Truck |
Sport utility |
Claim
Amount |
>$10,000 |
147 |
120 |
270 |
£$10,000 |
470 |
280 |
330 |
- The company would like to
learn more about the relationship between claim amount and
type of vehicle. In particular, the company would like to
compare the claim amounts for each type of vehicle. What
conditional distributions should the company
compute?
- Provide the conditional
distributions stated in part a.
- Do you think there is an
association between the type of vehicle and the claim amount?
Explain.
References
Dawson, Robert J. M. (1995). The
‘Unusual Episode’ Data Revisited. Journal of Statistics
Education [on-line] 3(3). (http://www.amstat.org/publications/jse/v3n3/datasets.dawson.html).
Moore,
David S. (2000). The Basic Practice of Statistics,
2nd edition. New York: W. H. Freeman and Company.
Simonoff,
Jeffrey S. (1997). The ‘Unusual Episode’ and a Second
Statistics Course. Journal of Statistics Education
[on-line] 5(1). (http://www.amstat.org/publications/jse/v5n1/simonoff.html).