"Visualizing Big Data in the Introductory Course"
David Kahle, Baylor University
Abstract
One of the major challenges in modern computational science is big data. While there are several aspects to the problem, one less discussed feature is that of teaching big data challenges for the first time. To that end, a common core component of the introductory statistics course is graphical descriptive statistics - the visual display of information as a means of organizing and summarizing data - for both the preliminary analysis of data and the communication of results. In this presentation we discuss how instructors might motivate big data challenges in the introductory course through the visualization of one- and two-dimensional variables and beyond for continuous and discrete variables.
Materials
- Download slides (PDF)
Recording
(Tip: click the fullscreen control)
Having trouble viewing? Try: Download (.mp4)
(Tip: right-click and choose "Save As...")
Comments
Nicholas Horton:
We've moved *far* beyond stem and leaf plots here. This is exactly what we want to have students see early and often. Kudos to David Kahler for an accessible overview of how to display meaningful and accessible graphical displays for "medium" data. @askdrstats
Nicholas Horton:
Oops: apologies to David Kahle for the mangling of his name.
Homer White:
I wonder if you could say a bit about how you assess students' understanding of sophisticated plots that are made to address difficult features (more dimensions, overplotting, etc.) specific to a dataset. Do you have them interpret and discuss plots that are produced for them, or do they learn to produce some of these plots themselves? In an introductory course I have difficulty finding time to teach students to produce anything beyond a few "named" plots (histogram, density plot, bar graph, scatterplot, etc.).
Homer White:
I can't stop thinking about that last plot -- the pd-graphics emebedding. Can you post somewhere the R code you used to produce it? It would be very cool to have a function to produce plots like it.
David Kahle:
Sure! How about here?
diamonds$size <- cut(diamonds$carat, c(0, .50, 1.0, 1.5, 2.0, 2.5, 3.0, 10))
qplot(clarity, price, data = subset(diamonds, .5 < carat & carat < 2.5),
geom = "boxplot", fill = cut, notch = F) +
facet_grid(size ~ .)
Homer White:
ggplot2 is amazing.
David Kahle:
Oh, and you'll need this as well, first:
library(ggplot2); theme_set(theme_bw())
Tulia E Rivera-Florez:
Even for medium size samples, loading the whole data set and efficient
data access is important in exploratory analysis. Is R the best option?
David Kahle:
In terms of computations, I think most programs would be fine for any dataset practical for the intro course, including R (you'd also be limited additionally by the students' hardware). Perhaps a more important consideration would be the interface: R is natively not a drag-down menu kind of application, which is bad for most intro courses. If programs like JMP or Tableau are available, maybe they'd be preferable. On the other hand, R has the advantage of being free and cross=platform; with a third party GUI it might be the app of choice.