By Julie Deeke, Tori Ellison
Information
We’ve developed and published online content for an undergraduate second course in data science that provides a practical approach to data science with overarching themes of selecting an appropriate analysis technique and communicating limitations of any available technique and analysis. In this Beyond session, we will discuss a lesson on simulating sampling distributions for linear regression coefficients, which highlights the need to select an appropriate technique for the available data and demonstrates to students the consequences of choosing an inappropriate option. This lesson also exemplifies how we build core data science and statistical concepts in a rigorous manner without relying on calculus or advanced mathematics. Instead, students engage with hands-on Python-based activities, simultaneously developing their coding abilities while exploring a specific statistical quantity for this regression context: mean squared error. The content for our entire second course in data science is free and accessible to all students, relying on one introductory data science course as its sole prerequisite. We have taught approximately 600 undergraduate students over the last year using this website as our course resource, including units on data wrangling, linear regression, feature selection including cross-validation, logistic regression, simulating sampling distributions, and statistical inference.