BOF-T07: Resources that enable far more efficient and accurate data preparation using R software. Discussion over what key features they might usefully contain.


Hillary Watt and Dr. Mintu Nath


Information

Background: Preparing data for analysis is often very time-consuming, especially using R software. Yet R is a popular statistics software choice, in part because it is free. Many people are hampered by limitations in knowledge, generally extended using internet searches. Yet many do not know what tasks are required, nor what function types are available, nor what internet search terms to use. Many do not know what preparation is required prior to effective use of functions, for merging and reshaping data. One supervisor reported that it took a few supervisory inputs before the reason for their students’ failure to merge became clear. The issue (lack of specified ID variable) was too obvious for the supervisor to check for, yet understandable from the perspective of student overwhelm.

Accuracy: R does not include checks and balances and warning against major errors, that commercial software may often provide. There is substantial risk of errors being introduced within the data manipulation process. Furthermore, online resources teaching functions do not generally warn against common errors, such as accidentally multiplying up observations when merging. Students may not be aware of standard file management strategies, such as the importance of tidy R scripts preparing data for analysis and of retaining the dataset as initially received. This strategy may help students to confidently overwrite previous version of their dataset. Many students otherwise keep multiple copies of their data, and confess to being confused over which is most appropriate for which analysis. Supervisors may omit to teach these strategies, perhaps because it feels too obvious.

Remedy: Resources were prepared that demonstrate a sensible workflow, hence indicating what tasks are required with a sensible order. Attention is given to providing easily modifiable versions of commands with appropriate instructions to deliver. A few accuracy checks are included, especially around the merge command. Detailed instructions and checklists for data preparation are provided, especially for merge and reshape commands. The starting point was translation of resources from stata, where it is much clearer what tasks are required to deliver most preparation tasks (whereas the possibilities within R are endless, rendering it far harder to determine a core list of crucial commands). 

Freely available online resource: https://bookdown.org/hcwatt99/Data_Wrangling_Recipes_in_R/ by Hilary Watt.

Feedback with permission to circulate: “Exceptional... some of the commands and tricks… could save me a few days of work each” PhD student. “Really helpful practical tips for those annoying things that often aren’t covered in other courses” post-doc. “Fabulous . really practical guidance. makes the hard work of restructuring and preparing data significantly easier … I haven’t seen this focused content presented in one place in such an easy to access form before.” Module lead where this material was relevant. From professor who met me once at a statistics teaching meeting, public LinkedIn recommendation: “The book is a great companion to any Statistics course that uses R and RStudio. In my own teaching…, students complete weekly data analysis exercises... Because the term time is limited, we never have sufficient time to teach how to prepare, code, clean, restructure and otherwise manipulate the data before you submit it to the actual analysis. Now I have a book to recommend to students, which is sufficiently detailed to address pretty much every query they may have. Excellent!”.
Planned extensions: Comprehensive checks with extensive debugging strategies will be added into future versions. Guidance on file management strategies will be included into later versions. 

For Discussion: What key features should be included into resources, to make data preparation efficient and accurate?

Authors: Hilary C Watt (Imperial College London, UK), Mark Cunningham (Imperial College London, UK), Mintu Nath (Aberdeen university, UK).