Data Cleaning and Data Analysis: Sharing Data with CASE
What you should deliver to CASE
Data cleaning can be a time consuming endeavor. CASE provides data cleaning services for clients that do not have access to those resources elsewhere. Cleaning data is the first step to ensuring that any subsequent analyses of the data are transparent and reproducible. For those clients that want to clean their own data and contract only with CASE for data analysis services we recommend that you follow steps similar to these and provide the items below to the CASE analysts.
- The raw data
- A tidy data set
- A codebook describing each variable and its values in the tidy data set
- An explicit and exact recipe that transformed 1 into 2 and 3
1. The raw data:
In some instances, CASE analysts may need to refer back to the raw file to obtain information. Often times the raw data will be contained in an unformatted Excel file with multiple worksheets, a .csv file, or tab-delineated file. However, raw data may be in more obscure formats such as a binary or JSON file
The raw data has had no software “run” on it, no values have been manipulated in any way, all of the data is present (nothing has been removed or deleted), and is free of any summaries.
2. The tidy dataset:
The tidy data set should have each variable in one (and only one) column. Only the variables of interest should be included in the tidy data set. That is, if the raw data has 154 variables but you are only interested in 4 of them, the tidy dataset will include 4 variables. Each unique observation should be in its own row.
If your tidy dataset is an Excel file, each table should be its own file. Excel files should not have multiple worksheets, no macros should be applied to the data, and nothing should be highlighted. CASE prefers tidy datasets to be shared as .csv or tab-delimited .txt files instead of Excel files or software specific data files (e.g., .dta, .sas, .sav).
3. The codebook:
Variables require more descriptive information than can be provided in a data set. The codebook contains this additional information, particularly:
- Information about the data structure (cross-sectional, wide, long, cross-sectional time series, etc.).
- Information about the variables, including the units of measurement.
- Information about the summary choices made, particularly if the tidy data set includes scale scores.
- Information about the study design and/or instrument(s) used to collect the data.
4. Data cleaning script:
A data cleaning script is a series of commands that transforms the raw data (1) into the tidy data set (2). The script may be written in R, Python, SAS, Stata, SPSS, etc. At a minimum script files should include the system you used the software on (Mac/Windows/Linux), the version of the software, and the author(s) of the script. Additional features include sufficient comments, generic directory paths and logical indentations.
What you should expect from CASE
Providing CASE with a properly tidied data set will decrease the workload of the analyst. This should speed up the time it takes to receive your results. However, to ensure quality analytical decisions, the analyst will check your data cleaning script, ask questions about your script, and confirm the data cleaning script produces identical results.
Upon receipt of your results, you should expect the following items:
- A fully reproducible analysis script that preforms each of the analysis requested
- Log/output files of all the commands and results.
- Additional files (.eps or .pdf) of any figures generated from the analysis.
This document is built upon the guide, “How to share data with a statistician”