Workshop 2: tidy data

Author

Januar Harianto

1 Example data

Download the following files for this workshop.

lotr.xlsx | ferret.xlsx | fish_station.csv

2 Data entry

Data is the reason why we can do science. We tend to invest a lot of time and effort in collecting data – in fact it forms the bulk of our work.

In this workshop we will quickly go through the process of data entry and work on a few examples together.

3 Tidy data

  • The tidy data concept was popularised by Hadley Wickham in his paper in the Journal of Statistical Software in 2014 – Tidy Data.
  • Not a new concept – tidy data is basically data in “long” format as opposed to “wide” format. Excel users tend to prefer the wide format but it is not ideal for data analysis.
  • Both formats contain the exact same information, so there is no loss of data. However, the long format is more suitable for data analysis and visualisation and is the preferred format for most statistical software.

Recording data in long format will save you time and effort in the long run.

Sex BW
F 2.15
M 2.55
F 2.95
F 2.70
M 2.20
F 1.85
M 2.55
M 2.60

 

F M
2.15 2.55
2.95 2.20
2.70 2.55
1.85 2.60
Figure 1: Data should be in long format (left) where each row is an observation and each column is a variable. This is the preferred format for most statistical software. Wide format (right) is also common, but may require additional steps to analyse or visualise in some instances.

4 Three rules

  • Each variable must have its own column.
  • Each observation must have its own row.
  • Each value must have its own cell.

The above rules mean that you need a clear understanding of what a variable, observation and value are.

5 How to tidy data?

Ideally, you would want to keep raw data intact and document the steps you took to tidy the data. This is important for reproducibility and transparency. Do it in Excel, R, or any other software you are comfortable with – the key is good record keeping.

6 Important considerations

  1. Each column should only contain one variable, and the variable should not contain mixed data types (e.g. numeric and character).
  2. When naming columns, avoid spaces and special characters. Use underscores or periods to separate words.
  3. Don’t use formatting, and avoid merged cells.
  4. Understand the difference between missing data and zero. It is better to leave cells as blank if the data is missing, or use NA.
  5. Include brief metadata directly in the data table as a separate column. For extensive metadata, it is recommended to document it in a separate file or worksheet.