Week 3: Data analysis

1 Introduction

In today’s practical, we will focus on data analysis. You will work with your group to analyse the data you collected in the previous lab. Cheatsheets are available for you to use, and your demonstrators are ready to help.

You may need to clean your data before you can analyse it.

Once you are done with your data, and if you have time, spend some effort to analyse the other data in Exercise 3.

2 Things to prepare

Make sure you have access to the data you collected in the previous lab. This can be obtained from your group’s Google Drive folder.
Bring your laptop and the necessary software (e.g., R, Jamovi) installed for data analysis. See Workshop 01 materials for installation instructions.
Lecture 3a may be helpful for understanding the concepts we will be applying in this practical, as they cover model assumptions and diagnostics.

Important

You will be working in the same groups as last week. If you were not present last week, join the group closest to you and work with them.

3 Learning outcomes

By the end of this practical, you should be able to:

Fit a model to your data.
Check if the data meets the assumptions of the model.
Interpret the model output.
(optional) have working templates for fitting GLMs to data with different structures, such as comparing group means or modelling relationships between continuous variables.

What to submit at the end of the practical

Your plot and model output, on Google Drive. We will spend some time discussing the results at the end of the tutorial.

4 Workshop

Today’s workshop is a hands-on preview of running models in both R and Jamovi.

5 Lab activity

Google Drive

When ready, work with your group to analyse the data you collected in the previous lab using a general linear model. Click on the link above to access the data that you have collected.

You will need to:

Fit your intended model to the data.
Check if the data meets the assumptions of the model.
Interpret the model output.

Of particular importance is the assessment of assumptions using residual plots, rather than formal tests of assumptions.

Your demonstrators are ready to help. We expect you to be able to select the appropriate model to your data, but you can consult our cheatsheets and your demonstrators for help.

If using Jamovi

Please install the GAMLj3 module in Jamovi. This will allow you to run general linear models in Jamovi with a lot of options:

Select the Analyses tab.
Click on the Modules button and select jamovi library.
In the Available tab, search for GAMLj3 and click on the INSTALL button.

While you are at it, you may want to consider installing the flexplot module in Jamovi, which uses GLM fundamentals to automatically generate plots for you.

Your data is available to download on Canvas. Please check that you have all the data you need before proceeding. You also have access to the data collected by other groups should you wish to practice your data analysis skills.

Background

If you have followed study design principles, the data you have collected should be clean and ready for analysis. However, it is still a good idea to inspect your data further before proceeding with the analysis. This includes a check for:

Missing data
Outliers
Data entry errors
Assumptions of the statistical model, and whether the data meets them

Missing data, outliers and data entry errors

These are what we sometimes call systematic errors. They can be detected by looking at the data and checking for unusual values, or cross-checking methods within your group. Make sure to remove or correct these errors before proceeding with the analysis.

6 Exercise 2 – Data analysis

Fitting a model to data

Recall that you have an empirical model e.g. y ~ x. You will need to use an appropriate statistical model to fit this relationship to your data.

Note

You should already know what model you are fitting to your data as it it part of your study design!

Checking model assumptions

Before interpreting your results, it is important to check that your chosen model is suitable for your data. This means assessing whether the data meet the assumptions required by the model. If these assumptions are not met, your results may be misleading.

The main assumptions for general linear models are:

Normality of residuals: The differences between observed and predicted values (residuals) should be roughly normally distributed.
Homogeneity of variance: The spread of residuals should be similar across all levels of your predictors.
Independence of observations: Each data point should be independent of the others. This is assumed, so we do not need to test it explicitly.

The simplest way to check these assumptions is by examining residual plots. Formal statistical tests are available such as the Shapiro-Wilk test for normality and Levene’s test for homogeneity of variance, but they can be overly sensitive (especially with large datasets) and may suggest problems where none exist.

Remember, minor violations of normality are often not a major concern (can you explain why?), but large differences in variance can seriously affect your conclusions. Always discuss any issues and how you addressed them in your analysis.

Violation of the normality assumption is often ignored when sampling is large, as the Central Limit Theorem suggests that the sampling distribution of the mean will be normally distributed regardless of the shape of the population distribution.

Note

Your demonstrators may show you how to transform variables to better meet the assumptions of your chosen model.

Running models

Running the model is probably the least eventful part of the analytical workflow and will take you a few seconds regardless of the software you are using, especially if you are running GLMs. Make sure that you record the software used and the specific statistical technique selected such that the analysis is reproducible.

Interpretation

Interpreting the output of a statistical model is a skill that takes time to develop. You will need to be able to:

Understand the important parts of the output e.g. F-statistic, p-value, degrees of freedom, etc.
Explain what the output means in the context of your data.
Explain what the output means in the context of your research question.

Your lectures and this week’s workshop will help you develop these skills.

Your data may not be interpretable due to the way it was collected. This is ok – a lesson learnt. You can still discuss the issues with the data and what you would do differently next time. Meanwhile, you can still practice your data analysis skills on your data (or download another group’s data).

7 Exercise 3 – putting it all together

The general linear modelling framework is incredibly intuitive for building statistical models. Use the datasets below to repeat Exercises 1 and 2. Some hints:

Try to add more than one predictor to your model and see how the output changes.
Use your understanding of data structure to determine if you have set your model up correctly – i.e. is that variable really numeric?
The data may need to be cleaned before you can analyse it. This is a good opportunity to practice your data cleaning skills.

Note

On top of the simple models you have been taught so far (e.g., using a single continuous or categorical predictor), you will see how the GLM framework can be extended to include multiple predictors of different types. These will be touched on in the next few weeks but you can get a head start now – that’s the beauty of GLMs!

Weight data for domestic cats: cats.xlsx

Handy hints:

Read the metadata if you are unsure what the variables mean.
You can look at the relationships between the continuous variables for each gender of cat. Is there a way to combine these relationships into one model?
Can Sex be used as a response variable in this model? You don’t have to do this, but it is an interesting question to consider.

Cherry trees: cherry.xlsx

Handy hints:

This dataset seems very simple and should be easy to analyse… or is it?
Use what you have learnt about simple linear regression and add more predictors to your model!
Think about the assumptions of the model and whether they are met. Do you need to alter your interpretation of these diagnostic plots in any way?

(Extra challenge) effects of dietary protein on pigs: pigs.xlsx

Handy hints:

This dataset is messy and may need to be tidied up before you can analyse it.
Think of how a model with a two-level categorical predictor (which compares two groups) can be extended to a predictor with multiple levels to compare multiple groups. Does this change the way you need to set up your model?
Are you brave enough to look for interactions between variables?

(Extra challenge) Sugar cane disease data: cane.xlsx

Handy hints:

This dataset is a bit more complex than the others. You may need to spend some time understanding the variables before you can analyse the data.
You do not need to use all the variables in your analysis. There are two possible response variables in this dataset - you can choose which one to use.

8 Submit your results

Once complete, please upload your results to Google Drive) for archival. If there’s time, we will discuss the results together.

9 End

That’s it. Module 1 practicals are now complete. I hope you have learnt a few things about experimental design and data analysis. We will continue to build on these skills in the next module using real-world data.