Tidy correlation tests in R

Statistics

R:Advanced

Correlation

Tests

Author

Dominic Royé

Published

April 17, 2019

Modified

December 30, 2024

When we try to estimate the correlation coefficient between multiple variables, the task is more complicated in order to obtain a simple and tidy result. A simple solution is to use the tidy() function from the broom package. In this post we are going to estimate the correlation coefficients between the annual precipitation of several Spanish cities and climate teleconnections indices: download. The data of the teleconnections are preprocessed, but can be downloaded directly from crudata.uea.ac.uk. The daily precipitation data comes from ECA&D.

Packages

In this post we will use the following packages:

Package	Description
tidyverse	Collection of packages (visualization, manipulation): ggplot2, dplyr, purrr, etc.
broom	Convert results of statistical functions (lm, t.test, cor.test, etc.) into tidy tables
fs	Provides a cross-platform, uniform interface to file system operations

# install the packages if necessary

if (!require("tidyverse")) install.packages("tidyverse")
if (!require("broom")) install.packages("broom")
if (!require("fs")) install.packages("fs")


# load packages
library(tidyverse)
library(broom)
library(fs)

Import data

First we have to import the daily precipitation of the selected weather stations.

Create a vector with all precipitation files using the function dir_ls() of the fs package.
Import the data using the map() function of the purrr package that applies another function to a vector or list, and then we can join them together in a single data.frame with list_rbind().
Select the columns that interest us, b) Convert the date string into a date object using the ymd() function of the lubridate package, c) Create a new column yr with the years, d) Divide the precipitation values by 10 and reclassify absent values -9999 by NA, e) Finally, reclassify the ID of each weather station creating a factor with new labels.

More details about the use of the dir_ls() and list_rbind() functions can be found in this previous post.

# precipitation files
files <- dir_ls(regexp = "txt")
files

# import all files and join them together
pr <- files |> map(read_csv, skip = 20) |> list_rbind()
pr

# create levels for the factor
id <- unique(pr$STAID)

# the corresponding labels
lab <- c("Bilbao", "Santiago", "Barcelona", "Madrid", "Valencia")

# first changes
pr <- select(pr, STAID, DATE, RR) |>
  mutate(
    DATE = ymd(DATE),
    RR = ifelse(RR == -9999, NA, RR / 10),
    STAID = factor(STAID, id, lab),
    yr = year(DATE)
  )
pr

We still need to filter and calculate the annual amount of precipitation. Actually, it is not correct to sum up precipitation without taking into account that there are missing values, but it should be enough for this practice. Then, we change the table format with the spread() function, passing from a long to a wide table, that is, we want to obtain one column per weather station.

pr_yr <- filter(pr, DATE >= "1950-01-01", 
                DATE < "2018-01-01") |>
          group_by(STAID, yr) |>
          summarise(pr = sum(RR, na.rm = TRUE))
pr_yr

pr_yr <- spread(pr_yr, STAID, pr)
pr_yr

The next step is to import the climate teleconnection indices.

# teleconnections
telecon <- read_csv("teleconnections_indices.csv")
telecon

Finally we need to join both tables by year.

data_all <- left_join(pr_yr, telecon, by = "yr")
data_all

Correlation test

A correlation test between paired samples can be done with the cor.test() function of R Base. In this case between the annual precipitation of Bilbao and the NAO index.

cor_nao_bil <- cor.test(data_all$Bilbao, 
                        data_all$NAO,
                        method = "spearman"
                )
cor_nao_bil

str(cor_nao_bil)

We see that the result is in an unmanageable and untidy format. It is a console summary of the correlation with all the statistical parameters necessary to get a conclusion about the relationship. The orginal structure is a list of vectors. However, the tidy() function of the broom package allows us to convert the result into a table format.

tidy(cor_nao_bil)

Apply the correlation test to multiple variables

The objective is to apply the correlation test to all weather stations and climate teleconnection indices.

First, we must pass the table to the long format, that is, create a column/variable for the city and for the value of the corresponding precipitation. Then we repeat the same for the teleconnections indices.

data <- pivot_longer(data_all, Bilbao:Valencia, names_to = "city", values_to = "pr") |>
            pivot_longer(NAO:AO, names_to = "telecon", values_to = "index")
data

To apply the test to all cities, we need the corresponding groupings. Therefore, we use the group_by() function for indicating the two groups: city and telecon. In addition, we apply the nest() function of the tidyr package (tidyverse collection) with the aim of creating lists of tables nested per row. In other words, in each row of each city and teleconnection index we will have a new table that contains the year, the precipitation value and the value of each teleconection, correspondingly.

data_nest <- group_by(data, city, telecon) |> nest()
head(data_nest)

str(head(slice(data_nest, 1)))

The next step is to create a function, in which we define the correlation test and pass it to the clean format using the tidy() function, which we apply to each groupings.

cor_fun <- function(df) cor.test(df$pr, df$index, method = "spearman") |> tidy()

Now we only have to apply our function to the column that contains the tables for each combination between city and teleconnection. To do this, we use the map() function that applies another function to a vector or list. What we do is create a new column that contains the result, a statistical summary table, for each combination.

data_nest <- mutate(data_nest, model = map(data, cor_fun))
head(data_nest)

str(head(slice(data_nest, 1)))

How can we undo the list of tables in each row of our table?

First we eliminate the column with the data and then simply we can apply the unnest() function.

corr_pr <- select(data_nest, -data) |> unnest()
corr_pr

The result is a table in which we can see the correlations and their statistical significance for each city and teleconnection index.

Heatmap of the results

Finally, we make a heatmap of the obtained result. But, previously we create a column that indicates whether the correlation is significant with p-value less than 0.05.

corr_pr <- mutate(corr_pr, sig = ifelse(p.value < 0.05, "Sig.", "Non Sig."))

ggplot() +
  geom_tile(
    data = corr_pr,
    aes(city, telecon, fill = estimate),
    linewidth = 1,
    colour = "white"
  ) +
  geom_tile(
    data = filter(corr_pr, sig == "Sig."),
    aes(city, telecon),
    linewidth = 1,
    colour = "black",
    fill = NA
  ) +
  geom_text(
    data = corr_pr,
    aes(city, telecon,
      label = round(estimate, 2),
      fontface = ifelse(sig == "Sig.", "bold", "plain")
    )
  ) +
  scale_fill_gradient2(breaks = seq(-1, 1, 0.2)) +
  labs(x = NULL, y = NULL, fill = NULL) +
  coord_cartesian(expand = F) +
  theme_void() +
  theme(axis.text = element_text())

Reuse

CC BY-SA 4.0

Citation

For attribution, please cite this work as:

Royé, Dominic. 2019. “Tidy Correlation Tests in R.” April 17, 2019. https://dominicroye.github.io/blog/2019-04-17-tidy-correlation-tests-in-r/.