# install the packages if necessary
if (!require("tidyverse")) install.packages("tidyverse")
if (!require("broom")) install.packages("broom")
if (!require("fs")) install.packages("fs")
# load packages
library(tidyverse)
library(broom)
library(fs)Tidy correlation tests in R
When we try to estimate the correlation coefficient between multiple variables, the task is more complicated in order to obtain a simple and tidy result. A simple solution is to use the tidy() function from the broom package. In this post we are going to estimate the correlation coefficients between the annual precipitation of several Spanish cities and climate teleconnections indices: download. The data of the teleconnections are preprocessed, but can be downloaded directly from crudata.uea.ac.uk. The daily precipitation data comes from ECA&D.
Packages
In this post we will use the following packages:
| Package | Description |
|---|---|
| tidyverse | Collection of packages (visualization, manipulation): ggplot2, dplyr, purrr, etc. |
| broom | Convert results of statistical functions (lm, t.test, cor.test, etc.) into tidy tables |
| fs | Provides a cross-platform, uniform interface to file system operations |
Import data
First we have to import the daily precipitation of the selected weather stations.
- Create a vector with all precipitation files using the function
dir_ls()of the fs package. - Import the data using the
map()function of thepurrrpackage that applies another function to a vector or list, and then we can join them together in a singledata.framewithlist_rbind(). - Select the columns that interest us, b) Convert the date string into a date object using the
ymd()function of thelubridatepackage, c) Create a new column yr with the years, d) Divide the precipitation values by 10 and reclassify absent values-9999byNA, e) Finally, reclassify the ID of each weather station creating a factor with new labels.
More details about the use of the dir_ls() and list_rbind() functions can be found in this previous post.
# precipitation files
files <- dir_ls(regexp = "txt")
files# import all files and join them together
pr <- files |> map(read_csv, skip = 20) |> list_rbind()
pr# create levels for the factor
id <- unique(pr$STAID)
# the corresponding labels
lab <- c("Bilbao", "Santiago", "Barcelona", "Madrid", "Valencia")
# first changes
pr <- select(pr, STAID, DATE, RR) |>
mutate(
DATE = ymd(DATE),
RR = ifelse(RR == -9999, NA, RR / 10),
STAID = factor(STAID, id, lab),
yr = year(DATE)
)
prWe still need to filter and calculate the annual amount of precipitation. Actually, it is not correct to sum up precipitation without taking into account that there are missing values, but it should be enough for this practice. Then, we change the table format with the spread() function, passing from a long to a wide table, that is, we want to obtain one column per weather station.
pr_yr <- spread(pr_yr, STAID, pr)
pr_yrThe next step is to import the climate teleconnection indices.
# teleconnections
telecon <- read_csv("teleconnections_indices.csv")
teleconFinally we need to join both tables by year.
data_all <- left_join(pr_yr, telecon, by = "yr")
data_allCorrelation test
A correlation test between paired samples can be done with the cor.test() function of R Base. In this case between the annual precipitation of Bilbao and the NAO index.
cor_nao_bil <- cor.test(data_all$Bilbao,
data_all$NAO,
method = "spearman"
)
cor_nao_bilstr(cor_nao_bil)We see that the result is in an unmanageable and untidy format. It is a console summary of the correlation with all the statistical parameters necessary to get a conclusion about the relationship. The orginal structure is a list of vectors. However, the tidy() function of the broom package allows us to convert the result into a table format.
tidy(cor_nao_bil)Apply the correlation test to multiple variables
The objective is to apply the correlation test to all weather stations and climate teleconnection indices.
First, we must pass the table to the long format, that is, create a column/variable for the city and for the value of the corresponding precipitation. Then we repeat the same for the teleconnections indices.
data <- pivot_longer(data_all, Bilbao:Valencia, names_to = "city", values_to = "pr") |>
pivot_longer(NAO:AO, names_to = "telecon", values_to = "index")
dataTo apply the test to all cities, we need the corresponding groupings. Therefore, we use the group_by() function for indicating the two groups: city and telecon. In addition, we apply the nest() function of the tidyr package (tidyverse collection) with the aim of creating lists of tables nested per row. In other words, in each row of each city and teleconnection index we will have a new table that contains the year, the precipitation value and the value of each teleconection, correspondingly.
The next step is to create a function, in which we define the correlation test and pass it to the clean format using the tidy() function, which we apply to each groupings.
Now we only have to apply our function to the column that contains the tables for each combination between city and teleconnection. To do this, we use the map() function that applies another function to a vector or list. What we do is create a new column that contains the result, a statistical summary table, for each combination.
How can we undo the list of tables in each row of our table?
First we eliminate the column with the data and then simply we can apply the unnest() function.
The result is a table in which we can see the correlations and their statistical significance for each city and teleconnection index.
Heatmap of the results
Finally, we make a heatmap of the obtained result. But, previously we create a column that indicates whether the correlation is significant with p-value less than 0.05.
ggplot() +
geom_tile(
data = corr_pr,
aes(city, telecon, fill = estimate),
linewidth = 1,
colour = "white"
) +
geom_tile(
data = filter(corr_pr, sig == "Sig."),
aes(city, telecon),
linewidth = 1,
colour = "black",
fill = NA
) +
geom_text(
data = corr_pr,
aes(city, telecon,
label = round(estimate, 2),
fontface = ifelse(sig == "Sig.", "bold", "plain")
)
) +
scale_fill_gradient2(breaks = seq(-1, 1, 0.2)) +
labs(x = NULL, y = NULL, fill = NULL) +
coord_cartesian(expand = F) +
theme_void() +
theme(axis.text = element_text())
