Package | Description |
---|---|
tidyverse | Collection of packages (visualization, manipulation): ggplot2, dplyr, purrr, etc. |
mapSpain | Administrative boundaries of Spain at different levels |
sf | Simple Feature: import, export and manipulate vector data |
janitor | Simple functions for examining and cleaning dirty data |
patchwork | Simple grammar to combine separate ggplots into the same graphic |
I recently came across a map from the National Atlas of Spain showing the number of libraries by municipality. However, one thing directly caught my attention. There’s a saying that many maps show only the population, and this seems to be the case here. The map does not provide any remarkable information about the distribution of libraries; it merely shows where the most people live. It’s a very common pitfall. There are many phenomena that depend on population. It can be assumed that the number of libraries largely depends on the number of inhabitants in each municipality. Therefore, we see cities like Madrid, Barcelona, Seville, Valencia, and many provincial capitals with a higher number of libraries.
“A good practice is always to normalize variables that depend highly on the population. Pairing absolute numbers with normalized data is also beneficial. However, there are cases where the absolute number may be of interest. For instance, when considering poverty, normalized numbers may obscure the significant number of individuals affected.
Normalization of population-dependent variables can be performed in different ways, depending on the context and type of data. In our case, it involves dividing the number of libraries by the total population. To avoid very small numbers, it is common to multiply by a fixed population, such as 10,000. Another alternative, but not in this case, would be the use of percent.
\[\text{Libraries per 10,000 inhabitants} = \left( \frac{\text{Number of libraries}}{\text{Total population}} \right) \times 10,000\]
If you divide the other way around, it gives you a different perspective. You are calculating how many people, on average, have access to a single library.
\[\text{Inhabitants per library} = \frac{\text{Total population}}{\text{Number of libraries}}\]
Let’s get to work!
Packages
# install the packages if necessary
if (!require("tidyverse")) install.packages("tidyverse")
if (!require("mapSpain")) install.packages("mapSpain")
if (!require("sf")) install.packages("sf")
if (!require("janitor")) install.packages("janitor")
if (!require("patchwork")) install.packages("patchwork")
# packages
library(tidyverse)
library(readxl)
library(sf)
library(mapSpain)
library(janitor)
library(patchwork)
Data
In this post we will use a dataset of libraries from Spain for 2022 (download)1 which I obtained from the Ministry of Culture. In this step we also import the population data from the INE and municipality boundaries from the mapSpain
package.
# import library data
bib <- read_excel("datos.xlsx") |>
clean_names() |>
filter(pais == "España")
bib
# A tibble: 7,482 × 26
niden nombre nombre_ca pais codigo_ca ca codigo_provincia provincia
<chr> <chr> <lgl> <chr> <chr> <chr> <chr> <chr>
1 397-4003003 Bibli… NA Espa… 01 Anda… 04 Almería
2 397-4003001 Bibli… NA Espa… 01 Anda… 04 Almería
3 397-4003002 Bibli… NA Espa… 01 Anda… 04 Almería
4 580-4004001 Bibli… NA Espa… 01 Anda… 04 Almería
5 589-4005001 Bibli… NA Espa… 01 Anda… 04 Almería
6 398-4006001 Bibli… NA Espa… 01 Anda… 04 Almería
7 590-4007001 Bibli… NA Espa… 01 Anda… 04 Almería
8 8468-10180 Bibli… NA Espa… 01 Anda… 04 Almería
9 399-4011001 Bibli… NA Espa… 01 Anda… 04 Almería
10 412-7108 Bibli… NA Espa… 01 Anda… 04 Almería
# ℹ 7,472 more rows
# ℹ 18 more variables: codigo_municipio <chr>, municipio <chr>,
# entidad_menor <chr>, direccion <chr>, distrito_postal <chr>,
# tipologia <chr>, telefono <chr>, correo_electronico <chr>,
# pagina_web <chr>, tipo_biblioteca <chr>, ano_fundacion <dbl>,
# telefono2 <chr>, longitud <chr>, latitud <chr>, correo_prestamo <chr>,
# acceso_catalogo <chr>, titularidad <chr>, gestion <chr>
# region and municipality boundaries
ccaa <- esp_get_ccaa() # for the map
mun <- esp_get_munic()
mun
Simple feature collection with 8131 features and 7 fields
Geometry type: GEOMETRY
Dimension: XY
Bounding box: xmin: -13.21926 ymin: 34.70178 xmax: 4.32409 ymax: 43.7889
Geodetic CRS: ETRS89
First 10 features:
codauto ine.ccaa.name cpro ine.prov.name cmun name LAU_CODE
382 01 Andalucía 04 Almería 001 Abla 04001
379 01 Andalucía 04 Almería 002 Abrucena 04002
374 01 Andalucía 04 Almería 003 Adra 04003
375 01 Andalucía 04 Almería 004 Albánchez 04004
358 01 Andalucía 04 Almería 005 Alboloduy 04005
373 01 Andalucía 04 Almería 006 Albox 04006
350 01 Andalucía 04 Almería 007 Alcolea 04007
364 01 Andalucía 04 Almería 008 Alcóntar 04008
352 01 Andalucía 04 Almería 009 Alcudia de Monteagud 04009
349 01 Andalucía 04 Almería 010 Alhabia 04010
geometry
382 POLYGON ((-2.77744 37.23836...
379 POLYGON ((-2.88984 37.09213...
374 POLYGON ((-2.93161 36.75079...
375 POLYGON ((-2.13138 37.29959...
358 POLYGON ((-2.70077 37.09674...
373 POLYGON ((-2.15335 37.54576...
350 POLYGON ((-3.05663 36.88506...
364 POLYGON ((-2.65344 37.33238...
352 POLYGON ((-2.27371 37.2416,...
349 POLYGON ((-2.5425 36.97485,...
# population data
pob <- read_csv2("33570.csv") |> janitor::clean_names()
pob
# A tibble: 10,739,520 × 5
sexo municipios edad_grupos_quinquenales periodo total
<chr> <chr> <chr> <chr> <dbl>
1 Total Total Nacional Todas las edades 1 de enero de 2022 47475420
2 Total Total Nacional Todas las edades 1 de enero de 2021 47385107
3 Total Total Nacional Todas las edades 1 de enero de 2020 47450795
4 Total Total Nacional Todas las edades 1 de enero de 2019 47026208
5 Total Total Nacional Todas las edades 1 de enero de 2018 46722980
6 Total Total Nacional Todas las edades 1 de enero de 2017 46572132
7 Total Total Nacional Todas las edades 1 de enero de 2016 46557008
8 Total Total Nacional Todas las edades 1 de enero de 2015 46624382
9 Total Total Nacional Todas las edades 1 de enero de 2014 46771341
10 Total Total Nacional Todas las edades 1 de enero de 2013 47129783
# ℹ 10,739,510 more rows
Preparation
For the library data, all that is needed is to create the municipality code and count the number of libraries. Then, we use the left_join()
function to join the vector data of the municipalities and the number of libraries. We also prepare the population data. Here, we must choose the whole population (any gender and age) and the year 2022. The municipality code must be extracted as a 5-digit number.
# select needed columns and create municipality code
bib <- select(bib, codigo_provincia, codigo_municipio) |>
mutate(LAU_CODE = str_c(codigo_provincia, codigo_municipio))
# count libraries
bib_count <- count(bib, LAU_CODE)
# join with boundaries
mun <- left_join(mun, bib_count)
# filter population data
pob_mun <- filter(
pob, sexo == "Total",
municipios != "Total Nacional",
edad_grupos_quinquenales == "Todas las edades",
str_detect(periodo, "2022")
) |>
select(municipios, total) |>
mutate(LAU_CODE = str_extract(municipios, "[0-9]{5}"))
Visualization
To create a proportional symbol map, we simply use the centroid of the municipality for the location. It is important to order the observations from highest to lowest, which ensures that the smaller circles are drawn on top of the larger ones.
In ggplot2, we use geom_sf()
for the vector data and indicate size with the column n. Additionally, we set the size range with scale_size()
and specify the breaks in the legend. We are also making a few small aesthetic changes.
m1 <- st_centroid(mun) |> arrange(-n)
p1 <- ggplot() +
geom_sf(data = ccaa, fill = "grey90", colour = "white") +
geom_sf(
data = m1,
aes(size = n),
alpha = .5, shape = 21, fill = "#7a0177", colour = "white"
) +
scale_size(range = c(1, 20), breaks = c(10, 50, 150, 400)) +
labs(size = NULL, title = "Absolute number of libraries") +
theme_void()
p1
This map clearly resembles the one in the national atlas. Big cities, such as Madrid, Barcelona or Valencia, are highlighted.
But what would we see if we normalize the variable?
To do this, we must first join the population data and calculate the rate.
m2 <- st_centroid(mun) |>
left_join(pob_mun) |>
mutate(norm_pob = n * 10000 / total) |>
arrange(-norm_pob)
select(m2, n, total:norm_pob)
Simple feature collection with 8131 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -13.12607 ymin: 34.75565 xmax: 4.2862 ymax: 43.7256
Geodetic CRS: ETRS89
First 10 features:
n total norm_pob geometry
1 1 19 526.31579 POINT (-1.30825 43.01419)
2 1 52 192.30769 POINT (-2.812271 40.57015)
3 1 57 175.43860 POINT (-2.868489 41.72316)
4 1 86 116.27907 POINT (-6.472424 40.34306)
5 1 95 105.26316 POINT (-0.4256999 40.02079)
6 1 96 104.16667 POINT (-1.24694 42.2997)
7 1 96 104.16667 POINT (-5.350944 39.71565)
8 1 102 98.03922 POINT (-0.3685699 41.34055)
9 1 104 96.15385 POINT (-5.427423 40.37489)
10 1 104 96.15385 POINT (-5.658987 39.72373)
Now we simply repeat the same map using the norm_pob column.
p2 <- ggplot() +
geom_sf(data = ccaa, fill = "grey90", colour = "white") +
geom_sf(
data = m2,
aes(size = norm_pob),
alpha = .5, shape = 21, fill = "#7a0177", colour = "white"
) +
scale_size(range = c(.5, 10), breaks = c(1, 10, 100, 500)) +
labs(size = NULL, title = "Libraries per 10,000 inhabitants") +
theme_void()
p2
This map now tells a completely different story. We see a high number of libraries per inhabitant in certain regions, while others are more homogeneous with a consistently low number of libraries. However, there are also impossible values that stand out. For example, in the north, Roncesvalles reaches 526 libraries per 10,000 inhabitants! That can’t be right, can it? This issue is caused by the very small municipalities. In this case, there is one library for every 19 inhabitants. In any case, what is also noticeable is that there are significantly fewer libraries per inhabitant in major cities like Madrid and Barcelona, with only 1 library per 10,000 inhabitants. It is not easy to find a multiplier that works uniformly to reflect both small and large municipalities.
To solve the problem of small municipalities, one possible good strategy would be to exclude all those with less than 100 inhabitants. It can even help in reducing overlapping.
p3 <- ggplot() +
geom_sf(data = ccaa, fill = "grey90", colour = "white") +
geom_sf(
data = filter(m2, total > 100),
aes(size = norm_pob),
alpha = .5, shape = 21, fill = "#7a0177", colour = "white"
) +
scale_size(range = c(.1, 10), breaks = c(1, 10, 50, 90)) +
labs(size = NULL, title = "Libraries per 10,000 inhabitants (without municipalities < 100)") +
theme_void()
p3
Finally, we create a comparison with the patchwork grammar.
p1 + p2 + p3 & theme(plot.title = element_text(size = 20, hjust = .5),
legend.text = element_text(size = 12))
Different ways of telling different stories! A completely different question is how much the overlapping of circles prevents us from reading the map accurately. For instance, we could use a Dorling cartogram for the population and use color to show the number of libraries. Below, you see an example with the foreign population by origin. The R Code for this I will post another time.
Footnotes
Include also population data.↩︎