An R tutorial to download and plot some Queensland population data

Posted:

Tags: R, Tidyverse, ggplot, Tutorial


Introduction

In this tutorial I will download Queensland population data from the Australian Bureau of Statistics (ABS), tidy it up a bit, and plot some graphs with ggplot. The data in the spreadsheet are not in the most useful format so this tutorial will use tidyverse tools to manipulate the data into something a bit more useful.

The objectives are to:

  • Download the spreadsheet from the ABS,
  • Extract data from the spreadsheet,
  • Wrangle the data to get into into a useful format,
  • Plot some of the data using ggplot.

The tools you will use are:

  • readxl to selectively use data from a spreadsheet,
  • stringr to tidy up column names,
  • kable and kableExtra to display data in a pretty table
  • lubridate to change a date format,
  • gather and separate to make useful variables in the data set,
  • ggplot to make a line graph,
  • wesanderson colour themes for nice plot colours,
  • you will also see how to make a custom ggplot theme.

Load the required libraries. I use the wesanderson package for picking manual colours in my plots.

library(tidyverse)
library(lubridate)
library(readxl)
library(stringr)
library(wesanderson)
library(kableExtra)

Make a custom ggplot theme

Making a theme is not too hard. You reaaly just have to type a ggplot theme name at the console without the parentheses and copy what you need into a function. This is my custom ggplot theme. It sets some ggplot options so I don’t have to add them to the plotting code.

theme_marquess <- function(){
  theme_minimal(base_size = 12, base_family="Avenir") %+replace%
    theme(rect = element_rect(fill = 'white', linetype = 0, colour = NA, size=NA),
          panel.background = element_blank(),
          panel.grid.minor.x = element_line(colour = NA, size = NA),
          panel.grid.minor.y = element_line(colour = 'gray90', size = 0.2),
          panel.grid.major.x = element_line(size = NA,  colour = NA),
          panel.grid.major.y = element_line(size = 0.3, linetype = 'solid', colour = "gray80"),
          strip.background = element_rect(fill = "grey80", colour = "grey20"),
          # Optional - sets x and y axis lines
          # axis.line = element_line(colour="gray50", size=0.3, lineend="square"),
          # Optional - sets a border around the chart. Leaving it off provides a cleab 538 style chart
          # panel.border = element_rect(linetype = "solid", fill = NA, color='gray60', size=1),
          legend.key = element_rect(fill="transparent", colour=NA)
    )
}

Get the data.

The spreadsheet files are available on the ABS website. I am using TABLE 53. Estimated Resident Population By Single Year Of Age, Queensland. I tend to download the spreadsheet locally and read in the data with readxl because the spreadsheets are periodically updated.

When I have downloaded the file to my working directory I read it in with the code below. There are a number of worksheets in the spreadsheet and we are only interested in two of them. One is the main data worksheet (Data1), and the other is the Index worksheet that I will use for column names.

The data worksheet is relatively straightforward. The range argument allows a certain range of cells that have data to be loaded. This means eyeballing the spreadsheet in advance to identify what needs to be extracted. Obtaining the column names reads the Index worksheet, specifies a range of cells to read, defines the data type in the cells (as text), and gives the tibble column a name.

qld <- read_excel(path = '3101053.xls',
                  sheet = 'Data1',
                  range = 'A10:GU57')

columns <- read_excel(path = '3101053.xls',
                      sheet = 1,
                      range='A13:A214',
                      col_types = 'text',
                      col_names = 'Header')

Tidy up the data (part 1).

The data set you get is 47 rows and 203 columns of population data. The columns show the unit age population levels from males aged 1 years old and under to 100 years old and over. Following that is the same for females. So 101 columns for males and 101 for females, plus 1 column for years results in 203 columns.

The Index worksheet that was used to extract strings to use as column headers is long and contains redundant information and punctuation. These can all be removed using the stringr package as follows. Just remove what is not necessary and trim the whitespace as follows.

columns$Header <- str_remove(columns$Header, pattern = "Estimated Resident Population ;  ")
columns$Header <- str_remove(columns$Header, pattern = "\ ; ")
columns$Header <- str_remove(columns$Header, pattern = "and over")
columns$Header <- str_remove(columns$Header, pattern = ";")
columns$Header <- str_trim(columns$Header, side = 'right')


colnames(qld) <- c('Year', columns$Header)

Look at the data.

You can use the kable and kableExtra packages to make a pretty table to look at bits of the data. Obviously in a regular R script you would just output to console. However, I just discovered kableExtra yesterday so I am keen to include the kable_styling function to make a Bootstrap style table.

kable(qld[1:5, 1:5], "html") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Year Male 0 Male 1 Male 2 Female 0 Female 1 Female 2 Female 3
1971-06-01 19091 18909 18480 18378 17975 17982 17511
1972-06-01 20106 19460 19239 19380 18669 18349 18172
1973-06-01 19586 20438 19999 18819 19785 19122 18844
1974-06-01 19281 19988 20898 18266 19234 20287 19654
1975-06-01 18738 19513 20275 17834 18583 19582 20681

Tidy up the data (part 2).

As you can see the data in wide and untidy. Each column has tow components, sex and age, that should be separated out. Also, the data would benefit from being converted to long format. The Year column is not formatted as a year, it is still in a date format. We can fix this with a short number of steps.

  • Gather the data to long format using the columns with Sex_Age the key, Population as the value, and ignore the Year column.
  • Covert the Sex_Age columns to 2 columns, Sex and Age, using separate.
  • Convert the date to a year.
  • Convert the data in the Age columnn to integers.
qld_long <- qld %>% gather(Sex_Age, Population, -'Year')
qld_pop <- qld_long %>% separate(Sex_Age, into = c('Sex', 'Age'), sep = " ")
qld_pop$Year <- year(qld_pop$Year)
qld_pop$Age <- as.integer(qld_pop$Age)

Plot the data

Now we can begin to plot the data using ggplot. There are many ways to slice the data and plot it, but to keep thing simple let’s just do a line plot of population count of males and females by year for each year of age, for a particular year.

We will use filter to pick out a particular year to plot and make a line graph using geom_line, and my theme that I included earier.

qld_pop %>% filter(Year == 2017) %>%
  ggplot(aes(x=Age, y=Population, color=Sex)) +
  geom_line(size=1.1, alpha=0.8) +
  labs(x = "Age",
       y = 'Number of people',
       title = "Population of Queensland in 2017 by sex and age.") +
  scale_color_manual(values=wes_palette("Royal1")) +
  theme_marquess()

Click the image below to enlarge.

About
A website with some tutorials about he R programming language, and other things.
Meta

©John Marquess

Powered by Jekyll, Knitr, and Bootstrap.

Source available on Github.