Tutorial - make population pyramids with Queensland data


Tags: R, Tidyverse, ggplot, Tutorial, Epidemiology

In this tutorial I will use Queensland population data already downloaded from the Australian Bureau of Statistics (ABS), and rearrange the data to produce population pyramids using ggplot. The population pyramid is one of the most popular methods to visualise population age structure. The method constructs a chart with each age group represented by a bar, and each bar ranged one above the other—youngest at the bottom, oldest at the top, and with the sexes separated—you get a simple shape.

Over recent years, the structure of populations has markedly changed, and the population pyramids have taken on more of a dome-like structure. This is nicely represented in the Queensland data. You can read more in the article from The Economist.

The data for this tutorial are the same as were used in the previous tutorial in which a spreadsheet was downloaded from the ABS website.

The clean data are included here for convenience.

qld_pop <- read.csv(file = 'http://marquess.me/data/qld_pop.csv')

The objectives this tutorial are to:

  • Recode cariables in the data into intervals, or bins,
  • Perform some reasonably involved data manipluation,
  • Make a plot using ggplot.

The tools you will use are:

  • dplyr to manipluate the data,
  • cut_interval to group indovidual ages into age bins,
  • mutate to make new variables,
  • gather and spread to make useful variables in the data set,
  • ggplot to make the pyramid,

Only the main tidyverse library is required for this tutorial.


Recode the individual ages into age bins

The rows of individual age data need to ge grouped as age bins of 5 years. The cut_interval function allows us to make groups with an equal number of intervals. The data have ages from under 1 years old to over 100 that we want to split into 5 year age groups, therefore we need 20 intervals. We use the closed right argument to specify that the higher age in each bin is inclusive. We can provide an array of equsl length to the interval as labels.

age_bins <- c('0-5','6-10','11-15','16-20','21-25',

qld_pop$age_bin <- cut_interval(qld_pop$Age, n=20, closed='right', labels=age_bins)

Prepare the data for plotting.

In this part of the exercise we want to perform a number of steps to convert the line list of individual observations of population counter per year for each year of and sex into an aggregated percentage for sea, age group, and year. The steps are as follows:

  • filter the population line list to sect three years we intend to plot,
  • group_by to group by year, sex, and age group,
  • summarise to sum the population in each group,
  • spread the data to obtain male and female columns,
  • mutate to calculate the percent in each group of males and females,
  • remove the redundant male and female population count columns,
  • mutate the male data to a negative value. This is important for plotting later on because of the way pyramid plots are constructed.
  • rename the columns,
  • gather the data set into long format .
pop_pyr_data_pct <-
  qld_pop %>%
  filter(Year %in% c(1977,1997,2017)) %>%
  group_by(Year, Sex, age_bin) %>%
  summarise(count=sum(Population)) %>%
  spread(Sex, count) %>%
  mutate(pct_F = Female*100/sum(Female),  pct_M = Male*100/sum(Male)) %>%
  mutate(pct_M = -pct_M) %>%
  select(-Female, -Male) %>%
  rename(AgeGrp = age_bin, Female=pct_F, Male=pct_M) %>%
  gather(Sex, Percent, -Year, -AgeGrp)

We obtain a data set that looks like this.

Year AgeGrp Sex Percent
1971 0-5 Female 11.509087
1971 6-10 Female 9.987273
1971 11-15 Female 9.662507
1971 16-20 Female 8.697683
1971 21-25 Female 8.049023
1971 26-30 Female 6.629657
1971 31-35 Female 5.737124
1971 36-40 Female 5.451008

Plot the data

Now we can begin to plot the data using ggplot. Each pyramid plot is a plot of two sides, with male and female data displayed on either side of zero on the x-axis. That was the reason to convert the male data to negative values. The plot is simply a bar plot with some additional formatting.

Here are the steps to construct the plot:

  • load the data and set the aesthetics for age group on x, percent on y, and colour the bars,
  • set stat = 'identity and use a suitable width value (personal preference) to separate the bars,
  • because we have negative values for male data we need to set the labels manually with labels and breaks. The breaks and labels need to be the same length to align,
  • Flip the chart on its side with coord_flip(),
  • Add a main label,
  • Tinker about with the format and colours of the theme,
  • I used theme_tufte from ggthemes for a nice clean graph and mofied the font in this plot,
  • Importantly, facet the plot by year so that each year has its own pyramid.

Here is the code for the plot.

ggplot(pop_pyr_data_pct, aes(x = AgeGrp, y = Percent, fill = Sex)) +   # Fill column
  geom_bar(stat = "identity", width = .85) +   # draw the bars
  scale_y_continuous(breaks = seq(-10,10,length.out = 5),labels = c('10%','5%','0','5%','10%')) +
  coord_flip() +  # Flip axes
  labs(title="Changes in Queensland populations structure since 1977") +
  theme(plot.title = element_text(hjust = .5),
        axis.ticks = element_blank()) +   # Centre plot title
  scale_fill_manual(values=c("#899DA4", "#C93312")) +
  theme_tufte(base_size = 12, base_family="Avenir") +
  facet_grid(. ~ Year)

Click the image below to enlarge.

A website with some tutorials about he R programming language, and other things.

©John Marquess

Powered by Jekyll, Knitr, and Bootstrap.

Source available on Github.