require(data.table)
require(tidyverse)
fread("https://www.dropbox.com/s/7xb76wliw4qoy4q/Output_Area_%282001%29_to_Output_Area_%282011%29_to_Local_Authority_District_%282011%29_Lookup_in_England_and_Wales.csv?dl=1",
showProgress = FALSE,
select = c(1: 4)) |>
as_tibble() |>
full_join(
fread("https://www.dropbox.com/s/wn6y5juef85nlgf/OA_%282011%29_to_OA_%282021%29_to_Local_Authority_District_%282022%29_for_England_and_Wales_Lookup.csv?dl=1",
showProgress = FALSE,
select = c(1: 5)) |>
as_tibble(),
by = "OA11CD") |>
rename(CHG11 = CHGIND, CHG21 = CHNGIND) |>
mutate(CHG21 = ifelse(CHG21 == "x", "X", CHG21)) -> df
Creating harmonised pseudo-Output Areas
for the 2001, 2011 & 2021 Censuses (England and Wales)
Please cite as
Harris R (2022) Creating harmonised pseudo-Output Areas for the 2001, 2011 & 2021 Censuses (England and Wales). https://profrichharris.github.io/census/harmonised.
The resulting files can be downloaded from https://www.dropbox.com/sh/jq4dkokxyflxpei/AAD1GOvoN7R_YyRQSN3aSMQ4a?dl=0
Introduction
One of the great step forwards from the 2001 Census onwards was the use of statistical Output Areas (OAs) at the most granular scale that are kept as consistent as possible from one census to the next. However, a few do change so, to make analysing change over time a little easier, we can lightly aggregate them into a consistent set of pseudo-OAs that have the same geography for each census. In many cases the pseudo-OA and the actual OAs will be the same but in some neighbouring OAs will be merged together.
Source data
The key to it all are the following lookup files: https://geoportal.statistics.gov.uk/maps/output-area-2001-to-output-area-2011-to-local-authority-district-2011-lookup-in-england-and-wales-1
They show which OAs are modified or stay the same from one census to the next. We can begin by reading them into R, joining them together, and tidying them up a little:
What we obtain is a data file, df
, which shows all the ‘connections’ between Output Areas from the 2001 Census, OA01CD
, to the 2011 Census, OA11CD
, to the 2021 Census, OA21CD
. For example, OA E00000001 is unchanged (“U”) across censuses, whereas E00000004 is merged (“M”) into E00166758 for 2011 but split (“S”) into OAs that include E00190429 in 2021:
|>
df filter(OA01CD == "E00000001")
# A tibble: 1 × 8
OA01CD OA01CDO OA11CD CHG11 OA21CD CHG21 LAD22CD LAD22NM
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 E00000001 00AAFA0001 E00000001 U E00000001 U E09000001 City of London
|>
df filter(OA01CD == "E00000004")
# A tibble: 2 × 8
OA01CD OA01CDO OA11CD CHG11 OA21CD CHG21 LAD22CD LAD22NM
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 E00000004 00AAFA0004 E00166758 M E00190429 S E09000001 City of London
2 E00000004 00AAFA0004 E00166758 M E00190433 S E09000001 City of London
The extra OA01CDO
arises because the original 2001 Census Output Area codes were recoded prior to 2011. OA01CDO
are the original codes, OA01CD
are what they became.
Stage 1
Imagine an Output Area, X, from one census is split into Y and Z for the next. That means X, Y and Z are all connected, where X is the parent of Y and Z. Alternatively, imagine X and Y from one census are merged into Z for the next. They are still all connected but, in the second case, Z is the parent of X and Y. Applying similar logic allows a consistent OA code, OAXXCD
, to be applied to the majority of cases in the data file, df
.
|>
df mutate(OAXXCD = NA) |>
mutate(OAXXCD = ifelse(CHG11 == "U" & CHG21 == "U", OA21CD, OAXXCD)) |>
mutate(OAXXCD = ifelse(CHG11 == "M" & CHG21 == "U", OA21CD, OAXXCD)) |>
mutate(OAXXCD = ifelse(CHG11 == "S" & CHG21 == "U", OA01CD, OAXXCD)) |>
mutate(OAXXCD = ifelse(CHG11 == "U" & CHG21 == "M", OA21CD, OAXXCD)) |>
mutate(OAXXCD = ifelse(CHG11 == "U" & CHG21 == "S", OA11CD, OAXXCD)) |>
mutate(OAXXCD = ifelse(CHG11 == "S" & CHG21 == "S", OA01CD, OAXXCD)) |>
mutate(OAXXCD = ifelse(CHG11 == "M" & CHG21 == "M", OA21CD, OAXXCD)) |>
mutate(OAXXCD = ifelse(CHG11 == "M" & CHG21 == "S", OA11CD, OAXXCD)) -> df
However, there are some cases that cannot be resolved in this way, of which some are,
|>
df filter(is.na(OAXXCD))
# A tibble: 1,146 × 9
OA01CD OA01CDO OA11CD CHG11 OA21CD CHG21 LAD22CD LAD22NM OAXXCD
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 E00000233 00ABGD0001 E00175112 S E00185671 M E09000002 Barkin… <NA>
2 E00000105 00ABFZ0005 E00165805 X E00165805 U E09000002 Barkin… <NA>
3 E00000729 00ACGA0007 E00000729 U E00178723 X E09000003 Barnet <NA>
4 E00000731 00ACGA0009 E00000731 U E00178723 X E09000003 Barnet <NA>
5 E00000731 00ACGA0009 E00000731 U E00178889 X E09000003 Barnet <NA>
6 E00000731 00ACGA0009 E00000731 U E00178893 X E09000003 Barnet <NA>
7 E00000733 00ACGA0011 E00000733 U E00178724 X E09000003 Barnet <NA>
8 E00000733 00ACGA0011 E00000733 U E00178891 X E09000003 Barnet <NA>
9 E00000733 00ACGA0011 E00000733 U E00178892 X E09000003 Barnet <NA>
10 E00000734 00ACGA0012 E00000734 U E00178724 X E09000003 Barnet <NA>
# … with 1,136 more rows
They arise in the following circumstances,
|>
df filter(is.na(OAXXCD)) |>
select(CHG11, CHG21) |>
group_by(CHG11, CHG21) |>
summarise(n = n()) |>
print(n = Inf)
`summarise()` has grouped output by 'CHG11'. You can override using the
`.groups` argument.
# A tibble: 9 × 3
# Groups: CHG11 [5]
CHG11 CHG21 n
<chr> <chr> <int>
1 M X 24
2 S M 456
3 S X 140
4 U X 246
5 X S 14
6 X U 209
7 <NA> M 4
8 <NA> S 15
9 <NA> U 38
They are very few of them – only r length(which(is.na(df$OAXXCD)))`
of 190743 entries in the file – but it is good to resolve them.
Stage 2
A small number of the cases that have yet to be resolved arise because there is no connection between the 2001 Census codes and the 2011 Census ones:
|>
df filter(is.na(CHG11))
# A tibble: 57 × 9
OA01CD OA01CDO OA11CD CHG11 OA21CD CHG21 LAD22CD LAD22NM OAXXCD
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 <NA> <NA> E00165720 <NA> E00165720 U E08000012 Liverpool <NA>
2 <NA> <NA> E00165673 <NA> E00165673 U E08000035 Leeds <NA>
3 <NA> <NA> E00165738 <NA> E00165738 U E07000245 West Suffolk <NA>
4 <NA> <NA> E00165746 <NA> E00165746 U E08000026 Coventry <NA>
5 <NA> <NA> E00165758 <NA> E00165758 U E08000012 Liverpool <NA>
6 <NA> <NA> E00165767 <NA> E00165767 U E06000003 Redcar and C… <NA>
7 <NA> <NA> E00165770 <NA> E00165770 U E06000003 Redcar and C… <NA>
8 <NA> <NA> E00165773 <NA> E00188170 M E06000003 Redcar and C… <NA>
9 <NA> <NA> E00165703 <NA> E00165703 U E06000007 Warrington <NA>
10 <NA> <NA> E00165705 <NA> E00165705 U E06000007 Warrington <NA>
# … with 47 more rows
I will resolve this through proximity matching. Specifically, by giving the population weighted centroid of OA11CD
a holding OA01CD
of the nearest population weighted centroid from 2001. The population weighted centroids are available from https://geoportal.statistics.gov.uk/maps/output-areas-december-2001-population-weighted-centroids-2 and from https://geoportal.statistics.gov.uk/maps/output-areas-dec-2011-pwc.
require(RANN)
read_csv("https://www.dropbox.com/s/1fcm9ggpxudv72l/Output_Areas_%28December_2001%29_Population_Weighted_Centroids.csv?dl=1") |>
select(X, Y, oa01cd) |>
rename(OA01CD = oa01cd) -> pts01
read_csv("https://www.dropbox.com/s/bsyg152nkcthvzc/Output_Areas_%28Dec_2011%29_PWC.csv?dl=1") |>
select(X, Y, OA11CD) |>
semi_join(df |> filter(is.na(OA01CD)), by = "OA11CD") -> pts11
<- nn2(data = pts01 |> select(-OA01CD),
i query = pts11 |> select(-OA11CD),
k = 1)$nn.idx[,1]
left_join(df,
data.frame(OA11CD = pts11$OA11CD, nearest = pts01$OA01CD[i]),
by = "OA11CD") |>
mutate(OA01CD = ifelse(!is.na(nearest), nearest, OA01CD)) |>
select(-nearest) -> df
Note that whilst all the cases where is.na(CHG11)
have now been given a holding OA01CD
, the OAXXCD
has yet to be assigned. That will be included in the next stage.
|>
df filter(is.na(CHG11)) |>
select(OA01CD, CHG11, OAXXCD)
# A tibble: 57 × 3
OA01CD CHG11 OAXXCD
<chr> <chr> <chr>
1 E00033058 <NA> <NA>
2 E00056821 <NA> <NA>
3 E00152565 <NA> <NA>
4 E00048351 <NA> <NA>
5 E00032881 <NA> <NA>
6 E00061137 <NA> <NA>
7 E00061229 <NA> <NA>
8 E00061071 <NA> <NA>
9 E00063383 <NA> <NA>
10 E00063066 <NA> <NA>
# … with 47 more rows
Stage 3
The next stage is best illustrated with an example. The Output Area E00000233 has yet to be assigned an OAXXCD
because of how it has been split but subsequently merged in successive censuses:
|>
df filter(is.na(OAXXCD)) |>
select(OA01CD, OA11CD, OA21CD, OAXXCD, CHG11, CHG21) |>
slice(n = 1)
# A tibble: 1 × 6
OA01CD OA11CD OA21CD OAXXCD CHG11 CHG21
<chr> <chr> <chr> <chr> <chr> <chr>
1 E00000233 E00175112 E00185671 <NA> S M
What we need to do is find every output area that is connected to E00000233 through any of the censuses. An initial search reveals,
|>
df filter(is.na(OAXXCD)) |>
slice(n = 1) -> search
|>
df filter(OA01CD %in% search$OA01CD |
%in% search$OA11CD |
OA11CD %in% search$OA21CD) -> found
OA21CD
found
# A tibble: 8 × 9
OA01CD OA01CDO OA11CD CHG11 OA21CD CHG21 LAD22CD LAD22NM OAXXCD
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 E00000233 00ABGD0001 E00175110 S E00175110 U E09000002 Barking… E0000…
2 E00000233 00ABGD0001 E00175111 S E00185716 S E09000002 Barking… E0000…
3 E00000233 00ABGD0001 E00175111 S E00185773 S E09000002 Barking… E0000…
4 E00000233 00ABGD0001 E00175111 S E00185814 S E09000002 Barking… E0000…
5 E00000233 00ABGD0001 E00175112 S E00185671 M E09000002 Barking… <NA>
6 E00000233 00ABGD0001 E00175121 S E00175121 U E09000002 Barking… E0000…
7 E00000233 00ABGD0001 E00175122 S E00175122 U E09000002 Barking… E0000…
8 E00000242 00ABGD0010 E00000242 U E00185671 M E09000002 Barking… E0018…
But we also need to check for the the connections of the connections and so forth, recursively, until no more are found:
<- nrow(found)
n <- 0
n.previous
while(n > n.previous) {
<- n
n.previous <- found
search
|>
df filter(OA01CD %in% search$OA01CD |
%in% search$OA11CD |
OA11CD %in% search$OA21CD) -> found
OA21CD
<- nrow(found)
n }
Once this is done, all of the connected output areas can be assigned the same identifying OAXXCD
.
|>
df mutate(OAXXCD = ifelse(OA01CD %in% found$OA01CD |
%in% found$OA11CD |
OA11CD %in% found$OA21CD,
OA21CD $OA21CD[1],
found-> df OAXXCD))
This whole process now needs to be repeated for every instance of where is.na(OAXXCD
) until they are all exhausted, which will take a few minutes to complete.
while(nrow(df |> filter(is.na(OAXXCD))) > 0) {
|>
df filter(is.na(OAXXCD)) |>
slice(n = 1) -> found
<- nrow(found)
n <- 0
n.previous
while(n > n.previous) {
<- n
n.previous <- found
search
|>
df filter(OA01CD %in% search$OA01CD |
%in% search$OA11CD |
OA11CD %in% search$OA21CD) -> found
OA21CD
<- nrow(found)
n
}
|>
df mutate(OAXXCD = ifelse(OA01CD %in% found$OA01CD |
%in% found$OA11CD |
OA11CD %in% found$OA21CD,
OA21CD $OA21CD[1],
found-> df
OAXXCD))
}
The end result is that all of the Output Areas from each of the censuses are now assigned to one of 174363 unique pseudo-OAs, each with its own OAXXCD
.
df
# A tibble: 190,743 × 9
OA01CD OA01CDO OA11CD CHG11 OA21CD CHG21 LAD22CD LAD22NM OAXXCD
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 E00000001 00AAFA0001 E00000001 U E00000001 U E09000001 City o… E0000…
2 E00000051 00ABFX0015 E00000051 U E00000051 U E09000002 Barkin… E0000…
3 E00000002 00AAFA0002 E00166756 M E00166756 U E09000001 City o… E0016…
4 E00000003 00AAFA0003 E00000003 U E00000003 U E09000001 City o… E0000…
5 E00000004 00AAFA0004 E00166758 M E00190429 S E09000001 City o… E0016…
6 E00000004 00AAFA0004 E00166758 M E00190433 S E09000001 City o… E0016…
7 E00000005 00AAFA0005 E00000005 U E00000005 U E09000001 City o… E0000…
8 E00000006 00AAFA0006 E00166758 M E00190429 S E09000001 City o… E0016…
9 E00000006 00AAFA0006 E00166758 M E00190433 S E09000001 City o… E0016…
10 E00000007 00AAFA0007 E00000007 U E00000007 U E09000001 City o… E0000…
# … with 190,733 more rows
length(unique(df$OAXXCD))
[1] 174363
Recoding
These pseudo-OAs are a light aggregation of the 175434 2001 Census, 181408 2011 Census, and 188880 2021 Census Output Areas. The OAXXCD
that have been assigned to them are a combination of various OA01CD
, OA11CD
and OA21CD
codes. To avoid confusion, it may be better to replace those with custom numbered identifiers:
|>
df arrange(OA01CD) -> df
|>
df select(OAXXCD) |>
filter(!duplicated(OAXXCD)) -> gps
data.frame(OAXXCD = gps$OAXXCD, i = 1: nrow(gps)) -> gps
|>
df left_join(gps, by = "OAXXCD") |>
mutate(i = str_pad(i, width = 7, pad = "0")) |>
mutate(OAXXCD = paste0("X", i, substr(OAXXCD, 1, 1))) |>
select(-i) -> df
We should also reset the OA01CD
for those based on spatial proximity back to their original NA
value.
|>
df mutate(OA01CD = ifelse(is.na(CHG11), NA, OA01CD)) -> df
Look-up files
We are now in a position to save the look-up files.
|>
df write_csv("OAXXCD_full_lookup.csv")
|>
df select(OA01CDO, OA01CD, OAXXCD) |>
unique(na.rm = TRUE) |>
arrange(OA01CD) |>
write_csv("OA01CD_to_OAXXCD.csv")
|>
df select(OA11CD, OAXXCD) |>
unique(na.rm = TRUE) |>
arrange(OA11CD) |>
write_csv("OA11CD_to_OAXXCD.csv")
|>
df select(OA21CD, OAXXCD) |>
unique(na.rm = TRUE) |>
arrange(OA21CD) |>
write_csv("OA21CD_to_OAXXCD.csv")
Higher-level geographies
It is helpful to know which higher-level geographies the pseudo-OAs fall within. Some of the various that could be used include MSOAs and their ‘friendly names’ (available from the House of Commons Library), local authority districts, and regions. We begin by matching these geographies to the 2021 Census OA geography and adding in their residential population counts from census table TS021, which are needed in the next step.
left_join(df |>
select(OAXXCD, OA21CD, LAD22CD, LAD22NM),
read_csv("https://www.dropbox.com/s/vcaw0emrgz5efta/OA21_LSOA21_MSOA21_LAD22_EW_LU.csv?dl=1",
progress = FALSE,
col_select = c(1, 4)) |>
rename(OA21CD = 1, MSOA21CD = 2), by = "OA21CD") |>
left_join(read_csv("https://www.dropbox.com/s/o3albmackgz0zao/MSOA-Names-2.2.csv?dl=1", progress = FALSE,
col_select = c(1, 4)) |>
rename(MSOA21CD = 1, MSOA21NM = 2), by = "MSOA21CD") |>
left_join(read_csv("https://www.dropbox.com/s/8um5h5quf9wc2vj/OA21_RGN22_LU.csv?dl=1", col_select = 1:3) |>
rename(OA21CD = 1, RGN21CD = 2, RGN21NM = 3), by = "OA21CD") |>
left_join(read_csv("https://www.dropbox.com/s/lw8r575noohhuu5/TS021-2021-1-filtered-2022-11-30T10_03_36Z.csv?dl=1",
progress = FALSE,
col_select = c(1, 5)) |>
rename(OA21CD = 1, Popn = 2) |>
group_by(OA21CD) |>
summarise(Popn = sum(Popn)) |>
ungroup(), by = "OA21CD") -> geography
geography
# A tibble: 190,743 × 9
OAXXCD OA21CD LAD22CD LAD22NM MSOA21CD MSOA21NM RGN21CD RGN21NM Popn
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 X0000001E E00000001 E09000001 City o… E020000… City of… E12000… London 176
2 X0000002E E00166756 E09000001 City o… E020000… City of… E12000… London 279
3 X0000003E E00000003 E09000001 City o… E020000… City of… E12000… London 255
4 X0000004E E00190429 E09000001 City o… E020000… City of… E12000… London 281
5 X0000004E E00190433 E09000001 City o… E020000… City of… E12000… London 226
6 X0000005E E00000005 E09000001 City o… E020000… City of… E12000… London 112
7 X0000004E E00190429 E09000001 City o… E020000… City of… E12000… London 281
8 X0000004E E00190433 E09000001 City o… E020000… City of… E12000… London 226
9 X0000006E E00000007 E09000001 City o… E020000… City of… E12000… London 144
10 X0000002E E00166756 E09000001 City o… E020000… City of… E12000… London 279
# … with 190,733 more rows
There is no guarantee that the pseudo-OAs will all fit exactly into the higher-level geographies, although most map exactly to 2021 Census OAs and will. For the ones that don’t, they are assigned the best fit, where best fit is defined by weight of population:
|>
geography select(OAXXCD, MSOA21CD, MSOA21NM,
LAD22CD, LAD22NM,|>
RGN21CD, RGN21NM, Popn) unique() -> geography
|>
geography group_by(OAXXCD) |>
summarise(Popn = sum(Popn), .groups = "keep") |>
left_join(geography |>
group_by(OAXXCD, MSOA21CD, MSOA21NM) |>
summarise(Popn = sum(Popn), .groups = "keep") |>
group_by(OAXXCD) |>
arrange(OAXXCD, desc(Popn)) |>
select(-Popn) |>
slice(1) |>
ungroup(), by = "OAXXCD") |>
left_join(geography |>
group_by(OAXXCD, LAD22CD, LAD22NM) |>
summarise(Popn = sum(Popn), .groups = "keep") |>
group_by(OAXXCD) |>
arrange(OAXXCD, desc(Popn)) |>
select(-Popn) |>
slice(1) |>
ungroup(), by = "OAXXCD") |>
left_join(geography |>
group_by(OAXXCD, RGN21CD, RGN21NM) |>
summarise(Popn = sum(Popn), .groups = "keep") |>
group_by(OAXXCD) |>
arrange(OAXXCD, desc(Popn)) |>
select(-Popn) |>
slice(1) |>
ungroup(), by = "OAXXCD") |>
select(OAXXCD, Popn, MSOA21CD, MSOA21NM, LAD22CD, LAD22NM,
-> geography
RGN21CD, RGN21NM)
|>
geography write_csv("OAXXCD_to_MSOAXXNM_to_LADXXNM_RGNXXNM.csv")
geography
# A tibble: 174,363 × 8
# Groups: OAXXCD [174,363]
OAXXCD Popn MSOA21CD MSOA21NM LAD22CD LAD22NM RGN21CD RGN21NM
<chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 X0000001E 176 E02000001 City of London E09000001 City of L… E12000… London
2 X0000002E 279 E02000001 City of London E09000001 City of L… E12000… London
3 X0000003E 255 E02000001 City of London E09000001 City of L… E12000… London
4 X0000004E 507 E02000001 City of London E09000001 City of L… E12000… London
5 X0000005E 112 E02000001 City of London E09000001 City of L… E12000… London
6 X0000006E 144 E02000001 City of London E09000001 City of L… E12000… London
7 X0000007E 348 E02000001 City of London E09000001 City of L… E12000… London
8 X0000008E 175 E02000001 City of London E09000001 City of L… E12000… London
9 X0000009E 503 E02000001 City of London E09000001 City of L… E12000… London
10 X0000010E 255 E02000001 City of London E09000001 City of L… E12000… London
# … with 174,353 more rows
Boundary file for the pseudo-OAs
To map the pseudo-OAs, we will create a boundary file for them, based on the 2021 Census OA boundaries. We begin by loading in those boundaries and appending the OAXXCD
codes.
require(sf)
st_read("https://www.dropbox.com/s/2ffk5mwezyyjzre/Output_Areas_%28December_2021%29_Boundaries_Generalised_Clipped_EW_%28BGC%29.geojson?dl=1") |>
left_join(df |> select(OAXXCD, OA21CD) |> unique(), by = "OA21CD") -> map
Reading layer `OA_2021_EW_BGC' from data source
`https://www.dropbox.com/s/2ffk5mwezyyjzre/Output_Areas_%28December_2021%29_Boundaries_Generalised_Clipped_EW_%28BGC%29.geojson?dl=1'
using driver `GeoJSON'
Simple feature collection with 188880 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -6.418667 ymin: 49.86479 xmax: 1.763706 ymax: 55.81112
Geodetic CRS: WGS 84
map
Simple feature collection with 188880 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -6.418667 ymin: 49.86479 xmax: 1.763706 ymax: 55.81112
Geodetic CRS: WGS 84
First 10 features:
OBJECTID OA21CD GlobalID SHAPE_Length
1 1 E00000001 {BC5EB21B-D42B-4715-A771-2C27575A08F0} 0.005263040
2 2 E00000003 {A1A2B34F-320E-4BB8-ACB4-7CA7CA16EF9C} 0.003439877
3 3 E00000005 {9337DA1A-FE0F-4210-9C95-ED2D20FD6287} 0.004284711
4 4 E00000007 {B336E11F-AF26-48A6-AC67-44F5B8B8840A} 0.016351659
5 5 E00000010 {CA8F9874-CDF5-4C1A-9D39-F74A410DAE44} 0.002386808
6 6 E00000013 {B9B93246-BBFB-46AF-8011-6B8ED5908956} 0.005347143
7 7 E00000018 {FCB7C232-4165-44DA-8FF8-B6CC383B528A} 0.012839007
8 8 E00000019 {715E0851-2375-4D6A-94AC-532BA0833C60} 0.007299853
9 9 E00000020 {4251A2FF-8FBE-4C11-AECE-044F23AB9AF2} 0.004090582
10 10 E00000021 {48B2DF49-45ED-48A4-B7FC-E0E32C1AD088} 0.025416295
SHAPE_Area OAXXCD geometry
1 9.001827e-07 X0000001E MULTIPOLYGON (((-0.09447861...
2 5.819360e-07 X0000003E MULTIPOLYGON (((-0.09576465...
3 1.109536e-06 X0000005E MULTIPOLYGON (((-0.09627474...
4 9.843568e-06 X0000006E MULTIPOLYGON (((-0.09600549...
5 2.724187e-07 X0000008E MULTIPOLYGON (((-0.09688753...
6 1.138881e-06 X0000011E MULTIPOLYGON (((-0.09525163...
7 6.407396e-06 X0000013E MULTIPOLYGON (((-0.0924986 ...
8 1.524146e-06 X0000014E MULTIPOLYGON (((-0.08991936...
9 7.621724e-07 X0000015E MULTIPOLYGON (((-0.09147971...
10 1.786263e-05 X0000016E MULTIPOLYGON (((-0.08810299...
The next step is to aggregate the 2021 Census OAs to the pseudo-OA geography.
aggregate(map |> select(OAXXCD), by = list(map$OAXXCD), FUN = length) |>
rename(OAXXCD = 1, nOA21 = 2) -> map
map
Simple feature collection with 174363 features and 2 fields
Attribute-geometry relationship: 0 constant, 1 aggregate, 1 identity
Geometry type: GEOMETRY
Dimension: XY
Bounding box: xmin: -6.418667 ymin: 49.86479 xmax: 1.763706 ymax: 55.81112
Geodetic CRS: WGS 84
First 10 features:
OAXXCD nOA21 geometry
1 X0000001E 1 POLYGON ((-0.09447861 51.51...
2 X0000002E 1 POLYGON ((-0.09725971 51.52...
3 X0000003E 1 POLYGON ((-0.09576465 51.52...
4 X0000004E 2 POLYGON ((-0.09505768 51.51...
5 X0000005E 1 POLYGON ((-0.09627474 51.51...
6 X0000006E 1 POLYGON ((-0.09600549 51.51...
7 X0000007E 1 POLYGON ((-0.08020353 51.52...
8 X0000008E 1 POLYGON ((-0.09688753 51.52...
9 X0000009E 2 POLYGON ((-0.0940139 51.521...
10 X0000010E 1 POLYGON ((-0.09635707 51.52...
Then to append the higher-level geographies.
|>
map left_join(geography, by = "OAXXCD") -> map
And save the file:
|>
map st_write(dsn = "OAXXCD_Boundaries_Generalised_Clipped_EW.geojson")
Writing layer `OAXXCD_Boundaries_Generalised_Clipped_EW' to data source
`OAXXCD_Boundaries_Generalised_Clipped_EW.geojson' using driver `GeoJSON'
Writing 174363 features with 9 fields and geometry type Unknown (any).
|>
map st_write(dsn = "OAXXCD_Boundaries_Generalised_Clipped_EW.shp")
Writing layer `OAXXCD_Boundaries_Generalised_Clipped_EW' to data source
`OAXXCD_Boundaries_Generalised_Clipped_EW.shp' using driver `ESRI Shapefile'
Writing 174363 features with 9 fields and geometry type Unknown (any).
Warning in CPL_write_ogr(obj, dsn, layer, driver,
as.character(dataset_options), : GDAL Message 1: One or several characters
couldn't be converted correctly from UTF-8 to ISO-8859-1. This warning will not
be emitted anymore.
Here are the consistent pseudo-OAs for London.
require(ggplot2)
|>
map filter(RGN21NM == "London") |>
ggplot() +
geom_sf()
Creating census counts for the pseudo-OAs
The following file is sourced from 2001 Census data and gives counts of the various ethnic groups per 2001 Census Output Area.
read_csv("https://www.dropbox.com/s/xd256vgc98qvtct/ethnicity2001.csv?dl=1")
# A tibble: 175,434 × 15
OA01CD Persons.01 WBRI.01 WIRI.01 WOTH.01 INDN.01 PKST.01 BANG.01 AOTH.01
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 E00000001 215 170 4 28 10 0 0 0
2 E00000002 101 70 3 19 3 0 0 0
3 E00000003 207 166 0 23 8 0 3 0
4 E00000004 278 227 6 29 4 0 0 0
5 E00000005 234 175 9 24 16 0 0 0
6 E00000006 221 184 8 20 3 0 0 3
7 E00000007 168 91 9 42 3 0 0 0
8 E00000008 169 111 3 30 8 3 0 0
9 E00000009 104 67 9 22 0 3 0 0
10 E00000010 134 106 4 9 0 0 0 3
# … with 175,424 more rows, and 6 more variables: BCAR.01 <dbl>, BAFR.01 <dbl>,
# BOTH.01 <dbl>, CHNE.01 <dbl>, OTHR.01 <dbl>, MIXD.01 <dbl>
Those counts can be tallied up into the pseudo-OAs:
read_csv("https://www.dropbox.com/s/xd256vgc98qvtct/ethnicity2001.csv?dl=1") |>
left_join(read_csv("OA01CD_to_OAXXCD.csv") |> select(OA01CD, OAXXCD), by = "OA01CD") |>
group_by(OAXXCD) |>
summarise(across(where(is.numeric), sum)) -> census.01
.01 census
# A tibble: 174,363 × 15
OAXXCD Persons.01 WBRI.01 WIRI.01 WOTH.01 INDN.01 PKST.01 BANG.01 AOTH.01
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 X0000001E 215 170 4 28 10 0 0 0
2 X0000002E 270 181 6 49 11 3 0 0
3 X0000003E 207 166 0 23 8 0 3 0
4 X0000004E 499 411 14 49 7 0 0 3
5 X0000005E 234 175 9 24 16 0 0 0
6 X0000006E 168 91 9 42 3 0 0 0
7 X0000007E 213 143 16 39 0 3 0 3
8 X0000008E 134 106 4 9 0 0 0 3
9 X0000009E 255 161 14 31 12 3 0 0
10 X0000010E 393 278 25 43 11 0 3 3
# … with 174,353 more rows, and 6 more variables: BCAR.01 <dbl>, BAFR.01 <dbl>,
# BOTH.01 <dbl>, CHNE.01 <dbl>, OTHR.01 <dbl>, MIXD.01 <dbl>
The same process can be applied to the 2021 Census counts.
read_csv("https://www.dropbox.com/s/5b4nm5a6wrtwjcv/ethnicity2021.csv?dl=1") |>
left_join(read_csv("OA21CD_to_OAXXCD.csv") |> select(OA21CD, OAXXCD), by = "OA21CD") |>
group_by(OAXXCD) |>
summarise(across(where(is.numeric), sum)) -> census.21
.21 census
# A tibble: 174,363 × 15
OAXXCD Persons.21 WBRI.21 WIRI.21 WOTH.21 INDN.21 PKST.21 BANG.21 AOTH.21
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 X0000001E 176 112 0 43 0 1 0 0
2 X0000002E 279 83 9 90 22 0 4 7
3 X0000003E 255 174 0 35 11 0 0 7
4 X0000004E 507 326 18 101 11 1 0 11
5 X0000005E 112 65 0 20 4 0 0 5
6 X0000006E 144 62 4 40 1 0 0 7
7 X0000007E 348 126 5 109 40 7 1 4
8 X0000008E 175 83 0 37 3 0 1 3
9 X0000009E 503 297 13 115 13 0 0 8
10 X0000010E 255 128 6 54 3 1 2 9
# … with 174,353 more rows, and 6 more variables: BCAR.21 <dbl>, BAFR.21 <dbl>,
# BOTH.21 <dbl>, CHNE.21 <dbl>, OTHR.21 <dbl>, MIXD.21 <dbl>
Because the geography is consistent across the censuses, it is possible to calculate change. For example, to calculate and to map the percentage change in the White British population in London’s neighbourhoods from 2001 to 2011:
|>
map filter(RGN21NM == "London") |>
inner_join(census.01 |>
select(OAXXCD, WBRI.01),
by = "OAXXCD") |>
inner_join(census.21 |>
select(OAXXCD, WBRI.21),
by = "OAXXCD") |>
mutate(PCTCHG = (WBRI.21 - WBRI.01) / WBRI.01 * 100) |>
mutate(PCTCHG = sign(PCTCHG) * sqrt(abs(PCTCHG))) |>
ggplot() +
geom_sf(aes(fill = PCTCHG), col = "transparent") +
scale_fill_gradient2("Sq root of PCTCHG", low = "dark blue",
mid = "white",
high = "dark red", midpoint = 0) +
ggtitle("Percentage change in White British population, 2001-11") +
theme_dark()