<- c('sf', 'tidyverse', 'tmap', 'httr', 'jsonlite', 'rvest',
packages 'sp', 'ggpubr', 'corrplot', 'broom', 'olsrr', 'spdep',
'GWmodel', 'devtools', 'rgeos', 'lwgeom', 'maptools', 'rsample', 'Metrics', 'SpatialML')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p, repos = "http://cran.us.r-project.org")
}library(p, character.only = T)
}
Take Home Exercise 3: Predicting HDB Public Housing Resale Pricies using Geographically Weighted Methods
Background Information
In every country, housing is a crucial part of household wealth. Whether for a couple or an individual, owning a home has always been a dream or aspiration. Several things influence housing costs. Some of them have a worldwide scope, such the overall health of a nation’s economy or the level of inflation. Some may focus more on the properties themselves. You can further divide these characteristics into structural and geographic ones. Structural considerations are aspects of the property itself, such as its size, configuration, and tenure. Locational factors are elements that relate to the area around the properties, such as closeness to a daycare center, a public transportation stop, and a shopping center.
Standard predictive models for house resale prices were created utilizing the Ordinary Least Square (OLS) approach. The spatial autocorrelation and spatial variability present in geographic data sets like housing transactions, however, were not taken into account by this method. The OLS estimate of predictive housing resale pricing models may produce inaccurate, inconsistent, or ineffective findings due to the presence of spatial autocorrelation (Anselin 1998). With this restriction, Geographic Weighted Models were developed to calibrate house resale price prediction models.
The Objective
FOr this exercise, I am required to forecast HDB resale prices in Singapore for the months of January and February 2023 at the sub-market level (i.e. HDB 3-room, HDB 4-room, and HDB 5-room). The traditional OLS method and GWR approaches must be used to build the predictive models. I must also contrast the performance of the geographically weighted approaches and the traditional OLS method.
1. Datasets
Aspatial dataset:
- HDB Resale data: a list of HDB resale transacted prices in Singapore from Jan 2017 onwards. It is in csv format which can be downloaded from Data.gov.sg.
Geospatial dataset:
- MP14_SUBZONE_WEB_PL: a polygon feature data providing information of URA 2014 Master Plan Planning Subzone boundary data. It is in ESRI shapefile format. This data set was also downloaded from Data.gov.sg
Location factors with regards to the geographic coordinates:
Downloaded from Data.gov.sg.
Eldercare data is a list of eldercare in Singapore. It is in shapefile format.
Hawker Centre data is a list of hawker centres in Singapore. It is in geojson format.
Parks data is a list of parks in Singapore. It is in geojson format.
Supermarket data is a list of supermarkets in Singapore. It is in geojson format.
CHAS clinics data is a list of CHAS clinics in Singapore. It is in geojson format.
Childcare service data is a list of childcare services in Singapore. It is in geojson format.
Kindergartens data is a list of kindergartens in Singapore. It is in geojson format.
Downloaded from Datamall.lta.gov.sg.
MRT data is a list of MRT/LRT stations in Singapore with the station names and codes. It is in shapefile format.
Bus stops data is a list of bus stops in Singapore. It is in shapefile format.
Location factors without geographic coordinates:
Downloaded from Data.gov.sg.
- Primary school data is extracted from the list on General information of schools from data.gov portal. It is in csv format.
Retrieved/Scraped from other sources
CBD coordinates obtained from Google.
Shopping malls data is a list of Shopping malls in Singapore obtained from Wikipedia.
Good primary schools is a list of primary schools that are ordered in ranking in terms of popularity and this can be found at Local Salary Forum.
2. Loading the R packages
The following code chunk will perform the following task. A list called packages will be created and will consists of all the R packages required to accomplish this exercise. There will be a check to see if R packages on package have been installed in R and if not, they will be installed. After which when all the R packages have been installed, the packages will then be loaded
library(xaringanExtra)
More on the packages used:
sf: used for importing, managing, and processing geospatial data
tidyverse: used for importing, wrangling and visualising data. It consists of a family of R packages, such as:
readr for importing csv data,
readxl for importing Excel worksheet,
tidyr for manipulating data,
dplyr for transforming data, and
ggplot2 for visualising data
tmap: provides functions for plotting cartographic quality static inpoint patterns maps or interactive maps by using leaflet API.
httr: Useful tools for working with HTTP organised by HTTP verbs (GET(), POST(), etc). Configuration functions make it easy to control additional request components (authenticate(), add_headers() and so on).
- In this analysis, it will be used to send GET requests to OneMapAPI SG to retrieve the coordinates of addresses.
jsonlite: A simple and robust JSON parser and generator for R. It offers simple, flexible tools for working with JSON in R, and is particularly powerful for building pipelines and interacting with a web API.
rvest: A new package that makes it easy to scrape (or harvest) data from html web pages, inspired by libraries like beautiful soup.
- In this analysis, it will be used to scrape data for shopping malls and good primary schools
sp: provides classes and methods for dealing with spatial data in R.
ggpubr: provides some easy-to-use functions for creating and customizing ggplot2 based publication ready plots
- In this analysis, it will be used to arrange multiple ggplots.
corrplot: For Multivariate data visualisation and analysis
broom: Takes the messy output of built-in functions in R, such as lm, nls, or t.test, and turns them into tidy tibble.
- In this analysis, functions like tidy and glance will be used to construct a tibble / summmary of the model which is easier to look at.
oslrr: Used to build OLD and performing diagnostic tests.
spdep: For spatial dependence statistics.
GWmodel: Calibrate geographical weighted family of modes.
devtools: used for installing any R packages which is not available in RCRAN. In this exercise, I will be installing using devtools to install the package xaringanExtra which is still under development stage.
xaringanExtra: is an enhancement of xaringan package. As it is still under development stage, we can still install the current version using install_github function of devtools. This package will be used to add Panelsets to contain both the r code chunk and results whereever applicable.
3. Importing of Aspatial Data & Wrangling
read_csv() function of readr package will be used to import resale-flat-prices into R as a tibble data frame called resale. glimpse() function of dplyr package is used to display the data structure
<- read_csv("data/aspatial/resale-flat-prices.csv") resale
glimpse(resale)
Rows: 148,373
Columns: 11
$ month <chr> "2017-01", "2017-01", "2017-01", "2017-01", "2017-…
$ town <chr> "ANG MO KIO", "ANG MO KIO", "ANG MO KIO", "ANG MO …
$ flat_type <chr> "2 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", …
$ block <chr> "406", "108", "602", "465", "601", "150", "447", "…
$ street_name <chr> "ANG MO KIO AVE 10", "ANG MO KIO AVE 4", "ANG MO K…
$ storey_range <chr> "10 TO 12", "01 TO 03", "01 TO 03", "04 TO 06", "0…
$ floor_area_sqm <dbl> 44, 67, 67, 68, 67, 68, 68, 67, 68, 67, 68, 67, 67…
$ flat_model <chr> "Improved", "New Generation", "New Generation", "N…
$ lease_commence_date <dbl> 1979, 1978, 1980, 1980, 1980, 1981, 1979, 1976, 19…
$ remaining_lease <chr> "61 years 04 months", "60 years 07 months", "62 ye…
$ resale_price <dbl> 232000, 250000, 262000, 265000, 265000, 275000, 28…
When we load in the dataset for the first time, we can see that:
The dataset contains 11 columns with 148,373 rows. The columns that are present in the data are: month, town, flat_type, block, street_name, storey_range, floor_area_sqm, flat_model, lease_commence_date, remaining_lease, resale_price.
For this take home exercise 3, we are allowed the option to choose to perform our analysis between either 3, 4 or 5 room flat transactions. Therefore, I will be selecting the 3 room flat transactions during the transaction period from 1st January 2021 to 31st December 2022. Test data should be included for January and February 2023 resale prices.
3.1 Filtering HDB Resale Data
filter() function of dplyr package will be used to select the desired flat_type and dates which will be stored in rs_subset.
<- filter(resale,flat_type == "3 ROOM") %>%
rs_subset filter(month >= "2021-01" & month <= "2023-02")
unique() function of R package will be used to check whether flat_type and month have been extracted successfully.
unique(rs_subset$month)
[1] "2021-01" "2021-02" "2021-03" "2021-04" "2021-05" "2021-06" "2021-07"
[8] "2021-08" "2021-09" "2021-10" "2021-11" "2021-12" "2022-01" "2022-02"
[15] "2022-03" "2022-04" "2022-05" "2022-06" "2022-07" "2022-08" "2022-09"
[22] "2022-10" "2022-11" "2022-12" "2023-01" "2023-02"
unique(rs_subset$flat_type)
[1] "3 ROOM"
glimpse() function will be used to view the overall resale transactions available for 3 room flat in Singapore.
glimpse(rs_subset)
Rows: 13,780
Columns: 11
$ month <chr> "2021-01", "2021-01", "2021-01", "2021-01", "2021-…
$ town <chr> "ANG MO KIO", "ANG MO KIO", "ANG MO KIO", "ANG MO …
$ flat_type <chr> "3 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", …
$ block <chr> "331", "534", "561", "170", "463", "542", "170", "…
$ street_name <chr> "ANG MO KIO AVE 1", "ANG MO KIO AVE 10", "ANG MO K…
$ storey_range <chr> "04 TO 06", "04 TO 06", "01 TO 03", "07 TO 09", "0…
$ floor_area_sqm <dbl> 68, 68, 68, 60, 68, 68, 60, 73, 67, 67, 68, 68, 73…
$ flat_model <chr> "New Generation", "New Generation", "New Generatio…
$ lease_commence_date <dbl> 1981, 1980, 1980, 1986, 1980, 1981, 1986, 1976, 19…
$ remaining_lease <chr> "59 years", "58 years 02 months", "58 years 01 mon…
$ resale_price <dbl> 260000, 265000, 265000, 268000, 268000, 270000, 27…
From the glimpse() function result, we can see that from Jan 2021 to December 2022, there are 23,656 transactions for 3 room flat in Singapore.
3.2 Transforming HDB Resale Data Columns
Here, mutate function of dplyr package will be used to create columns such as:
address
: concatenation of theblock
andstreet_name
columns using paste() function of base R package.remaining_lease_yr
&remaining_lease_mth
: split the year and months part of theremaining_lease
respectively using str_sub() function of stringr package then converting the character to integer using as.integer() function of base R package. After performing mutate function, we will store the new data inrs_transform
<- rs_subset %>%
rs_transform mutate(rs_subset, address = paste(block,street_name)) %>%
mutate(rs_subset, remaining_lease_yr = as.integer(str_sub(remaining_lease, 0, 2)))%>%
mutate(rs_subset, remaining_lease_mth = as.integer(str_sub(remaining_lease, 9, 11)))
head(rs_transform)
# A tibble: 6 × 14
month town flat_type block street_name storey_range floor_area_sqm flat_model
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
1 2021… ANG … 3 ROOM 331 ANG MO KIO… 04 TO 06 68 New Gener…
2 2021… ANG … 3 ROOM 534 ANG MO KIO… 04 TO 06 68 New Gener…
3 2021… ANG … 3 ROOM 561 ANG MO KIO… 01 TO 03 68 New Gener…
4 2021… ANG … 3 ROOM 170 ANG MO KIO… 07 TO 09 60 Improved
5 2021… ANG … 3 ROOM 463 ANG MO KIO… 04 TO 06 68 New Gener…
6 2021… ANG … 3 ROOM 542 ANG MO KIO… 04 TO 06 68 New Gener…
# ℹ 6 more variables: lease_commence_date <dbl>, remaining_lease <chr>,
# resale_price <dbl>, address <chr>, remaining_lease_yr <int>,
# remaining_lease_mth <int>
3.2.1 Sum remaining lease in months
There are some NA values in remaining lease months with value of 0. We need to multiply the remaining_lease_yr by 12 to convert into months. The remaining_lease_mths column will be created using mutate function of dplyr package which contains the summation of the remaining_lease_yr and remaining_lease_mths using rowSums() of R package.
$remaining_lease_mth[is.na(rs_transform$remaining_lease_mth)] <- 0
rs_transform$remaining_lease_yr <- rs_transform$remaining_lease_yr * 12
rs_transform<- rs_transform %>%
rs_transform mutate(rs_transform, remaining_lease_mths = rowSums(rs_transform[, c("remaining_lease_yr", "remaining_lease_mth")])) %>%
select(month, town, address, block, street_name, flat_type, storey_range, floor_area_sqm, flat_model,
lease_commence_date, remaining_lease_mths, resale_price)
head(rs_transform)
# A tibble: 6 × 12
month town address block street_name flat_type storey_range floor_area_sqm
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 2021-01 ANG M… 331 AN… 331 ANG MO KIO… 3 ROOM 04 TO 06 68
2 2021-01 ANG M… 534 AN… 534 ANG MO KIO… 3 ROOM 04 TO 06 68
3 2021-01 ANG M… 561 AN… 561 ANG MO KIO… 3 ROOM 01 TO 03 68
4 2021-01 ANG M… 170 AN… 170 ANG MO KIO… 3 ROOM 07 TO 09 60
5 2021-01 ANG M… 463 AN… 463 ANG MO KIO… 3 ROOM 04 TO 06 68
6 2021-01 ANG M… 542 AN… 542 ANG MO KIO… 3 ROOM 04 TO 06 68
# ℹ 4 more variables: flat_model <chr>, lease_commence_date <dbl>,
# remaining_lease_mths <dbl>, resale_price <dbl>
3.3 Retrieval of Addresses & its Coordinates
Data such as postal codes and coordinates of the addresses is required to get the proximity to various location factors later on.
3.3.1 Creaion of unique list of addresses
Unique addresses will be created and stored in add_list. unique() function of base R package will be used to extract the unique addresses and sort() function of R package to sort the unique vector.
<- sort(unique(rs_transform$address)) add_list
3.3.2 Creation of function to retrieve Coordinates from OneMapSG API
A dataframe postal_coords will be created to store all final retrieved coordinates. The GET() function of httr package will be used to perform a GET request to https://developers.onemap.sg/commonapi/search. There are a few search arguments variables and information we have to take note of
searchVal: Unique keywords that user will enter to filter results
returnGeom {Y/N}: Yes or No to check if user want to return the geometry
getAddrDetails {Y/N}: Yes or No to check if user want to return address details for a point
Return JSON response will contain many fields but we are only interested in postal code and coordinates like Longitude and Latitude. A new dataframe new_row will be created and is used to store each final set of coordinates retrieved. There is also the need to check the number of responses because some searched location have 0, some only have 1 result and others have many. Finally, the JSON result will be appended to the dataframe postal_coords using rbind() function of R.
<- function(add_list){
get_coords
# Create a data frame to store all retrieved coordinates
<- data.frame()
postal_coords
for (i in add_list){
#print(i)
<- GET('https://developers.onemap.sg/commonapi/search?',
r query=list(searchVal=i,
returnGeom='Y',
getAddrDetails='Y'))
<- fromJSON(rawToChar(r$content))
data <- data$found
found <- data$results
res
# Create a new data frame for each address
<- data.frame()
new_row
# If single result, append
if (found == 1){
<- res$POSTAL
postal <- res$LATITUDE
lat <- res$LONGITUDE
lng <- data.frame(address= i, postal = postal, latitude = lat, longitude = lng)
new_row
}
# If multiple results, drop NIL and append top 1
else if (found > 1){
# Remove those with NIL as postal
<- res[res$POSTAL != "NIL", ]
res_sub
# Set as NA first if no Postal
if (nrow(res_sub) == 0) {
<- data.frame(address= i, postal = NA, latitude = NA, longitude = NA)
new_row
}
else{
<- head(res_sub, n = 1)
top1 <- top1$POSTAL
postal <- top1$LATITUDE
lat <- top1$LONGITUDE
lng <- data.frame(address= i, postal = postal, latitude = lat, longitude = lng)
new_row
}
}
else {
<- data.frame(address= i, postal = NA, latitude = NA, longitude = NA)
new_row
}
# Add the row
<- rbind(postal_coords, new_row)
postal_coords
}return(postal_coords)
}
3.3.3 Retrieve resale coordinates using the get_coords function
<- get_coords(add_list) coords
3.3.4 Check results
Checking if columns contain any empty values
is.na(coords$postal) | is.na(coords$latitude) | is.na(coords$longitude) | coords$postal=="NIL"), ] coords[(
[1] address postal latitude longitude
<0 rows> (or 0-length row.names)
The outcome reveals that 215 CHOA CHU KANG CTRL has no postal code but does have available coordinates. The postal code to enter when searching on gothere.sg is 680215.
3.3.5 Combination of HDB resale and coordinate data
After getting the coordinates, we must merge them with the HDB resale dataset using the dplyr package’s left join() method. The information will be kept in rs coords.
<- left_join(rs_transform, coords, by = c('address' = 'address')) rs_coords
head(rs_coords)
# A tibble: 6 × 15
month town address block street_name flat_type storey_range floor_area_sqm
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 2021-01 ANG M… 331 AN… 331 ANG MO KIO… 3 ROOM 04 TO 06 68
2 2021-01 ANG M… 534 AN… 534 ANG MO KIO… 3 ROOM 04 TO 06 68
3 2021-01 ANG M… 561 AN… 561 ANG MO KIO… 3 ROOM 01 TO 03 68
4 2021-01 ANG M… 170 AN… 170 ANG MO KIO… 3 ROOM 07 TO 09 60
5 2021-01 ANG M… 463 AN… 463 ANG MO KIO… 3 ROOM 04 TO 06 68
6 2021-01 ANG M… 542 AN… 542 ANG MO KIO… 3 ROOM 04 TO 06 68
# ℹ 7 more variables: flat_model <chr>, lease_commence_date <dbl>,
# remaining_lease_mths <dbl>, resale_price <dbl>, postal <chr>,
# latitude <chr>, longitude <chr>
3.4 Writing file to rds
<- write_rds(rs_coords, "Data/rds/rs_coords.rds") rs_coords_rds
3.5 Reading the rs_coords_rds file
<- read_rds("Data/rds/rs_coords.rds")
rs_coords glimpse(rs_coords)
Rows: 13,780
Columns: 15
$ month <chr> "2021-01", "2021-01", "2021-01", "2021-01", "2021…
$ town <chr> "ANG MO KIO", "ANG MO KIO", "ANG MO KIO", "ANG MO…
$ address <chr> "331 ANG MO KIO AVE 1", "534 ANG MO KIO AVE 10", …
$ block <chr> "331", "534", "561", "170", "463", "542", "170", …
$ street_name <chr> "ANG MO KIO AVE 1", "ANG MO KIO AVE 10", "ANG MO …
$ flat_type <chr> "3 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", "3 ROOM",…
$ storey_range <chr> "04 TO 06", "04 TO 06", "01 TO 03", "07 TO 09", "…
$ floor_area_sqm <dbl> 68, 68, 68, 60, 68, 68, 60, 73, 67, 67, 68, 68, 7…
$ flat_model <chr> "New Generation", "New Generation", "New Generati…
$ lease_commence_date <dbl> 1981, 1980, 1980, 1986, 1980, 1981, 1986, 1976, 1…
$ remaining_lease_mths <dbl> 708, 698, 697, 770, 698, 709, 768, 652, 682, 664,…
$ resale_price <dbl> 260000, 265000, 265000, 268000, 268000, 270000, 2…
$ postal <chr> "560331", "560534", "560561", "560170", "560463",…
$ latitude <chr> "1.36211140145298", "1.37405846295585", "1.370577…
$ longitude <chr> "103.85076647513", "103.854168170426", "103.85785…
3.5.1 Transform and Assign CRS
The Latitude & Longitude are in decimal, therefore the projected CRS will be WGS84. We will need to assign the CRS of 4326 first before transforming it to 3414 which is the EPSG code for Singapore SVY21
<- st_as_sf(rs_coords,
rs_coords_sf coords = c("longitude",
"latitude"),
crs=4326) %>%
st_transform(crs = 3414)
st_crs(rs_coords_sf)
Coordinate Reference System:
User input: EPSG:3414
wkt:
PROJCRS["SVY21 / Singapore TM",
BASEGEOGCRS["SVY21",
DATUM["SVY21",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4757]],
CONVERSION["Singapore Transverse Mercator",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",1.36666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",103.833333333333,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",1,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",28001.642,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",38744.572,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["northing (N)",north,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["easting (E)",east,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Cadastre, engineering survey, topographic mapping."],
AREA["Singapore - onshore and offshore."],
BBOX[1.13,103.59,1.47,104.07]],
ID["EPSG",3414]]
3.5.2 Plotting HDB resale points
tmap_mode("view")
tm_shape(rs_coords_sf)+
tm_dots(col="blue", size = 0.02)