Energy Consumption Analysis Of Commercial Buildings Using GAM

Author

Yuchen Xue

Published

June 30, 2018

Intro

This is the final assignment of the course “Regression Analysis” at National Taiwan University of Science and Technology (NTUST). The purpose of this final assignment is to test the students’ knowledge of the R language and their ability to analyze a dataset using a linear model.

In this assignment, I selected a modified version of the open source dataset published by the energy supply company EnerNOC, which contains the 30-minute energy consumption data for 100 commercial/industrial sites for the year 2012. I split the data into training and testing subset, built several linear models using Generalized Additive Model (GAM) with different conditions using the training set, predicted the energy consumption and examined the prediction result using the test data. The model could successfully generalize the change in energy consumption and achieved a MAPE value of 14.89.

1. Data Preparation

1.1 Introduction to the Dataset

EnerNOC GreenButton Data is a subset of the Open EnerNOC data repository. The raw data was provided by the EnerNOC electricity supplier and contains anonymous 5-minute electricity consumption data of 100 commercial/industrial sites for the year 2012. The simplified version contains data at 30-minute intervals.

The explanatory variables are:

value: Electricity consumption by timestamp
week: In which week was the data collected
date: On which day was the data collected
type: Type of the building

1.2 Overview of the Dataset

Load Necessary Packages.

library(httr)
library(feather)
library(data.table)
library(mgcv)

Loading required package: nlme

This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.

library(car)

Loading required package: carData

library(ggplot2)
library(dplyr, warn.conflicts = FALSE)

Read the Data and show the overview of the data.

# Download the data as a tempfile and loaded it locally
url = "https://raw.githubusercontent.com/yuchen-xue/Learn-R-Quarto/main/content/data/DT_4_ind"
GET(url, write_disk(tf <- tempfile(fileext = "")))

Response [https://raw.githubusercontent.com/yuchen-xue/Learn-R-Quarto/main/content/data/DT_4_ind]
  Date: 2025-03-18 21:58
  Status: 200
  Content-Type: application/octet-stream
  Size: 3.58 MB
<ON DISK>  /tmp/RtmpxGXCb9/file19532474ade9

DT <- as.data.table(read_feather(tf))

# Remove the tempfile
unlink(tf)

str(DT)

Classes 'data.table' and 'data.frame':  70080 obs. of  5 variables:
 $ date_time: POSIXct, format: "2012-01-02 00:00:00" "2012-01-02 00:30:00" ...
 $ value    : num  1590 1564 1560 1585 1604 ...
 $ week     : chr  "Monday" "Monday" "Monday" "Monday" ...
 $ date     : Date, format: "2012-01-02" "2012-01-02" ...
 $ type     : chr  "Commercial Property" "Commercial Property" "Commercial Property" "Commercial Property" ...
 - attr(*, ".internal.selfref")=<externalptr>

Plot the Data.

ggplot(data = DT, aes(x = date, y = value)) +
  geom_line() + 
  facet_grid(type ~ ., scales = "free_y") +
  theme(panel.border = element_blank(),
        panel.background = element_blank(),
        panel.grid.minor = element_line(colour = "grey90"),
        panel.grid.major = element_line(colour = "grey90"),
        panel.grid.major.x = element_line(colour = "grey90"),
        axis.text = element_text(size = 10),
        axis.title = element_text(size = 12, face = "bold"),
        strip.text = element_text(size = 9, face = "bold")) +
  labs(x = "Date", y = "Load (kW)")

We can see that the electricity consumption of the category Food Sales & Storage is not affected by weekdays or weekends.

1.3 Processing of Information about Days and Weeks

We use the car::record() function to easily describe the relationship between the electricity consumption and each day of the week. We do this by appending a new column that associates each day of the week with a unique number.

DT[, week_num := as.integer(car::recode(week,
    "'Monday'='1';'Tuesday'='2';'Wednesday'='3';'Thursday'='4';
    'Friday'='5';'Saturday'='6';'Sunday'='7'"))]
unique(DT[, week])

[1] "Monday"    "Tuesday"   "Wednesday" "Thursday"  "Friday"    "Saturday" 
[7] "Sunday"

unique(DT[, week_num])

[1] 1 2 3 4 5 6 7

We extract information related to “industry”, “data”, “week” and “period” from the dataset. Since the data was collected every half an hour, there’re 48 consecutive observations within a day, thus we have period <- 48.

n_type <- unique(DT[, type])
n_date <- unique(DT[, date])
n_weekdays <- unique(DT[, week])
period <- 48

We select the electricity consumption of a commercial building, store it as the variable data_r and plot the data.

type == n_type[1] stands for “Commercial Property”, whereas date %in% n_date[57:70] corresponds to two weeks.

data_r <- DT[(type == n_type[1] & date %in% n_date[57:70])]

ggplot(data_r, aes(date_time, value)) +
  geom_line() +
  theme(panel.border = element_blank(),
        panel.background = element_blank(),
        panel.grid.minor = element_line(colour = "grey90"),
        panel.grid.major = element_line(colour = "grey90"),
        panel.grid.major.x = element_line(colour = "grey90"),
        axis.text = element_text(size = 10),
        axis.title = element_text(size = 12, face = "bold")) +
  labs(x = "Date", y = "Load (kW)")

We re-organize the data in accordance with the change of days and weeks.

N <- nrow(data_r) # number of rows in the training set
window <- N / period # number of days in the training set
matrix_gam <- data.table(Load = data_r[, value],
                         Daily = rep(1:period, window),
                         Weekly = data_r[, week_num])
head(matrix_gam)

       Load Daily Weekly
      <num> <int>  <int>
1: 1630.875     1      1
2: 1611.201     2      1
3: 1657.160     3      1
4: 1653.042     4      1
5: 1702.316     5      1
6: 1648.411     6      1

2. Model Building

2.1 The First Model

We use the mgcv:gam() function to build the GAM model. The periodic change in days is described by a “cubic regression spline”, whereas the periodic change in weeks is described by “P-splines”.

gam_1 <- gam(Load ~ s(Daily, bs = "cr", k = period) +
               s(Weekly, bs = "ps", k = 7),
             data = matrix_gam,
             family = gaussian)

Inspect the summary of the model.

summary(gam_1)$r.sq

[1] 0.7718406

summary(gam_1)$sp.criterion

  GCV.Cp 
245544.9

GCV is an indicator of the fit of the model. The lower this value is, the fitter the model is. In addition we can see that R-sq is not high, which indicates the bad performance of the model.

Compare the difference between the prediction and the reality over those two weeks.

matrix_gam$Predict=gam_1$fitted.values
ggplot(matrix_gam[1:nrow(matrix_gam),], aes(1:nrow(matrix_gam)))+
           labs("lab")+
           geom_line(aes(y=Load, color="Real"), size = 0.8)+
           geom_line(aes(y = Predict, color = "Predict"), size = 0.8)

It doesn’t look nice. We carefully inspect the electricity consumption during the first week.

row_Mon <- nrow(matrix_gam)/2
matrix_gam$Predict=gam_1$fitted.values
ggplot(matrix_gam[1:row_Mon,], aes(1:row_Mon))+
           labs("lab")+
           geom_line(aes(y=Load, color="Real"), size = 0.8)+
           geom_line(aes(y = Predict, color = "Predict"), size = 0.8)

This model can only predict the trend of the weekdays’ electricity consumption, but fails in predicting the exact amount of electricity consumption. We do a detailed inspection on the electricity consumption on Monday.

row_Mon <- nrow(matrix_gam)/14
matrix_gam$Predict=gam_1$fitted.values
ggplot(matrix_gam[1:row_Mon,], aes(1:row_Mon))+
           labs("lab")+
           geom_line(aes(y=Load, color="Real"), size = 0.8)+
           geom_line(aes(y = Predict, color = "Predict"), size = 0.8)

The problem is that the real electricity consumption at the end of the day is not at the same level as it at the beginning of the day, but the model shows a pure periodic change within the day, which is different from the reality. Thus we need to change our mindset and build another model.

2.2 The Second Model

This time we use the method of interaction between different scale and build a model by combining Daily and Weekly.

gam_2 <- gam(Load ~ s(Daily, Weekly),
             data = matrix_gam,
             family = gaussian)
 
summary(gam_2)$r.sq

[1] 0.9352108

summary(gam_2)$sp.criterion

  GCV.Cp 
71162.37

According to R.sq and p-value, we can say that this model performs better.

Plot the difference between predicted result and the real result.

datas <- rbindlist(list(data_r[, .(value, date_time)],
                        data.table(value = gam_2$fitted.values,
                                   data_time = data_r[, date_time])))

Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.

datas[, type := c(rep("Real", nrow(data_r)), rep("Fitted", nrow(data_r)))]
 
ggplot(data = datas, aes(date_time, value, group = type, colour = type)) +
  geom_line(size = 0.8) +
  theme_bw() +
  labs(x = "Time", y = "Load (kW)",
       title = "Fit from GAM n.2")

We can see that the fit during Monday to Thursday significantly improved.

2.3 The Third Model

Next, we use another advanced method of interaction and use a smooth function called “tensor product”.

gam_3 <- gam(Load ~ te(Daily, Weekly,
                       bs = c("cr", "ps")),
             data = matrix_gam,
             family = gaussian)
 
summary(gam_3)$r.sq

[1] 0.9268452

summary(gam_3)$sp.criterion

 GCV.Cp 
79724.9

Plot gam_3.

datas <- rbindlist(list(data_r[, .(value, date_time)],
                        data.table(value = gam_3$fitted.values,
                                   data_time = data_r[, date_time])))

Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.

datas[, type := c(rep("Real", nrow(data_r)), rep("Fitted", nrow(data_r)))]
 
ggplot(data = datas, aes(date_time, value, group = type, colour = type)) +
  geom_line(size = 0.8) +
  theme_bw() +
  labs(x = "Time", y = "Load (kW)",
       title = "Fit from GAM n.3")

2.4 The Fourth Model

We can make it even better. For example let the knots (a concept that is similar to dimension) fit the periodic change of days and weeks better.

gam_4 <- gam(Load ~ te(Daily, Weekly,
                        k = c(period, 7),
                        bs = c("cr", "ps")),
              data = matrix_gam,
              family = gaussian)
 
summary(gam_4)$r.sq

[1] 0.9727604

summary(gam_4)$sp.criterion

  GCV.Cp 
34839.46

We can see that R-sq has increased a little bit. But the most significant is edf, which has increased 5 times.

Plot gam_4

datas <- rbindlist(list(data_r[, .(value, date_time)],
                        data.table(value = gam_4$fitted.values,
                                   data_time = data_r[, date_time])))

Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.

datas[, type := c(rep("Real", nrow(data_r)), rep("Fitted", nrow(data_r)))]
 
ggplot(data = datas, aes(date_time, value, group = type, colour = type)) +
  geom_line(size = 0.8) +
  theme_bw() +
  labs(x = "Time", y = "Load (kW)",
       title = "Fit from GAM n.4")

2.5 The Fifth Model

All right, what about be greedier and combine all the previous models? Let’s examine our thought by building gam_5.

gam_5 <- gam(Load ~ s(Daily, bs = "cr", k = period) +
                    s(Weekly, bs = "ps", k = 7) +
                    ti(Daily, Weekly,
                       k = c(period, 7),
                       bs = c("cr", "ps")),
             data = matrix_gam,
             family = gaussian)
 
summary(gam_5)$r.sq

[1] 0.9717469

summary(gam_5)$sp.criterion

  GCV.Cp 
35772.35

Although p-value remains \(0\), but R-sq decreased and GCV increased. This means that the performance is not as good as the previous gam_4 model.

Plot gam_5.

datas <- rbindlist(list(data_r[, .(value, date_time)],
                        data.table(value = gam_5$fitted.values,
                                   data_time = data_r[, date_time])))

Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.

datas[, type := c(rep("Real", nrow(data_r)), rep("Fitted", nrow(data_r)))]
 
ggplot(data = datas, aes(date_time, value, group = type, colour = type)) +
  geom_line(size = 0.8) +
  theme_bw() +
  labs(x = "Time", y = "Load (kW)",
       title = "Fit from GAM n.5")

2.6 The Sixth Model

Now is our last attempt. Here we add another method of tensor product interations and introduce a stricter condition by setting full = TRUE.

gam_6 <- gam(Load ~ t2(Daily, Weekly,
                       k = c(period, 7),
                       bs = c("cr", "ps"),
                       full = TRUE),
             data = matrix_gam,
             family = gaussian)
 
summary(gam_6)$r.sq

[1] 0.9738273

summary(gam_6)$sp.criterion

  GCV.Cp 
32230.68

Plot gam_6.

datas <- rbindlist(list(data_r[, .(value, date_time)],
                        data.table(value = gam_6$fitted.values,
                                   data_time = data_r[, date_time])))

Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.

datas[, type := c(rep("Real", nrow(data_r)), rep("Fitted", nrow(data_r)))]
 
ggplot(data = datas, aes(date_time, value, group = type, colour = type)) +
  geom_line(size = 0.8) +
  theme_bw() +
  labs(x = "Time", y = "Load (kW)",
       title = "Fit from GAM n.6")

This plot looks even better.

2.7 Comparison of the Models

With so many models, how to decide which one is the best? Just ask the omnipotent AIC.

AIC(gam_1, gam_2, gam_3, gam_4, gam_5, gam_6)

             df       AIC
gam_1  17.46996 10248.993
gam_2  30.70080  9415.768
gam_3  25.65709  9492.545
gam_4 121.41166  8912.611
gam_5 115.80849  8932.746
gam_6 100.12005  8868.628

Apparently gam_4, gam_5, gam_6 are on the leading board. gam_6 has the best performance, while gam_4 comes in second.

Next we plot gam_4, gam_6 together and see what we found.

layout(matrix(1:2, nrow = 1))
plot(gam_4, rug = FALSE, se = FALSE, n2 = 80, main = "gam n.4 with te()")
plot(gam_6, rug = FALSE, se = FALSE, n2 = 80, main = "gam n.6 with t2()")

These contour lines indicate each model’s response on Weekly and Daily. Although they look similar, the contour of gam_6 has more wave-like patterns. This is an indication of its higher sensitivity.

Visualization of the Best Performing Model

Before the end of this section, let’s see what we can do to make the plot of gam_6 looks better. Firstly we use the vis.gam function from the package mgcv.

#　vis.gam(gam_6, main = "t2(D, W)", plot.type = "contour",
#　        color = "terrain", contour.col = "black", lwd = 2)
vis.gam(gam_6, main = "t2(D, W)", 
        color = "terrain", contour.col = "black", lwd = 2)

We can see that the electricity consumption on weekdays are higher than on weekends. The peak hours are arround 3 pm from Monday to Thursday.

Without using the contour.col option, we can make a 3D plot.

vis.gam(gam_6, n.grid = 50, theta = 35, phi = 32, zlab = "",
        ticktype = "detailed", color = "topo", main = "t2(D, W)")

Change the viewing angle

vis.gam(gam_6, n.grid = 50, theta = 190, phi = 20, zlab = "",
        ticktype = "detailed", color = "topo", main = "t2(D, W)")

3. Analysis of the Contribution of Each Explanatory Variable

3.1 Models Building

Now let’s see what would happen if we discard explanatory variables one by one.

gam_6D <- gam(Load ~ t2(Daily, 
                       k = period,
                       bs = "cr",
                       full = TRUE),
             data = matrix_gam,
             family = gaussian)
 
summary(gam_6D)$r.sq

[1] 0.5129567

summary(gam_6D)$sp.criterion

  GCV.Cp 
518169.2

Em, it doesn’t look good.

Now let’s discard variable Daily.

gam_6W <- gam(Load ~ t2(Weekly,
                       k = 7,
                       bs =  "ps",
                       full = TRUE),
             data = matrix_gam,
             family = gaussian)
 
summary(gam_6W)$r.sq

[1] 0.2502016

summary(gam_6W)$sp.criterion

  GCV.Cp 
791454.6

We can see it looks much worse.

Let’s maintain a rigorous attitude and use ANOVA to compare the difference between these three leading models. Firstly we discard variable Weekly and see what will happen

anova(gam_6, gam_6D, test="F")

Analysis of Deviance Table

Model 1: Load ~ t2(Daily, Weekly, k = c(period, 7), bs = c("cr", "ps"), 
    full = TRUE)
Model 2: Load ~ t2(Daily, k = period, bs = "cr", full = TRUE)
  Resid. Df Resid. Dev      Df   Deviance      F    Pr(>F)    
1    550.77   15740821                                        
2    661.13  339050831 -110.37 -323310010 106.61 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Then we discard variable Daily and see what will happen

anova(gam_6, gam_6W, test="F")

Analysis of Deviance Table

Model 1: Load ~ t2(Daily, Weekly, k = c(period, 7), bs = c("cr", "ps"), 
    full = TRUE)
Model 2: Load ~ t2(Weekly, k = 7, bs = "ps", full = TRUE)
  Resid. Df Resid. Dev      Df   Deviance     F    Pr(>F)    
1    550.77   15740821                                       
2    667.88  526095056 -117.11 -510354235 158.6 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The result is clear – non of the variable Weekly and variable Daily can be dropped!

3.2 Model Plotting

Plot the result of discarding variable Weekly.

datas <- rbindlist(list(data_r[, .(value, date_time)],
                        data.table(value = gam_6D$fitted.values,
                                   data_time = data_r[, date_time])))

Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.

datas[, type := c(rep("Real", nrow(data_r)), rep("Fitted", nrow(data_r)))]
 
ggplot(data = datas, aes(date_time, value, group = type, colour = type)) +
  geom_line(size = 0.8) +
  theme_bw() +
  labs(x = "Time", y = "Load (kW)",
       title = "Fit from GAM n.6")

We can see that there is no weekly difference in the electricity consumption when the variable Weekly is dropped.

Plot the result of discarding variable Daily.

datas <- rbindlist(list(data_r[, .(value, date_time)],
                        data.table(value = gam_6W$fitted.values,
                                   data_time = data_r[, date_time])))

Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.

datas[, type := c(rep("Real", nrow(data_r)), rep("Fitted", nrow(data_r)))]
 
ggplot(data = datas, aes(date_time, value, group = type, colour = type)) +
  geom_line(size = 0.8) +
  theme_bw() +
  labs(x = "Time", y = "Load (kW)",
       title = "Fit from GAM n.6")

We can see that there is no difference in the electricity consumption over the 24 hours of a day when the variable Daily is dropped.

4. Prediction on Electricity Consumption

Lastly, the most exciting part – let’s predict the electricity consumption for the next two weeks.

data_test <- DT[(type == n_type[1] & date %in% n_date[71:84])]
matrix_test <- data.table(Load = data_test[, value],
                           Daily = rep(1:period, window),
                           Weekly = data_test[, week_num])
pred_week <- predict(gam_6, matrix_test[1:(7*period)],interval="confidence", level = 0.95)

datat <- rbindlist(list(data_test[, .(value, date_time)],
                        data.table(value = pred_week,
                                   data_time = data_test[, date_time])))

Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.

datat[, type := c(rep("Real", nrow(data_test)), rep("Predicted", nrow(data_test)))]
ggplot(data = datat, aes(date_time, value, group = type, colour = type)) +
  geom_line(size = 0.8) +
  theme_bw() +
  labs(x = "Time", y = "Load (kW)",
       title = "Predicted result on GAM n.6")

Predict the electricity consumption of the next month.

data_test <- DT[(type == n_type[1] & date %in% n_date[71:98])]
matrix_test <- data.table(Load = data_test[, value],
                           Daily = rep(1:period, window),
                           Weekly = data_test[, week_num])
pred_week <- predict(gam_6, matrix_test[1:(7*period)],interval="confidence", level = 0.95)

datat <- rbindlist(list(data_test[, .(value, date_time)],
                        data.table(value = pred_week,
                                   data_time = data_test[, date_time])))

Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.

datat[, type := c(rep("Real", nrow(data_test)), rep("Predicted", nrow(data_test)))]
ggplot(data = datat, aes(date_time, value, group = type, colour = type)) +
  geom_line(size = 0.8) +
  theme_bw() +
  labs(x = "Time", y = "Load (kW)",
       title = "Predicted result on GAM n.6")

5 Additional Information

Additional content that was added at the end of the semester.

Function Definition

Define the MAPE function.

mape <- function(real, pred){
  return(100 * mean(abs((real - pred)/real)))
}

Define the criteria for model evaluation (R-sq, GCV, MAPE).

gam_eval <- function(model){
    return(data.table(RSQ=summary(model)$r.sq, 
                      GCV=summary(model)$sp.criterion, 
                      MAPE=mape(data_new[, value], model$fitted.values)))
}

Define the function for plotting models.

gam_plot <- function(model, title){
    datas <- rbindlist(list(data_new[, .(value, date_time)],
                        data.table(value = model$fitted.values,
                                   data_time = data_new[, date_time])))
datas[, type := c(rep("Real", nrow(data_new)), rep("Fitted", nrow(data_new)))]
 
ggplot(data = datas, aes(date_time, value, group = type, colour = type)) +
  geom_line(size = 0.8) +
  theme_bw() +
  labs(x = "Time", y = "Load (kW)",
       title = title)
}

Model Building

Use the first season’s data to train a model.

data_new <- DT[(type == n_type[1] & date %in% n_date[1:91])]

ggplot(data_new, aes(date_time, value)) +
  geom_line() +
  theme(panel.border = element_blank(),
        panel.background = element_blank(),
        panel.grid.minor = element_line(colour = "grey90"),
        panel.grid.major = element_line(colour = "grey90"),
        panel.grid.major.x = element_line(colour = "grey90"),
        axis.text = element_text(size = 10),
        axis.title = element_text(size = 12, face = "bold")) +
  labs(x = "Date", y = "Load (kW)")

Re-organize the data and add information about the electricity consumption of the previous one day and the previous week.

matrix_new <- data.table(Load = data_new[, value],
                         PrevDayLoad = c(data_new[1:48, value], data_new[1:4320, value]),
                         PrevWeekLoad = c(data_new[1:336, value], data_new[1:4032, value]),
                         Daily = rep(1:period, window),
                         Weekly = data_r[, week_num])

Warning in as.data.table.list(x, keep.rownames = keep.rownames, check.names =
check.names, : Item 4 has 672 rows but longest item has 4368; recycled with
remainder.

Warning in as.data.table.list(x, keep.rownames = keep.rownames, check.names =
check.names, : Item 5 has 672 rows but longest item has 4368; recycled with
remainder.

Build the first model.

gam_new_1 <- gam(Load ~ s(Daily, bs = "cr", k = period) +
               s(Weekly, bs = "ps", k = 7),
             data = matrix_new,
             family = gaussian)

Evaluate the first model.

eval_1 <- gam_eval(gam_new_1)
eval_1

         RSQ      GCV     MAPE
       <num>    <num>    <num>
1: 0.7194149 311287.5 18.89976

Plot the first model.

gam_plot(gam_new_1, "Fit from gam_new_1")

Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.

Build the second model.

gam_new_2 <- gam(Load ~ s(Daily, Weekly),
             data = matrix_new,
             family = gaussian)

Evaluate the second model.

eval_2 <- gam_eval(gam_new_2)
eval_2

         RSQ      GCV     MAPE
       <num>    <num>    <num>
1: 0.8487935 168039.3 10.62365

Plot the second model.

gam_plot(gam_new_2, "Fit from gam_new_2")

Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.

Build the third model.

gam_new_3 <- gam(Load ~ te(Daily, Weekly,
                       bs = c("cr", "ps")),
             data = matrix_new,
             family = gaussian)

Evaluate the third model.

eval_3 <- gam_eval(gam_new_3)
eval_3

         RSQ      GCV     MAPE
       <num>    <num>    <num>
1: 0.8423209 175028.3 10.18985

Plot the third model.

gam_plot(gam_new_3, "Fit from gam_new_3")

Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.

Build the fourth model.

gam_new_4 <- gam(Load ~ te(Daily, Weekly,
                        k = c(period, 7),
                        bs = c("cr", "ps")),
              data = matrix_new,
              family = gaussian)

Evaluate the fourth model.

eval_4 <- gam_eval(gam_new_4)
eval_4

         RSQ      GCV     MAPE
       <num>    <num>    <num>
1: 0.8796766 135799.2 8.319575

Plot the fourth model.

gam_plot(gam_new_4, "Fit from gam_new_4")

Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.

Build the fifth model.

gam_new_5 <- gam(Load ~ s(Daily, bs = "cr", k = period) +
                    s(Weekly, bs = "ps", k = 7) +
                    ti(Daily, Weekly,
                       k = c(period, 7),
                       bs = c("cr", "ps")),
             data = matrix_new,
             family = gaussian)

Evaluate the fifth model.

eval_5 <- gam_eval(gam_new_5)
eval_5

         RSQ    GCV     MAPE
       <num>  <num>    <num>
1: 0.8801601 135306 8.337419

Plot the fifth model.

gam_plot(gam_new_5, "Fit from gam_new_5")

Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.

Build the sixth model.

gam_new_6 <- gam(Load ~ t2(Daily, Weekly,
                       k = c(period, 7),
                       bs = c("cr", "ps"),
                       full = TRUE),
             data = matrix_new,
             family = gaussian)

Evaluate the sixth model.

eval_6 <- gam_eval(gam_new_6)
eval_6

         RSQ      GCV    MAPE
       <num>    <num>   <num>
1: 0.8802681 134768.2 8.32655

Plot the sixth model.

gam_plot(gam_new_6, "Fit from gam_new_6")

Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.

Build the seventh model – without using smooth function of GAM.

gam_new_simple <- gam(Load ~ Daily+Weekly,
              data = matrix_new,
              family = gaussian)

Evaluate the seventh model.

eval_7 <- gam_eval(gam_new_simple)
eval_7

         RSQ      GCV     MAPE
       <num>    <num>    <num>
1: 0.4302044 629325.7 25.61359

Plot the seventh model.

gam_plot(gam_new_simple, "Fit from gam_new_simple")

Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.

Build the eighth model – only apply te smooth function on the variable Daily.

gam_new_te_daily <- gam(Load ~ te(Daily, bs = "cr", k = period) +Weekly,
             data = matrix_new,
             family = gaussian)

Evaluate the eighth model.

eval_8 <- gam_eval(gam_new_te_daily)
eval_8

         RSQ      GCV     MAPE
       <num>    <num>    <num>
1: 0.6366094 402558.4 20.44199

Plot the eighth model.

gam_plot(gam_new_te_daily, "Fit from gam_new_te_daily")

Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.

Build the nineth model – only apply te smooth function on the variable Weekly.

gam_new_te_weekly <- gam(Load ~ te(Weekly, bs = "ps", k = 7) +Daily,
             data = matrix_new,
             family = gaussian)

Evaluate the nineth model.

eval_9 <- gam_eval(gam_new_te_weekly)
eval_9

         RSQ    GCV   MAPE
       <num>  <num>  <num>
1: 0.5124041 539142 24.279

Plot the nineth model.

gam_plot(gam_new_te_weekly, "Fit from gam_new_te_weekly")

Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.

Analysis of the models

Use a table to analyse the nine models.

eval_table <- bind_rows(eval_1, eval_2, eval_3, eval_4, eval_5, eval_6, eval_7, eval_8, eval_9)
all_aic <- AIC(gam_new_1, gam_new_2, gam_new_3, gam_new_4, gam_new_5, gam_new_6, gam_new_simple, gam_new_te_daily, gam_new_te_weekly)$AIC

eval_table[, AIC :=all_aic]
eval_table

         RSQ      GCV      MAPE      AIC
       <num>    <num>     <num>    <num>
1: 0.7194149 311287.5 18.899761 67646.26
2: 0.8487935 168039.3 10.623654 64953.21
3: 0.8423209 175028.3 10.189855 65131.27
4: 0.8796766 135799.2  8.319575 64020.79
5: 0.8801601 135306.0  8.337419 64004.82
6: 0.8802681 134768.2  8.326550 63987.99
7: 0.4302044 629325.7 25.613586 70721.15
8: 0.6366094 402558.4 20.441993 68769.43
9: 0.5124041 539142.0 24.278996 70045.54

Predict the electricity consumption of season 2 using the fourth model.

data_test_2qt <- DT[(type == n_type[1] & date %in% n_date[92:183])]
matrix_test_2qt <- data.table(Load = data_test_2qt[, value],
                           Daily = rep(1:period, 91),
                           Weekly = data_test_2qt[, week_num])
pred_2qt <- predict(gam_new_4, matrix_test_2qt[1:(7*period)],interval="confidence", level = 0.95)

datat <- rbindlist(list(data_test_2qt[, .(value, date_time)],
                        data.table(value = pred_2qt,
                                   data_time = data_test_2qt[, date_time])))

Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.

datat[, type := c(rep("Real", nrow(data_test_2qt)), rep("Predicted", nrow(data_test_2qt)))]
ggplot(data = datat, aes(date_time, value, group = type, colour = type)) +
  geom_line(size = 0.8) +
  theme_bw() +
  labs(x = "Time", y = "Load (kW)",
       title = "Predicted result on GAM n.6")

Predict the electricity consumption of season 3 using the fourth model.

data_test_3qt <- DT[(type == n_type[1] & date %in% n_date[183:274])]
matrix_test_3qt <- data.table(Load = data_test_3qt[, value],
                           Daily = rep(1:period, 91),
                           Weekly = data_test_3qt[, week_num])
pred_3qt <- predict(gam_new_4, matrix_test_3qt[1:(7*period)],interval="confidence", level = 0.95)

datat <- rbindlist(list(data_test_3qt[, .(value, date_time)],
                        data.table(value = pred_3qt,
                                   data_time = data_test_3qt[, date_time])))

Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.

datat[, type := c(rep("Real", nrow(data_test_3qt)), rep("Predicted", nrow(data_test_3qt)))]
ggplot(data = datat, aes(date_time, value, group = type, colour = type)) +
  geom_line(size = 0.8) +
  theme_bw() +
  labs(x = "Time", y = "Load (kW)",
       title = "Predicted result on GAM n.6")

Predict the electricity consumption of season 4 using the fourth model.

data_test_4qt <- DT[(type == n_type[1] & date %in% n_date[274:365])]
matrix_test_4qt <- data.table(Load = data_test_4qt[, value],
                           Daily = rep(1:period, 91),
                           Weekly = data_test_4qt[, week_num])
pred_4qt <- predict(gam_new_4, matrix_test_4qt[1:(7*period)],interval="confidence", level = 0.95)

datat <- rbindlist(list(data_test_4qt[, .(value, date_time)],
                        data.table(value = pred_4qt,
                                   data_time = data_test_4qt[, date_time])))

Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.

datat[, type := c(rep("Real", nrow(data_test_4qt)), rep("Predicted", nrow(data_test_4qt)))]
ggplot(data = datat, aes(date_time, value, group = type, colour = type)) +
  geom_line(size = 0.8) +
  theme_bw() +
  labs(x = "Time", y = "Load (kW)",
       title = "Predicted result on GAM n.6")

Compute the MAPE of the predictions.

mape_2qt <- mape(matrix_test_2qt[1:(7*period)]$Load, pred_2qt)
mape_3qt <- mape(matrix_test_3qt[1:(7*period)]$Load, pred_3qt)
mape_4qt <- mape(matrix_test_4qt[1:(7*period)]$Load, pred_4qt)
mapes <- cbind(mape_2qt, mape_3qt, mape_4qt)
mapes

     mape_2qt mape_3qt mape_4qt
[1,] 12.01464 14.89136 12.93795