Forecasting hourly electricity consumption for sets of ...927793/FULLTEXT01.pdfintermittent and...

DEGREE PROJECT, IN , SECONDMASTER'S PROGRAMME, ICT INNOVATIONLEVEL

STOCKHOLM, SWEDEN 2015

Forecasting hourly electricityconsumption for sets of householdsusing machine learning algorithms

THOMAS LINTON

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY

Abstract

To address inefficiency, waste, and the negative consequences of electricity generation, com-panies and government entities are looking to behavioural change among residential con-sumers. To drive behavioural change, consumers need better feedback about their electricityconsumption. A monthly or quarterly bill provides the consumer with almost no useful infor-mation about the relationship between their behaviours and their electricity consumption.Smart meters are now widely dispersed in developed countries and they are capable of pro-viding electricity consumption readings at an hourly resolution, but this data is mostly usedas a basis for billing and not as a tool to assist the consumer in reducing their consump-tion.

One component required to deliver innovative feedback mechanisms is the capability toforecast hourly electricity consumption at the household scale. The work presented by thisthesis is an evaluation of the effectiveness of a selection of kernel based machine learningmethods at forecasting the hourly aggregate electricity consumption for different sized setsof households. The work of this thesis demonstrates that k-Nearest Neighbour Regressionand Gaussian process Regression are the most accurate methods within the constraints ofthe problem considered. In addition to accuracy, the advantages and disadvantages of eachmachine learning method are evaluated, and a simple comparison of each algorithms com-putational performance is made.

Keywordsmachine learning, kernel methods, k-nearest neighbour, kernel ridge regression, gaussianprocesses, support vector regression, electricity, forecasting

I

Abstract (Swedish)

För att ta itu med ineffektivitet, avfall, och de negativa konsekvenserna av elproduktion såvill företag och myndigheter se beteendeförändringar bland hushållskonsumenter. För attskapa beteendeförändringar så behöver konsumenterna bättre återkoppling när det gällerderas elförbrukning. Den nuvarande återkopplingen i en månads- eller kvartalsfaktura gerkonsumenten nästan ingen användbar information om hur deras beteenden relaterar tillderas konsumtion. Smarta mätare finns nu överallt i de utvecklade länderna och de kange en mängd information om bostäders konsumtion, men denna data används främst somunderlag för fakturering och inte som ett verktyg för att hjälpa konsumenterna att minskasin konsumtion.

En komponent som krävs för att leverera innovativa återkopplingsmekanismer är förmåganatt förutse elförbrukningen på hushållsskala. Arbetet som presenteras i denna avhandlingär en utvärdering av noggrannheten hos ett urval av kärnbaserad maskininlärningsmetoderför att förutse den sammanlagda förbrukningen för olika stora uppsättningar av hushåll.Arbetet i denna avhandling visar att "k-Nearest Neighbour Regression" och "Gaussian Pro-cess Regression" är de mest exakta metoder inom problemets begränsningar. Förutom nog-grannhet, så görs en utvärdering av fördelar, nackdelar och prestanda hos varje maskinin-lärningsmetod.

II

Table of Contents

1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Project Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4.1 Benefits, ethics and sustainability . . . . . . . . . . . . . . . . . . . . . . . 31.5 Research Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.6 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.6.1 Weather data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.7 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Theoretical Background 72.1 Residential electricity consumption . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.1 Supervised learning and regression . . . . . . . . . . . . . . . . . . . . . 92.2.2 Kernel-based methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.3 Time series forecasting considerations . . . . . . . . . . . . . . . . . . . . 112.2.4 Bias and variance tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.5 Cross validation and model evaluation . . . . . . . . . . . . . . . . . . . 122.2.6 Error measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Scientific methods 173.1 Data collection and description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1.1 Preprocessing and data quality . . . . . . . . . . . . . . . . . . . . . . . . 183.2 Explanatory analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.1 Building types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2.2 Heating, lighting and consumption . . . . . . . . . . . . . . . . . . . . . . 193.2.3 Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3.1 k-Nearest Neighbour Regression . . . . . . . . . . . . . . . . . . . . . . . 233.3.2 Kernel Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.3.3 Gaussian Process Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3.4 Support Vector Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4 Experiments 314.0.5 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1 Validation and learning procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.2 k-Nearest Neighbour Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.3 Kernel Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.4 Gaussian Process Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.5 Support Vector Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

III

4.6 Discussion of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.6.1 Implementation considerations . . . . . . . . . . . . . . . . . . . . . . . . 39

5 Conclusions 415.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Bibliography 43

Appendices 1

A Plots for k-Nearest Neighbour Regression implementation 3

B Plots for Kernel Ridge Regression implementation 7

C Plots for Gaussian Process Regression implementation 11

D Plots for Support Vector Regression implementation 15

IV

List of Figures

2.1 Line plots comparing plots of consumption for 5 households as individualsand aggregated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Kernel plots as a function of a single input . . . . . . . . . . . . . . . . . . . . . . 112.3 The tradeoff between bias and variance . . . . . . . . . . . . . . . . . . . . . . . 12

3.1 Line plots showing average hourly consumption for each hour of the day fordifferent building types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 Scatter plots showing the correlation between outdoor air temperature andenergy consumption for different heating types. . . . . . . . . . . . . . . . . . . 20

3.3 Plots of averaged consumption for the entire time period covered by the dataset for apartments and houses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.4 Autocorrelation for the electricity consumption for apartments and houses . . 223.5 Sample functions generated by different kernel types . . . . . . . . . . . . . . . 263.6 Optimal separating hyperplane according to maximum margin in two dimen-

sions with support vectors filled . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.7 Visualisation of ε support vector regression . . . . . . . . . . . . . . . . . . . . . 29

4.1 Values of k by housing type for different set sizes. Each colour represents adifferent value for the set size, and the numbers used are shown in the legend. 33

A.1 Example forecasts using k-NNR over a four day period for the NAR model . . 3A.2 Example forecasts using k-NNR over a four day period for the TIME model . 4A.3 Scatter plots of observed vs. forecast values for k-NNR implementation . . . . 5

B.1 Example forecasts using KRR over a four day period using the NAR model . . 7B.2 Example forecasts using KRR over a four day period using the SE kernel and

the TIME model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8B.3 Scatter plots of observed vs. forecast values for KRR implementation using

the SE kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

C.1 Example forecasts for the GPR implementation using the exponential kerneland the NAR model with 95% confidence interval . . . . . . . . . . . . . . . . . 11

C.2 Example forecasts for the GPR implementation using the exponential kerneland the TIME model with 95% confidence interval . . . . . . . . . . . . . . . . 12

C.3 Scatter plots of observed vs. forecast values for the GPR implementationusing the exponential kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

D.1 Example forecasts for the SVR implementation over a four day period usingthe SE kernel and the NAR model . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

D.2 Example forecasts for the SVR implementation over a four day period usingthe SE kernel and the TIME model . . . . . . . . . . . . . . . . . . . . . . . . . . 16

D.3 Scatter plots of observed vs. forecast values for the SVR implementation usingthe SE kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

V

List of Tables

2.1 Categorisation of common error metrics for time series forecasting . . . . . . 14

3.1 Variables available in the data set . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2 Number of households of each house type and average occupancy . . . . . . . 18

4.1 Features used by each of the models . . . . . . . . . . . . . . . . . . . . . . . . . 314.2 Grid search values for the k-NNR implementation . . . . . . . . . . . . . . . . . 324.3 Results for implementation of k-NNR algorithm . . . . . . . . . . . . . . . . . . 334.4 Model fit and forecast times for k-NNR algorithm . . . . . . . . . . . . . . . . . 344.5 Grid search values for KRR implementation . . . . . . . . . . . . . . . . . . . . . 344.6 Results for implementation of KRR algorithm . . . . . . . . . . . . . . . . . . . . 344.7 Model fit and forecast times for KRR algorithm . . . . . . . . . . . . . . . . . . . 344.8 Results from forecasts using GPR with three different kernels . . . . . . . . . . 364.9 Model fit and forecast times for GPR algorithm . . . . . . . . . . . . . . . . . . . 374.10 Grid search values for SVR implementation . . . . . . . . . . . . . . . . . . . . . 374.11 Results from forecasts using the SVR algorithm . . . . . . . . . . . . . . . . . . . 384.12 Model fit and forecast times for SVR algorithm . . . . . . . . . . . . . . . . . . . 38

VI

Chapter 1

Introduction

The world is facing an electricity crisis. The continual increase in electricity consumptionrequires a corresponding increase in electricity production, and the environment is sufferingthe consequences. Much of the worlds electricity is generated through burning fossil fuels,an activity which causes environmental problems like global warming through the releaseof carbon dioxide into the atmosphere [32, p.33]. Other methods of electricity generationmay not have such a direct effect on the environment, but the negative consequences arestill evident: the environmental impact of mining uranium and disposing of nuclear wastefor nuclear power and the use of fossil fuels in the manufacturing of solar panels are justtwo examples.

The electricity we generate is valuable, but it is often treated as an inexhaustible resourceand little thought is given to how it is consumed leading to inefficiency and waste. In theresidential sector, the feedback consumers receive about electricity consumption contributesto this problem. It is critical that this feedback is improved to help residential consumersdecrease their consumption and assist in reducing overall electricity demand and its envi-ronmental consequences.

1.1 Background

The residential sector consumes a large percentage of generated electricity in developedcountries, accounting for approximately 29% [2] in Europe and 22% [1] in Sweden withresidential consumption forecast to rise until at least 2030 [1]. Residential consumptionhas increased over the previous decades with the rise in electricity hungry consumer ap-pliances. In 2009 it was found that in the United States heating and cooling is no longerthe main consumer of household electricity, with appliance consumption rising from 24%of household usage to 34.6% [4]. In the residential sector, consumption is directly relatedto consumer behaviour, and residential consumption patterns can more easily be reducedbecause the end uses are not as essential as they might be in other sectors.

The European Environment Agency is targeting behavioural change among residential con-sumers as a way to improve energy efficiency due to a "body of evidence in the academicliterature which demonstrates that there is potential for energy savings due to measurestargeting behaviour change." [12, p. 5] Behavioural change is the modification of humanbehaviour, and in the context of this thesis the behaviours are those which result in elec-tricity consumption, for example the use of heating, lighting, or appliances. In addition toelectricity savings, behavioural change can help address the problem of peak load demand,which is mostly caused by the residential energy sector [13]. The potential for electricity

1

savings through behavioural change has been an area of study for decades with studies andtrials showing promising results. A review of 38 different methods of direct feedback be-tween 1975 and 2000 found that highest energy savings can be achieved through feedbacksystems that provide near real time data on displays that are mounted in the home (eitheras a standalone appliance or through a television or computer) [19].

To promote behavioural change among residential consumers the feedback they receiveabout their electricity consumption needs to be improved. The typical feedback is a monthly(or even quarterly) bill that is too far removed from the actual consumption to provide anyvalue. This form of feedback makes it almost impossible for a consumer to draw a linkbetween their behaviours and the amount of electricity they consume. The provision ofintermittent and infrequent consumption feedback to consumers is for historical reasons.Electricity meters used to be physically read and consumers were provided with an invoicebased on the readings, or based on a forecast of consumption using old meter readings.Fortunately more data about electricity consumption is now available due to smart meterroll outs in many countries. Smart meters are capable of providing hourly or daily con-sumption figures. It is estimated that by the end of 2020 around 800 million smart meterswill be installed worldwide, with market penetration in North America and Europe beingover 70% [5]. In Sweden 95% [37] of households have a smart meter that is capable ofrecording hourly consumption data along with the corresponding timestamp. Despite theavailability of this data, electricity providers typically don’t utilise it to provide any usefulfeedback to consumers.

1.1.1 Project Context

This project is conducted in co-operation with Greenely, a start up based in Stockholm,Sweden. Greenely is focused on harnessing smart meter data to cause behavioural changeamong residential electricity consumers through improved consumption feedback. Greenely’sinitial product offering will be a mobile application (hereinafter referred to as "the applica-tion") that serves as a medium for the delivery of this feedback. As part of the prototyping ofthe application a survey and behavioural analysis was conducted to determine the importantdrivers of electricity conserving behaviour. It was found that the users social environmentis important as a direct driver of behaviour, and as a method for disrupting and modifyingelectricity consumption habits. The users social environment is defined by the people theyassociate with and their attitudes toward the environment and electricity.

To bring the users social environment into the mobile application, Greenely may implementa feature allowing households to be grouped into teams with the common goal of reducingtheir electricity consumption. Incentivising people as teams rather than as individuals hasbeen shown to be a strong motivator [10] and it may be more effective in the long termthan providing feedback to individual households. A team could consist of family members,friends, apartments in an apartment building, or whole geographic regions. Dependingupon the implementation, teams could be formed by users, or be predefined based on hous-ing characteristics, which would be the case for geographic location based teams.

Based on the aggregate historical consumption of the team, a forecast of future consump-tion could enable several forms of feedback. For example, the application may notify themembers of a team when their aggregate consumption was above or below forecast values,and produce plots of the teams consumption compared to forecasts. Another example is acountry wide ranking of cities according to how far above or below the forecast consumptionthe cities actual consumption was over a period of time.

A system to forecast future consumption for teams of households can thus be seen as abuilding block for Greenely to provide innovative forms of feedback. The implementation

2

of this functionality is the motivation for the work presented in this thesis. For the remainderof this thesis the word "set" will be used for a team.

1.2 Problem

The problem of forecasting electricity consumption for countries is a well researched area,and many methods have been applied to the problem. However, research into forecastingat the household scale is sparse. The problem considered by this thesis is forecasting theaggregate electricity consumption for small (less than 100) sets of households. Studies havefound that as the number of households in a set increases, the data becomes more structuredand more complex forecasting algorithms that leverage this structure begin to perform moreaccurately than simple algorithms [27]. This work is an investigation into how a selectionof machine learning algorithms perform on varying sized sets of households. In particular,the thesis is concerned with the following questions:

• How does the set size affect the accuracy of the algorithms?

• What are the advantages and disadvantages of the algorithms, and what limitationsmight they place on system considered in Section 1.1.1?

• Which of the algorithms are the most suitable for the implementation of the systemconsidered in Section 1.1.1?

Additionally, this thesis intends to provide a general foundation for the implementationof the system considered in Section 1.1.1, and some consideration of the feasibility of thesystem and restrictions that might be placed on its eventual implementation.

1.3 Purpose

The purpose of the thesis is to present a theoretical and practical investigation into the useof several machine learning methods for forecasting the aggregate electricity consumptionfor different sized sets of households. The purpose of this work is to determine how effectivethese methods are at forecasting so as to inform the implementation of the system outlinedin Section 1.1.1.

1.4 Goal

The goal of the degree project is to develop a system for forecasting the aggregate electricityconsumption for different sized sets of households. Therefore this thesis presents the workdone in establishing a theoretical foundation for such a system.

1.4.1 Benefits, ethics and sustainabilityThis thesis will primarily benefit Greenely. It will allow the implementation of innovativemechanisms of feedback to consumers about their electricity consumption. This will bene-fit consumers by assisting them in reducing their expenses, and the environment, throughreduced electricity demand.

Secondarily, the project will benefit other researchers interested in the area of electricityconsumption forecasting for small numbers of households and analogous scenarios by pre-senting an evaluation of the performance of several prominent algorithms and theoreticalinformation about the implementation of the algorithms.

3

This project uses a real data set that contains extremely sensitive information like addresses,house occupancy numbers, and electricity consumption patterns for individual householdswhich could be used in malicious ways. This document will be published publicly so noinformation that could identify individual households should be present. The data set willremain on a single machine with an encrypted drive and not be transmitted to any othermachines or distributed to third parties for any reason.

This work is conducted using a fixed data set from a specific point in time, but the researchshould remain applicable and relevant for a long period. The characteristics of electricityconsumptions and the patterns of consumption for a household are unlikely to change overthe coming decades, except for an increasing trend which changes too slowly to alter theresults of this research.

1.5 Research Methodology

In designing a research methodology for this work the research onion of Saunders et al. [41,p.108] was used as a framework. The research onion is a useful tool for designing an effec-tive research strategy and ensuring that all of the methodological aspects of the research areconsidered. The research onion consists of layers. Starting from the outermost layer theselayers are: philosophies, approaches, strategies, choices, time horizons, and techniques andprocedures.

The forecast of electricity consumption involves the analysis and use of quantitative datasets and the evaluation of forecast accuracy is based on quantitative error measures, and sothe methodologies employed in this work are primarily quantitative.

The first action when designing a research methodology is the choice of research philoso-phy, which is an appropriate first decision because it informs later decisions about researchmethods. Philosophical assumptions describe how the researcher views the world. For thiswork the positivist philosophical assumption will be used, which is the primary paradigmused in the quantitative realm. The positivist paradigm is working with the observable,and only drawing inferences from observable phenomena [41]. In other words, this projectassumes the only source of truth is the data.

The research method will be experimental because it is focused on cause and effect, andthe relationships and causality between dependent and independent variables [41, p.142].The research time horizon will be cross-sectional, meaning that it is the study of electric-ity consumption at a particular point in time [41, p.155]. This has consequences for thesustainability of this research, which was discussed in Section 1.4.1. Finally, the researchtechniques and procedures will come from field of machine learning.

1.6 Delimitations

This analysis in this project is confined to a single real world data set obtained for householdsin and around the Swedish city of Västerås.

The algorithms implemented are from the field of machine learning. Further, the methodsused are a subset of machine learning algorithms that employ a kernel. The commonalitybetween the methods from the use of a kernel enables a common thread of discussion forthe algorithms and a shared theoretical basis. This project is not an exhaustive investigationof all forecasting methods, and it is acknowledged that are some algorithms that may out-perform the implemented methods. Perhaps the most notable absence are neural network

4

algorithms, which are often successful in time series forecasting competitions, and are themost prominent in forecasting studies [14, p. 41]. Zhang provides a recent overview offorecasting with neural networks [52]. Initial testing with some recurrent neural networkstructures revealed poor accuracy compared to the used kernel based methods, which waslikely a consequence of overfitting due to the small training data size. Overfitting is commonin neural networks where training data is limited, and various methods have been proposedto deal with the issue [30]. It may be possible to achieve better results using neural net-works, but it may require a more advanced implementation than those which are commonlyavailable in machine learning libraries.

Based on analytics of how users behave within the prototype of the application, it wasestimated that it will take approximately a week of using the application before the userbecomes interested in this more advanced functionality, and approximately another weekbefore they form a set and expect feedback about that sets performance. At that time,the application will have access to two weeks of historical smart meter data for that user,and consequently the algorithms presented in this work will be restricted to two weeks ofhistorical data as the basis for forecasts. An additional reason for this restriction is that theforecasts should be reasonably responsive to changes in consumption habits, and so fixingthe usage of historical data to two weeks ensures that the forecasts will adapt over time. Anadditional requirement is that the forecast should be an hourly forecast for a complete day.Other sizes of training data and forecast lengths are not considered in this work, although itis acknowledged that the results may be improved with changes to these parameters.

This work considers Swedish households and infrastructure. The results and theory pre-sented may not generalise to all countries. It is likely that the results are relevant for otherNorthern European countries, but may not be relevant for more temperate climates.

1.6.1 Weather dataIt is common in the forecasting of electricity consumption to use weather variables to im-prove forecast accuracy, for example outdoor air temperature and cloud cover. This is effec-tive due to the high correlation between these variables and electricity consumption. Theimplementation considered by this work places no restrictions on the geographic locationof the houses that form the set, and so if the set is formed from households that are geo-graphically far from each other weather variables will have little meaning. However, it isacknowledged that weather variables could be beneficial to the accuracy of the models, andthe eventual implementation of the system could use a heuristic to determine whether toinclude weather variables as inputs to the algorithms. For example, a determination basedon the maximum distance between houses in the set.

Note some weather data is used as part of the explanatory data analysis but not as an inputto the algorithms.

1.7 Outline

Chapter 2 will present the theoretical background of machine learning and the methodsused, some theoretical considerations specific to time series data, and the extant researchrelated to electricity consumption forecasting. Chapter 3 will provide a description andexplanatory analysis of the data set used, and a description of the algorithms implementedfor the experimental section. Chapter 4 includes the experimental setup and the results fromthe experiments along with a discussion of the results. Chapter 5 presents the conclusionsof this work and some future direction.

5

Chapter 2

Theoretical Background

In many fields, observations can be collected about a phenomenon, called data, and thenanalysed to extract information about the underlying phenomenon.

Methods for extracting knowledge from data have arisen in several fields, namely statistics,machine learning and data mining. Although these fields have different approaches andphilosophies, they have the common goal of extracting knowledge from data. Breiman [16]describes Statistics as focused on developing parameters for some stochastic model thatresult in that model fitting the data well (the data modelling approach). Machine learning,on the other hand, takes a more algorithmic approach where the internal workings of themodel are unimportant, and predictive accuracy is the prime consideration (the algorithmicmodelling approach). A simplification that may serve to illustrate this further is to say thatStatistics wants to see and understand the data generating process, but machine learningonly wants to replicate it. As such, Statistics primarily deals with models, and machinelearning focuses on learning algorithms and procedures with more of an emphasis on largescale data. The third field, Data mining, is more focused on performing analysis on data,for example to uncover anomalies or patterns [14, p. 8]. The lines between the fields arebecomingly increasingly blurred, and there is a significant crossover between the techniquesthat each field uses.

Within the scope of problems where the goal is to extract knowledge from data is a subsetof problems that involve predicting the future from historical time series data. These areknown as forecasting problems. Forecasting problems are found in many areas, such asmeteorology, economics and load forecasting.

The analysis of time series data is traditionally an endeavour for statisticians whose aim isprimarily to find the internal structure of a time series to "gain a better understanding ofthe dynamic process by which the time series data are generated." [35, p.17]. This involveslooking at the characteristics of a time series such as trend and seasonality. These character-istics can be analysed and the signal decomposed into its constituent parts. Models are builtusing techniques like ARMA and ARIMA (see [35] for further information on these meth-ods). This is the data modelling approach discussed by Breiman in [16] which he argueshas disadvantages for statisticians due to the emphasis being placed on the model, ratherthan the problem and the data [16, p.214]. When the focus is on the model significantassumptions need to be made about the data being modelled. The forecasting algorithmsconsidered by this work must perform well for many different time series with varying char-acteristics (according to the problem described in Section 1.2), which are inherited fromthe households that form the set. It was assumed that data modelling techniques would be

7

too dependent on the characteristics of each time series because of the assumptions madeabout the data.

This work will be approached from the machine learning perspective, with the focus pri-marily placed on predictive performance.

2.1 Residential electricity consumption

The electricity consumption of a household is determined by the appliances in use andthe amount of energy they consume. The consumption can be divided into three cate-gories.

The first is a base load, which is caused by those appliances that are always on. This in-cludes appliances like refrigerators, clocks, and appliances which are in standby. In general,refrigeration is the largest contributor to base load [20]. The base load follows a reasonablysmooth pattern. The second category is seasonal load which is primarily attributable toheating and cooling systems. The seasonal load varies with the type of heating and coolingsystem, and with the temperature ranges for the geographic region. The final category isactive load, which are those appliances used actively by the occupants, such as televisions,washing machines, and lighting. This is the most problematic category for forecasting al-gorithms, because the usage pattern can be random as it is dependant on the occupantsand their occupancy patterns [23, p.935]. The appliances that make up active load alsotend to use large amounts of electricity. A survey of Irish households found the most energyconsuming appliances to be tumble dryers and dishwashers [23].

For an individual household, the electricity consumption is highly stochastic due to the activeload. Algorithms for forecasting may perform poorly because the consumption pattern isvery noisy. The consumption pattern may have some structure, for example consumptionis generally higher in the evenings, but the pattern is still quite unpredictable. When aset of households is considered, the aggregate consumption pattern should become morepredictable because the effect of the individual occupants becomes less important to theoverall signal, and the actions of all the occupants then contribute to smooth out the patternof consumption. Figure 2.1 shows that patterns in the data emerge when the aggregateconsumption over 5 households is graphed.

(a) Individual (b) Aggregate

Figure 2.1: Line plots comparing plots of consumption for 5 households as individualsand aggregated

8

2.2 Machine learning

Machine learning is a multidisciplinary field that draws from probability and statistics, pat-tern recognition, and computational learning theory. It has been a rising field over theprevious several decades, and has gained widespread attention because of its ability to un-cover meaning in the increasingly large data sets available for analysis. Today, machinelearning is so pervasive that the average person encounters systems built using machinelearning methods on a daily basis.

Machine learning problems can be categorised according to the type of data that the systemwill learn from. The three fundamental categories are supervised learning, unsupervisedlearning, and reinforcement learning. For a general introduction to learning frameworkssee [25]. This thesis deals exclusively with the application of supervised learning algo-rithms.

2.2.1 Supervised learning and regression

The goal of a supervised learning algorithm is to learn the mapping from inputs x ∈ X tooutputs y ∈ Y when given a data set of input and output pairs D = (xi , yi)

Ni . Typically the

input x is a multidimensional vector of numbers, and yi is a variable dependent on x thatis either one of a finite number of values or a real valued scalar [34]. The term supervisedis used because the output y "supervises" the learning process. The inputs x are referred toas features.

In the case where the output y is one of a finite number of values the problem is knownas a classification problem, and when it is a real valued scalar it is a regression problem.Classification problems are useful when something needs to be classified into one of sev-eral categories, such as in the case of classifying an email as spam or not spam. Regressionproblems are similar except the output value is a real valued scalar. Some examples of re-gression applications are stock market predictions, temperature forecasting, or height andweight predictions. The forecast of electricity consumption is another example of a regres-sion problem because the output variable is a real valued scalar representing the electricityconsumption at some point in the future.

For most learning algorithms that are capable of performing regression the aim is to approx-imate the unknown function f : X→ Y from D. Because the true function underlying thedata generating process is generally not known, it is typical for learning algorithms to makesome assumptions about the function. These assumptions can be embedded in the choiceof algorithm and in the parameters of the algorithm. In particular, an important assump-tion about the data is whether the features have a linear or non-linear relationship with thetarget variable.

Most machine learning algorithms were developed for linear data, but through the use ofkernels linear algorithms can be adapted to nonlinear problems.

2.2.2 Kernel-based methods

A subset of machine learning algorithms is those which use kernels. A kernel is effectively ameasure of similarity between data points. The importance of similarity between data pointscan be highlighted by considering how a learning algorithm might make a prediction basedon an unseen input. To map an unseen feature vector to a target variable the algorithmcould proceed by comparing the unseen feature vector to seen feature vectors accordingto some measure of similarity (which is defined by the kernel). The algorithm then knows

9

that the target variable for the unseen feature vector should be in the region of the targetvariables for the seen feature vectors.

κ : X × X→ R, (x , x ′) 7→ κ(x , x ′) (2.1)

κ(x , x ′) = ⟨ϕ(x),ϕ(x ′)⟩ (2.2)

where k is known as the kernel,ϕ is called its feature map which maps into some dot productspace called the feature space and ⟨·, ·⟩ denotes an inner product.

The advantage of using a kernel to define similarity between points is that the kernel al-lows the construction of algorithms that work in dot product spaces [26]. The use of adot product space allows the algorithm to work in a higher dimensionality, where a linearrelationship in the data may exist even if no such linear relationship existed in the originaldimensions of the problem. This is known as the kernel trick. Any linear model can be turn-ing into a non-linear model by the use of the kernel trick. The use of dot products makessolving these problems computationally tractable, where working directly in that higherdimensional space may not have been possible.

KernelsThis section provides the formulation of the kernels that will be used in this project.

Perhaps the most common type of kernel is the squared exponential (SE) kernel, whichis often also known as the radial basis function kernel. It is the default kernel in manykernel algorithm implementations because it is a solid approach to any problem where noadditional knowledge about the data is available but "general smoothness assumptions" canbe made [44].

κSE(d) = ex p(− d2

2ℓ2) (2.3)

where ℓ is a parameter known as length-scale, which affects how quickly the function canchange direction, and d is the distance between two points, typically given by |x − x ′| .The SE kernel has been criticised on the basis that modelling physical processes with sucha smooth function is unrealistic [45] with the Matérn class of kernels being proposed as analternative. The Matérn class of kernels are given by:

κMAT ERN (d) =21−v

Γ (v)(p

2vdℓ)Kv(p

2vdℓ) (2.4)

where Γ is the gamma function, Kv is the modified Bessel function of the second kind, d isthe distance separating the two points x and x ′, and ℓ and v are non-negative parameters ofthe function. The Matérn formulation is much more complex, but it gives rise to a number ofuseful kernels through the parameter v. As v→∞ the function converges to the SE kernelintroduced above, and when v = 1

2 the function produces the exponential or Ornstein-Uhlenbeck kernel:

kEX P(d) = ex p(−dℓ) (2.5)

10

Additionally in this project the Matérn kernel with v = 32 is used which should provide a

good mixture between the sharpness of the exponential kernel and the smoothness of theSE kernel. The Matérn32 kernel is given by:

κM32(d) = (1+p

3dℓ)ex p(

p3dℓ) (2.6)

(a) Exponential kernel (b) RBF kernel (c) M32 Kernel

Figure 2.2: Kernel plots as a function of a single input

2.2.3 Time series forecasting considerationsA time series forecast can be made for the short-term, medium-term, or long-term. A forecastmay consist of multiple steps into the future, such as a daily forecast for five days. The lengthof the forecast into the future is called the forecast horizon and the point from which it ismade is called the forecast origin or just the origin. In general, forecasts with a larger forecasthorizon are less accurate. This is because errors can aggregate and amplify as the forecastmoves further away from the origin.

Machine learning algorithms are at a disadvantage compared to other techniques used fortime series forecasting because most algorithms assume that each observation is indepen-dent and identically distributed, which is not the case in time series data where observationsclose together in time are correlated.

One method of feature design for time series forecasting is to use lagged values of thetarget output variable. For example, the features could be the previous 5 hours of electricityconsumption. This is known as an autoregressive (AR) model where the target value isassumed to be a linear combination of lagged values, and a nonlinear autoregressive (NAR)model in the nonlinear case. The length of the forecast horizon imposes constraints on theuse of lagged values because all of the lagged values must be available at each step in theforecast horizon. Taking the case of a 24 step forecast horizon and feature set consisting ofthe previous 5 observations, at steps t > 5 none of the required lagged observations will beavailable.

To enable the use of lagged variables in autoregressive models various forecasting strate-gies are available. The two most common methods of dealing with this are the recursive anddirect forecasting strategies [14]. The simplest approach is the recursive strategy. The re-cursive strategy is an iterative approach where the forecast is made one step at a time, withpreviously made predictions being incorporated as observations to enable forecasts for latersteps. The disadvantage of this approach is that errors in predictions get incorporated intolater inputs resulting in an aggregation of errors over the forecast horizon. The direct ap-proach is where a different model is built for each step in the forecast horizon, which allowsfor each model to be built using lagged values that would be available at that step.

Another approach for feature design for time series forecasting is to extract features from thetimestamp of the observation. For example, in hourly data the hour, day of the week, day of

11

the year, etc. could be passed as numerical inputs to the algorithm, causing the algorithm tolearn the relationship between the time and output. This converts the time series probleminto a more standard regression one, because the algorithm does not learn a relationshipbetween historical outputs and the current output, it only learns a relationship between atimestamp and an output.

2.2.4 Bias and variance tradeoffThe bias and variance tradeoff describes the problem of balancing between two sources oferrors: an error due to bias and an error due to variance.

The error due to bias is the difference between the expected prediction of a model and thecorrect value over the training data. It is a systematic error that the model will always makeeven if trained with several data sets of equal quality. A high bias can occur by fitting asimple model to relatively complicated data, such as fitting a straight line to nonlinear data.This is known as underfitting the data. Conversely, fitting a relatively complex model tosimple data is known as overfitting, which will lead to a low bias because of the flexibilityin the model, but it may lead to a high variance because the model is too conditioned to thetraining data.

The error due to variance is the variability of a prediction of a model at a given point, orthe amount by which the predictions of a model trained with one data set differs from theexpected predicted value over all training data sets. As such, it is a good measure of howwell a model will generalise when predicting for unseen data.

model complexity

erro

r

training errortest error

Figure 2.3: The tradeoff between bias and variance

Figure 2.3 provides a visual representation of this concept. As model complexity increases,the error in the training data will decrease, but the model may be overfitting the trainingdata, which will lead to high errors when the model is used with test data.

2.2.5 Cross validation and model evaluationTo address the bias-variance trade off, a model must be assessed on its ability to predicttarget values for unseen observations. This section describes a procedure for assessing theperformance of a model for unseen observations in time series data. In this section a distinc-tion is made between the in-sample (training) error and the out-of-sample (test) error. Theout-of-sample error is a measure of how well the model can generalise to new data.

12

A basic method of model validation is the holdout method:

1. Split the data into two subsets, a training set and a testing set.

2. Train the model using the training set.

3. Make predictions using the testing set.

4. Compute the out-of-sample error by comparing the predictions made using the testingset with the actual observations using an appropriate error measure (see Section 2.2.6for a further discussion on error measures).

This method has two obvious drawbacks: the data set may be small and not all of it can beused for training if it is divided in two and the performance of the model is dependent on thelocation of the split. A number of cross-validation procedures exist to mitigate these issues.A common cross-validation method is the K-fold method. The data is split into K − 1 foldsand an iterative procedure is used so that each fold is used as a test set with the remainderof the data being used as the training set. The error measurement then becomes the averageof the errors for each of the folds.

For time series data K-fold cross validation is not appropriate because all observations in thetest set must have occurred temporally later than the observations in the training set. Theholdout method can be used for time series cross validation if the data ordered temporallyand the test data set is chosen from the end. This has been termed last block validation [15].This idea has been expanded to achieve a more rigorous cross-validation procedure for timeseries data.

Tashman distinguishes between two methods for building training sets and the test sets fromtime series data: fixed-origin and rolling-origin [47]. The forecasting origin is denoted bythe time t of the last observation in the observation in the training set. The last observationin the forecast horizon is denoted by n.

In fixed-origin procedures the origin t does not change and forecasts are then generated fort +1, t +2, ...t + n. It is clear that the drawback discussed above regarding the dependencyof errors on the location of the forecast origin is present, and depending on the error mea-surement used the resultant error may be a "melange of near-term and far-term forecasterrors" [47, p. 439].

Rolling-origin procedures can mitigate these problems by regenerating the model at eachstep before making predictions for the remaining forecast horizon. When n = 3, for exam-ple, three forecasts will be generated from origin t, two forecasts will be generated fromorigin t+1, and one forecast will be generated from origin t+2. The value of n should be atleast as large as the maximum forecast required by the model, and may be larger to ensuresome minimum number of forecast are made at the maximum forecast length. [47].

2.2.6 Error measures

This section will provide a brief evaluation of common methods for measuring the accuracyof forecasts. The methods considered here are a selection that are popular in the literatureand regularly used in time series forecasting competitions.

Percentage based errors become independent of the scale of the data by scaling accordingto the actual value being forecast. Consider the MAPE which is given by:

MAPE =1n

n∑t=1

��At − Ft

At

�� (2.7)

13

Category Acronym Name

Percentage basedMAPE Mean Absolute Percentage ErrorMdAPE Median Absolute Percentage ErrorsMAPE Symmetric Mean Absolute Percentage ErrorsMdAPE Symmetric Median Absolute Percentage Error

RelativeMRAE Mean Relative Absolute ErrorMdRAE Median Relative Absolute ErrorGMRAE Geometric Mean Relative Absolute Error

Scaled MASE Mean Absolute Scaled Error

Table 2.1: Acronyms for common error metrics for time series forecasting.

where At and Ft are the actual and forecasted value at time t. It is clear from the MAPEformulation that the error is undefined if At is 0, and as At becomes very close to 0 theerror also exhibits undesirable characteristics. The sMAPE error metric first appeared as anadjustment to MAPE [8, p.385] as a method of addressing the issues with small values ofAt . It is generally given by:

sMAPE =1n

n∑t=1

Ft − At

(|At |+ |Ft |)/2 (2.8)

Despite having symmetric in the name, the sMAPE is not very symmetrical. It treats forecastsunder the actual value more harshly than forecasts over the value. Additionally, although itis a percentage based error the percentage ranges from 0 to 200%.

An alternative method of scaling is to use scale according to an error calculated using someother method of forecasting as a benchmark. A commonly used benchmark is the naiveforecasting method where each forecast is equal to the previous observation. The generalformulation of a relative error measure is:

Rt =Et

Et∗ (2.9)

where Et is the forecasting model being evaluated and Et∗ is the benchmark forecast method.These methods are shown in the relative category of Table 2.1 with each relative measurebeing the mean, median, and geometric mean of the sum of Rt for all t. This avoids someof the issues with percentage based errors, but relative measures still exhibit some undesir-able statistical properties because Et∗ can be small [28, p. 684]. To address this Hyndmanproposed the mean absolute scaled error (MASE) which is similar to the relative error for-mulation but the error is scaled by the in-sample mean absolute error (MAE) using the naiveforecasting method. It is given by:

qt =Et

1n−1

∑ni=2 |Yi − Yi−1| (2.10)

MASE avoids the problems that other methods have, and is also easily interpretable. In thecase where qt < 1 the forecast method achieves a better forecast on average than the naivebenchmark [28, p.685] method, and vice-versa.

The MASE measure will be used as a measure of the algorithms performance along with thesMAPE metric. Despite the problems with the sMAPE metric, it will also be used because

14

of its popularity in forecasting competitions (e.g. NN3) and to give other forecasters afamiliar reference to assess the results of this work. Because it is roughly interpretable as apercentage it is also useful for people who are not familiar with the MASE measure.

2.3 Related work

The forecasting of electricity demand has long been an area of research because of its impor-tance to electricity providers. An accurate forecasting model allows an electricity provider toplan their generation and distribution more efficiently. This research is generally conductedusing consumption for an entire grid or country, which is a different problem to forecastingat the smaller scale of households. Very little research on forecasting the aggregate con-sumption for different sized sets of households was found. However, some research hasbeen conducted into forecasting consumption for large buildings, which shares some of thecharacteristics of the problem this work considers.

Dong et al. used support vector regression (SVR) to forecast the monthly energy consump-tion of four commercial buildings in Singapore with good results [21]. The approach used astepwise search for the optimal hyperparameters, and a similar approach is adopted in thiswork. SVR was also used by Humea et al. [27] to forecast the consumption for varying sizedsets of households. They found that SVR performed worse than simple linear regression forsmall set sizes, but SVR began to outperform linear regression when the set contained morethan 32 households.

Some authors have suggested grouping households with similar consumption patterns be-fore forecasting [29], which would allow patterns and structure to emerge in the data faster.Unfortunately this is not applicable for the problem considered by this work because theforecast is being made for sets that are formed in different ways, and are not necessarilyformed by grouping households with similar consumption patterns.

There exists a range of related work that doesn’t directly consider the problem of electricityconsumption forecasting, but considers time series forecasting in general. Ahmed et al. con-ducted a survey of several different machine learning methods using the M3 data set whichshowed an "unambiguous ranking" [7] among the methods, with multilayer perceptrons (atype of neural network) and Gaussian processes being the most accurate. The M3 data setconsists of economic data, so the methods may not perform as effectively on other typesof data. Additionally, the study was limited to one-step ahead forecasting. Sapankevychand Sankar provide a survey of the use of support vector machines (SVM) for time-seriesforecasting [39] in which they found that SVMs are a viable approach to time-series fore-casting. Nearest neighbour algorithms have been applied to time series forecasting [31] andan extension to the nearest neighbour method was made by McNames particularly for timeseries data [31]. A handful of other machine learning methods have also been used for timeseries forecasting, including Gaussian processes [24], regression trees [50] and Bayesianmodels [11].

15

Chapter 3

Scientific methods

This chapter outlines the scientific methods used to collect, describe, and analyse the data.

3.1 Data collection and description

The experiments conducted in this work use a real data set containing hourly electricityconsumption obtained from smart meters for the period February 01, 2014 to December 31,2014. The data set consisted of readings from 93 individual households located in Västerås,Sweden. The houses contained in the data set were part of a pilot project conducted byGreenely for the purpose of testing a prototype of the application. The participants wereselected by a survey, and they had all indicated an interest in participating in a trial forthe application. The impact of the selection process on the characteristics of the electricityconsumption patterns in the data set is unknown, but it is likely that the average age of theparticipants is below the average age of the Swedish population because smart phone useis more prevalent among younger people [42].

The consumption data was read from the households smart meters and transmitted to theelectricity provider using the EDIEL protocol, which is a standard for data interchange inthe gas and electricity markets (see www.ediel.org). The electricity provider then relayedthe data to Greenely using the same protocol. The data was not manipulated in any way bythe electricity provider and the data received by Greenely was the raw data directly fromthe households smart meter.

From the hourly readings the following information was included or derived for use withthe algorithms:

Notation DescriptionHR Hour of consumptionWK Week of consumptionWKD Day of the week of consumptionISWKDAY Whether the day was a workdayLAGXX Amount of electricity consumed XX hours agoCONSUMPTION Amount of electricity consumed for this hour

Table 3.1: Variables available in the data set

It is possible to derive more features from the timestamp of the electricity consumption,such as the day, month, or year of consumption. Because the training data is limited to atwo week period those additional features would not provide useful data to the algorithms.

17

In addition to the hourly electricity consumption, the data set contained information aboutthe households such as the building type, heating type, number of occupants, and loca-tion.

3.1.1 Preprocessing and data qualityThe data set did not contain any missing values for any of the households and so no methodsof dealing with missing data was required.

The data set did contain some unusual consumption values, but it is not clear if they wereerroneous measurements or if they were due to unusual end uses in the households, forexample the use of power tools. The consumption levels were still within a range where thedata could have been valid, and so no other preprocessing was performed.

3.2 Explanatory analysis

This section contains some explanatory analysis of the data set used in this project andhighlights the important characteristics found.

3.2.1 Building typesThe data set contains buildings of two different types: apartments and houses. Apartmentsare characterised by lower occupancy than houses and a smaller living space.

House type Number Average occupancy Average area (m2)Apartment 29 1.62 77.27House 64 3.21 144.42

Table 3.2: Number of households of each house type and average occupancy

(a) Apartment (b) Houses

Figure 3.1: Line plots showing average hourly consumption for each hour of the dayfor different building types

The most noticeable difference between the consumption patterns of the two housing typesis in the morning, where houses show a small peak in consumption and apartments do not.For each of the housing types there is a significant difference between the consumptionpatterns of a workday and a non-workday. Houses have higher minimum and maximumconsumption values, which is to be expected given that they have higher average occupancyand larger areas.

18

Apartments and houses will be treated separately for the remainder of this analysis becauseof their different characteristics.

3.2.2 Heating, lighting and consumption

The heating system in a house can have a significant impact on the households electricityconsumption. In Sweden, the heating in apartments does not contribute to the electricityconsumption for the apartment. Apartment heating is centralised for the building, and sothe electricity supply used for the heating comes from a different source. Thus it is notpossible to see the effects of a heating system on an apartments consumption.

For houses, the heating type can either be district (telehating), direct, or heatpump based.These three types contribute to electricity consumption in differing amounts. The districtand heatpump methods consume the most electricity, and district consumes a relativelysmall amount. The amount of electricity consumed by a heating system correlates stronglywith the outdoor air temperature. As discussed in Section 1.6.1, outdoor air temperaturewill not be used as an input for the algorithms, and so the periodic nature of the consumptioncaused by heating systems will have to be captured through other features.

To show the effect of outdoor air temperature on consumption, outdoor air temperature wasrecorded for each hour of consumption for each household by querying observed outdoorair temperature at the nearest Swedish Meteorological and Hydrological Institute (SMHI)weather station. To do this, the Google Geocoding API [3] was used to get a latitude andlongitude for each households based on its address, and the nearest weather station wasfound by comparing distances calculated using the Vincenty formula. The temperature wasqueried from the quality controlled "corrected archive" of historical observations providedby the weather station and SMHI. All of the households in the dataset are located within anapproximately 100km radius, so averaging the outdoor air temperature among householdsstill provides some worthwhile information.

Figure 3.2 shows the correlations between energy consumption and outdoor air temper-ature for different heating types. As expected the correlations are the strongest for theheating types that are most dependent on electricity, the direct and heatpump methods, asmaller correlation for the district heating type which is more electricity efficient, and aneven smaller correlation for apartment based heating.

Because the apartment based heating contributes nothing to electricity consumption thecorrelation shown for apartments shows that outdoor air temperature correlates with elec-tricity consumption regardless of the heating system. This is intuitively correct, becausein colder temperatures people tend to spend more time indoors, and because colder tem-peratures generally occur in the early morning and late evening when consumption is atits highest according to Figure 3.1. The small correlation shown is likely due to lightingsystems, which also contribute to the larger correlations shown between outdoor air tem-perature and consumption for houses.

Figure 3.3 shows the seasonality of the consumption patterns. During the colder seasonsthe consumption is significantly higher, largely due to the impact of heating and lightingsystems. A seasonal pattern can be seen for apartments in the maximum consumption val-ues, which are higher in colder weather due to the use of lighting. A seasonal pattern is alsoevident for the minimum values of houses but not apartments because heating systems areactive 24 hours a day.

19

(a) Apartment (b) Direct

(c) Heatpump (d) District

Figure 3.2: Scatter plots showing the correlation between outdoor air temperatureand energy consumption for different heating types.

20

(a) Apartments

(b) Houses

Figure 3.3: Plots of averaged consumption for the entire time period covered by thedata set for apartments and houses

21

3.2.3 Autocorrelation

Autocorrelation shows the correlation of a signal with itself as a function of the time lagbetween the two points. It is a useful visualisation to determine periodic phenomena intime series data sets.

(a) Apartment autocorrelation (b) Apartment autocorrelation function

(c) House autocorrelation (d) House autocorrelation function

Figure 3.4: Autocorrelation for the electricity consumption for apartments and houses

The autocorrelation plots in Figure 3.4 highlight the periodic nature of electricity consump-tion patterns. Apartments and houses show a correlation that peaks strongly around the 24hour period, with houses also showing a smaller correlation around the 12 hour period. Theautocorrelation is useful for designing features which included lagged values of the targetoutput variable where an approach is to attempt to capture the autocorrelation structure inthe feature vector.

3.3 Algorithms

This section outlines the theory and mathematics behind the algorithms implemented inthis project. The mathematical derivations for each algorithm may not be presented in itsentirety because the intention is to provide mathematics that will help the readers under-standing and to show parameters that will be relevant in the implementations. A referencewill be provided to the complete derivation if one is not given.

22

3.3.1 k-Nearest Neighbour Regression

The k-NNR algorithm is a simple but versatile algorithm. It’s simplicity makes it easy toimplement, and the results easy to interpret and explain. Despite it’s simplicity, it oftenperforms well.

The k-NNR takes a different approach to learning than the other algorithms considered inthis work. It uses a local learning approach, while the other methods use a global learningapproach. Global learning attempts to approximate a function that maps all possible inputvalues to an output. This is achieved by fitting a distribution over the data [6, p.13]. Theassumption is that there is some true function that is generating the data, and that functioncan be approximated using the learning algorithm. Conversely, local learning regards thelearning of such a hidden function improbable or difficult, and so it proceeds by using onlylocal but useful information from the data. Local learning does not attempt to model theentire input domain, but attempts to approximate a function that can map a single inputto an output by selecting training data that is similar to the input point. As such, the as-sumptions it makes about the data are weaker. This means it adapts well to different datasets, even if there is no apparent structure, but the predictions are also unstable because thealgorithm relies only on a small number of training examples [25].

The k-Nearest Neighbour Regression (k-NNR) algorithm finds k neighbours which are theclosest to the input point according to some distance metric. Despite not strictly being akernel-based method, k-NNR has a strong link with kernel-based methods because it alsouses a function to determine the similarity of points. In a kernel-based method the distancefunction is defined by the kernel, and in k-NNR it is typically Euclidean distance, althoughit is possibly to modify the k-NNR algorithm to use kernels [51].

In the case of regression, a prediction for a target output is made by averaging the outputof the k neighbours nearest to the given input vector:

y∗ =1k

k∑j=1

y j (3.1)

where y j is the jth nearest neighbour. The k parameter has to be chosen carefully as it con-trols the bias-variance tradeoff of the fit. A higher k leads to more neighbours contributingto the output leading to a smoother fit and lower variance and higher bias, and vice-versafor smaller values of k.

The k-NNR algorithm is an example of a lazy-learning algorithm. No model is maintained,and all computation is deferred until the output from the algorithm is requested. In a naiveimplementation of the algorithm the consequence of this is that it can be extremely fast forsmall data sets, but the computational demand increases linearly with the size of the data setbecause all observations need to be considered to find the k nearest neighbours. However,there are various other approaches to make the neighbour search more computationallyefficient [33].

Modifications can be made to the distance and averaging functions used to improve per-formance. A common approach is to use a weighted averaging function for averaging theoutputs of the k neighbours so that nearer points have a higher contribution to the out-put.

The implementation of k-NNR used in this project is based on the open source scikit-learn [36]package for the Python programming language.

23

3.3.2 Kernel Ridge RegressionKernel ridge regression is an algorithm that is based on perhaps the most common regressionalgorithm: linear regression. Ridge regression is a regularised version of linear regression,and kernel ridge regression is a further adaptation to allow the algorithm to work with akernel. This section first introduces linear regression, and then expands the theory intoridge and kernel ridge regression.

Regression problems were introduced in Section 2.2.1. For linear regression, the standardmodel is given by:

y = xT w+ ε, (3.2)

where x is an input vector, w is a vector of weights, y is the observed target value, and εwhich is noise that is assumed to be normally distributed with zero mean and σ2

n varianceso that ε ∼ N(0,σ2

n). The weight vector w is estimated by minimising the error betweenobservations and predictions. A common method is the ordinary least squares estimator(OLS) which minimises the sum of squared residuals. In matrix form and assuming zeronoise this is given by:

w= (XTX)−1XTy (3.3)

where X is the design matrix and y is the target values. The problem with OLS linearregression is that there is no constraints on the size of the coefficients in w, which meansthat for very large coefficients the variance of the solution can be very high (while the biasis very low). This can occur when there is multicollinearity between the features. To controlthe variance the coefficients can be regularised to control how large they grow at the costof bias. This is what ridge regression achieves by introducing a penalisation on the size ofthe coefficients, which shrinks the coefficients towards zero, and providing a mechanism tocontrol the bias-variance tradeoff. The closed form of ridge regression is given by:

wridge = (XTX+λI)−1XTy (3.4)

where λ controls the penalisation on the size of the coefficients. Note that if λ = 0 thesolution is the same as that for OLS linear regression given in 3.3. For more informationrefer to [25].

The solution can be manipulated to use the kernel trick to allow it to be effective for non-linear data. As was mentioned in the general background of kernel-based methods in Sec-tion 2.2.2 this involves reformulating the expression so that everything is expressed as innerproducts. In the case of kernel ridge regression, this is achieved by using the Woodburyidentity:

wridge = (XTX+λI)−1XTy= XT (XXT +λI)−1y (3.5)

The prediction for a new input x∗ is then:

y∗ = x∗T wridge

= x∗T XT (XXT +λI)−1y

= κ(x∗)(K+λI)−1y

(3.6)

24

where K is the kernel matrix given by Ki j = k(xi ,x j) (also known as the Gram matrix),κ(x∗) = κ(xi ,x∗) (a vector of covariances between the test point and the training points)and κ is a kernel function [40].

It is important to note that the solution to the kernel ridge regression problem depends onthe entire training data, which is not the case for all the algorithms described in this section.This means that the time the algorithm needs to make a prediction grows with the size ofthe data set.

The implementation of KRR used in this project is based on the open source scikit-learn [36]package for the Python programming language.

3.3.3 Gaussian Process Regression

A Gaussian distribution defines a probability distribution for a single random variable. Sim-ilarly a Gaussian process is a distribution over functions. The definition for a Gaussianprocess provided by Rasmussen et al. in [38] is: A Gaussian Process is a collection of randomvariables, any finite number of which have a joint Gaussian distribution.

A Gaussian process is given by:

f (x)∼ GP(m(x), k(x , x ′)) (3.7)

With m(x), the mean, and k(x, x’), the covariance function, being denoted by:

m(x) = E[ f (x)],k(x , x ′) = E[( f (x)−m(x))( f (x ′)−m(x ′))] (3.8)

In most situations the mean function is assumed to be zero for simplicity, but there are someadvantages to choosing a non-zero mean function. For further discussion on non-zero meanfunctions refer to [38, p.27].

For the purposes of forecasting the most important component of the Gaussian process isthe covariance matrix. The covariance matrix corresponds to a kernel function makingGaussian processes belong to the class of kernel-based algorithms. The choice of kernelfunction is critical, as it determines almost all the properties of the resultant model and em-beds the assumptions about how similarity between points decays with distance in the truefunction underlying the process. The selection of an appropriate kernel and correspondingcovariance matrix should be made in accordance with the characteristics of the data beingmodelled.

The covariance matrix is what "implies a distribution over functions" [38, p.14]. For exam-ple, the corresponding covariance matrix for some inputs can be used to create a Gaussianvector, which can then be used to generate values as a function of the inputs. The plotsin Figure: 2.2 show some examples of functions that were sampled from a Gaussian pro-cess. By examining these functions some intuition can be gained about the kernels thatmatch with the different characteristics of the data being modelled. The exponential kernelis effective at modelling data that changes sharply, while the SE kernel provides smootherfunctions, and the M32 falls somewhere in the middle.

It is mathematically possible to show that the sum or product of two kernels is also a kernel(see [38, p.95]) which allows for the design of new kernels based on composites. Figure 3.5d

25

(a) Exponential kernel functions (b) SE kernel functions

(c) M32 kernel functions (d) SE * Exponential kernel functions

Figure 3.5: Sample functions generated by different kernel types

shows sample functions generated from the product of an SE and exponential kernel, whichinherits some of the properties of both kernels.

The covariance function is used to define the prior knowledge that is held about the shape ofthe function being modelled, but to be useful for prediction or forecasting this knowledgeneeds to be updated when training data is presented. This is achieved through bayesianinference. A joint probability distribution of the training outputs y and the test outputs y∗in the noise free case is defined as:

�yy∗

�∼ N

�0,

�K K T∗K∗ K∗∗

��(3.9)

Given n training points and n∗ test points then K∗ is n∗xn matrix of covariances evaluated forall pairs of the training and test points, and similarly for the other matrices K(X , X , K(X∗, X∗.Loosely speaking, the posterior is computed by rejecting functions from the prior that are notcompatible with the training data. To do this the joint probability distribution is conditioned(for a complete derivation of this see Appendix 1 of [38]) on the observations:

y∗|X∗, X , y ∼ N(K∗K−1y, K∗∗ − K∗K−1K T∗ ) (3.10)

This distribution can also be sampled in the same way as the prior. To obtain an actualestimate for a point the mean can be used:

26

y∗ = K∗K−1y (3.11)

Which has a variance:

var(y∗) = K∗∗ − K∗K−1K T∗ (3.12)

The variance is used to calculated the 95% confidence interval of the prediction as±1.96p

var(y∗).This derivation was for the noise free case, but in the case of noise an additional term isintroduced and the prediction for a point becomes:

y∗ = K∗(K +σ2n I)−1y (3.13)

where σ2n represents the variance of the noise. This prediction is equivalent to kernel ridge

regression introduced in the previous section. GPR is closely related, because it is a bayesiangeneralisation of kernel ridge regression, which gives GPR the benefits of the bayesianframework, such as the ability to estimate the uncertainty in a prediction through the vari-ance. Other methods, such as support vector regression do not have this benefit.

The kernel used to create the covariance matrix may have hyperparameters, such as thelength-scale for the SE kernel discussed in 2.2.2. These hyperparameters, denoted by θcan be chosen by taking the derivative of the marginal log likelihood with respect to thehyperparameters and using any gradient based optimisation algorithm to find the maximummarginal log likelihood. The marginal log likelihood is given by [38, ch. 5]:

logp(y|X,θ ) = −12

y T K−1 − 12

log|K | − n2

log2π (3.14)

The Gaussian process implementation in this work uses the GPy [9] package for the Pythonprogramming language.

3.3.4 Support Vector RegressionSupport vector regression (SVR) is based on support vector machines (SVM) which wereoriginally designed by Vapnik et al [49] [48] to solve pattern recognition problems. A briefintroduction to SVM theory and their application to classification problems will be presentedfirst, and then the theory extended into the realms of regression.

Considering first the simple case of a two dimensional space with linearly separable pointsbelonging to two different classes (a classification problem). The goal of a SVM is to finda straight line (a hyperplane) that will separate these points. Infinitely many hyperplanesexist that can separate the two classes, but which is the optimal choice? In SVM theory theoptimal hyperplane is the one that maximises the distance between the hyperplane and thenearest data points, which is known as the margin.

Mathematically, the hyperplane is defined by:

w · x+ b = 0 (3.15)

Where w is a vector of weights and b is a bias. We can find two hyperplanes such that thepoints are separated, and no points fall between the two (the dotted lines in Figure 3.6),such that:

27

x

y

w· x+

b =0w

· x+b =

1

w· x+

b =−1

2||w||

b||w||

w

Figure 3.6: Optimal separating hyperplane according to maximum margin in twodimensions with support vectors filled

w · xi + b ≥ 1 for yi = +1 (3.16)

w · xi + b ≤ 1 for yi = −1 (3.17)

Combining equations 3.16 and 3.17 gives:

yi(w · xi + b)− 1≥ 0 ∀i (3.18)

The region bounded by the two hyperplanes described in 3.16 and 3.17 is the margin dis-cussed above, and geometry can show that it is equal to 2

||w|| , which means that finding themaximum hyperplane is equivalent to:

minimise12||w||2 such that yi(w · xi + b)− 1≥ 0 ∀i (3.19)

Note that 12 ||w||2 was used here as a substitution for ||w||, which leads to the same solution

but provides the more familiar formulation that is solvable with quadratic programmingtechniques. This was a simple case where the data was separable, what happens when thedata is not separable so that it is not possible to find a separating hyerplane? In 1995 Cortesand Vapnik introduced the concept of a soft margin [18] which allows for some misclassi-fications. Using a soft margin, a slack variable ξi is introduced to measure the misclassifi-cation in the data, and an additional term is added to 3.19 to penalise misclassifications.The optimisation becomes a trade off between maximising the margin and minimising theerror penalty which is controlled by the parameter C . Equations 3.16 and 3.17 and 3.18become:

w · xi + b ≥ 1 for yi = +1− ξi (3.20)

w · xi + b ≤ 1 for yi = −1+ ξi (3.21)

28

minimise12||w||2 + C

L∑i=1

ξi such that yi(w · xi + b)− 1+ ξi ≥ 0 ∀i (3.22)

Regression using support vector machinesIn 1996 the SVM was extended to allow it to perform regression by Vapnik, Drucker, Burges,Kaufman and Smola in [22]. This involves a change to the penalty function that was intro-duced to allow for misclassifications. The penalty function in a regression SVM only appliesa penalty if the predicted value yi is greater than some distance ε away from the actualvalue, which creates a "tube" around the hyperplane called the ε-insensitive tube. Further,the penalty function gives one of two penalties to outputs that fall outside the tube depend-ing on whether they occur above (ξ+) or below (ξ−) the tube. This type of regression usingsupport vector machines is known as ε-support vector regression. There are other methodsof doing regression with support vector machines, such as least squares [46].

With the regression penalty function, equation 3.22 becomes:

min12||w||2 + C

L∑i=1

(ξ+i + ξ−i ) (3.23)

subject to the constraints that ξ+i ≥ 0,ξ−i ≥ 0∀i and t i ≤ yi + ε+ ξi , t i ≥ yi − ε− ξi wheret i is the actual value and yi is a predicted value. In other words, a penalty is only appliedif the value falls outside the ε-insensitive tube and the penalty is different depending onwhether the point is above or below the tube [43].

x

yy

y +ε+

y −ε−

ξ+

ξ−

Figure 3.7: Visualisation of ε support vector regression

The parameters C and ε are tuneable and they determine how harshly the SVM should treaterrors. These will be relevant in the later sections where the implementation of the SVMfor regression is made. The optimisation problem of 3.23 can be solved using Lagrangianmultipliers, giving a solution of:

f (x) =N∑

i=1

(αi −α∗i )κ(x i , x) + b (3.24)

where αi and α∗i are Lagrangian multipliers, and κ is the kernel.

29

The solution to the SVR is sparse in the input space, the result depends only on a subset oftraining data (the support vectors).

The implementation of SVR used in this project is based on the open source scikit-learn [36]package for the Python programming language.

30

Chapter 4

Experiments

This chapter describes the procedures used to conduct the experiments, and presents theresults and a discussion.

4.0.5 Models

The experiments were conducted using two different models with different features. Thebasis for this was discussed in Section 2.2.3.

The first was a nonlinear autoregressive model, which will be referred to as the NAR model.It used lagged values of the consumption for the previous 24 hours as a feature set {LAG01,LAG02, ..., LAG24} which were chosen to capture the autocorrelation structure shown inFigure 3.4. Because not all of the lagged values are available for each step in the forecasthorizon, a recursive forecasting strategy was implemented. The second model, referred toas the TIME model, uses features derived from the timestamp of the electricity consump-tion. The feature set is {HR, WK, WKD, ISWORKDAY}. The two models are summarised inTable 4.1.

Model Name FeaturesNAR {LAG01, LAG02, ..., LAG24}TIME {HR, WK, WKD, ISWORKDAY}

Table 4.1: Features used by each of the models

The experiments were conducted on a 1.8GHz Intel i5 with 8GB of memory.

4.1 Validation and learning procedure

This section describes the method of searching for hyperparameters, performing cross vali-dation, and evaluating the performance of each of the implementations.

To evaluate the models a rolling-origin validation procedure with fixed training size (Al-gorithm 1) was used. This procedure was introduced in Section 2.2.5. According to theconstraints of the problem discussed in Section 1.2 the training set size was fixed at 335observations (two weeks), and the forecast horizon was set at 24 points. The size of thedata set, the training set, and the forecast horizon meant that 315 models were built andeach made a 24 point forecast.

31

Algorithm 1 Rolling-origin with fixed training set size

train size← 336test size← 24step← 24for i← train size, |D| do

train←Di∀i ∈ {i..., train size+ i}test←Di∀i ∈ {train size+ i + 1..., train size+ i + 1+ test size}<build model using training set, evaluate and record performance using test set>i← i + step

If the algorithm implementation involved determining some optimal hyperparameters a gridsearch was executed with parameters likely to contain optimal values for the hyperparam-eters based on knowledge of the data and the theory behind the algorithms described inSection 3.3. The grid search used a rolling-origin procedure for cross validation with anincreasing training window size (Algorithm 2) and a minimum training window of 96 ob-servations. Because of the size of the training set this resulted in a 10 fold cross valida-tion.

Algorithm 2 Rolling-origin cross validation with increasing training set size

train size← 96test size← 24step← 24for i← train size, |D| do

train←Di∀i ∈ {1..., train size+ i}test←Di∀i ∈ {train size+ i + 1..., train size+ i + 1+ test size}<build model using training set, evaluate and record performance using test set>i← i + step

4.2 k-Nearest Neighbour Regression

The parameter k was selected using a grid search over the values in Table 4.2. The k val-ues for the grid search correspond to the number of days in the working week and a fullweek and multiples thereof, which should provide a good criteria for the search. Two dif-ferent weighting functions were used, one that provides a uniform weighting (denoted asUniform) to all of the nearest neighbours, and one that weights the contribution accordingto the distance from the test point (denoted as Distance). The weighting function for thecontribution from each of the k-nearest neighbours was also included in the grid search.The results from the k-NNR algorithm are shown in Table 4.3 with the best results for eachset size and house type highlighted in grey.

Hyperparameter Valuesk 5, 7, 10, 14, 15, 20, 21, 25, 28Weighting Uniform, Distance

Table 4.2: Grid search values for the k-NNR implementation

Figure 4.1 shows that for both apartments and houses the value of k tends to decrease as thenumber of households in the set increases. The smoother electricity consumption patternreduces the advantages of averaging over many training examples.

32

(a) Apartments (b) Houses

Figure 4.1: Values of k by housing type for different set sizes. Each colour represents adifferent value for the set size, and the numbers used are shown in the legend.

MASE sMAPE (%)Type Model Set size Avg Max Min Avg Max Min

Apartment

NAR

5 1.07 3.54 0.40 27.66 61.57 10.4910 1.02 2.98 0.44 21.51 56.29 10.7825 0.98 2.56 0.38 15.25 34.74 5.7629 (All) 0.96 2.47 0.38 14.32 32.86 6.75

TIME

5 0.97 3.32 0.38 24.91 50.58 12.6210 0.94 2.74 0.42 19.49 42.33 9.6525 0.90 2.40 0.33 14.05 31.51 5.8229 (All) 0.87 2.41 0.35 13.06 32.14 6.01

House

NAR

5 1.23 2.90 0.44 28.29 52.30 10.2510 1.15 2.54 0.51 17.54 35.84 7.1525 1.14 2.62 0.56 12.78 27.21 5.9364 (All) 1.16 2.82 0.35 10.46 23.89 3.94

TIME

5 1.17 2.71 0.51 26.97 56.67 8.8610 1.12 2.82 0.44 16.93 34.84 5.1725 1.06 2.73 0.49 11.99 26.32 4.8464 (All) 1.13 3.96 0.43 10.18 27.51 3.68

Table 4.3: Results for implementation of k-NNR algorithm

The difference between k-NNR and the other methods implemented in this work was high-lighted in Section 3.3.1. Because it is a local learning method it makes weaker assumptionsabout the data, and does not try and discover an underlying function responsible for gener-ating the data across the entire input domain. For this reason, it should perform better thanthe other algorithms with small set sizes. As the set size increases and the pattern becomesmore regular it becomes more valid to make stronger assumptions about the data.

The k-NNR algorithm is very fast to fit, which is a consequence of the small training datasize. The number of examples it has to examine to determine the nearest neighbours issmall.

4.3 Kernel Ridge Regression

The hyperparameters for KRR were selected using a SE kernel with hyperparameters thatwere selected using a grid search over the values in Table 4.5. The results from the KRRalgorithm are shown in Table 4.6 with the best results for each set size and house typehighlighted in grey.

33

Model Type Average model fit time (s) Average forecast time (s)

NARApartment 0.154 0.008House 0.168 0.008

TIMEApartment 0.133 < 0.001House 0.139 < 0.001

Table 4.4: Model fit and forecast times for k-NNR algorithm

Hyperparameter Valuesλ 0.0001, 0.001, 0.1, 1ℓ 0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100

Table 4.5: Grid search values for KRR implementation


Apartment

NAR

5 1.13 3.62 0.43 29.62 117.80 14.2310 1.07 2.93 0.46 22.72 54.26 10.6325 1.04 5.05 0.45 16.65 128.75 6.9229 (All) 1.01 3.43 0.42 15.36 48.94 7.24

TIME

5 1.09 3.52 0.47 28.68 82.36 14.0610 1.02 3.20 0.39 21.52 50.38 10.2925 1.00 2.65 0.40 15.92 38.50 7.3729 (All) 0.97 2.66 0.44 14.87 33.28 6.96

House

NAR

5 1.24 3.09 0.39 28.38 60.69 9.3410 1.21 3.23 0.49 18.35 81.92 7.5325 1.13 4.63 0.49 12.59 33.97 4.8064 (All) 1.14 3.28 0.40 10.28 32.19 3.37

TIME

5 1.23 2.78 0.36 28.25 57.57 10.9110 1.18 2.99 0.53 17.91 42.55 6.9425 1.17 3.50 0.45 13.29 39.56 4.6964 (All) 1.26 4.09 0.42 11.55 39.39 4.29

Table 4.6: Results for implementation of KRR algorithm

In all cases the forecasts for KRR are less accurate than for k-NNR. Comparing the exampleforecasts in Figure B.2 to the predictions for k-NNR in Figure A.2 the forecasts follow asmoother fit because of the KRR algorithm attempting to fit a function to the data, ratherthan just averaging like k-NNR. Perhaps the only advantage of KRR for this application isthat it is the simplest of the global learning models implemented here, and is more easilyunderstandable than GPR and SVR.

Model Type Average fit time (s) Average forecast time (s)



Table 4.7: Model fit and forecast times for KRR algorithm

34

4.4 Gaussian Process Regression

The GPR algorithm was implemented using the three kernels described in Section 2.2.2.The hyperparameters of the kernels were selected through an optimisation process usingthe Broyden-Fletcher-Goldfarb-Shanno (BGFS) algorithm [17].

The results from the GPR algorithm are shown in Table 4.8 with the best results for each setsize and house type highlighted in grey. In some of the experiments no results were availablebecause the optimisation routine of the software package that was used failed to find any pa-rameters for the kernels. The optimisation process used to find the optimal hyperparametersin Gaussian processes is typically a gradient descent algorithm that optimises the marginallikelihood p(y|X ,θ ). This optimisation problem is non-convex so there is no guarantee ofconvergence. Further details on the optimisation problem can be found in [38]. Because ofthe failure of the optimisation routine the hyperparameters were manually set and reason-able results were achieved, but the results were excluded due to the imperfect optimisation.The result for the 25 house set with the M32 kernel shows that the optimisation routinefailed for some models because of the high maximum errors for MASE and sMAPE. Thereason that the M32 kernel was able to produce more results under optimisation comparedwith the SE kernel is due to the less rigorous smoothness constraints of the kernel (this isalso demonstrated by the exponential kernel never failing during optimisation).

The exponential kernel performed the best for GPR, while the SE kernel performed the worstand the M32 kernel fell in between the two. This speaks to the irregularity of the data, andthe need for the use of a kernel that does not have smoothness constraints like the SE kernel.The result for the M32 kernel is expected given the performance of the other two, becauseit falls between them in terms of the smoothness of the functions that it generates (see forexample the sampled functions from the Gaussian process prior in Figure 3.5).

It was shown in Section 3.3.3 that GPR is a bayesian generalisation of KRR, and so whenusing the same kernel they should produce similar results, which they do. In the case of theNAR model for apartments the results are extremely similar. For KRR and GPR to produceexactly the same results a more thorough search over the hyperparameters for KRR wouldbe required. Because the exponential kernel was more effective than the SE kernel forGPR, it can be said that the exponential kernel would make the KRR implementation moreeffective.

The exponential kernel nearly matches the performance of the k-NNR algorithm, and beatsit when the set consists of 64 houses. This is an important result, because it shows thatin the case of many households, the electricity consumption pattern starts to become moreregular, and a benefit can be gained from assuming an underlying function in the data. Itis likely that a similar result would have been apparent for apartments if the data set hadcontained more apartments. It can be seen that with 29 apartments the accuracy of GPRcomes close to k-NNR. This also confirms the result seen by Humeau et al. in [27] wheremethods that rely on an internal structure in the data begin to outperform other methods asthe aggregate consumption from more households are considered. For the implementationof the prediction system it is likely that GPR would be more effective than k-NNR for largesets like neighbourhoods or cities.

35

MASE sMAPE (%)Type Model Set size Kernel Avg Max Min Avg Max Min

Apartment

NAR

5 κEX P 1.11 3.48 0.42 28.50 55.86 13.0110 κEX P 1.05 2.99 0.49 21.87 47.10 12.0525 κEX P 0.99 2.54 0.37 15.46 36.47 6.5229 (All) κEX P 0.96 2.44 0.38 14.41 33.17 6.485 κSE 1.13 3.67 0.42 29.31 82.51 14.8710 κSE 1.07 3.10 0.45 22.65 57.94 11.4925 κSE 1.03 2.94 0.47 16.34 41.68 7.3629 (All) κSE 1.02 8.04 0.38 15.87 159.83 6.635 κM32 1.11 3.55 0.43 28.56 56.41 14.2710 κM32 1.04 2.85 0.45 21.81 49.20 11.5025 κM32 0.98 2.52 0.45 15.31 36.59 7.2329 (All) κM32 0.95 2.41 0.41 14.37 32.83 7.35

TIME


House

NAR

5 κEX P 1.24 3.77 0.38 28.27 59.33 11.8110 κEX P 1.16 2.59 0.44 17.63 48.39 7.1825 κEX P 1.10 2.75 0.45 12.35 27.91 4.9564 (All) κEX P 1.09 2.26 0.46 9.68 19.15 3.925 κSE 1.26 4.97 0.36 28.83 99.84 11.4310 κSE N/A N/A N/A N/A N/A N/A25 κSE N/A N/A N/A N/A N/A N/A64 (All) κSE N/A N/A N/A N/A N/A N/A5 κM32 1.23 3.24 0.37 28.06 65.87 9.7110 κM32 1.14 2.66 0.44 17.27 51.47 6.8625 κM32 1.17 10.84 0.50 13.71 200.00 4.7864 (All) κM32 N/A N/A N/A N/A N/A N/A

TIME


Table 4.8: Results from forecasts using GPR with three different kernels

36

Overall, GPR performed well and the ability of GPR to optimise the hyperparameters of thekernel is an advantage, particularly in small data sets where it enables a good trade offbetween fitting the data and smoothing. Additionally the probabilistic framework and theability of the algorithm to return an uncertainty measurement along with a prediction is a bigadvantage. The plots shown in Appendix C use the variance returned with the prediction toplot a 95% confidence interval. The results also highlight the importance of using a kernelthat is appropriate for the data and shows that it is worthwhile to explore other possiblekernels, rather than always relying on the SE kernel which is the default choice in manyalgorithm implementations.

Model House type Average model fit time (s) Average forecast time (s)



Table 4.9: Model fit and forecast times for GPR algorithm

4.5 Support Vector RegressionSVR was implemented using a SE kernel with hyperparameters that were selected using agrid search over the values in Table 4.10. The results of the SVR implementation are shownin Table 4.11 with the best results for each set size and house type highlighted in grey.

Hyperparameter ValuesC 1, 10, 100ℓ 0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100ε 0, 0.01, 0.1, 0.5, 1, 2, 4

Table 4.10: Grid search values for SVR implementation

In general, SVR performs slightly better than KRR, but not as effectively as GPR. The majordifference between SVR and KRR/GPR is shown in the optimisation problem. The optimisa-tion problem of SVR yields a sparse solution. This has consequences for the computationalefficiency of the method. In theory, the sparse solution should take longer to find but pre-dictions should be made faster when the solution is known. The benchmarks in Table 4.12confirm that SVR is the slowest algorithm for building the model, but the advantages of thesparse solution were not apparent in the prediction speeds because of the small training setsize. The benefit would become clear if the training data was an order of magnitude larger.Given that this was the primary advantage that SVR had over GPR and KRR it seems clearthat GPR is the preferable solution for the implementation considered.

4.6 Discussion of results

For all of the algorithms the results show a decreasing error as the set size increases, whichwas expected due to the more regular consumption pattern. Apartments tended to have alower MASE but a higher sMAPE. As was expected, the lower occupancy patterns of apart-ments led to more random consumption patterns, making the scaling factor of the MASEmeasurement larger (see Section 2.2.6) due to the poor performance of the naive forecast-ing method on the in-sample data. The use of both a relative measure and a percentagebased error was beneficial to the understanding of the results.

37


Apartment

NAR

5 1.07 3.92 0.37 27.56 62.72 11.7410 1.04 3.27 0.43 21.68 51.59 10.3125 1.00 2.52 0.32 15.73 34.04 6.8329 0.98 2.40 0.34 14.77 31.40 6.19

TIME

5 1.06 3.70 0.44 27.23 58.14 13.9910 0.98 3.20 0.42 20.50 50.54 11.2325 0.96 2.82 0.35 15.00 36.15 5.2429 (All) 0.94 2.56 0.41 14.20 31.63 6.25

House

NAR

5 1.22 6.16 0.30 27.67 81.71 9.6710 1.15 2.61 0.54 17.50 39.36 6.8925 1.12 2.61 0.52 12.59 26.92 4.7964 (All) 1.13 2.51 0.42 10.11 21.86 3.20

TIME

5 1.19 3.08 0.35 27.45 56.53 10.1410 1.15 3.17 0.51 17.42 44.60 6.8225 1.11 2.95 0.50 12.57 35.90 5.3664 (All) 1.15 4.07 0.47 10.41 29.08 3.33

Table 4.11: Results from forecasts using the SVR algorithm

Model Type Average model fit time (s) Average forecast time (s)


TIMEApartment 9.735 0.001House 3.159 0.001

Table 4.12: Model fit and forecast times for SVR algorithm

In Appendices A, B, C, and D the plots of forecasted values against actual values shows thatall of the algorithms tend to forecast values that are less than the actual values. Based on theplots, this effect is worse for apartments and primarily occurs for the lowest consumptionvalues. A possible explanation of this is that during the period when the occupants of anapartment are not active the consumption is caused by the base load which does not varywith time, and fitting a function over the consumption pattern may lead it to underestimatethe base load at times due to the smoothness constraints of the function. The underesti-mation is worse for apartments than houses, and this may be because when the occupantsare not active in a house a seasonal and base load (see Sections 2.1 and Sections 3.2.2) arepresent, leading to a smoother consumption pattern.

In all cases the TIME model performed better than the NAR model. The aggregation of errorsover each step of the forecast causes the NAR model to perform poorly overall, but it wasexperimentally confirmed that the NAR model performs significantly better than the TIMEmodel when making 24 one-step predictions without the use of the previously predictedvalues in the feature vectors. This is not possible in practice for reasons discussed in 2.2.3.It may be possible to improve the results of the NAR model by using another forecastingstrategy, such as the direct strategy discussed in 2.2.3.

The TIME model generally allowed for faster model creation times and faster predictions dueto the lower dimensionality of the feature set. The benchmarks of each of the algorithmswere largely dependent on the size of the grid search, which was larger than necessarybecause the same grid search was used for both the NAR and TIME models. The NAR model

38

required much lower values for the length-scale of the kernel ℓ because of the characteristicsof the data. If only a single model was used, the width of the grid search could be narrowedto improve the speed of the learning process, however the best results were achieved bythe algorithms that didn’t use a grid search (GPR) or didn’t require a wider grid search forkernel hyperparmeters (k-NNR).

4.6.1 Implementation considerationsIn Section 1.1.1 an implementation of the forecasting system within the application wasintroduced. Therefore, it is important to point out any limitations and requirements thatstem from the use of these algorithms. Two feedback mechanisms were discussed that wouldbe based on the forecasting system: producing plots of forecasts for comparison with actualconsumption, and notifications about the performance of the set.

Plotting the forecasts for comparison with actual consumption is not so problematic, if theresults are clearly wrong the user is likely to ignore them. Although this is not an idealsituation, it is less troublesome than the use of notifications that are triggered through acomparison of the forecast and the actual consumption, which has the potential to annoyusers. Of particular concern are the large maximum errors for the smaller set sizes, whichare over 50% for the 5 household sets. Some further investigation could be done into thesituations that lead to these large errors, as they might be caused by public holidays. If thatwere found to be true, the mobile application could use a heuristic to prevent delivery offeedback to users based on the forecasting algorithm on public holidays.

As discussed earlier in this section, the algorithms generally forecast under the actual con-sumption values. This should be kept in mind for the implementation, but it can be seenas a positive. Section 2.2.6 showed that the sMAPE penalises forecasts less than the actualvalue more harshly than forecasts that were greater than the actual value. This allows anincreased confidence when triggering notifications to users based on perceived decrease intheir consumption against the forecast consumption. Threshold values for triggering notifi-cations in the actual implementation should be adjusted accordingly.

In general, the errors are significantly higher for the smaller set sizes, and it may be moreeffective to use large predefined sets such as neighbourhoods or cities rather than allow-ing users to form their own sets to ensure some minimum set size where the errors areacceptable. It is unlikely, for example, that users will form a set containing more than 25households on their own.

The computational tractability of each of the algorithms with a large number of sets is un-likely to be a consideration for the implementation. It could become an issue in the case ofSVR, but since it offers no advantages and other algorithms achieve better accuracy it is nota candidate for the implementation.

39

Chapter 5

Conclusions

This thesis presents the evaluation of kernel methods for short term forecasting of the ag-gregate electricity consumption for varying sized sets of households. The evaluation wasconducted for the implementation of a forecasting system in a mobile application to pro-vide users with a benchmark value to compare their electricity consumption to. The pur-pose of the system, and the mobile application in general, is to provide feedback about theusers electricity consumption to assist them in reducing their household electricity consump-tion.

Each kernel method was experimentally tested with different sized sets of households. Twoerror measurements were used to evaluate the algorithm, one commonly used percentagebased measurement (sMAPE), and a newer measurement that provides a built in compari-son with a naive forecast (MASE). The methods were also evaluated for speed by a simplebenchmark of the time taken to build the models and the time for prediction.

The work considered one local learning method (k-NNR) and three global learning methods(KRR, GPR, and SVR). It was found that the local learning method was the most effectivefor small set sizes because it does not assume any underlying function in the data. Fittinga function becomes more advantageous at larger set sizes, which was shown when GPRoutperformed other methods when forecasting for the largest set size (64 houses) in thedata set.

These results show that for the implementation of the system considered in Section 1.1.1 k-NNR will be the most effective for small sets, like a set of all the apartments in an apartmentbuilding, and GPR will be most effective for larger scale sets like neighbourhoods, suburbs,or cities. Although similar accuracy could be achieved through the use of KRR instead ofGPR, the advantages of the probabilistic framework of GPR make it a better candidate forimplementation.

5.1 Future work

This work considered apartments and houses separately due to their different characteris-tics, and it would be interesting to consider the effects that sets containing both apartmentsand households would have on the algorithms. Further, weather data could also be intro-duced to see its impact.

The implementation of the NAR model in this work used an iterative method to achievea multi-step prediction. Several alternatives to the iterative method have been proposedthat reduce the effect of errors aggregating over each step. The direct method could be

41

implemented which may perform better. An excellent overview of multi-step forecastingmethods is available in [14].

The implementation of each of the methods could also be improved. The KRR and SVRalgorithms could be improved through the implementation of an exponential kernel, butGPR offers more beneficial properties than these two algorithms so there isn’t any reason todo this. The results achieved by GPR may improve with the use of composite kernels to bettercapture the periodicity in the data, for example the product of a periodic and exponentialkernel. It is likely that the advantages of composite kernels would become more apparentas the set size gets larger (> 100).

There are a host of other methods that may be effective for this application. Neural networksare one method that may provide better results, as they are popular in the literature and inforecasting competitions. The traditional statistical methods for forecasting such as ARMAand ARIMA are also an avenue to explore.

42

Bibliography

[1] Energy Policies of IEA Countries, Sweden, 2013 Review. https://www.iea.org/

publications/freepublications/publication/Sweden2013_free.pdf.Accessed: 2015-02-15.

[2] Final energy consumption by sector and fuel (CSI 027/ENER 016) - Assessment pub-lished Jan 2015. http://www.eea.europa.eu/data-and-maps/indicators/

final-energy-consumption-by-sector-8/assessment-2. Accessed: 2015-05-01.

[3] The Google Geocoding API. https://developers.google.com/maps/

documentation/geocoding/. Accessed: 2015-04-12.

[4] Residential energy consumption survey (RECS) - Energy Information Administration.http://www.eia.gov/consumption/residential/. Accessed: 2015-04-12.

[5] The smart meter revolution: Towards a smarter future.https://m2m.telefonica.com/multimedia-resources/

the-smart-meter-revolution-towards-a-smarter-future. Accessed:2015-02-10.

[6] Global learning vs. local learning. In Machine Learning, Advanced Topics in Scienceand Technology in China, pages 13–27. Springer Berlin Heidelberg, 2008.

[7] Nesreen K Ahmed, Amir F Atiya, Neamat El Gayar, and Hisham El-Shishiny. An empir-ical comparison of machine learning models for time series forecasting. EconometricReviews, 29(5-6):594–621, 2010.

[8] J Scott Armstrong and Long-Range Forecasting. From crystal ball to computer. NewYork ua, 1985.

[9] The GPy authors. GPy: A gaussian process framework in python. http://github.com/SheffieldML/GPy, 2012–2014.

[10] Philip Babcock, Kelly Bedard, Gary Charness, John Hartman, and Heather Royer. Let-ting down the team? evidence of social effects of team incentives. Technical report,National Bureau of Economic Research, 2011.

[11] David Barber, A Taylan Cemgil, and Silvia Chiappa. Bayesian time series models. Cam-bridge University Press, 2011.

[12] A-D Barbu, Nigel Griffiths, and Gareth Morton. Achieving energy efficiency throughbehaviour change: what does it take? 2013.

[13] Robert Bartels and Denzil G Fiebig. Residential end-use electricity demand: resultsfrom a designed experiment. The Energy Journal, pages 51–81, 2000.

43

https://www.iea.org/publications/freepublications/publication/Sweden2013_free.pdf

https://www.iea.org/publications/freepublications/publication/Sweden2013_free.pdf

http://www.eea.europa.eu/data-and-maps/indicators/final-energy-consumption-by-sector-8/assessment-2

http://www.eea.europa.eu/data-and-maps/indicators/final-energy-consumption-by-sector-8/assessment-2

https://developers.google.com/maps/documentation/geocoding/

https://developers.google.com/maps/documentation/geocoding/

http://www.eia.gov/consumption/residential/

https://m2m.telefonica.com/multimedia-resources/the-smart-meter-revolution-towards-a-smarter-future

https://m2m.telefonica.com/multimedia-resources/the-smart-meter-revolution-towards-a-smarter-future

http://github.com/SheffieldML/GPy

http://github.com/SheffieldML/GPy

[14] Souhaib Ben Taieb, Rob J Hyndman, and Gianluca Bontempi. Machine learning strate-gies for multi-step-ahead time series forecasting. 2014.

[15] Christoph Bergmeir and José M Benítez. On the use of cross-validation for time seriespredictor evaluation. Information Sciences, 191:192–213, 2012.

[16] Leo Breiman et al. Statistical modeling: The two cultures (with comments and arejoinder by the author). Statistical Science, 16(3):199–231, 2001.

[17] Charles George Broyden. The convergence of a class of double-rank minimizationalgorithms 1. general considerations. IMA Journal of Applied Mathematics, 6(1):76–90, 1970.

[18] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning,20(3):273–297, 1995.

[19] Sarah Darby et al. The effectiveness of feedback on energy consumption. A Review forDEFRA of the Literature on Metering, Billing and direct Displays, 486:2006, 2006.

[20] Anibal De Almeida, Paula Fonseca, Barbara Schlomann, and Nicolai Feilberg. Charac-terization of the household electricity consumption in the eu, potential energy savingsand specific policy recommendations. Energy and Buildings, 43(8):1884–1894, 2011.

[21] Bing Dong, Cheng Cao, and Siew Eang Lee. Applying support vector machines to pre-dict building energy consumption in tropical region. Energy and Buildings, 37(5):545–553, 2005.

[22] Harris Drucker, Chris JC Burges, Linda Kaufman, Alex Smola, Vladimir Vapnik, et al.Support vector regression machines. Advances in neural information processing systems,9:155–161, 1997.

[23] Steven Firth, K Lomas, A Wright, and R Wall. Identifying trends in the use of do-mestic appliances from household electricity consumption measurements. Energy andBuildings, 40(5):926–936, 2008.

[24] Agathe Girard, Carl Edward Rasmussen, Joaquin Quinonero-Candela, and RoderickMurray-Smith. Gaussian process priors with uncertain inputs? application to multiple-step ahead time series forecasting. 2003.

[25] Trevor Hastie, Robert Tibshirani, Jerome Friedman, T Hastie, J Friedman, and R Tib-shirani. The elements of statistical learning, volume 2. Springer, 2009.

[26] Thomas Hofmann, Bernhard Schölkopf, and Alexander J Smola. Kernel methods inmachine learning. The annals of statistics, pages 1171–1220, 2008.

[27] Samuel Humeau, Tri Kurniawan Wijaya, Matteo Vasirani, and Karl Aberer. Electricityload forecasting for residential customers: Exploiting aggregation and correlation be-tween households. In Sustainable Internet and ICT for Sustainability (SustainIT), 2013,pages 1–6. IEEE, 2013.

[28] Rob J Hyndman and Anne B Koehler. Another look at measures of forecast accuracy.International journal of forecasting, 22(4):679–688, 2006.

[29] Dejan Ilic, Per Goncalves da Silva, Stamatis Karnouskos, and Malte Jacobi. Impactassessment of smart meter grouping on the accuracy of forecasting algorithms. InProceedings of the 28th Annual ACM Symposium on Applied Computing, pages 673–679. ACM, 2013.

44

[30] Steve Lawrence, C Lee Giles, and Ah Chung Tsoi. Lessons in neural network training:Overfitting may be harder than expected. In AAAI/IAAI, pages 540–545, 1997.

[31] James McNames. A nearest trajectory strategy for time series prediction. In Pro-ceedings of the International Workshop on Advanced Black-Box Techniques for NonlinearModeling, pages 112–128. KU Leuven Belgium, 1998.

[32] Efstathios E Michaelides. Alternative energy sources. Springer Science & Business Me-dia, 2012.

[33] Marius Muja and David G Lowe. Fast approximate nearest neighbors with automaticalgorithm configuration. VISAPP (1), 2, 2009.

[34] Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.

[35] Ajoy K Palit and Dobrivoje Popovic. Computational intelligence in time series forecasting:theory and engineering applications. Springer Science & Business Media, 2006.

[36] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.Journal of Machine Learning Research, 12:2825–2830, 2011.

[37] Jurek Pyrko. Am i as smart as my smart meter is?–swedish experience of statisticsfeedback to households. In Proceedings of the ECEEE, pages 1837–1841, 2011.

[38] Carl Edward Rasmussen. Gaussian processes for machine learning. Citeseer, 2006.

[39] Nicholas I Sapankevych and Ravi Sankar. Time series prediction using support vectormachines: a survey. Computational Intelligence Magazine, IEEE, 4(2):24–38, 2009.

[40] Craig Saunders, Alexander Gammerman, and Volodya Vovk. Ridge regression learn-ing algorithm in dual variables. In (ICML-1998) Proceedings of the 15th InternationalConference on Machine Learning, pages 515–521. Morgan Kaufmann, 1998.

[41] Mark Saunders, Philip Lewis, and Adrian Thornhill. Research methods for businessstudents. Pearson Education Limited, fifth edition, 2011.

[42] Aaron Smith. Smartphone ownership - 2013 update. Pew Research Center: WashingtonDC, 12, 2013.

[43] Alex J Smola and Bernhard Schölkopf. A tutorial on support vector regression. Statis-tics and computing, 14(3):199–222, 2004.

[44] Alex J Smola, Bernhard Schölkopf, and Klaus-Robert Müller. The connection betweenregularization operators and support vector kernels. Neural networks, 11(4):637–649,1998.

[45] Michael L Stein. Interpolation of spatial data: some theory for kriging. Springer Science& Business Media, 1999.

[46] Johan AK Suykens and Joos Vandewalle. Least squares support vector machine classi-fiers. Neural processing letters, 9(3):293–300, 1999.

[47] Leonard J Tashman. Out-of-sample tests of forecasting accuracy: an analysis andreview. International Journal of Forecasting, 16(4):437–450, 2000.

[48] V Vapnik and A Chervonenkis. On a perceptron class. Automation and Remote Control,25:112–120, 1964.

45

[49] Vladimir Vapnik. Pattern recognition using generalized portrait method. Automationand remote control, 24:774–780, 1963.

[50] Bo-Suk Yang, Andy Chit Chiow Tan, et al. Multi-step ahead direct prediction for themachine condition prognosis using regression trees and neuro-fuzzy systems. ExpertSystems with Applications, 36(5):9378–9387, 2009.

[51] Kai Yu, Liang Ji, and Xuegong Zhang. Kernel nearest-neighbor algorithm. NeuralProcessing Letters, 15(2):147–156, 2002.

[52] G Peter Zhang. Neural networks for time-series forecasting. In Handbook of NaturalComputing, pages 461–477. Springer, 2012.

List of Abbreviations

AR Autoregressive

GPR Gaussian process regression

k-NNR k-Nearest neighbour regression

KRR Kernel ridge regression

NAR Nonlinear autoregressive

SE Squared exponential (kernel)

SVR Support Vector Regression

List of Mathematical Symbols

⟨·, ·⟩ notation for an inner product

D dimension of the input space X

D a data set

GP Gaussian process: f ∼ GP(m(x), k(x,x′))

Γ the Gamma function

k the k parameter in k-nearest neighbour regression

κ a kernel function

κ(x,x′) covariance or kernel function evaluated at x and x

′

K or K(X , X ) n × n covariance (or Gram) matrix

K∗ n × n∗ matrix K(X , X∗), the covariance between training and test cases

Kv the modified Bessel function of the second kind

ℓ characteristics length-scale of kernel

ϕ feature map

X input space

X D × n matrix of training inputs, also called the design matrix

X∗ D × n∗ matrix of the test inputs

y 1 × n vector of the training target variables

yi ith training target variable

y∗ 1 × n vector of the test target variables

Y output space

49

Appendices

These appendices include selected plots from the implementations including plots of exam-ple predictions. The predictions were made for a four day period using the methodologydescribed in Section 4. This means that each plot shows the output of four models witheach being trained on the preceding two weeks of consumption data.

Appendix A

Plots for k-Nearest NeighbourRegression implementation

(a) 5 apartments (b) 29 apartments

(c) 5 houses (d) 64 houses

Figure A.1: Example forecasts using k-NNR over a four day period for the NAR model

3



Figure A.2: Example forecasts using k-NNR over a four day period for the TIME model

4

(a) All apartments NAR model (b) All apartments TIME model

(c) All houses NAR model (d) All houses TIME model

Figure A.3: Scatter plots of observed vs. forecast values for k-NNR implementation

5

Appendix B

Plots for Kernel Ridge Regressionimplementation



Figure B.1: Example forecasts using KRR over a four day period using the NAR model

7



Figure B.2: Example forecasts using KRR over a four day period using the SE kerneland the TIME model

8



Figure B.3: Scatter plots of observed vs. forecast values for KRR implementation usingthe SE kernel

9

Appendix C

Plots for Gaussian ProcessRegression implementation



Figure C.1: Example forecasts for the GPR implementation using the exponentialkernel and the NAR model with 95% confidence interval

11



Figure C.2: Example forecasts for the GPR implementation using the exponentialkernel and the TIME model with 95% confidence interval

12



Figure C.3: Scatter plots of observed vs. forecast values for the GPR implementationusing the exponential kernel

13

Appendix D

Plots for Support Vector Regressionimplementation



Figure D.1: Example forecasts for the SVR implementation over a four day periodusing the SE kernel and the NAR model

15



Figure D.2: Example forecasts for the SVR implementation over a four day periodusing the SE kernel and the TIME model

16



Figure D.3: Scatter plots of observed vs. forecast values for the SVR implementationusing the SE kernel

17

www.kth.se

Forecasting hourly electricity consumption for sets of ...927793/FULLTEXT01.pdfintermittent and...

Documents

Transcript of Forecasting hourly electricity consumption for sets of ...927793/FULLTEXT01.pdfintermittent and...