A machine-learning approach to estimating the …...A machine-learning approach to estimating the...

IN DEGREE PROJECT MATHEMATICS,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2018

A machine-learning approach to estimating the performance and stability of the electric frequency containment reserves

HENRIK EKESTAM

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ENGINEERING SCIENCES

A machine-learning approach to estimating the performance and stability of the electric frequency containment reserves HENRIK EKESTAM Degree Projects in Optimization and Systems Theory (30 ECTS credits) Degree Programme in Applied and Computational Mathematics (120 credits) KTH Royal Institute of Technology year 2018 Supervisor at Svenska Kraftnät: Andreas Westberg Supervisor at KTH: Anders Forsgren Examiner at KTH: Anders Forsgren

TRITA-SCI-GRU 2018:281 MAT-E 2018:63

Royal Institute of Technology School of Engineering Sciences KTH SCI SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

Abstract

The stability and reliability of the power system is of utmost importance, with one measure being thefrequency quality. For a number of years, the frequency quality has been decreasing in the Nordicsynchronous area. The Revision of the Nordic Frequency Containment Process project has introduceda proposed set of pre-qualification requirements to ensure the stability and performance of frequencycontainment reserves. The purpose of this thesis has been to examine the potential of complementing theevaluation of the requirements through the use of machine-learning methods applied to signals sampledduring normal operation of a power plant providing frequency containment. Several simulation modelshave been developed to generate such signals with the results fed into five machine-learning algorithms forclassification: decision tree, adaboost of decision tree, random forest, support vector machine, and a deepneural network. The results show that on all of the simulation models it is possible to extract informationregarding the stability and performance while with high accuracy preserving the distribution of physicalparameters of the approved samples. The conclusion is that machine-learning methods can be used toextract information from operation signals and that further research is recommended to determine howthis could be put to practice and what precision is needed.

Sammanfattning

Stabilitet och palitlighet hos kraftsystemet ar av yttersta vikt, med frekvenskvaliteten som en indikator.Under ett antal ar har frekvenskvaliteten sjunkit inom det nordiska synkronomradet. Projektet TheRevision of the Nordic Frequency Containment Process har foreslagit nya pre-kvalificeringskrav syftandestill att sakerstalla stabilitet och prestanda hos frequency containment reserves. Syftet med detta exa-mensarbete har varit att utforska mojligheterna att komplettera utvarderingen av dessa krav genom attanvanda masininlarningsmetoder applicerade pa signaler hamtade fran normal drift av ett kraftverk somlevererar frequency containment. Flera simuleringsmodeller har utvecklats for att generera sadana signalersom sedan har analyserats av fem olika maskininlarningsmetoder for klassificering: beslutstrad, adaboostav beslutstrad, random forest, stodvektormaskin samt ett djup neuralt natverk. Resultaten visar att detfor samtliga simuleringsmodeller har varit mojligt att extrahera information kring stabilitet och prestandaoch samtidigt med hog noggrannhet bevara fordelningen av fysikaliska parametrar hos godkanda prover.Slutsatsen ar att maskininlarningsmetoder kan anvandas for att extrahera information fran driftsignalersamt att fortsatta undersokningar rekommenderas for att avgora hur denna information kan anvandaspraktiskt, och vilken precision i bedomningarna som da skulle kravas.

i

Contents

1 Introduction 11.1 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Limitations in scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4 Structure of the report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Theory - Frequency containment reserves 32.1 Power usage and frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Frequency containment reserves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2.1 Pre-qualification requirements for FCR-N . . . . . . . . . . . . . . . . . . . . . . . 42.2.2 Pre-qualification procedure for FCR-N . . . . . . . . . . . . . . . . . . . . . . . . . 52.2.3 Simulation models for hydro power plants . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Per unit scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.1 Machine base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.2 FCP base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.3 Conversion table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Theory - Machine learning 103.1 Introduction to machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2 Estimating the quality of classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2.1 Bias – Variance trade-off . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2.2 Training, validation and test sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2.3 Cross validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.3 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.3.1 Decision tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.3.2 Random forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.3.3 Adaboost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.3.4 Support vector machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.3.5 Neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Methodology 214.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.2 Key performance indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2.1 Time domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.2.2 Frequency domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2.3 Mixed domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.3 Hyperparameter tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.4 Data management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.4.1 Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.4.2 Scaling and centring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.5 Evaluation of classification quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.5.1 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.5.2 Accuracy assessment table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.5.3 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.5.4 Classifier parameter distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5 Linear model 315.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.2.1 Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.2.2 Comparison of classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.2.3 Confusion matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.2.4 Classifier parameter distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.3 Discussion of model and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

ii

6 Non-linear model 386.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39



7 Non-linear model with noise 457.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46



8 Sub-sampled non-linear model with noise 528.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

8.2.1 Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528.2.2 Comparison of classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538.2.3 Classifier parameter distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53


9 Discussion 579.1 Suggestions for further research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

10 Conclusions 61

11 Literature 62

Appendix A Code libraries 63

Appendix B Full feature selection 64B.1 Linear model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64B.2 Non-linear model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65B.3 Non-linear model with noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

iii

List of symbols

Symbol Physical unit Description

b % Backlash

C MW Available FCR-N capacity for a power plant

df Hz Scale factor for normal frequency band

dP MW Scale factor total FCR-N capacity

2D s Estimated backlash

∆f Hz Frequency deviation from nominal

∆P MW Delivered FCR-N

∆t s Sample interval

epHz% Droop, inverse static gain

Ek J Kinetic energy

ϕ ◦ Phase angle

f Hz Grid frequency

fn Hz Nominal grid frequency

F Transfer function of controller

G Transfer function of system

H s Inertia constant

J kg ·m2 Moment of inertia

k %Hz Load frequency dependency

Ki s−1 Integral gain of controller

Kp 1 Proportional gain of controller

Mp 1 Maximal sensitivity peak value for performance

Ms 1 Maximal sensitivity peak value for stability

ω rads Angular frequency

P MW Power

r Hz Reference value frequency deviation

s Laplace variable

S Sensitivity transfer function

Sn MW Rated power

T s Period, inverse of frequency

T s Simulated time interval

Ti s Integration time constant

Tw s Water time constant

Ts s Servo time constant

w MW Power disturbance

Y0 % Gate set point

In many cases the parameters have been scaled to use per unit values. For the used per unit systems,refer to section 2.3.

iv

List of abbreviations

Abbreviation Description

AB Adaboost — a machine learning classifier

ARX Autoregressive exogenous (-model)

CI Confidence interval

CV Cross validation

DT Decision tree — a machine learning classifier

ER Error rate

FCP Frequency containment process (-project)

FCR Frequency containment reserves

FCR-D Frequency containment reserves, disturbed operation

FCR-N Frequency containment reserves, normal operation

FRR Frequency restoration reserves

KPI Key performance indicator

ML Machine learning

MSE Mean square error

NN Neural network — a machine learning classifier

PI Proportional, integral (-controller)

RAR Analysis and review of Requirements for Automatic Reserves (-project)

ReLU Rectified linear unit — a neural network activation function

RF Random forest — a machine learning classifier

SVM Support vector machine — a machine learning classifier

TSO Transmission system operator

v

1 Introduction

The stability and reliability of the power system is of utmost importance. By Kirchhoff’s first law [1],the power production and consumption in the power system will be in balance. One consequence ofthis balance in an alternating current system is that frequency deviations will occur if an imbalancein production and demand exists. Such frequency deviations may be seen as an indicator of decreasedreliability of the system and is thus to be stabilised and corrected by dedicated Frequency containmentreserves (FCR). During normal – undisturbed – operation such containment reserves are denoted FCR-N.

For a number of years, the frequency quality has been decreasing in the Nordic synchronous area [2].To remedy this decrease in frequency quality the Nordic transmission system operators 2014 initiatedthe Revision of the Nordic Frequency Containment Process (FCP) project. As part of the revision theFCP-project has introduced a proposed set of pre-qualification requirements to ensure the stability andperformance of FCR providers [3]. The purpose of this Master’s thesis has been to examine the potentialof complementing the evaluation of the FCP requirements through the use of machine-learning methodsapplied to the input and output signals sampled during normal operation of a FCR-N providing powerplant.

1.1 Aim

The aim of the project is to examine the potential of complementing the evaluation of the FCP requirementson performance and stability through the use of machine-learning methods applied to the input andoutput signals sampled during normal operation of a FCR-N providing power plant. The examinedmethods should be non-invasive while keeping the physical interpretability and transparency with regardsto the information handled by the algorithm and how it is used.

1.2 Method

Outline:

• Perform simulations on existing models of a hydro power plant.

• Take the generated input and output signals and calculate key performance indicators with physicalinterpretability.

• Use the key performance indicators as input to the examined machine-learning algorithms to trainmodels for evaluation.

• Evaluate the accuracy of the resulting machine-learning models.

• Assess the potential of examined methods, the requisites for successful operation and make sugges-tions for further research.

The methodology of the thesis is further described in chapter 4.

1.3 Limitations in scope

The project is to:

• Constitute only an initial examination of the potential of the proposed methods to complement theFCP pre-qualification.

1

• Not to change or evaluate the established models of a hydro power plant. The model is taken to beequivalent to reality.

• The FCP pre-qualification requirements [3] are given and are used as the answer key to which powerplants are to be seen as qualified.

1.4 Structure of the report

Chapter 1 is an introduction to this report. Chapter 2 contains a depiction of the theoretical backgroundregarding frequency containment of the power grid and the proposed requirements on stability andperformance. A corresponding introduction to the machine-learning setting and the applied classifiersis given in chapter 3. The methodology of this thesis – including the simulations, calculation of keyperformance indicators and construction of the machine-learning models, as well as evaluation of themodels – is presented in chapter 4. Four variants of the simulation model with gradually increasingcomplexity have been applied, where chapter 5, 6, 7, and 8 respectively contains a detailed description ofthe simulation models as well as the results on each model, and a shorter discussion of the respectivemodel and corresponding results. A more in-depth discussion of the methodology and the general resultsas well as suggestions for further research are given in chapter 9. The conclusions of the thesis arepresented in chapter 10.

2

2 Theory - Frequency containment reserves

2.1 Power usage and frequency

The reliability of the power system is of paramount importance. By Kirchhoff’s first law [1], every amountof power consumed at the edge of the power grid has to be supplied by the grid at the same instant as itis used by the consumer. Hence, the power grid reacts instantaneously to the demands of the market atevery point in time. The first line of defence against changes in demand is the inertia of the synchronousmachines in the system. The kinetic energy Ek of a generator with frequency ω is

Ek = Jω2

2, (1)

where J is the moment of inertia of the rotating mass. At nominal frequency fn = ωn

2π = 50 Hz therelation may be taken to be

Ek,n = Jω2n

2= H · Sn, (2)

where Sn is the rated power of the generator at nominal frequency and H is the inertia constant [4]. Bytaking a time-derivative on the preceding equation a relation between power usage and angular velocitymay be derived as

Pt − Pg = Jωdω

dt. (3)

This is the so-called swing equation, where Pt is the power injected into the generator by the turbine andPg is the power extracted from the generator by the power grid [5]. It is readily seen that an imbalancein power produced by the turbines and the power consumed by the grid leads to a change in angularfrequency with a corresponding change in kinetic energy of the generator. A synchronous generator issynchronised with the power grid, i.e. the frequency of the generator is the same as the frequency of thevoltage in the power grid. Thus, when energy is taken from the generators in the system in order toincrease supply a corresponding dip in frequency is seen across the power grid. By the same reasoninga decrease in demand of the consumers will be compensated by the system as increased kinetic energyof the generators, with a corresponding increase in frequency. It can thus be concluded that changes inelectric frequency of the power grid may be used as a measure of the imbalances in power production andconsumption throughout the power grid.

2.2 Frequency containment reserves

The frequency of the system needs to be stabilised at a nominal value as it indicates the productionbalance in the grid. Devices connected to the grid may also be harmed if the electric frequency driftsoutside of the design specifications for the device [4]. Thus deviations in frequency need to be containedin magnitude. By the preceding section a deviation from nominal frequency arises when the kineticenergy of the generators is used to even out imbalances in supply and demand of power. Special powersources and/or power sinks are deployed within the system in order to react to sudden changes infrequency and try to contain them by providing additional supply or demand. Such reserves are calledFrequency Containment Reserves (FCR). Under nominal circumstances the frequency is to be held atf = 50.0± 0.1 Hz; the reserves responsible under these conditions is called FCR-N. When the system isdisturbed, i.e. the frequency is outside of the nominal band another set of reserves by the name FCR-Dis activated. The two reserves work together to stabilise the power imbalance in the system as well as the

3

frequency deviation. The frequency may then be restored by the slower frequency restoration reserves(FRR) that also restores the FCR capacity. [6]

2.2.1 Pre-qualification requirements for FCR-N

FCR-N capacity is supplied by actors on the market on behalf of the transmission system operator (TSO).Since 2017 a new scheme of pre-qualification of FCR-N capacity has been proposed by the Nordic TSOsin order to ensure the quality of the frequency reserves [3]. The Frequency Containment Process project(FCP) has established two conditions on performance and stability respectively to be fulfilled by theFCR-N supplier in a pre-qualification process. The two conditions are understood by regarding the systemas of the general form depicted in figure 1 below and described in a non-linear model of a hydro powerplant in the following section.

F(s) G(s)ΣΣ

Control unit System

Disturbance

Output

y

d

r

Figure 1. General overview of a feedback system with disturbance.

Two versions of the system are considered when stipulating the necessary conditions on performanceand stability. The first version corresponds to a scenario where the amount of inertia within the systemis taken to be at a minimal value and thus models the worst case scenario. This scenario is used forthe stability criterion in order to ensure robust stability, i.e. the system should be stable even undermodel uncertainties. The transfer function representing the system is in this case denoted Gmin(s). Inthe second scenario the amount of inertia within the system is set at the average value which is used toestablish the performance criterion. Thus, the performance condition states that the system should havesome set performance at nominal conditions without regard to uncertainties in model and control. Thetransfer function for the system with average inertia is denoted Gavg(s). [3]

By defining the sensitivity transfer function as

S(s = jω) =1

1 + F (s)G(s), (4)

the stability criterion from the FCP project may be formulated as

||Smin(s)||∞ < MS , (5)

i.e. the supremum value of the sensitivity function for the minimal inertia system is to be below athreshold taken to be Ms = 2.31 [3]. Meanwhile, the performance requirement is stated as

||Savg(s)||∞ <σf

||D(s)Gavg(s)||∞, (6)

or equivalently

4

||Savg(s)Gavg(s)D(s)||∞ < σf , (7)

where Gavg is the transfer function of the average inertia system and D(s) is the transfer function fromunfiltered white noise to the disturbance [3], such that

d = D(s) · w, (8)

where w is specified to be a white noise source and d is the disturbance depicted in figure 1 above. Theperformance and stability conditions are illustrated in figure 2 below. The constant σf represents thepower spectral density of the frequency deviation and scales to 1 in per unit scaling [3], further describedin section 2.3 below.

10-4 10-3 10-2 10-1 100 10110-2

10-1

100

101

102

Mag

nitude

(abs)

Performance req. Stability req.

Frequency (rad/s)

Requirements by the FCP-Project

Smin(s)Savg(s)

Figure 2. Illustration of the stability and performance requirements in equation (5) and (6) respectively.The solid blue line is to be below the dashed blue line at all frequencies for the stability requirement to befulfilled. Similarly, the red solid line is to be below the dashed line of the same colour for fulfilment of theperformance requirement.

2.2.2 Pre-qualification procedure for FCR-N

The performance and stability requirements described in the preceding section are evaluated for a powerplant by performing sine-in-sine-out tests to estimate the response of the control unit F (s) and byextension the sensitivity transfer function S(s). The test are conducted by disconnection of the feedbackloop in Figure 1 above and replacement of the feedback signal with a sine wave of specific frequencysuperimposed on a nominal signal with fn = 50 Hz, and measuring the corresponding output. Bymeasuring the gain and phase shift of the signal the transfer function at that specific frequency maybe estimated. By performing these sine tests at a representative range of frequencies the total responseof the sensitivity function may be approximated and evaluated against the performance and stabilityrequirements.

By the FCP project the following time periods are to be used with the sine tests [7], with T = 2πω :

T (seconds) 10 15 25 40 50 60 70 90 150 300

Here the time periods are approximately evenly spaced on a logarithmic scale.

In addition to the sine-in-sine-out test a step response test is to be performed to establish the maximalcapacity of the power plant for providing FCR-N[7], as illustrated in Figure 3 below.

5

0 50 100 150 200 250 300

time (a.u.)

49.8

49.9

50

50.1

50.2

f (H

z)

17

18

19

20

21

22

23

P (M

W)

FCR-N Normalisation step sequence

ΔP1

ΔP2

ΔP3ΔP4

Output

Input

Figure 3. Pre-qualification step sequence with corresponding definitions of ∆P1,∆P2,∆P3,∆P4.

From the step sequence the backlash of the power plant is estimated as the parameter 2D, which is usedto determine the available FCR-N capacity C the power plant may deliver. The parameters 2D and Care defined as:

2D =||∆P1| − |∆P2||+ ||∆P3| − |∆P4||

2, (9)

C =|∆P1|+ |∆P3| − 2D

2. (10)

In addition, the FCP-requirements state that the results from the sine-sweep should be interpolatedbetween data points, thus increasing the difficulty of fulfilment. The requirements also allow for areduction of the requirements by 5 % to account for measurement uncertainties [3]. Neither of theseconditions has been considered for this thesis, with partially cancelling effects.

2.2.3 Simulation models for hydro power plants

The simulation model used throughout this thesis is based on the reference model of a hydropower plantdefined by the FCP project. The simulation model is presented in Figure 4 below and will be brieflydiscussed in this section together with the corresponding pre-qualification model in Figure 5.

6

Servo Penstock

ep

ep

0

r [p.u.]

Δ PFCR [p.u.fcp]

PI

1/f0ScaleΔf

System inertia

w Disturbance [MW]

Δ f

Frequency [p.u.fcp]

Pn2R · 600/C

Local2Global

1/df

scaleΔf

1/CScale ΔP

BLGR

F(s)

Figure 4. Model used for simulation of the operation of a hydro power plant and its FCR-N characteristics.The dashed box corresponds to the control unit F (s) in Figure 1.

The simulation model consists of two sub-systems, the power system G(s) and the control systemF (s). The power system represents the frequency deviation that arises when inertia is used to even outimbalances in power production and consumption, while the control system models the inflow of waterinto the power system and the resulting power generation by the turbines. The power system transferfunction may be derived by taking a Laplace transform on the swing equation. The control systeminvolves a proportional-integral (PI) regulator with parameters Kp and Ki representing proportionaland integral gain respectively, a servomotor with time constant Ts that regulates the water flow intothe turbines and a gate rate limiter that limits the rate of change of the gate servo signal. The waterflow from the gate servo is modelled with a backlash, i.e. hysteresis, and sent into a system representingthe waterways with water time constant Tw. The water ways block has a zero in the right hand of thecomplex plane and is thus a non-minimum phase system. The parameter ep, called droop, limits thestatic gain of the PI-controller, which becomes 1/ep [5]. The static gain represents the system behaviourfor a constant deviation from the set point. Increasing values of the droop decreases the gain and thusdecreases the response from the system for a given deviation in frequency. The parameter Pn2R is ascaling factor introduced to scale the signal between per unit systems, further discussed in section 2.3.

The backlash models how the system resists acting on small changes in input, i.e. no change in waterflow happen at all until the desired change is above some threshold. This is a highly non-linear behaviourthat introduces hysteresis. Thus, it is harder to react to small changes in frequency within the non-linearmodel that includes backlash than it would be in a strictly linear model. A semi-linearised model isachieved by setting the backlash in between the servomotor and the waterways to zero. This version ofthe simulation model is in the continuation referred to as the linear model. This semi-linearised modelstill contains some non-linearities in the form of the gate rate limiter.

Servo Penstock

ep

Droop

Δ P

FCR [MW]PI

Δ f

Sine wave [Hz]

1/f0

Scale

Pn2R · 600/C

Local2GlobalBLGR

Figure 5. Model used in simulated pre-qualification tests to evaluate the FCR-N stability and performancerequirements.

The model for simulated pre-qualification tests is based on the simulation model but with the feedbackloop disconnected, and is depicted in Figure 5 above.

7

2.3 Per unit scaling

Quantities are sometimes scaled to express them as fractions of typical values for a production unit withthe intent of making production units comparable. For example the power output may be divided withthe rated power to get an expression for the utilisation of the plant. The rated power Sn is then used asthe base for power as expressed per unit by calculating

Ppu =P

Pbase=

P

Sn. (11)

The FCP project defines two such bases, the machine base and the FCP base, introduced below. [3]

2.3.1 Machine base

The machine base is obtained by taking the rated power of all FCR-N providing plants summed as thepower base. The frequency base is taken as nominal frequency, i.e. 50 Hz. Hence the power and frequencybases become:

Pbase = Sn = ep · dP ·f0

df· dPC

= 1 p.u. , fbase = f0 = 50 Hz = 1 p.u. (12)

The expression for the power base may be somewhat simplified by introduction of the parameter Pn2R

Pn2R = ep · dP ·f0

df, (13)

such that the power base becomes Pbase = Pn2R · dPC . Since the droop ep per the FCP-project[3] isdefined as

ep =df/f0

dP/Sn−FCR, (14)

the parameter Pn2R is equivalent to the rated capacity Sn−FCR of a FCR-N providing plant. It is alsoseen that with a FCR-N capacity of C per plant, and a total capacity of dP , the number n of such plantshas to be

n =dP

C, (15)

and thus

Pn2R · dPC

= n · Sn−FCR . (16)

8

2.3.2 FCP base

The purpose of the FCP base is to scale the FCR-N contribution from a power plant with the maximalallowed contribution C, per the FCP project pre-qualification step sequence illustrated in Figure 3. Thefrequency is scaled with the full activation frequency deviation, i.e. 0.1 Hz, for FCR-N. Thus the powerand frequency bases become:

Pbase = C = 1 p.u. , fbase = df = 0.1 Hz = 1 p.u. (17)

2.3.3 Conversion table

Desired quantity = Scale factor×Given quantity (18)

Desired quantity

Given quantityP PMB PFCP

P 1 Pn2R · dPC C

PMB 1 1Pn2R ·

C2

dP

PFCP 1

Here the following relation holds, per equation (16):

1

Pn2R· C

2

dP=

C

n · Sn−FCR. (19)

9

3 Theory - Machine learning

In this chapter the machine-learning setting is introduced in section 3.1, and an explanation of how theaccuracy of classification can be assessed is given in section 3.2. The classifiers that have been appliedare discussed in section 3.3.

3.1 Introduction to machine learning

The machine-learning problem is to establish a mathematical model that well expresses the relationbetween some input and output, i.e. to find a function that creates a mapping from the input to theoutput. The model may be used for either prediction or inference. In the prediction setting the aim is topredict properties of the unknown output from given inputs. This can mathematically be seen as findingcorrelations in a known data set without regards to causation. In the inference setting the aim is insteadto find such causations in the data set. The main difference between predication and inference is thus thatwhen performing predictions the aim is to estimate the output corresponding to a given input, i.e. makeas correct predictions as often as possible, while the aim when performing inference is to explain why theoutput changes in that specific way. Hence a prediction model focuses on achieving god estimates whilean inference model puts more emphasis on the interpretation of the connections that arises within themodel. The result of this difference is that when creating a model for performing predictions the modelshould be evaluated by the prediction accuracy on independent, previously unseen, data. The sources ofthe correlations found by the model are of less interest. An inference model is harder to evaluate becauseof the need to discern the difference between correlations and causations. All models explored within thisthesis are in the prediction setting.

The model is created by a specified machine-learning algorithm that is supplied with a set of knownhistorical inputs with corresponding outputs. The inputs as well as the outputs may be quantitative orqualitative, i.e. categorical. If the output to be estimated by the model is quantitative the problem isof the regression type, while categorical outputs correspond to a classification problem. As the aim ofthis thesis is to predict if a power plant fulfils the FCP-requirements or not all models explored will beclassification models.

The process of creating the model from historical data is called to train the model. The inputs andoutputs corresponding to the same historical event may be collected into a set denoted as a trainingsample. The sample consists of a vector of inputs, denoted the features of the sample, as well as theoutput which is denoted as a label in the classification setting. The data set for training the model consistof all the historical samples that are known at the time of training. Some of the samples in the data setare withheld from the model at the time of training to instead be used as independent data for estimatingthe classification accuracy of the trained model. The samples withheld from training constitutes the testset, while the samples used for training is denoted the training set. In addition to enabling an assessmentof the classification accuracy, the partitioning enables a method to reduce the amount of overfitting in theconstructed model. Overfitting occurs when the model is permitted to learn from irrelevant information inthe historical data set, for example random noise in the data or statistical anomalies with the distributionof the samples. Such dependence on irregularities in the train data will appear as decreased test setaccuracy even while the training set accuracy remains high.

All of the classifiers examined for this thesis are represented by the choices of parameters and hyper-parameters of the respective machine-learning model. The hyperparameters describe the general structureof the model while the parameters represent the precise details within that structure. For example, ifpolynomial regression is applied the hyperparameter would be the degree of the polynomial while theparameters of the model correspond to the coefficients of the polynomial. The training process introducedabove only decides on the choices of parameters. If in addition hyperparameter selection is to be appliedan additional validation set of samples withheld from the training process is needed to independentlydecide on hyperparameter choices. Alternatively, cross validation may be applied on the training set, atthe cost of increased computational complexity.

10

The machine-learning setting is further examined in the sections that follow, with special regard given tomethods for estimating the classification accuracy as well as increasing it.

3.2 Estimating the quality of classification

When training on the historical samples of features – as described in the preceding section – the qualityof the resulting classifier has to be examined. Section 3.2.1 establishes a limit on accuracy that exists in asetting with random noise and the resulting bias-variance trade-off that is to be considered when designinga classifier. Methods to perform this trade-off are introduced in section 3.2.2 by dividing the availableobservations into train, validation and test sets and in section 3.2.3 by performing cross validation.

3.2.1 Bias – Variance trade-off

In regression and classification problems the objective is to construct a machine-learning model that fitsthe known training data and generalizes well to unknown test data. The bias is a measure of how well themodel accommodates the supplied training data where a low bias means that the model is well fitted tothe data. The variance measures the extent to which the model changes when a different set of trainingdata is supplied, e.g. the impact of noise in the data has on the model. In general it can be said that amodel that fits the training data well will also be fitted to the accompanying noise in the data. Hence,a low bias model will typically be of high variance, and vice versa. Thus it is in general impossible tominimize both the bias and the variance of a model at the same time, resulting in a bias-variance trade-offthat has to be made when constructing the model. This phenomenon is illustrated in Figure 6 below.

x-1 -0.5 0 0.5 1

y

-1.5

-1

-0.5

0

0.5

1

1.5

2Low order polynomial

High bias - Low variance

x-1 -0.5 0 0.5 1

y

-2

-1

0

1

2High order polynomial

Low bias - high variance

Figure 6. Illustration of the Bias-Variance trade-off, here shown for a linear regression on a dataset withunknown noise. The low order polynomial does not fit the data very well, i.e. is biased, but also not verysensitive to changes in data and noise and thus of low variance. The high order polynomial fits the data wellbut is very sensitive to changes in data or noise. Hence, the high order polynomial is of low bias but highvariance. The bias-variance trade-off is to find a regression model that balances these two phenomena.

The bias-variance trade-off may be studied mathematically by introducing the setting y = f(x) + ε, wherey is the true output, x is the input vector of features, f is some deterministic but unknown function andε is random noise independent of the input with mean zero and variance σ2. Let S be a given set of pairsof inputs and outputs, S = {(x1, y1), (x2, y2), . . . , (xn, yn)}, denoted the training set. The training setis a subset of all possible, perhaps infinite, pairs of input and output data. From this set an estimatorf of the function f is to be made using some specified method of estimation. Because of the limitedtraining set and random noise the estimator is to be seen as a realisation from an infinite class of possibleestimators that could be created by that specific method. It is thus desirable to determine some generalproperties of the estimator f with regard to unseen pairs of input and output data (x0, y0) /∈ S.

11

One measure of the quality of the class of estimators that has been used is the mean square error (MSE),defined as

MSE = E

[(y0 − f(x0)

)2]. (20)

It is readily seen that the MSE is non-negative, and equal to zero only for an ideal estimator. Anotherquality measure that is useful in discrete classification is the error rate (ER) here taken to be

ER =1

n

n∑i=1

I(yi 6= yi), (21)

where I is an indicator function that is equal to 0 when the classification is correct and equal to 1otherwise. Both measures can be decomposed into the variance and bias of the estimator together withthe variance of the noise term, here shown for the MSE. First some preliminary results:

1. Var(X) = E[(X − E[X])2] = E[X2]− E[X]2 ⇒⇒ E[X2] = Var(X) + E[X]2

For some random variable X.

2. Var(y) = E[(y − E[y])2] = E[(y − E[f ])2] = E[(y − f)2] = E[ε2] == Var(ε) + E[ε]2 = Var(ε) = σ2

Here it has been used that f is deterministic, i.e. that E[f(x0)] = f(x0), and that the noise haszero mean: E[ε] = 0.

3. E[y] = E[f(x)] + E[ε] = f(x).

From these results it is possible to find a decomposition of the MSE according to

MSE = E[(y − f)2] = E[y2 − 2yf + f2] =

= E[y2]− 2E[yf ] + E[f ]2 =

= Var(y) + E[y]2 − 2E[yf ] + Var(f) + E[f ]2 =

= Var(y) + Var(f) + (E[y]2 − 2E[yf ] + E[f ]2) =

= Var(y) + Var(f) + (f2 − 2f · E[f ] + E[f ]2) =

= Var(y) + Var(f) + (f − E[f ])2 =

= σ2 + Var(f) + Bias(f)2.

Here Var(f) = E[(f(x)−E[f(x)])2] is a measure of the spread in the estimator when varying the training

data while Bias(f(x)) = E[f(x) − f(x)] is a measure of the mean deviation of the estimator from thetrue function with x held constant. σ2 represents the irreducible error and arises from the variance ofthe noise. Note that in the above given derivation of the decomposition it has been assumed that thevariables exist in a continuous setting and that ε and f are independent. A derivation in the more generalsetting is found in [8].

In general it can be said that a more complex model for the estimator, i.e. a model with more degrees offreedom, will be better suited to explain the training data. The bias term arises when a model is used forestimation that is less complex than the function to be estimated, e.g. if a linear estimator f is used toestimate a quadratic function f . Thus, increasing the complexity of the model will typically lead to adecrease in bias. Variance, on the other hand, arises when the estimator tries to fit the noise term ε, aswell as the input x, to the output y. A more complex model will be more able to adjust to the error term,a phenomena called overfitting. Hence, increasing model complexity leads to increasing variance of theestimator.

12

Since the bias term is decreasing while the variance increases with increasing complexity, a trade-offbetween minimising bias and variance has to be made in order to minimize the MSE. This Bias - Variancetrade-off in the error needs to be considered when designing and evaluating classifiers and is illustrated inFigure 7 below. Two methods that have been used are dividing the available data into training and testsets, and cross validation respectively. The methods are introduced in the sections that follow.

Model complexity0 1 2 3 4 5

0

1

2

3

4Simulated contribution to MSE

Variance

Bias2

2

MSE

Figure 7. Simulated contribution to mean square error from variance of the estimator, bias squared andirreducible error σ2 due to the variance of the noise as the complexity of the model increases.

3.2.2 Training, validation and test sets

One way to perform the bias–variance trade-off is to randomly partition the available data into two sets:a training set and a test set. The training set is used to fit the model parameters while the test set isused to evaluate the classification error of the resulting trained model. This method reduces the risk ofoverfitting, since the data set used to train the method is different from the set where the errors areevaluated, and computationally efficient. When training and test set has been used in this project, a70/30 split has typically been performed, i.e. 70 % of the data goes into the training set and 30 % intothe test set.

One disadvantage of performing this partitioning is that the available data points for fitting parametersare reduced as some data points are withheld for verification. Another disadvantage of partitioning thedata arises when hyperparameters of a method need to be chosen in addition to the ordinary parameters.For example, when fitting a polynomial

pn(x) = c0 + c1x+ c2x2 + . . .+ cnx

n, (22)

the parameters are the coefficients ci of the polynomial while the degree n of the polynomial is ahyperparameter. If the coefficients are fitted for polynomials of various degrees on the training set, asecond validation set is needed to determine the optimal value of the hyperparameter, i.e. which of thepolynomials perform best. A third test set is then needed to evaluate the error of the final model. If onlytwo sets are used either the parameters or the hyperparameters are at risk of overfitting. Hence furtherpartitioning is needed over the training and test set when fitting hyperparameters, further reducing theamount of data available for training the model as illustrated in Figure 8 below. To avoid this problem,cross validation may instead be performed, as described in the next section.

13

Training set Valida�on set Test set

Fit parameters

Determinehyperparameters

Es�mateerror

Figure 8. To independently determine parameters, hyperparameters and estimate classification errors theset of all observations has to be divided into three separate sets, if cross validation is not applied.

3.2.3 Cross validation

As an alternative to partitioning the data set into training, validation and test sets, k-fold cross validationmay be performed. The idea is to randomly partition the set of observations into k folds, use one of thefolds as the validation set and train the model on the remaining k − 1 folds. Using the validation setsome specified error measure is calculated. The process is then repeated so that each of the k folds isused once as a validation set and k − 1 times included in the training set. The k-fold cross validation(CVk) estimate of the error is then achieved by calculating the average over the chosen error measure:

CVk =1

k

k∑i=1

Errori . (23)

For this thesis the error rate in equation (21) has been chosen as the error measure when performingcross validation. By using cross validation more data can be used for training the model and less forvalidation, since averaging the results increases the accuracy in the error estimates. This comes at theexpense of having to retrain and retest the ML-model k times, once for each fold. For this project, avalue of k = 10 has been used when performing k-fold cross validation.

Cross validation is used to combine the training set with the validation set, i.e. the sets to fit theparameters and hyperparameters respectively. To accurately estimate the final error rate it is still neededto have a set of data that is independent of the train and validation data to counteract overfitting. Hence,a separate test set is still required for error estimation on the final model.

Iter 1

Iter 2

Iter 3

Train data

Validation data

Cross validation

Figure 9. Example of partitioning the original train data set into k = 3 folds to perform 3-fold crossvalidation. In each iteration the train subset is used to train the model while the validation subset is used toevaluate the hyperparameters of the model. The results are then averaged over the folds.

14

3.3 Classifiers

In this section the classifiers that have been considered in this project are introduced together with theirrespective parameters and hyperparameters.

3.3.1 Decision tree

The decision tree algorithm works by performing recursive binary splitting of the feature space, asillustrated in Figure 10 below. At each node in the tree, a single feature from the input vector x is chosen,xi, and a threshold parameter a is chosen to split the feature space in two halves according to

H1 = {X|Xi < a} , H2 = {X|Xi ≥ a} . (24)

This process is repeated recursively until all training samples at a node has the same classification or athreshold for the height of the tree is reached. Note that the same feature xi may be split several timesat different nodes in the tree and for different values of a. To assess which feature to split at a node, andat which value, the Gini index has been used:

G =

K∑k=1

pmk (1− pmk) =

K∑k=1

pmk −K∑k=1

p2mk = 1−

K∑k=1

p2mk. (25)

Here K is the total number of classes, pmk is the proportion of data points currently in region m that areof class k. It is readily seen that the Gini index is non-negative and approaches zero as more and more ofthe samples in region m have the same classification, i.e. if one of the ratios pmk approaches one and theothers go to zero. Thus, the Gini index is a measure of impurity such that a lower index at a node meansthat a larger fraction of the samples at that node has the same classification. The feature xi to split andthe threshold a to split at is chosen greedily at each node to achieve the maximal reduction in Gini index.

If the tree is allowed to grow without bounds it may cause severe overfitting on the training data. Oneway to prevent this is to use a hyperparameter that limits the maximal height of the classification tree.Another possibility is to let the tree grow until no further reductions in Gini index can be achieved andthen post-“prune” the tree by using the validation set to remove nodes such that the validation set errordecreases.

Figure 10. Example of a decision tree as created by the decision tree classifier. When a new data point isto be classified, the algorithm starts at the top of the illustration and follows a path by answering yes or noto the questions stated in each box within the path. The final classification is found when reaching one ofthe end nodes of the tree, i.e. approved or not approved.

15

3.3.2 Random forest

The decision tree algorithm as described in the preceding section will, if not limited, lead to a classifierwith very low bias and high variance. The idea behind the random forest algorithm is to use many suchclassifiers and reduce the variance by averaging the results, or taking the majority vote in the classificationsetting.

If the same training set is used to train all the decision trees the resulting classifiers will be highlycorrelated or perhaps even identical and hence the resulting reduction in variance will be minuscule. Todecorrelate the trees and improve the variance reduction two different strategies are applied: bootstrappingand random feature selection. Bootstrapping means that each decision tree gets its own training setrandomly sub-sampled from the main training set. The high variance of the decision tree algorithm willlead to large changes in the tree structure even for small changes in training data and thus decorrelationof the classifiers.

If one of the features dominates when calculating Gini index reduction that feature may still be used atthe top of most trees even with random sub-sampling of data, and hence bootstrapping may not lead toas much decorrelation as anticipated. Thus, in addition to bootstrapping, random feature selection isused to provide decorrelation. Random feature selection means that at a split, instead of choosing amongthe full set of available features only a randomly selected subset of features are considered for makingthe split. A high performing feature that otherwise would be placed at the top of each of the trees willthen randomly be rejected for that position, and thus decorrelation occurs as another feature is selectedfor the top spot of that tree. A typical choice is to only consider the square root of the total amount offeatures at each node split, which means that most features are disregarded at each single split.

The hyperparameters of the random forest algorithm are the number of decision trees to average overand the number of features to consider at each split, in addition to the hyperparameters of the decisiontree algorithm itself.

3.3.3 Adaboost

The main principle behind the Adaboost algorithm is to weigh the training samples such that the trainingprocess considers it more important to correctly classify some samples than others. Initially all samplesare weighted evenly and a decision tree classifier is trained on the data with a severe limit on the maximalheight of the tree. The resulting weak classifier will be high bias and low variance. The next step is tocalculate the residual of the classification, i.e. find the misclassified samples, and increase their weightwhile decreasing the weight of the samples that were correctly classified. This process is repeated untilsome specified limit on the number of decision trees are reached. The final result of the Adaboost classifieris achieved by weighting the results of each individual decision tree with their respective training errorand then taking the majority vote of the classifiers. The technical details of the algorithm are found in [9].

The main hyperparameters of the Adaboost algorithm are the number of decision trees to use and themaximal depth of each individual tree.

3.3.4 Support vector machine

The idea behind the support vector machine is to find a hyperplane that separates the feature space intotwo regions such that all data points of class one are on one side of the plane and all points that belongto class two are on the other side of the plane [10]. A hyperplane in a p dimensional space is defined bythe equation

w0 + w1x1 + w2x2 + . . .+ wpxp = w0 +

p∑n=1

wnxn = w0 + wTx = 0, (26)

16

which in two dimensions reduces to the equation of a straight line. Here x is the vector of coordinates, wthe corresponding vector of weights and w0 the bias. By setting the bias term w0 to zero the hyperplanewould pass through the origin. If the problem is linearly separable, i.e. if there exists a hyperplane suchthat all points can be correctly separated, infinitely many such hyperplanes can typically be created bysmall rotations or translations of the plane. It is then desirable to find the unique separating hyperplanewhich maximises the margin to the points closest to the plane. It may also happen that no such separatinghyperplane exists at all; in that case it would be desirable to let some data points be misclassified andfind the hyperplane that best separates the classes instead of failing to find a plane that does it perfectly.This leads to an optimisation problem as follows

minw,w0,ξ

wTw + C

n∑i=1

ξi (27)

S.t. yi(w0 + wTxi

)≥ 1− ξi, (27a)

ξi ≥ 0, i = 1, . . . , n. (27b)

Here wTw is inversely proportional to the square of the margin, ξi is slack variables that allows samples toviolate the margin while C is a penalty for margin violation. yi contains the classification of sample i suchthat yi = sgn(w0 + wTxi), i.e. yi ∈ {1,−1} where the sign corresponds to which side of the hyperplanethe sample is situated. n is the total number of samples to be trained on.

The decision boundary in the optimisation problem (27) above will be linear because of the linear innerproduct 〈w, xi〉 = wTx in constraint (27a). By replacing the inner product with a generalized innerproduct a non-linear decision boundary may be obtained. Such a generalisation is referred to as a kernelK(u, v) = φ(u)Tφ(v) for some vectors u and v, where φ(·) is a non-linear transformation. By applying thenon-linear transformation to the problem in (27), the support vector machine problem may be obtainedwith a primal and dual formulation as follows:

minw,w0,ξ

wTw + C

n∑i=1

ξi (P)

S.t. yi(w0 + wTφ(xi)

)≥ 1− ξi, (Pa)

ξi ≥ 0, i = 1, . . . , n. (Pb)

minα

1

2αTQα− eTα (D)

S.t. yTα = 0, (Da)

0 ≤ αi,≤ C i = 1, . . . , n. (Db)

In the dual problem α is the vector of Lagrange-multipliers, e is a vector of all ones, and Q is a positivesemidefinite matrix with Qij = yiyjK(xi, xj), where K(u, v) is the chosen kernel. The dual problem is aquadratic problem with linear constraints and is the problem used in the optimisation routine. Note thatin the dual formulation the weight vector w and non-linear transformation φ(·) is not used explicitly.

When a new sample with feature vector x0 is to be classified the decision function is

y0 = sgn

(n∑i=1

yiαiK(xi, x0)

). (30)

The above given formulation of a support vector machine classifier is suitable only for binary classificationas y0 ∈ {−1, 1}, but the method may be adapted to multi-class classification. Further details and a morethorough derivation of the optimisation problem may be found in [10].

The following kernels have been considered when using the support vector machine classifier in thisproject.

Linear kernel

K(u, v) =

p∑j=1

ujvj . (31)

17

Polynomial kernel

K(u, v) =

1 + γ

p∑j=1

ujvj

d

. (32)

Here d is the degree of the polynomial and γ is a positive constant.

Radial kernel

K(u, v) = exp

−γ p∑j=1

(uj − vj)2

, (33)

where γ is a positive constant.

Linear kernel Polynomial kernel (d=3) Radial kernel

Figure 11. Two-dimensional example of a support vector machine classifier with three different kernels.The decision boundary is shown in black with the margin at each side shown in red and blue respectively.For the linear kernel some data points violates the margin as the data is not linearly separable.

Figure 11 above shows the three types of kernels applied to the same training data set in a support vectormachine classifier. Margin violation is seen for the linear kernel as the data is not linearly separable.

The hyperparameters of the support vector machine are the margin violation penalty C, the choice ofkernel and, if applicable to the kernel, the coefficient γ and the polynomial degree d.

3.3.5 Neural network

The artificial neural network classifier is based on individual units, denoted artificial neurons, orderedin layers such that each layer is connected to one preceding layer and one succeeding layer [11]. Theartificial neuron constitutes of a mathematical model inspired by some aspects of a biological neuron –hence the name – and is depicted in Figure 12 below. The idea behind the artificial neuron is to take avector x as inputs to the neuron and create a weighted sum θ = wTx, where w is a vector of weights.This weighted sum is used as stimulus for the neuron; if the stimulus is above some threshold the neuronactivates and outputs a 1, otherwise the output remains at the initial 0. The activation is controlled byan activation function with the simplest form being an ordinary step function. The threshold is typicallymodified by introducing a bias term to the weighted sum θ. Mathematically this may be formulated as

θ = wTx+ b, y = f(θ), f(θ) =

{1, if θ > 0

0, if θ < 0(34)

18

x1

x2

xn

w1

w2

wn

Weights

Biasb

Inputs ΣOutput

y

Activationfunction

Artificial neuron

Figure 12. Structure of an artificial neuron.

The artificial neural network consists of layers, where the first and last layers are denoted visible layers,and the layers in-between are hidden layers. The individual neurons are contained in the hidden layers,such that each neuron in hidden layer i take input from the output of hidden layer i− 1. The outputof layer i is in turn sent to hidden layer i+ 1 and used as inputs for the latter layer. The first and lasthidden layer is instead connected to the visible input and output layer respectively. The input layerconsists of the vector of features made available to the algorithm while the information sent to the outputlayer constitutes the resulting classification. This structure is illustrated in Figure 13 below.

Input Hiddenlayer 1

Hiddenlayer 2

Output

1 2 3 4

Neural network with four layers

Figure 13. General structure of an artificial neural network. Each circle in the hidden layers corresponds toan individual neuron.

The weights w and biases b throughout the neural network is updated through a gradient decent method,such that

v(i+1) = v(i) − η∇C. (35)

Here v is a collection of the weights and biases in vector form, η the step length of the algorithm, oftendenoted the learning rate of the network, and C is some error measure, e.g. the mean square error. Thegradient ∇C is calculated numerically using the backpropagation algorithm [12]. The gradient shouldin principle be calculated once for each training sample, with a corresponding update of parametersafter each calculation. To increase calculation performance it is typically chosen to use several randomlyselected samples at once when calculating the gradient and performing the parameter update. Thisupdate algorithm is denoted stochastic gradient descent with the number of simultaneously used samples

19

denoted the batch size. Since the algorithm is not guaranteed to converge to a stationary solution in afeasible time the number of iterations are pre-determined; this hyperparameter is denoted the numberof training steps. A more in-depth look at the neural network updated with stochastic gradient descentthrough the backpropagation algorithm may be found in [12].

Several variations of the structure of a neural network exist. The type of network here described andused for this thesis is a fully connected feed forward deep neural network. A fully connected networkmeans that each individual neuron in a layer is connected to every neuron in the preceding layer as wellas every neuron in the succeeding layer. An example of a non-fully connected network is the convolutionalneural network, where some of the connections between layers has been removed, forcing that part ofthe network not to take all available information in account but instead to focus on smaller details. Thenetwork is feed forward as information is allowed to flow only in one direction, in contrast to for examplea recurrent network where information is allowed to loop back into an earlier layer. That the network isseen as deep means that the network constitutes of more than one hidden layer.

In addition to the various ways of connecting the neurons into a network, the individual neurons maybe varied, especially with regard to the activation function f(θ) in equation (34), also depicted in figure12 above. The implementation used throughout this thesis has used the rectified linear unit (ReLU) –equivalent to a unit ramp – instead of the previously introduced unit step, as a way to ease the gradientcalculations and also as ReLU has been shown to increase final classification accuracy compared to otherchoices of activation functions [13].

The hyperparameters of the applied neural network classifier are the number of hidden layers, the numberof neurons in each layer, the batch size, the learning rate and the number of training steps.

20

4 Methodology

In this chapter the methodology that has been developed for use within the project is presented. Section4.1 contains a brief description of the four simulation models that have been used to generate signals offrequency and power from a simulated power plant as well as an answer key to how that power plant is tobe classified. The generated signals are then used as a base for calculation of key performance indicatorswhich is used as features for the machine-learning methods. The key performance indicators are definedin section 4.2. Section 4.3 contains a description of how hyperparameter-tuning for the machine-learningmethods were performed, while section 4.4 is a depiction of how the features where managed with regardto feature selection and scaling. In section 4.5 the methods of evaluation that have been used for thetrained classifiers is described.

4.1 Simulations

The frequency and power signals that are used by the machine-learning classifiers for classification havebeen obtained by performing simulations on a model of a hydro power plant. Four such models have beendeveloped with gradually increasing complexity, all based on the reference model introduced in section2.2.3. The four models are:

• a linear model, obtained by setting the backlash to b = 0. Further discussed in chapter 5.

• a non-linear model, with backlash allowed to be non-zero. Further discussed in chapter 6.

• a non-linear model with noise, where measurement noise and uncertainties have been added to theresulting signals. Further discussed in chapter 7.

• a sub-sampled model, based on the noise model and used to examine the effects of sampling interval.Further discussed in chapter 8.

Several power plants have been simulated by random selection of some of the parameters in the respectivesimulation model. For each simulated power plant signals of frequency and power have been generatedcorresponding to two hours of normal operation of that plant. In addition to the simulation of normaloperation the pre-qualification process introduced in Figure 5 has been simulated for each respective plantto generate an answer key to the fulfilment of the FCP conditions on stability and performance. For thisthesis a plant has been classified as approved if it simultaneously fulfils both the stability and performanceconditions, otherwise as non-approved. This process was repeated to generate a representative sampleof simulated power plant for each simulation model. The data set of answer keys together with keyperformance indicators calculated on the generated signals were then used to construct a machine-learningmodel for the data as well as evaluation of the same model.

For further details regarding the simulation models, model parameters and number of simulations, as wellas the results, refer to the chapter for the respective model.

4.2 Key performance indicators

Several key performance indicators (KPI) have been defined as a way to characterise the system frommeasurements of the input signal f(t) and the output signal P (t). The KPIs are used as the input featuresto the machine-learning algorithms, as discussed in Section 3.1. Throughout this section ∆f(t) is takento be the deviation in frequency from the nominal f0 = 50 Hz while ∆P (t) is the corresponding changein power production, i.e. the activated FCR-N capacity. The indicators are constructed by analysingthe signals in the time domain and the frequency domain respectively; section 4.2.1 introduces the keyperformance indicators in the time domain while the indicators from the frequency domain are introducedin section 4.2.2 below. Some indicators are calculated from a combination of values form the time and

21

frequency domain; these indicators are introduced in section 4.2.3. Figure 14 below presents an exampleof a frequency signal with corresponding FCR-N contribution from a simulated power plant.

Figure 14. Illustration of the frequency and power signals that has been used as a base for the calculationof the key performance indicators.

4.2.1 Time domain

The following key performance indicators were defined in the time domain of the system. Here the symbolX denotes the arithmetic mean of the quantity X while X represents an estimation of the quantity. X∗

is the complex conjugate of X.

Standard deviation of the power signalThe sample standard deviation of the power signal sP is calculated as

sP =

√√√√ 1

N − 1

N∑i=1

(∆Pi −∆P

)2, (36)

where N is the number of samples, ∆Pi is the power signal at sample i, and ∆P is the arithmetic meanof ∆P .

Standard deviation of the frequency signalDefined analogously as for the power signal, the sample standard deviation of the frequency signal iscalculated as

sf =

√√√√ 1

N − 1

N∑i=1

(∆fi −∆f

)2. (37)

Quotient sP and sfThe quotient sP /sf is introduced as a possible measure of the amplification of the input signal within thesystem.

Relative arc length f(t)The arc length of the frequency signal is a measure of the variation of the signal over the measured timeperiod. The measure is normalized by the interval length to achieve the relative arc length, taken to be

lf =

∫ t1t0

√1 +

(d∆fdt

)2

dt∫ t1t0

1dt=

1

(t1 − t0)

∫ t1

t0

√1 +

(d∆f

dt

)2

dt . (38)

22

Relative arc length P(t)Defined analogously as for f(t), the relative arc length of P (t) is calculated as

lP =1

(t1 − t0)

∫ t1

t0

√1 +

(d∆P

dt

)2

dt . (39)

Quotient arc length P (t), arc length f(t)The quotient lP /lf is introduced as a possible measure of the quality of the control work exerted by thesystem.

Correlation coefficient P(t), f(t)The correlation coefficient for two variables is defined as the sample covariance normalized with thesample variances of each of the variables:

r =Cov(∆P,∆f)√

Var(∆P ) ·Var(∆f). (40)

The correlation coefficient is unit-less and takes values between −1 and 1.

Cross correlation P(t), f(t)The cross correlation is defined as

(f ? g)(t) =

∫ ∞−∞

f∗(t) · g(t+ τ)dt. (41)

It is a measure of the correlation of the two signals when the signal g(t) is time-shifted by τ units. It issimilar to the convolution of the signals, but without the time-reversal of one of the signals that is usedin the convolution process. Since the correlation for an ideal system is expected to be close to −1 [3],two key performance indicators is defined by finding the value on τ that minimizes the cross correlationtogether with the corresponding value on the cross correlation.

R2 value for the regression ∆P (t) = a · f(t) + bThe coefficient of determination R2 is a measure of the amount of variance of the dependent variablethat can be explained by the regression. For an ideal system the power signal is expected to be close toproportional to the frequency, which corresponds to a R2 value of 1 for the regression P (t) = a · f(t) + b.The R2 value is calculated as

R2 = 1−

∑Ni=1

(∆Pi − ∆Pi

)2

∑Ni=1

(∆Pi −∆Pi

)2 . (42)

Here ∆Pi is the power at sample i and ∆Pl the corresponding estimated value from the regression. ∆Pis the mean power.

4.2.2 Frequency domain

This section introduces the key performance indicators that have been defined in the frequency domain.The indicators are calculated by performing system identification in order to estimate the gain and thephase angle of the system F for different frequencies. This was initially implemented using two separatemethods, ARX and SRIVC, by use of the corresponding MATLAB functions arx() and tfest(). SRIVCwere later omitted as its accuracy were deemed worse than ARX and the time to perform a simulationwere seen to increase by 300 % with it included. Both methods try to fit a linear model to the provided

23

input and out signals – ARX uses a discretised model while SRIVC uses a continuous model [14]. Themethods both need to be provided with the number of zeros and poles to use in the estimation; for ARXthe number of zeros m has been set to m = 15 and the number of poles to n = 15 while SRIVC weretried with m = 4, and n = 8, as recommended in [14] for each method. From the estimations by ARXkey performance indicators has been calculated as follows.

Bandwidth

The angular frequency wB for which the amplitude gain is equal to√

1/2

Phase shift at bandwidthThe phase shift ϕs at the bandwidth of the estimated transfer function is defined as the phase angle ϕBat angular frequency wB minus the phase ϕε as ω goes to zero, i.e.

ϕs = ϕB − ϕε, ε→ 0. (43)

When calculated numerically ε = 10−4 has been used.

Estimated fit of system identification10 percent of the signal samples has been used to estimate how well the resulting transfer functioncorresponds to the measured data, by calculating the normalized root mean square error (NRMSE) as

NRMSE = 100

(1−||y − y||2||y − y||2

), (44)

where y is the verification data, y is the estimated data and y is the average over the verification data.The values of the normalized error are in the interval (−∞, 1], where a value of one corresponds to aperfect identification while decreasing values indicates increasing deviations between the estimation andverification data. When used as a key performance indicator the NRMSE value has been regularised bysetting all negative values to zero to reduce the interval and ensure numerical stability.

Active control workThe following expression is used as an approximate measure of the amount of control work done inantiphase by the system, i.e. the useful work which the system tries to counteract frequency deviationswith:

WA =

∫ ∞0

−Re{F (ω)}dω . (45)

The minus sign is taken in order to get positive values when the system produces some counteraction inaggregate over the frequencies. The integral has been numerically evaluated using the trapezoidal ruleand the output from the MATLAB function bode().

Reactive control workThe reactive control work, i.e. the amount of control applied orthogonally to the input and thus non-useful,is correspondingly approximated by

WR =

∫ ∞0

Im{F (ω)}dω . (46)

The integral has been numerically evaluated using the trapezoidal rule and the output from the MATLABfunction bode().

Bandwidth × Phase shift at bandwidthThe bandwidth times the phase shift at the bandwidth (wB × ϕs) has been introduced as an indicator

24

of the statistical interaction between these two values. The interaction is a measure of the relationshipbetween the two variables; it is large in magnitude when both the ingoing variables are large, and smallwhen at least one of the variables are small.

4.2.3 Mixed domain

This section introduces key performance indicators that do not fit into the time/frequency domain division.

Bandwidth × quotient arc length P (t), arc length f(t)The interaction between the bandwidth and the quotient of the arc length P (t) with the arc length f(t)is introduced as a key performance indicator. It is measure of the extent to which the bandwidth and thequotient of the arc lengths takes respectively large values at the same time. Calculated as ωB × lP /lf .

2Dpu

The parameter 2Dpu is calculated from the step response test performed during the pre-qualificationprocess and is used as an estimation of the backlash in the hydro power plant. If this indicator is tobe available the step response section of the pre-qualification process thus has to be performed, leavingout only the sine-sweeps. The indicator has not been used on the linear model as the value of 2Dpu isexpected to be close to zero when the backlash b is set to zero in the simulations. The parameter iscalculated as

2Dpu =2 · 2D

|∆P1|+ |∆P3|, (47)

with definitions as in equation (9).

4.3 Hyperparameter tuning

The hyperparameters of a machine-learning model are the parameters that describe the structure of themodel, e.g. the maximal height of a decision tree, and were introduced for each respective classifier inSection 3.3 above. This section describes the process used to select the values of the hyperparameters foreach classifier.

The optimal values on the hyperparameter for each method were obtained by performing 10-fold cross-validation on the combined train and validation data set. A manual grid search was performed for eachtype of classifier to find the combination of hyperparameters that maximises the classification accuracy onthe cross validation set. The test set was not used during the hyperparameter tuning to reduce the riskof overfitting on the hyperparameters. The test set was instead used to provide an independent estimateof the resulting classification accuracy after the hyperparameters had been selected.

25

Table 1. Examined hyperparameter choices during the tuning process, for each classifier respectively.

Classifier Hyperparameter Examined values

Decision tree Max depth {2, 4, 6, 8, 12, 16, 20, 24, 30}

Random forest Number of estimators {16, 24, 32, 40, 48, 56, 64, 80}

Max depth {2, 4, 6, 8, 12, 16, 20, 24, 30}

Min samples split {2, 4, 6, 8}

Adaboost Number of estimators {25, 37, 50, 62, 75, 100}

Support vector machine Kernel Radial, Linear, Polynomial (d ∈ {2, 3, 4})

C {5, 10, 15, 20, 25, 30}

γ, radial kernel {0.0625, 0.125, 0.25, 0.5, 1}

γ, linear/polynomial kernel {0.125,0.25,0.5,1,2}

Neural network Learning rate {0.05, 0.1, 0.2, 0.5}

Number of hidden layers {1, 2, 3, 4, 5}

Units in each hidden layer {8, 10, 12, 15, 18, 20, 25, 30}

Batch size {50, 100, 200, 500}

Train steps {500, 800, 1000, 1200, 1500, 2000}

The tested hyperparameters for each classifier is shown in Table 1 above. The hyperparameters wereinitially chosen by hand-tuning of each respective hyperparameter and then expanding the achievedinterval to be able to capture a near-optimal value even if the optimal value differs in magnitude, whilekeeping computational feasibility with regard to the number of examined hyperparameters.

4.4 Data management

4.4.1 Feature selection

Feature selection for tree based methods

A recursive feature selection procedure was developed in order to determine performance indicators withlow contribution to the resulting classification, inspired by results in [15]. The feature selection for the treebased methods – decision tree, adaboost and random forest – was performed by training a random forestclassifier on the training data and summing the reductions in Gini index that each feature contributedwith. The sum over all features was then scaled to one and the resulting value for each feature wastaken to be a measure of the relative importance of that feature. The feature with the lowest importancewas then eliminated and the process recursively repeated until only one feature remained. After eachelimination the classification accuracy was determined for a random forest classifier on the training setto evaluate the resulting effect of the selected features at that point. A 95 % confidence interval for theaccuracy after each rejected feature was obtained by performing bootstrap sub-sampling on the test data.The results from the described algorithm were then taken as a baseline of what could be achieved withfeature selection. The feature selection was thereafter hand tuned by manual inspection of the results tofurther reduce the number of selected features as well as increase the accuracy.

26

Feature selection for support vector machines

Feature selection for support vector machines were implemented using a SVM trained with a linear kernel,as suggested in [16]. When trained with a linear kernel, the magnitude of each weight wi in equation (26)represents the relative importance of that feature to the direction of the separating hyperplane. Weightsthat are smaller in magnitude are taken to be of less use for the classifier. Hence, the feature correspondingto the weight wi smallest in magnitude are eliminated in each iteration. After each elimination, a SVMwith the desired kernel is trained and bootstrap sampling performed to evaluate the results of theelimination on the kernel of interest. The hereby achieved feature selection was then hand-tuned toachieve maximal reduction in number of selected features as well as to maximise the accuracy.

Feature selection for neural networks

Feature selection for the neural network method was performed using the same feature ranking algorithmas for random forest, per suggestion in [11]. After each feature elimination the accuracy were evaluatedusing a neural network classifier and bootstrap sampling. The results achieved from the random forestalgorithm were then used as a basis for manual hand-tuning to achieve the best performance and maximalfeature reduction specifically for the neural network classifier.

4.4.2 Scaling and centring

The key performance indicators have been centred to mean zero and scaled to standard deviation one byperforming the following transformation elementwise

X =X − µxσx

, (48)

where X is a vector combining one KPI over all the samples, for example a vector of the arc length ofthe frequency signals, µx is the arithmetic mean of the vector and σx are the standard deviation of thevector. The resulting vector X thus contains the information of one KPI from all the samples scaled suchthat the vector X has standard deviation one and mean zero. This process is repeated separately foreach KPI. The values of µx and σx are known for the historical training set and have to be saved forfuture use of the classifier with newly acquired unscaled data.

4.5 Evaluation of classification quality

4.5.1 Bootstrapping

Bootstrapping is a resampling method that is applied to a data set in order to achieve an estimate ofthe probability distribution from which the data set has been generated. It has been used in this thesisin order to obtain an estimate of the test set accuracy of the classifiers together with a correspondingestimate of a confidence interval. It has also been used to construct the Adaboost and Random forestclassifiers.

The main principle behind the bootstrapping method is to take a known data set and randomly samplevalues from that data set with replacement (i.e. the same value may be selected several times), thusobtaining a bootstrapped version of the data set of the same size as the original. For example, if theoriginal data set is {1, 2, 3, 4}, a bootstrapped data set could be {2, 1, 2, 3}. The process of interest, e.g.calculating the classification accuracy, is then performed on the new, bootstrapped, data set. This processis repeated N number of times resulting in a results set of N results of the process, e.g. N estimationsof the classification accuracy based on N bootstrapped data sets. The resulting data set could then beused to estimate properties of the process that has been performed. For example, to obtain a confidence

27

interval for the classification accuracy. The number N should typically be taken as large as possible withconsideration to the computing resources available.

4.5.2 Accuracy assessment table

Table 2 below presents an example of an accuracy assessment table as used throughout this report toassess and compare the achieved accuracy after feature selection has been made.

Table 2. Example of an accuracy assessment table.

Classifier Accuracy (%) Std. dev. (%) 95 % CI (%) Hyperparameters

Decision tree 79 2.5 75-84 Max depth=8

The accuracy is a measure of the fraction of samples that were correctly classified, equivalent to theerror rate in equation (21). Standard deviation and the 95 % confidence interval (CI) were calculatedusing 300-fold bootstrap sampling on the test data, with the train data unmodified. The hyperparametercolumn shows the 10-fold cross-validated hyperparameter choices after feature selection has been made.

4.5.3 Confusion matrix

Table 3 below presents an example of the confusion matrices that has been used to analyse variousaspects of the classification accuracy, in addition to the overall accuracy of the classifier. The data isobtained by using the already trained classifier on the test data and determining the different kinds ofmisclassifications that has occurred.

Table 3. Example of a confusion matrix as used in this report.

Precision (%) Recall (%) F1-score (%) Support

Not approved 91 90 91 119

Approved 91 92 91 125

Aggregated 91 91 91 244

For each category of classification — approved or not approved — several accuracy measures has beenused, with definitions in equation (49) and examples in Table 4 below:

Precision =tp

tp+ fp, Recall =

tp

tp+ fn, F1 = 2 · precision · recall

precision + recall. (49)

A positive sample (p) is a sample that has been given the classification of interest, in this context eitherapproved or not approved. A sample that has been given any other classification is denoted as a negative(n). If a positive sample is correctly classified it is called a true positive (tp), otherwise a false positive(fp). A negative sample that should have been positive is a false negative (fn) while a sample that hascorrectly been put outside of the class of interest is seen as a true negative (tn).

In this context the accuracy measure in section 4.5.2 may then be taken as

Accuracy =tp+ tn

tp+ fp+ tn+ fn. (50)

28

Table 4. Explanation of concepts used in confusion matrices with corresponding examples.

Measure Explanation Example

Precision

The number of samples that has correctly beengiven a specific classification divided by thetotal number of samples that has been giventhat classification

117 samples have been classified asnot approved by the model. Out ofthose, 107 is correctly classified. Theprecision is then 107/117 = 91 %

Recall

The number of samples that has correctly beengiven a specific classification divided by the totalnumber of samples that should have been giventhat classification per the answer key.

107 samples have been correctly clas-sified as not approved by the model.The actual number (support) of notapproved samples is 119. The recallis then 107/119 = 90 %

F1-Score

The harmonic mean of precision and recall:

F1 = 2 · precision · recall

precision + recall

F1 = 2 · 91 · 90

91 + 90= 90 %

SupportThe total number of samples in a category perthe answer key

119 samples should have been clas-sified as not approved by a perfectclassifier.

Note that the precision, recall and F1-score all take values in the range 0–100 %

4.5.4 Classifier parameter distributions

The reliability of the classifiers have been evaluated by calculating the distributions of three parametersexpected to influence the behaviour of a power plant to ensure physical reasonability of the classifiers.An example of such a parameter distribution is presented in Figure 15 below.

targe

t dt ab rfsv

m nn

Classifier

10

30

50

70

90

Ti

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

targe

tdtab rfsv

mnn

Classifier

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

2D

pu

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Kp · ep, approved Ti, approved 2Dpu, approved

Classifier

Normalized parameter distributions after classification

Figure 15. Example of a parameter distribution histogram as used in this thesis.

29

Three parameters are examined in these histograms:

• Kp · ep ,

• Ti =1

ep ·Ki,

• 2Dpu = b · ep ·f0

df,

where b is the backlash parameter in the simulations. The histograms are separated for approved andnon-approved samples. For each parameter one column corresponds to one classifier, where the targetcolumn denotes the expected distribution per the answer key from the simulations. Each column isseparately normalized to sum to one. Note that the colour coding in the histograms changes with themaximal value for that specific histogram and thus comparing values between histograms by comparingonly colour values is not a well-defined operation. Instead the colour has to be converted to the actualvalue using the supplied colour key.

30

5 Linear model

5.1 Simulations

The initial training and testing data were obtained by performing simulations on the linear model asillustrated in Figure 4 in Section 2.2.3. Three parameters were subjected to a parameter sweep asdescribed in Table 5 below while the rest of the parameters in the model were held constant with values asdepicted in Table 6. The input and output signals of the model were as described in Table 7 below. 600simulations were performed with randomly selected parameters from independent uniform distributionsto serve as input for the machine-learning algorithms. A typical 70 %/30 % into train and test data wouldlead to 180 samples in the test set, which were deemed too few to perform meaningful statistical analyseson. Hence, 300 samples were randomly chosen to serve as combined training and validation data in 10fold cross validation. The remaining 300 simulations were withheld from the algorithms to serve as testdata. Simulations that exhibited signs of numerical instability were omitted from further analysis.

The parameters described in Table 5, as well as the range of each parameter, were taken from a simulationon the FCP-project’s reference model [3]. The parameter set for each individual simulation was randomlysampled such that the expected number of qualified samples would be at 50 %, per the qualificationresults from the FCP-simulation.

The power imbalance signal for the simulations was taken from the RAR-profile [17], which is an estimationof power imbalance over 93 non-consecutive days. For each simulation, a two hour interval was randomlyselected from the profile to act as the disturbance. Thus, different disturbance signals were used for eachsimulation with a distribution corresponding to the RAR-profile.

Table 5. Range of values used in parameter sweep in the linear model. The parameters were for eachsimulation randomly selected from independent distributions such that about 50 % of the samples would bequalified. The per unit values is taken in the machine base.

Parameter Description Range Step size Unit

Kp Proportional coefficient PI-regulator 1-10 0.5 p.u.

Ti Integral time constant PI-regulator 10-100 10 s

ep Droop 2-12 2 %

31

Table 6. Values of parameters held constant during simulations of the linear model. The per unit values istaken in the machine base.

Parameter Description Value Unit

b Backlash 0 p.u.

T Simulated time interval 7200 s

∆t Sample interval 0.2 s

Y0 · Tw Gate set point × time constant water ways 1.5 p.u.

RG Gate slew rate 0.05 %/s

Ts Gate servo time constant 0.2 s

dP Scale factor total FCR-N capacity 600 MW

df Scale factor for normal frequency band 0.1 Hz

f0 Nominal frequency 50 Hz

Ek Kinetic energy at nominal frequency 190 000 MWs

Sn System load 42 000 MW

H Inertia constant 4.5 MWs/MW

k Load frequency dependency 0.1 %/Hz

Table 7. Specification of the input and output signals of the model.

Parameter Description Unit

∆P FCR-N contribution to the system p.u.

∆f Deviation from nominal frequency p.u.

w Power imbalance p.u.

r Reference value for frequency deviation ∆f p.u.

5.2 Results


Ranking of the usefulness of the key performance indicators were performed as described in section 4.4.1above with results presented in Table 8, Table 9 and Table 10 below, as well as in Figure 16.

32

Table 8. Result of the feature selection process for the tree based classifiers. The key performance indicatorsare approximately ranked with increasing contribution to classification accuracy. Non-selected features arenot shown.

# of KPIs Added KPI DomainAccuracy

(%)Std. dev.

(%)95 % CI

(%)

1 Active control work Frequency 62 2.6 57-67

2 Bandwidth Frequency 82 2.2 77-86

3 Std.dev P(t)Std.dev f(t)

Time 80 2.5 74-84

4 Relative arc length f(t) Time 78 2.3 74-83

5 Reactive control work Frequency 80 2.1 76-84

6 Std. dev P(t) Time 85 2.0 81-89

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Number of selected features

0

10

20

30

40

50

60

70

80

90

100

Tes

t se

t ac

cura

cy

Impact of feature selection on test set accuracy

Figure 16. Illustration of the change in test set accuracy as the amount of chosen features changes. Theshaded region represents a 95 % confidence interval for the accuracy while the dashed blue line shows thefinal number of selected features. Corresponds to Table 31 in Appendix B.

Table 9. Result of the feature selection process for the support vector machine classifier. The key performanceindicators are approximately ranked with increasing contribution to classification accuracy. Non-selectedfeatures are not shown.


(%)Std. dev.

(%)95 % CI

(%)


2 Bandwidth × phase loss Frequency 69 2.4 64-74


Time 69 2.7 64-75

4 Rel.arc length P(t)Rel.arc length f(t)

Time 73 2.4 69-78

5 Std. dev P(t) Time 83 2.2 78-87

33

Table 10. Result of the feature selection process for the neural network classifier. The key performanceindicators are approximately ranked with increasing contribution to classification accuracy. Non-selectedfeatures are not shown.


(%)Std. dev.

(%)95 % CI

(%)



Time 72 2.8 66-77


5.2.2 Comparison of classifiers

Table 11 below presents the accuracy of the examined classifiers on 244 samples of previously unseen testdata together with 10-fold cross validated hyperparameter choices. As a comparison, Table 12 presentsthe corresponding results if no feature selection is made. The values for accuracy, standard deviation and95 % confidence interval were obtained by performing bootstrap sampling on the available test data.

Table 11. Classifier performance on the linear model after feature selection together with cross validatedchoices on hyperparameters.



Random forest 85 2.0 81-88Estimators=64,Max depth=8,

Min samples split=4

Adaboost 83 2.1 79-87 Estimators=50

Support vector machine 87 2.0 83-91Kernel=Radial,

C=20,γ=.25

Neural network 88 2.5 82-92

Learning rate=0.1,hidden layers=[12,12],

batch size=100,train steps= 1000

Table 12. Comparison of classifier performance on the linear model with and without feature selection.

After feature selection Before feature selection

Classifier Accuracy (%) 95 % CI (%) Accuracy (%) 95 % CI (%)

Decision tree 79 75-84 69 64-75

Random forest 85 81-88 82 78-87

Adaboost 83 79-87 81 76-85

Support vector machine 87 83-91 82 78-86

Neural network 88 82-92 88 83-89

34

5.2.3 Confusion matrices

Table 13. Confusion matrix for(Decision tree / Random forest / Adaboost / Support vector machine / Neural network)

Precision (%) Recall (%) F1-Score (%) Support

Not approved 85 / 89 / 85 / 89 / 92 75 / 82 / 83 / 88 / 85 80 / 85 / 84 / 88 / 88 164

Approved 74 / 80 / 80 / 86 / 84 85 / 88 / 82 / 87 / 91 79 / 84 / 81 / 86 / 87 136

Aggregated 80 / 85 / 83 / 87 / 88 79 / 85 / 83 / 87 / 88 79 / 85 / 83 / 87 / 88 300


Figure 17 below shows the relationship between the parameters Kp, Ki, ep and the classification resultsfor each classifier. As a comparison, the distribution of the parameters for approved and non-approvedsamples per the simulation answer key are supplied in the Target column. A more detailed specificationof how the histograms were calculated is found in Section 4.5.4.

35

Kp · ep, approved

targe

t dt ab rfsv

m nn

Classifier

10

30

50

70

90

Ti

0

0.05

0.1

0.15

0.2

0.25

targe

t dt ab rfsv

m nn

Classifier

10

30

50

70

90

Ti

0

0.05

0.1

0.15

0.2

0.25


Kp · ep, non-approved

Ti, approved

Ti, non-approved

Classifier

Classifier

Figure 17. Normalized parameter distributions for Kp · ep and Ti = 1Ki·ep

, separated into approved and

non-approved samples for each classifier respectively. Target is the distribution per the simulation resultsand is included for comparison. Each distribution separately normalized to sum to one.

36

5.3 Discussion of model and results

This section contains a short discussion of the results specifically for the linear model. A more generaldiscussion concerning the over-all results is found in chapter 9.

• Feature selection

– The 17 available features may be reduced drastically while maintaining accuracy. The finalnumber of selected features is of the order 3 to 6, depending on method.

– Slight increases in accuracy are seen after feature selection, for all classifiers but the neuralnetwork. The changes are statistically significant only for decision tree on the current test set.

– The gains of feature selection were the largest for decision tree and support vector machine:DT gains 10 %-points while SVM gains 5 %-points.

– The three features selected for the neural network is shared with the rest of the methods,although support vector machine uses bandwidth × phase loss instead of just bandwidth

• Accuracy comparison

– No statistically significant differences are shown for the test set accuracy after bootstrappingand feature selection.

– Decision tree is the worst performer. This is seen with significance only before feature selection.

• Confusion matrices

– Precision tends to be higher for not approved with recall being higher for approved samples,for all methods. This indicates a slight tendency for the methods to be too generous withapproval.

• Classifier parameter distributions

– The over-all result is that the parameter distributions conform well to the target from theanswer key.

– Adaboost shows some shift from the bin 0.3–0.4 for Kp · ep, approved, to 0.4–0.5.

– All methods tend to have some problems with the sharp peak at 40–60 in Ti, approved, suchthat the normalized distributions are smoothed out. When compared with the distribution ofthe non-approved it seems that this is because some of the values in the interval that shouldhave been approved end up with a classification of non-approved. Another possible explanation,with less support in the data, is that samples that should have been non-approved with Tiin the range 60-100 have crept into the approved set. The normalization process would thenprovide the smoothing.

37

6 Non-linear model

6.1 Simulations

The training and test data were obtained by performing simulations on the non-linear model as illustratedin Figure 4 in Section 2.2.3. Four parameters were subjected to a parameter sweep as described in Table14 below while the rest of the parameters in the model were held constant with values as depicted in Table15. The input and output signals of the model were the same as for the linear model and described inTable 7. 1800 simulations were performed with randomly selected parameters from independent uniformdistributions to serve as input for the machine-learning algorithms. 1260 of the simulations were randomlyselected to serve as training data, with the remaining 540 samples withheld for use as test data.

The parameters described in Table 5, as well as the range of each parameter, were taken from a simulationon the FCP-project’s reference model [3]. The parameter set for each individual simulation was randomlysampled such that the expected number of qualified samples would be at 50 %, per the qualificationresults from the FCP-simulation.

The power imbalance signal for the simulations was taken from the RAR-profile [17], which is an estimationof power imbalance over 93 non-consecutive days. For each simulation, a two hour interval was randomlyselected from the profile to act as the disturbance. Thus, different disturbance signals were used for eachsimulation with a distribution corresponding to the RAR-profile.

Table 14. Range of values used in parameter sweep in the non-linear model. The parameters were for eachsimulation randomly selected from independent distributions such that about 50 % of the samples would bequalified. The per unit values is taken in the machine base.

Parameter Description Range Step size Unit

Kp Proportional coefficient PI-regulator 1-10 0.5 p.u.

Ti Integral time constant PI-regulator 10-100 10 s

ep Droop 2-12 2 %

b Backlash 0-0.012 0.001 p.u.

38

Table 15. Values of parameters held constant during simulations of the non-linear model. The per unitvalues is taken in the machine base.

Parameter Description Value Unit

T Simulated time interval 7200 s

∆t Sample interval 0.2 s

Y0 · Tw Gate set point × time constant water ways 1.5 p.u.

RG Gate slew rate 0.05 %/s

Ts Gate servo time constant 0.2 s

dP Scale factor total FCR-N capacity 600 MW

df Scale factor for normal frequency band 0.1 Hz

f0 Nominal frequency 50 Hz

Ek Kinetic energy at nominal frequency 190 000 MWs

Sn System load 42 000 MW

H Inertia constant 4.5 MWs/MW

k Load frequency dependency 0.1 %/Hz

6.2 Results


Ranking of the usefulness of the key performance indicators were performed as described in section 4.4.1above with results presented in Table 16, 17 and 18 below.

Table 16. Result of the hand tuned feature selection process for the tree based classifiers. The keyperformance indicators are approximately ranked with increasing contribution to classification accuracy.Non-selected features are not shown.


(%)Std. dev.

(%)95 % CI

(%)


2 2Dpu n/a 76 1.9 73-80


4 Relative arc length P(t) Time 81 1.7 78-84


39

Table 17. Result of the feature selection process for the support vector machine classifier. The keyperformance indicators are approximately ranked with increasing contribution to classification accuracy.Non-selected features are not shown.


(%)Std. dev.

(%)95 % CI

(%)



Time 65 2.0 61-69

3 2Dpu n/a 66 2.2 62-70

4 R2 Time 72 2.0 68-76

5 Std. dev P(t) Time 81 1.7 78-85




(%)Std. dev.

(%)95 % CI

(%)



3 2Dpu n/a 71 1.6 69-72


Time 73 3.1 66-74




Time 75 2.6 70-77

8 Bandwidth ×rel.arc length P (t)/f(t)

Combined 74 2.1 72-77

9 Std. dev P(t) Time 79 2.5 74-81


Table 19 below presents the accuracy of the examined classifiers on 540 samples of previously unseen testdata together with 10-fold cross validated hyper parameter choices. As a comparison, Table 20 presentsthe corresponding results if no feature selection is made. The values for accuracy, standard deviation and95 % confidence interval were obtained by performing bootstrap sampling on the available test data.

40

Table 19. Classifier performance on the non-linear model after feature selection together with cross validatedchoices on hyper parameters.




Min samples split=6


Support vector machine 82 1.8 79-87Kernel=Radial,

C=20,γ=.25




Table 20. Comparison of classifier performance on the non-linear model with and without feature selection.



Decision tree 80 77-84 80 77-83

Random forest 82 78-85 81 78-85

Adaboost 80 77-83 80 77-84






Not approved 84 / 83 / 82 / 86 / 82 76 / 79 / 78 / 75 / 75 80 / 81 / 80 / 80 / 79 272

Approved 78 / 80 / 79 / 80 / 79 85 / 84 / 82 / 89 / 86 81 / 82 / 80 / 84 / 82 267

Aggregated 81 / 82 / 80 / 83 / 81 81 / 82 / 80 / 82 / 81 80 /82 / 80 / 82 / 81 539


Figure 18 below shows the relationship between the parameters Kp, Ki, ep, b and the classification resultsfor each classifier. As a comparison, the distribution of the parameters for approved and non-approved

41

samples per the simulation answer key are supplied in the Target column. A more detailed specificationof how the histograms were calculated is found in Section 4.5.4.

targe

t dt ab rfsv

m nn

Classifier

10

30

50

70

90

Ti

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

targe

t dt ab rfsv

m nn

Classifier

10

30

50

70

90

Ti

0

0.05

0.1

0.15

0.2

targe

tdtab rfsv

mnn

Classifier

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

2D

pu

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

targe

tdtab rfsv

mnn

Classifier

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

2D

pu

0

0.05

0.1

0.15

0.2

Kp · ep, approved


Ti, approved

Ti, non-approved

2Dpu, approved

2Dpu, non-approved

Classifier

Classifier


Figure 18. Normalized parameter distributions for Kp · ep, Ti = 1Ki·ep

and 2Dpu = b · ep · f0/df , separated

into approved and non-approved samples for each classifier respectively. Target is the distribution per thesimulation results and is included for comparison. Each distribution separately normalized to sum to one.

42


This section contains a short discussion of the results specifically for the non-linear model. A more generaldiscussion concerning the over-all results is found in chapter 9.

• Hyperparameters

– Some changes in hyperparameter selection are seen compared to the linear model. Randomforest was optimally used with half the number of estimators and slightly more minimal samplesat a split; Adaboost increased the number of estimators, while neural network increased thehidden units in each layer form 12 to 20. All other hyperparameters remain the same aspreviously. All hyperparameters are of the same order of magnitude for both the linear andnon-linear models. The differences between the linear and non-linear models are considered tobe small and insignificant.


– The 18 available features may be reduced drastically while maintaining accuracy. The numberof selected features was in the range 5 to 9, up from 3 to 6.

– Neural network goes from the lowest number of selected features at 3 for the linear model tothe highest at 9 for the non-linear model. The data do not support any specific conclusionsregarding why this happens.

– The gains from performing feature selection are less clear compared to for the linear model.Neural network loses some accuracy after feature selection. This result might indicate that thefeature selection routine for NN is insufficient.

– The non-linear model is the first model for which the indicator 2Dpu is introduced. The resultsindicate that this feature is of use for all the methods.


– The non-linear model still shows no statistically significant differences between the methodson the test set accuracy after bootstrapping and feature selection.

– All of the methods but decision tree lose some accuracy after introduction of backlash. DTinstead gains one percentage point.

– Neural network, support vector machine, and random forest is still the best performing methods,if NN taken without feature selection.



– This is the same tendency as was seen for the linear model.


– The non-linear model is the first model where backlash/2Dpu has been used, and thus the firstfor which the distribution of that parameter can be addressed.

– The general result is that the parameter distributions conform quite well to the target fromthe answer key.

43

– The variations are larger than what was seen for the linear model.

– Adaboost no longer shows the shift from the bin 0.3–0.4 for Kp · ep, approved, to 0.4–0.5, thatwere seen for the linear model.

– The problems with the peak at 40–60 in Ti, approved, that all the methods had are lesspronounced. Some problems still remain with support vector machine and especially with theneural network, where too many samples have been classified as non-approved in the region.

– All of the methods miss a peak in Ti, non-approved in the bracket 70–80, which is smoothedout in all cases.

– Over-all, most of the inaccuracies from the linear model have been replaced in the non-linearmodel by others, but the problem at 40–60 in Ti, approved remains, albeit less pronounced.

44

7 Non-linear model with noise

7.1 Simulations

The simulations on the non-linear model — as described in the preceding chapter — were extended byintroducing measurement inaccuracies and limitations in measurement precision corresponding to themaximal allowed errors per the requirements of the FCP-project [7]. The simulations were performed asfor the original non-linear model while adding random uniform noise and rounding to the signals ∆fand ∆P respectively. The power signal ∆P were subject to noise constructed with mean zero and limits±0.5% of the machines rated power Pn [MW], and then rounded down to the closest multiple of 0.1 MW.The frequency signal ∆f were subject to noise constructed with mean zero and limits ±10 mHz androunded down to the closest multiple of 1 mHz. I.e. the signals with noise were constructed as

∆f = ∆f + vf [Hz], ∆P = ∆P + vP [MW] (51)

vf ∼ U(− 10

1000,

10

1000

), vP ∼ U

(− 0.5

100· Pn,

0.5

100· Pn

)(52)

Apart from this, the simulations with added signal errors were handled identically to the simulationswithout noise. Figure 19 below illustrates the results of adding noise to the signals.

Figure 19. Illustration of the power and frequency signals before (left) and after (right) measurementinaccuracies and precision limits has been introduced.

45

7.2 Results


Table 22. Result of the hand tuned feature selection process for the tree based classifiers. The keyperformance indicators are approximately ranked with increasing contribution to classification accuracy.Non-selected features are not shown.


(%)Std. dev.

(%)95 % CI

(%)

1 2Dpu n/a 72 2.0 68-75

2 Bandwidth × phase loss Frequency 76 2.0 71-79



Table 23. Result of the feature selection process for the support vector machine classifier. The keyperformance indicators are approximately ranked with increasing contribution to classification accuracy.Non-selected features are not shown.


(%)Std. dev.

(%)95 % CI

(%)



3 2Dpu n/a 70 1.9 66-73


Time 71 1.9 67-74

5 R2 Time 76 1.9 72-79


7 Cross correlation P(t), f(t) Time 78 1.7 75-81

8 Std. dev P(t) Time 80 1.8 77-84

46



(%)Std. dev.

(%)95 % CI

(%)

1 2Dpu n/a 68 1.8 65-70


3 Std. dev P(t) Time 69 1.7 67-72



Time 75 3.1 71-80

6 R2 Time 79 1.1 77-80


Table 25 below presents the accuracy of the examined classifiers on 538 samples of previously unseen testdata together with 10-fold cross validated hyper parameter choices. As a comparison, Table 26 presentsthe corresponding results if no feature selection is made. The values for accuracy, standard deviation and95 % confidence interval were obtained by performing bootstrap sampling on the available test data.

Table 25. Classifier performance on the non-linear model with noise after feature selection together withcross validated choices on hyper parameters.




Min samples split=4


Support vector machine 80 1.7 77-83Kernel=Radial,C=20,γ=.125




47

Table 26. Comparison of classifier performance on the non-linear model with noise with and without featureselection.



Decision tree 74 70-77 71 67-75

Random forest 78 75-82 76 72-79

Adaboost 74 70-77 75 72-79






Not approved 77 / 78 / 74 / 81 / 82 68 / 77 / 73 / 79 / 75 72 / 77 / 74 / 80 / 79 271

Approved 71 / 77 / 73 / 79 / 79 79 / 78 / 75 / 81 / 86 75 / 77 / 74 / 80 / 82 267

Aggregated 74 / 77 / 74 / 80 / 81 74 / 77 / 74 / 80 / 81 74 / 77 / 74 / 80 / 81 538


Figure 20 below shows the relationship between the parameters Kp, Ki, ep, b and the classification resultsfor each classifier. As a comparison, the distribution of the parameters for approved and non-approvedsamples per the simulation answer key are supplied in the Target column. A more detailed specificationof how the histograms were calculated is found in Section 4.5.4.

48

targe

t dt ab rfsv

m nn

Classifier

10

30

50

70

90

Ti

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

targe

t dt ab rfsv

m nn

Classifier

10

30

50

70

90

Ti

0

0.05

0.1

0.15

0.2

targe

tdtab rfsv

mnn

Classifier

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

2D

pu

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

targe

tdtab rfsv

mnn

Classifier

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

2D

pu

0

0.05

0.1

0.15

0.2

Classifier

Classifier

Kp · ep, approved


Ti, approved 2Dpu, approved

Ti, non-approved 2Dpu, non-approved





49


This section contains a short discussion of the results specifically for the non-linear model with noise. Amore general discussion concerning the over-all results is found in chapter 9.

• Hyperparameters

– Only small changes were seen in hyperparameter selection compared to earlier models. Adaboostand neural network returned to the choices from the linear model, with random forest partiallyreturning. The maximal depth for decision tree decreased slightly to 6 from 8 for the precedingmodels. For support vector machine the hyperparameter γ is halved. All hyperparameters areof the same order of magnitude as for both the linear and non-linear models. The differencescompared to the non-linear model are small and most is a return to the hyperparameterchoices for the linear model. All variations between the linear, non-linear and noise model areconsidered to be small and insignificant.


– The 18 available features may be reduced drastically while maintaining accuracy. The numberof selected features was in the range 4 to 8.

– Random forest and neural network uses fewer selected features than for the non-linear model,while support vector machine increases in number of selected features. The effect of this isthat now SVM uses the largest number of features. No discernible pattern is seen regardingthis and hence the fluctuations are taken to be stochastic in nature.

– Decision tree, random forest and support vector machine make gains from the feature selection,with the changes close to significant. Adaboost sees a small loss while the neural network isunaffected. The confidence interval for NN before and after feature selection indicates thatperhaps a small increase in accuracy happens for NN, but this effect is far from significant.Compared to the preceding model NN no longer loses in accuracy after feature selection.

– 2Dpu is still of use for all the methods.


– This time statistical significance for the differences between decision tree and adaboost versussupport vector machine.

– Random forest, support vector machine, and neural network lose 2–4 %-points compared tothe non-linear model without noise. Decision tree and adaboost loses 6 %-points.

– Neural network, support vector machine, and random forest still perform best. The differencefrom SVM to decision tree and adaboost respectively is for this model shown with statisticalsignificance, after feature selection.



– This is the same tendency as was seen for the linear and non-linear models, although lessenedthis time.

50



– The variations are of similar type and magnitude as for the non-linear model without noise.

– Adaboost does not show the shift from the bin 0.3–0.4 for Kp · ep, approved, to 0.4–0.5, thatwere seen only for the linear model. This indicates a random fluctuation.

– The irregularities with the peak at 40–60 in Ti, is similar to before. The problem were supportvector machine and neural network have too many samples classified as non-approved in theregion is now also seen for the remaining methods.

– All of the methods miss a peak in Ti, non-approved in the bracket 70–80, which is smoothedout in all cases. This deviation from the target is slightly less pronounced than before.

– Over-all, most of the inaccuracies vary between the models, but the larger ones, like theproblem at 40–60 in Ti approved, remains from the non-linear model.

51

8 Sub-sampled non-linear model with noise

8.1 Simulations

The simulations on the non-linear model with measurement inaccuracies as depicted in the precedingchapter were extended to examine the effects of different sampling intervals of the signals ∆f and ∆Prespectively. 1800 simulations were performed with a sampling interval of ∆t = 0.2 s. Three sets of 1800simulations each were then constructed by taking the results and independently add noise for each set.The second and third sets were low-pass filtered and down sampled to a sampling interval of 1 s and3 s respectively. The same hyperparameter selection was used for the down-sampled models as for theoriginal noise model, for all methods.

8.2 Results


Table 28. Comparison of selected features for each algorithm at different signal sampling intervals.

Classifier type Sampling interval (s)

0.2 1 3

Tree algorithms • 2Dpu

• Active control work• Bandwidth × phase loss• Relative arc length P(t)—

• 2Dpu

• Active control work• Bandwidth × phase loss• Relative arc length P(t)• Std. dev P(t)

• 2Dpu

• Active control work• Bandwidth × phase loss• Relative arc length P(t)• Std. dev P(t)

Support vector machine1 • 2Dpu

• Active control work—• Cross-correlation f(t), P(t)• Reactive control work—• Rel. arc length P(t)• R2

—• Std. dev P(t)—

• Std.dev P(t)Std.dev f(t)

• 2Dpu

• Active control work—• Cross-correlation f(t), P(t)• Reactive control work————• Std. dev P(t)—


• 2Dpu

—• Bandwidth × phase loss• Cross-correlation f(t), P(t)—• Relative arc length f(t)——

• Rel.arc length P(t)Rel.arc length f(t)

—• Std. dev f(t)


Neural network2 • 2Dpu

• Bandwidth × phase loss—• Reactive control work——• R2

• Std. dev P(t)


• 2Dpu

—————• R2

• Std. dev P(t)


• 2Dpu

—• Correlation f(t), P(t)—• Relative arc length f(t)• Relative arc length P(t)—• Std. dev P(t)


1SVM: The list of selected features for the longest sampling interval differs from the preceding. The change in test setaccuracy if the set of features for the 1 s interval is used also for the 3 s interval is 1.6 percentage points downwards.

2NN: The change in test set accuracy if the set of features for the 1 s interval is used also for the 3 s interval is 0.9percentage points downwards.

52


Table 29. Classifier performance on the sub-sampled non-models with noise after feature selection togetherwith 95 %-confidence intervals.

Accuracy (%) 95 % Confidence interval (%)

Classifier ∆t = 0.2 ∆t = 1 ∆t = 3 ∆t = 0.2 ∆t = 1 ∆t = 3

Decision tree 74 72 76 70-77 69-76 72-80

Random forest 78 78 77 75-82 74-81 72-81

Adaboost 74 72 73 70-77 68-76 69-77

Support vector machine 80 80 79 77-83 76-83 75-83

Neural network 78 78 78 74-82 73-82 73-83


Figure 21 and 22 below shows the relationship between the parameters Kp, Ki, ep, b and the classificationresults for each classifier. As a comparison, the distribution of the parameters for approved and non-approved samples per the simulation answer key are supplied in the Target column. A more detailedspecification of how the histograms were calculated is found in Section 4.5.4.

53

targe

t dt ab rfsv

m nn

Classifier

10

30

50

70

90

Ti

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

targe

t dt ab rfsv

m nn

Classifier

10

30

50

70

90

Ti

0

0.05

0.1

0.15

0.2

targe

tdtab rfsv

mnn

Classifier

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

2D

pu

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

targe

tdtab rfsv

mnn

Classifier

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

2D

pu

0

0.05

0.1

0.15

0.2

Classifier

Classifier


Kp · ep, non-approved Ti, non-approved 2Dpu, non-approved

Normalized parameter distributions after classification(Δt = 1 s)




54

targe

tdtab rfsv

mnn

Classifier

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

2D

pu

0

0.05

0.1

0.15

0.2

targe

t dt ab rfsv

m nn

Classifier

10

30

50

70

90

Ti

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

targe

t dt ab rfsv

m nn

Classifier

10

30

50

70

90

Ti

0

0.05

0.1

0.15

0.2

targe

tdtab rfsv

mnn

Classifier

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

2D

pu

0

0.05

0.1

0.15

0.2

0.25

0.3

Classifier

Classifier

Normalized parameter distributions after classification(Δt = 3 s)


Kp · ep, non-approved Ti, non-approved 2Dpu, non-approved




55


This section contains a short discussion of the results specifically for the sub-sampled non-linear modelswith noise. A more general discussion concerning the over-all results is found in chapter 9.

• Hyperparameters

– The same hyperparameters have been used as for the preceding model.


– The 18 available features may be reduced drastically while maintaining accuracy. For the threesub-sampled models the number of selected features varied in the range 4 to 8, in accordancewith previous results.

– The random forest feature selection, also used for decision tree and adaboost, is seen to bemostly stable. Support vector machine and neural network tends to more swing a lot more inthe selection. The changes in selection between models make a difference for SVM and NN, asnoted in the footnotes.

– No comparison with/without feature selection has been made for the sum-sampled models.From previous results, feature selection is expected to make a small but positive contributionto the accuracy.

– The feature 2Dpu is still selected by all the methods. The data up until this point suggeststhat 2Dpu strongly contributes to classification accuracy.


– The statistical significance in the difference between support vector machine and decisiontree and adaboost respectively remains for the two shorter sampling intervals. DT and ABrehabilitates somewhat for ∆t = 3 and thus the significances are lost.

– The general trend is a slight downward trend in accuracy as sampling interval increases inlength. The exception is decision tree, where the results are unstable and swing up and down.

– Neural network, support vector machine, and random forest still perform best, but this resultis not fully significant for this sample size.



– No conclusive differences is seen in parameter distributions between non-linear, non-linear withnoise and sub-sampled models. All variations are of similar type and magnitude.

– The irregularities regarding the peak at 50–60 in Ti, is similar to before. All of the modelsexhibit the same tendency.

– All methods miss a peak in Ti, non-approved in the bracket 70–80, which is smoothed outin all cases. This is somewhat expected, as all noise simulations share the same underlyingsignals, but with separately calculated noise.

– Over-all, most of the inaccuracies vary between the models, but the larger ones, like theproblem at 40–60 in Ti approved, remains from the non-linear model.

56

9 Discussion

The purpose of this thesis has been to examine the potential of complementing the evaluation of theproposed FCP requirements through the use of machine-learning methods applied to the input and outputsignals sampled during normal operation of a FCR-N providing power plant. The methodology to performsuch an examination has consisted of two main parts; the first being to perform a series of simulations ona set of models taken to represent a power plant, and the second being to apply a set of machine-learningmethods to the resulting simulation results. The output of the machine-learning classifications has thenbeen analysed to achieve an estimate of the general accuracy of the methods as well as to search forirregularities in the output indicating systematic errors in parts of the analysed domain.

Some assumptions, approximations as well as simplifications have been made during the process, especiallywhen performing the simulations of a power plant. The main simplification is that just a single powerplant is taken to provide the full FCR-N capacity of the system, with no special regard taken to how thatplant would interact with other FCR-providing plants as well as the rest of the system. The modelling ofthis individual power plant also contains approximations and simplifications in various parts, e.g. thelinearised water ways. Furthermore, assumptions have been made regarding the parameter set usedin the simulations regarding the parameter range and granularity, as well as which parameters whereheld constant and at which values. These assumptions may not necessarily hold validity outside of thesimulation model. Some of the harm caused by these assumptions, approximations and simplifications maybe rectified by the machine-learning algorithms, with their possible capability to construct classificationmodels that generalizes well to unseen data and new conditions. To heavily depend on such generalisationsthough, is to depend on what is in effect one kind of extrapolation, perhaps a perilous commitment.Because of these limitations in the simulations as well as the unknown capabilities of the machine-learningmethods to compensate, the hereby achieved results is to be taken rather as an indication of what kindof information that may be extracted from non-invasive testing of a power plant, than as a promise tobe representative of what the result would be analysing the present power system as a whole. Furtherrefinements to the methodology are needed to be able to claim the latter.

Under these reservations to the interpretation of the result the general conclusion is that the frequency andpower signals contain information correlated with the stability and performance of an FCR-N providingpower plant. It has been shown that some of this information transfer to the constructed set of keyperformance indicators. Finally, it has been shown that the resulting machine-learning models producesclassifications with parameter distributions that follows the expectation from the simulations, indicatingphysically reasonable classification results and possibly an ability of the models to generalize well. A moredetailed analysis follows of the discussions that have been carried out at the end of each model chapter.The division of the analysis has been made in correspondence with the division in the model chapters.

Classification accuracy

The classification accuracy aggregated over the simulation models is presented in Table 30 below.

Table 30. Aggregated accuracy results for all machine-learning methods on all simulation models.

(1s) (3s)

Decisiontree 79 80 74 72 76

Randomforest 85 82 78 78 77

Adaboost 83 80 74 72 73

Supportvectormachine

87 82 80 80 79

Neuralnetwork 88 80 78 78 78

Classifier Linear Non-linearNon-linearwithnoise

Sub-sampledwithnoise

57

The general trend for the classifiers random forest, support vector machine and neural network hasconsistently been decreasing classification accuracy as more advanced simulation models have been applied,with more noise and measurement uncertainties. Some inconsistencies appear for the classifiers adaboostand especially decision tree where the results vary up and down between the simulation models, a trendnot seen for the other classifiers. Adaboost and decision tree have also consistently been performing worseaccuracy-wise, albeit only sparingly shown with statistical significance on the limited test set. Thesetwo trends indicate that the algorithms adaboost and decision tree should be taken as less promising forclassification uses than the others. The remaining methods – RF, SVM, and NN – are taken to be ofabout equally promising nature for future work as they have been performing similarly with each methodtaking the top spot on a least one of the simulation models. The fact that the models have performed sosimilarly, in combination with a typical accuracy confidence interval width of about 7 %, indicates that alimiting factor has been the number of samples available for the methods to classify. To better separatethe accuracy of the classifiers, and to do so with statistical significance, a larger number of simulationswould be advisable.

Hyperparameter selection

Relatively stable hyperparameter selections were obtained between the models, with variations that weresmall, and inconclusive. Many of the changes between two models were reversed after the succeedingmodel was applied. The results indicate that the initial hyperparameter selection from the linear modelholds well for all the proceeding models, and hence that the hyperparameter selection routine is stable.

Feature selection

The feature selection algorithms resulted in drastically reduced number of selected features for all classifiersand over all the models. The 18 features were reduced to at the most 9 selected features at the higherend. At the lower end, 3 selected features were enough for the neural network to achieve 88 % accuracyon the linear model. The effect of feature selection on classification accuracy has been quite small,mostly increasing the accuracy slightly. NN has been more resilient to increases, mostly keeping the sameaccuracy before and after feature selection, with a small loss of accuracy on the non-linear model.

The random forest feature selection algorithm, used for all the tree based classifiers, has been mostlystable with regard to selected features. Only very small variations were seen over all the models withregard to both which features were selected and the number of features for each model, for a total of 8features selected at least once with each model varying between 4 and 6 selected features. The supportvector machine and neural network feature selections respectively swing more, with SVM having selected11 features at least once and between 5 and 8 each time. For NN, 11 features have been selected at leastonce, with the number of selected each time varying between 3 and 9. The changes in selection for SVMand NN affect the classification accuracy, if an earlier selection is used the loss was about 1− 2 %, onthe noise model, although the difference is not significant on this sample size. The conclusion is thatthe feature selection has been more stable for the tree based methods than for SVM and NN, possiblyindicating that the latter two methods are more sensitive to the presentation of the available informationcontent in the selected features.

The indicator 2Dpu is selected for all the methods on all the models, indicating high information content.The quotient of the standard deviations has for SVM and NN been used on all the models, while for therest of the methods it has been used only on the linear model. Features calculated from the frequencydomain have frequently been selected, indicating the importance of the ARX system identification function.

Confusion matrices

The confusion matrices on all models have consistently reported higher precision for samples that wereclassified as non-approved, and higher recall for samples that should have been approved. This indicatesa slight tendency to be too generous with approval. This trend was largest for the linear model, anddecreasing for the latter models. The main conclusion from the confusion matrices becomes that themethods have achieved their accuracy through a quite balanced trade-off on precision and recall, indicating

58

that the machine-learning models might have the ability to generalize well.

Classifier parameter distributions

Over-all the parameter distributions conform quite well to the target from the answer key, with mostlysmall deviations appearing in the data relative to the target. These deviations changes seemingly randombetween the models and do not seem to indicate any systematic errors by the methods. The resultsfrom the linear model contain fewer deviations than for the other models. No obvious differences areseen between non-linear, non-linear with noise and sub-sampled models. For the latter models it holdsthat all variations are of similar character and magnitude. That the linear model contains fewer andsmaller deviations combined with the conformity of the remaining models indicates that the models withnon-linearities has been harder to classify, either because of inherent uncertainties within the underlyingsignals or because of the relatively smaller number of samples with regard to the number of parametersvaried. Hence, the results might point to the number of simulation samples as a constraint on theachievable accuracy.

All methods have had some problems with Ti in the general region of 40− 80 s, with the exact placementvarying somewhat between the models. The general trend is that sharp peaks have been smoothedby the classifiers. Part of this effect is expected as random errors in classification combined with thenormalization in the histograms will make extreme values tend towards the mean. As this effect is morepronounced specifically for the Ti parameter indicates that part of this effect may be systematic, perhapsbecause the introduced KPIs does not contain enough information to accurately capture the dependenceof the stability and performance of the plant on the Ti parameter of the PI-regulator.

The conclusion from the parameter distributions is that smaller deviations arise from random noise inthe results such that the deviations changes between the models and simulations. Larger deviations areseen for the Ti parameter that could indicate systematic classification errors. The random errors couldperhaps be rectified by a larger sample size, while any systematic errors would have to be handled by theaddition of KPIs with new information content.

9.1 Suggestions for further research

This thesis has shown that under some model assumptions and approximations the frequency and powersignals from a power plant contains information that could be used to estimate the performance andstability for providing frequency containment. This has been done through calculation of key performanceindicators on the signals to characterise the power plant, indicators then fed to one of several machine-learning algorithms. Some suggestions will be made for optimising the process and putting it to use inthe real power grid.

To proceed with the herein suggested methodology one recommended action would be to examine anddetermine the purpose for which a machine-learning method would be used. This examination wouldfocus on the information that the method would need to provide, as well as the desired level of accuracy ofthat information. The next step could be to perform simulations that better represents the conditions forwhich the method would be applied, with special regard given to the composition of the simulation model,the parameter choices as well as input signals. The machine-learning method could then be suppliedwith training data that is more representative of the use for which it is intended, thereby increasing thevalidity of the achieved accuracy results and perhaps increase the models ability to generalize well tounseen but closely related conditions. During such simulations it would also be advisable to increase thenumber of simulations that are performed, as some signs have been exhibited that the models have beensomewhat starved for data samples.

After the intended use of the machine-learning approach has been determined, and a representative dataset of such use has been made available, the process should be optimised to maximise the accuracy ofclassification. This could be done with regard to the choices of machine-learning method, the featureand hyperparameter selection as well as the introduced set of key performance indicators. One notablecharacteristic of the indicators that have been stipulated in this thesis is that they all are global in

59

scope, i.e. they all take the full two hour signal in account. One interesting area to research further is ifindicators with varying time horizons, e.g. correlations between just parts of the signals, could be used toincrease the accuracy of the method.

The final suggestion for further research is that all the results here obtained, as well as all the resultsthat would come from the above given suggestion, are obtained from a simulation model and hence insome sense hypothetical in nature. At some point the results and machine-learning models will have to bevalidated towards real world data and the existing power grid for them to be applicable for real world use.

60

10 Conclusions

The aim of the project has been to examine the potential of machine learning as a complement toevaluate the proposed pre-qualification requirements on performance and stability for FCR-N per theFCP-project. The examined methods were to be non-invasive while keeping the physical interpretabilityand transparency with regards to the information handled by the algorithm and how it was used. Thishas been performed by developing a set of key performance indicators through use of information in thetime and frequency domain respectively. These indicators have been used as feature input to severalmachine-learning algorithms which were then evaluated for usefulness as classifiers for stability andperformance per the FCP-requirements. The selected machine-learning methods were: decision tree,adaboost, random forest, support vector machine, and a artificial deep neural network.

Several simulation models have also been developed in order to be able to calculate the performanceindicators and supply the machine-learning methods with data to classify. The methods have beenconsecutively more detailed starting with a linearised model and ending with a sub-sampled non-linearmodel with simulated noise. The classification accuracy started at up to 88 % on the linear model,steadily declining through increasing model complexity to end at up to 79 % on the non-linear model withnoise and 3 s sample interval. The methods random forest, support vector machine and neural networkhave consistently performed better on the supplied data sets, although the difference to decision treeand adaboost has been small and only sparingly statistically significant. The methods have all beenshown to be slightly too generous with classification in the sense that more samples that should havebeen approved actually did become classified as such.

The parameter distributions have indicated that the errors from the classifiers are to be seen mostlyas random. Some evidence exist to indicate that some systematic errors may have occurred regardingthe parameter Ti, possibly caused by information content regarding the influence from that parametermissing in the information supplied by the presently chosen key performance indicators.

The general conclusion is that the proposed methodology has potential to act as a complement to theevaluation process for the FCP-requirements. Further studies are needed to determine in what waymachine learning should be used in the process and the level of accuracy and precision that wouldbe needed for such applications, as well as which of the examined machine-learning methods is to beimplemented.

61

11 Literature

[1] G. R. Kirchhoff, “Ueber den durchgang eines elektrischen stromes durch eine ebene, insbesonere durcheine kreisformige,” in Mitglied des physikalischen Seminars zu Konigsberg, pp. 487–514, Annalen derPhysik und Chemie, 1845.

[2] E-Bridge, Analysis & Review of Requirements for Automatic Reserves in the Nordic SynchronousSystem - Final report. ENTSO-E, 2011.

[3] R. Eriksson et al., FCR-N Design of requirements. ENTSO-E, 2017.

[4] K. Walve, Kraftsystemets dynamik och dimensionering. Svenska kraftnat, 2007.

[5] E. Dahlborg, Pay for performance. Uppsala universitet, 2015.

[6] ENTSO-E, “Business glossary.” https://docstore.entsoe.eu/data/data-portal/glossary/

Pages/home.aspx, May 2018.

[7] FCP-Project, Technical requirements for frequency containment reserve provision in the Nordicsynchronous area. ENTSO-E, 2017.

[8] P. Domingos, “A unified bias-variance decomposition and its applications,” in Proceedings of theSeventeenth International Conference on Machine Learning, 2000.

[9] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of on-line learning and anapplication to boosting,” Journal of Computer and System Sciences, vol. 55, pp. 119–139, August1997.

[10] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, pp. 273–297,September 1995.

[11] N. Vecoven, Feature selection with deep neural networks. University of Liege, 2017.

[12] M. A. Nielsen, Neural Networks and Deep Learning. Determination Press, 2015.

[13] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,” in Proceedings of theFourteenth International Conference on Artificial Intelligence and Statistics, vol. 15, pp. 315–323,PMLR, April 2011.

[14] D. T. Duong et al., “Estimation of hydro turbine-governor system’s transfer function from pmumeasurements,” in IEEE Power and Energy Society General Meeting, pp. 1–5, IEEE, July 2016.

[15] S. Alvarez de Andres and R. Dıaz-Uriarte, “Gene selection and classification of microarray datausing random forest,” BMC Bioinformatics, vol. 7, January 2006.

[16] I. Guyon et al., “Gene selection for cancer classification using support vector machines,” MachineLearning, vol. 46, pp. 389–422, January 2002.

[17] E-Bridge, Analysis & Review of Requirements for Automatic Reserves in the Nordic SynchronousSystem - Simulink Model Description. ENTSO-E, 2011.

62

https://docstore.entsoe.eu/data/data-portal/glossary/Pages/home.aspx

https://docstore.entsoe.eu/data/data-portal/glossary/Pages/home.aspx

A Code libraries

Power plant simulations

MATLAB + Simulink https://mathworks.com

Decision tree, Random forest, Adaboost, Support vector machine

Scikit-learn http://scikit-learn.org

Neural network

Tensorflow https://www.tensorflow.org

Auxiliary libraries

Numpy http://www.numpy.org

Pandas https://pandas.pydata.org

63

https://mathworks.com

http://scikit-learn.org

https://www.tensorflow.org

http://www.numpy.org

https://pandas.pydata.org

B Full feature selection

B.1 Linear model

Table 31. Full feature selection for random forest on the linear model. The indicators are ranked withdecreasing contribution to overall classification accuracy.

# of KPIs Added KPI DomainContributionof new KPI

(%)

Accuracy(%)

Std. dev.(%)

95 % CI(%)

1 Active control work Frequency 100 62 2.6 57-67

2 Bandwidth - Arx Frequency 44 82 2.2 77-86


Time 31 80 2.5 74-84

4 Relative arc length f(t) Time 14 78 2.3 74-83

5 Reactive control work Frequency 15 80 2.1 76-84

6 Std. dev P(t) Time 11 85 2.0 81-89

7 Correlation f(t), P(t) Time 8.3 85 1.8 81-88


Combination 8.8 83 2.1 80-87

9 Bandwidth × phase loss Frequency 6.7 84 2.1 80-88

10 Std. dev f(t) Time 4.4 85 2.2 80-89

11 R2 Time 5.1 84 2.2 79-87

12 Phase loss Frequency 4.7 85 2.0 81-89

13 Relative arc length P(t) Time 4.5 83 2.1 79-87


Time 4.1 85 1.9 81-89

15 Cross correlation P(t), f(t) Time 3.9 84 2.1 80-83

16 Cross correlation time con-stant τ

Time 0.0 80 2.3 76-85

17 Est fit - Arx Frequency 0.0 85 2.0 81-89

64

B.2 Non-linear model

Table 32. Full feature selection for random forest on the non-linear model. The indicators are ranked withdecreasing contribution to overall classification accuracy.


(%)

Accuracy(%)

Std. dev.(%)

95 % CI(%)

1 Relative arc length f(t) Time 100 63 2.2 59-67

2 2Dpu n/a 50 76 1.8 73-79


4 Relative arc length P(t) Time 18 81 1.7 78-85

5 Reactive control work Frequency 13 80 1.9 76-84


Time 12 80 1.8 76-83

7 Bandwidth × phase loss Frequency 8.0 81 1.7 77-84

8 Phase loss Frequency 7.8 83 1.6 79-86

9 Std. dev P(t) Time 5.1 81 1.7 78-84



Combination 4.4 81 1.7 78-85

12 Bandwidth Frequency 4.4 81 1.6 78-84

13 Correlation P(t),f(t) Time 3.2 80 1.7 77-84


Time 3.4 82 1.7 78-85

15 R2 Time 3.2 81 1.7 78-84

16 Std. dev f(t) Time 2.5 81 1.7 78-84


Time 1.4 80 1.7 78-84


65

B.3 Non-linear model with noise

Table 33. Full feature selection for random forest on the non-linear model with noise. The indicators areranked with decreasing contribution to overall classification accuracy.


(%)

Accuracy(%)

Std. dev.(%)

95 % CI(%)

1 2Dpu n/a 100 72 1.9 68-76

2 Bandwidth × phase loss Frequency 49 75 1.8 71-78



Combined 19 75 1.8 71-78

5 Relative arc length P(t Time 12 78 1.9 73-81

6 Std. dev P(t) Time 9.8 77 1.9 74-81

7 Phase lag at bandwidth Frequency 9.0 77 1.9 74-81

8 Bandwidth Frequency 7.8 76 1.8 72-79

9 Reactive control work Frequency 6.0 76 1.9 72-80

10 R2 Time 6.0 77 1.7 73-80


Time 5.3 77 2.0 73-81


Time 5.4 77 1.8 73-81

13 Correlation f(t), P(t) Time 3.9 75 1.8 71-78

14 Relative arc length f(t) Time 3.6 78 1.8 75-82

15 Std. dev f(t) Time 2.5 77 1.8 73-80




Time 1.8 73 1.7 70-77

66

TRITA -SCI-GRU 2018:282

www.kth.se

A machine-learning approach to estimating the …...A machine-learning approach to estimating the...

Documents

Transcript of A machine-learning approach to estimating the …...A machine-learning approach to estimating the...