UK Transmission Datamining Methods for Demand Forecasting at National Grid Transco David Esp A...

101
UK Transmission Datamining Methods for Demand Forecasting at National Grid Transco David Esp A presentation to the Royal Statistical Society local meeting of 24 February 2005 at the University of Reading, UK.

Transcript of UK Transmission Datamining Methods for Demand Forecasting at National Grid Transco David Esp A...

  • Slide 1

UK Transmission Datamining Methods for Demand Forecasting at National Grid Transco David Esp A presentation to the Royal Statistical Society local meeting of 24 February 2005 at the University of Reading, UK. Slide 2 Introduction National Grid Transco The Company Gas Demand & Forecasting Datamining Especially Adaptive Logic Networks Datamining for Gas Demand Forecasting Framing the Problem Data Cleaning Model Inputs Model Production Scope for Improvement Conclusions Contents Slide 3 UK Transmission Introduction to: National Grid Transco Slide 4 National Grid Transco (NGT) Part of the NGT Group (www.ngtgroup.com) NGT Group has interests around the globe, particularly the US NGT-UK consists of: National Grid (NG): Electricity transmission (not generation or distribution) Transco (T): Gas transmission Slide 5 UK Transmission Introduction to: Gas Demand and its Forecasting at National Grid Transco Slide 6 Breakdown of Demand National Transmission System (NTS) Many Large industrials Large industrials Gas-fired power stations 13 Local Distribution Zones (LDZs) Mostly domestic The presentation will focus on models for this level onlY. Slide 7 Forecasting Horizons Within day - at five different times Day Ahead Up to one week ahead Slide 8 Gas Demand Daily Profiles Slide 9 What Factors Drive Gas Demand? Weather Thermostats Heat leakage from buildings Heat distribution in buildings (hot air rises) Gas-powered plant efficiencies Consumer Behaviour Season (e.g. stay indoors when dark) Holidays Weather-Influenced Consumer Behaviour Perception of weather (actual or forecast) Adjustment of thermostats Slide 10 Weather Temperature ( 1C = 5 to 6%) Wind ( above 10 Knots 1K = 0.5%) Cooling Power - wind-chill (a function of wind and temperature) ( + Straight, delayed and moving average derivations of all the above ). Slide 11 Demand Temperature Relationships Slide 12 Temperature Effects Slide 13 Millions Cubic Meters (mcm) Percentage Change In Demand Seasonal Temperature Sensitivity of Gas Demand Slide 14 Consumer Behaviour Seasonal Transitions (Autumn and Spring) Bank Holidays (Typically -5 to -20% variation) Adjust thermostats & timers in (delayed) response to weather. e.g. protracted or extreme cold spells Weather Forecast Effects Special Events Slide 15 UK Transmission Introduction to Datamining: What & Why Slide 16 A generally accepted definition: The non-trivial extraction of implicit, previously unknown and potentially useful information from data [Frawley, Piatetsky-Shapiro & Metheus] In practice: The use of novel computational tools (algorithms & machine power). Information may include models, such as neural networks. A higher-level concept, of which Datamining forms a (key) part: Knowledge Discovery from Databases (KDD) Relationship: Knowledge > Information > Data Datamining Slide 17 What are they? Relatively novel computer-based data analysis & modelling algorithms. Examples: neural nets, genetic algorithms, rule induction, clustering. In existence since 1960s, popular since 1995. Why advantages have they over traditional methods? More automatic Less reliance on forecasting expertise. Fewer man-hours (more computer-hours) Potentially more accurate New kinds of model, more accurate than existing ones Greater accuracy overall, when used in combination with existing models Knowledge discovery might lead to improvements in existing models. Datamining Techniques Slide 18 Data Cleaning Self-Organizing Map Used to highlight atypical demand profiles and cluster typical ones Adaptable (Nonlinear Nonparametric) Modelling Adaptive Logic Network (ALN) Automatically produces models from data. Better than a Neural Network Input Selection Genetic Algorithm (GA) Selects best combination of input variables for model Also optimizes an ALN training parameter - learning rate Core Methods & Tools Slide 19 1995-1999: Financial, electrical & chemical problems. 1999: Diagnosis of Oil-Filled Equipment (e.g. supergrid transformers) by Kohonen SOM. 2000: Electricity Demand Forecasting Encouraging results Business need disappeared 2001-2: EUNITE Datamining competitions 2003: Gas Demand Forecasting Experiments 2004: Gas Demand Forecasting models in service 2005: More gas models, also focusing on wind power. Experience Slide 20 UK Transmission Introduction to Datamining: Nonlinear Nonparametric Models The core datamining method applied to gas demand forecasting. Slide 21 Some Types of Problem Linear - e.g. y=mx+c Non-Linear and Smooth Monotonic - e.g. y=x 3 Non-Monotonic - e.g. y=x 2 Discontinuous - e.g. y=max(0,x) We might not know the type of function in advance. Slide 22 Parametric Modelling Linear (1st Order Polynomial) Fit3rd Order Polynomial Fit Slide 23 Non-Parametric Modelling One Linear SegmentTwo Linear Segments Linear Segmentation is not the only non-parameterised technique. The key feature is growth - hence no constraint on degrees of freedom. Slide 24 Non-Parametric Modelling Three Linear SegmentsFour Linear Segments No need for prior knowledge of the nature of the underlying function. The underlying function does not have to be smooth, monotonic etc. Slide 25 Parametric Modelling Method A formula is known or at least assumed Typically a polynomial (e.g. linear). May be any kind of formula. Can be discontinuous. Model complexity is constrained Tends to make the training process robust and data-thrifty. A model of complexity exactly as required by the problem should be slightly more accurate than a non parametric model, which can only approximate this degree of complexity. Specialist regression tools can be applied for different classes of function linear (or linearizable), smooth, discontinuous... Slide 26 Parametric Modelling Method: e.g. Multiple Linear Regression Advantages: Extremely fast both to train and use If well-tailored to the problem, should give optimal results. Disadvantages: Requires uncorrelated inputs Assumptions about data distributions Slide 27 Non Parametric Modelling: Benefits Advance knowledge of the problem is not required Domain-specific knowledge, though helpful, is not vital. No assumptions about population density or independence of inputs. Model complexity is unconstrained Advantage: Model may capture unimagined subtleties. Disadvantages Training demands greater time, data volume and quality. Model may grow to become over-complex, e.g. fitting every data point. Additional possibilities: Feasibility Study Determine if any model is possible at all. Knowledge Discovery Analyze the model to determine an equivalent parametric model. Slide 28 Non-Parametric Modelling: Issues Might not be completely flexible; learning algorithm may have limitations. We may need to partition the problem manually. The model might not generalize to the extent theoretically possible. Much greater need for training data. Can over-fit (resulting in errors): Extra measures needed to prevent this. Longer training time (may not be an issue). Slide 29 UK Transmission Introduction to Datamining: Nonlinear Nonparametric Models: Under, Optimal and Over Fitting This section applies to many nonlinear nonparametric modelling methods, not just neural networks. Slide 30 Example: Underlying (2-D) Function A privileged view - we would not normally know what the function looked like... z = 1000 sin(0.125 x) cos(4 /(0.2 y +1)) Slide 31 Undertrained Model ALN model with 24 segments i.e. planes. Too angular (from privileged knowledge) Slide 32 Optimally Trained Model ALN model with 300 planes. Looks very similar to our defined function. Slide 33 Overtrained Model An ALN with 1500 planes joins the dots of the data instead of generalising. Slide 34 Determining Optimality of Fit The function is not known in advance Might be smooth, might be wrinkly - we dont know. What are our requirements on the model? What degree of accuracy is needed? Any constraints on shape or rates-of-change? How do we assess the models quality? Test against a held-back set of data Analyze the models characteristics Assumes we know what to require or expect. e.g. Sensitivity to inputs (at various parts of the data space) e.g. Cross-sections (of each variable, for different set-points of the other variables) Slide 35 Traditional Cross-Validation Validate on data that is randomly or systematically selected from the same period as the training data. Train on the training data (grey) until error is least on the cross-validation data (blue). Actual use will be in the future (green), on data which is not yet available. Future data (unavailable) Slide 36 Back-Validation Validate on data that, relative to the training data, is as old as the future is new. Train on the training data (grey) until error is least on the back-validation data (blue). Reason: like the future data, the back-val. data is an edge. This method has been proven by experiment to be superior to traditional cross validation for both gas and electricity problems. Back-val. data Training (regression) data Future data (unavailable) Slide 37 Optimal and Over Training This is deliberate over-training. The optimum point is where the (purple) Back-Validation (Backval) error curve is at a minimum, namely Epoch 30. This agrees well with that of the Holdback (pseudo future) data. Slide 38 UK Transmission Introduction to Datamining: Nonlinear Nonparametric Models: Example Algorithms Slide 39 Machine Learning / Natural Computing / Basis Function Techniques Derive models more from data (examples) than from knowledge. Roots in nature and philosophy e.g. artificial intelligence & robotics. but converging with traditional maths & stats. Many types of algorithm. Evolutionary / Genetic Algorithms Neural Network (e.g. MLP-BP or RBF) - popular Support Vector Machine - fashionable Adaptive Logic Network - experience Regression Tree Rule Induction Instance (Case) and Cluster Based Slide 40 UK Transmission Introduction to Datamining: Nonlinear Nonparametric Models: Example Algorithms: Neural Networks (ANNs) Focussing on the Multi Layer Perceptron (MLP) Slide 41 Neural Networks - Brief Overview (1) But how many neurons or layers? Repeatedly experiment (grow, prune) Slide 42 Neural Networks - Brief Overview (2) Inspired by nature (and used to test it). Output is sum of many (basis-) functions, typically S-shaped. Each function is offset and scaled by a different amount. Very broadly analogous to Fourier etc. Given data, produce its underlying model. Slide 43 Neural Networks - Brief Overview (3) Slide 44 UK Transmission Introduction to Datamining: Nonlinear Nonparametric Models: Example Algorithms: Adaptive Logic Networks (ALNs) Slide 45 Main Advantages over ANNs Theoretical No need to define anything like a number of neurons or layers ALNs automatically grow to the required extent. No need for outer loop of experimentation (e.g. pruning) Basis functions are more independent, hence: easier and faster learning greater accuracy faster execution. Less black-box - can be understood. Function inversion - can run backwards. Slide 46 Main Advantages over ANNs Observed Better accuracy: sharper detail. Better training: faster, more reliable and more controllable. Slide 47 UK Transmission Adaptive Logic Networks: How they Work: ALN Structure Slide 48 What is an ALN? A proprietary technique developed by William Armstrong, formerly of University of Alberta, founder of Dendronic Decision Limited in Canada. WWW.DENDRONIC.COM A combined set of Linear Forms (LFs) An LF: y=offset+a 1 x 1 +a 2 x 2 +... An ALN initially has one LF - making it the same as normal linear regression After optimizing its own fit, each LF can divide into independent LFs. ALNs are generated in a descriptive form that can be translated into various programming languages (e.g. VBA, C or Matlab). Slide 49 y = Min(a,b,c) - lines cut downy = Max(a,b,c,d) - lines cut up Minimum (Min) & Maximum (Max) Operators in ALNs... Inputs: Min Output:... Max Linear Forms: (regressions) Slide 50 LeftHump = Min(a,b,c) RightHump = Min(d,e,f,g) y = Max(LeftHump,RightHump) Min & Max Combined... Output: Inputs: LeftHump Min Max RightHump Min Linear Forms: Slide 51 ALNs are Trees of Linear Forms More Complex Trees are Possible Can grow to any number of layers, any number of linear forms. During training, each leaf - linear form - can split into a min or max branch. Later in training, leaves can be recombined as necessary. Tree complexity can be limited by Tolerance - a sufficiently accurate leaf wont split any further. Can be fixed or varying across the data space Direct constraint - e.g. max. depth = 5. Indirectly, by stopping training at minimum validation error Slide 52 UK Transmission Introduction to Datamining: Nonlinear Nonparametric Models: Example Algorithms: ALNs vs. MLPs: Simple Demo Demonstration of ALN benefits through a trivial example. Slide 53 Artificial Problem: With smooth regions and a sharp point Slide 54 Neural Net - 4 Hidden Neurons Slide 55 Handicapped ALN: Tolerance=0.6 4 Linear Forms Slide 56 Neural Net - Further Training Slide 57 Unhandicapped ALN: Offset is simply for clarity of presentation Slide 58 UK Transmission Adaptive Logic Networks: How they Work: Further Details Slide 59 A Snapshot of Training Side-effect: Orange points no longer influence that LF, but will now pull up the other two LFs. A data point is presented. It pulls the linear form it influences towards itself (by learning factor proportion). LF1 LF2 LF3 y = Max(LF1,LF2,LF3) Slide 60 ALN Learning: LF Splitting If repeated adjustments of a given LF fail to reduce error below Tolerance, the LF splits into two and the process is repeated for each one independently. Due to random elements of training, they wander apart to cover different portions of the data space. Input axis Output axis Slide 61 Recap: ALN Structure During training ALNs can grow into complex trees. Branches are Max and Min operators. Leaves are Linear Forms. Trees can be of any depth. The one shown here is just a simple example. Transformation may be possible into a more efficient form where initial branches are if..then rules.... Output Inputs MAX MIN MAX Slide 62 ALNs can be Compiled into DTRs 1 2 3 4 5 6 Input axis x Min(5,6)Min(4,5)Min(2,3,4) Min(1,2) Example: For x in this interval only pieces 4 and 5 play a role. x Slide 63 Bagging - Averaging Several ALNs A very simple way to improve accuracy Applicable to any set of diverse models having same goal For example standard MLP neural nets For ALNs, diversity arises through random number generator affecting the training process e.g. the order in which data are presented. BestMean is a proven refinement e.g. reject results outside 2 * stdev then compute the new mean Slide 64 UK Transmission Model Development How datamining methods were brought to bear on our gas demand forecasting problem. Slide 65 Stages of Model Production Framing the Problem Data Preparation Data Cleaning Derived Variables, Partitioning. Input Selection ALN Training Implementation in Code Conversion of the ALN to a convenient programming language. Quality Assessment User-testing in the target environment. Slide 66 UK Transmission Model Development: Framing the Problem Slide 67 How should we frame the problem? We are in a vacuum here, so we need to guess or preferably experiment. Hourly or daily? The main requirement is for daily total demand Summing hourly demands tends to give greater accuracy. Absolute or relative? But d(Demand)/d(Temperature) varies with Temperature One big model for all LDZs, all-year round? Separate models for each LDZ? Split the year into parts or just flag or normalize each part? What parts? GMT/BST Seasons? Christmas? Easter? Try clustering, make a model for each cluster Also try experiments based on intuition & guesswork Slide 68 Traditional framing of the problem Daily totals Linear relationships Only model standard days - employ normalization (adjustment factors) for special days such as bank holidays. Compute the change in demand Slide 69 New framing of the problem Based on experience & intuition Hourly totals (daily = sum of hourlies) Nonlinear relationships Model all days - no need for normalization (adjustment factors). Absolute demand Slide 70 Coloured areas are clusters, each with a distinctive daily demand profile. Red text is our interpretation. Experience: Clustering of Electricity Profiles Kohonen SOM - as implemented in Eudaptics Viscovery SOM-Mine Slide 71 Clustering of Gas Profiles not such a detailed picture as for electricity... Yellow-ish areas indicate similar profiles, Red-ish areas indicate more varying profiles. Jan & Dec Jan Feb Mar & Nov Apr May & Oct June July Aug & Sept Slide 72 Find the Best Structure for the Model By experiment... Experiments (on one typical LDZ): One model for the whole year Separate models for each of four clusters Separate models for the GMT, BST and Xmas & New Year periods Separate models for GMT and BST, experimenting with various types of indicator for the Xmas-NY period straight flags & fuzzy flags THIS PRODUCED THE BEST RESULTS Slide 73 Final Structure for the Model Produce separate models for each season of each LDZ. Two seasons: GMT & BST The Easter and Xmas-NY periods are indicated by separate fuzzy flags. 13 LDZs Each model will contain a Bag of 10 ALNs Bag returns BestMean of the 10 ALNs Bestmean rejects results outside 2 * stdev Thus 260 ALNs need to be produced. Slide 74 UK Transmission Model Development: Data Preparation: Data Cleaning Slide 75 Data Cleaning Data Problems Some actual demands are unrealistic. Atypical demands are not useful for training. Detection Method Viscovery - commercial Kohonen / SOM tool Was used to highlight unusual profiles. Also manually checked & plotted ranges and profiles in Excel. Slide 76 Greater Requirement for Data Quality Our models may be more demanding than traditional ones in terms of data quality. Since our models are non parametric, they may be more susceptible to glitches in the data (may try to model them). It is possible that the available data will not meet our quality requirements. The existing data is clean in respect of daily totals, but hourly figures are traditionally less important. Slide 77 Bad Profile Detection Once again, making use of Eudaptics Viscovery SOM-Mine Arguably the best possible two-dimensional representation of an n-dimensional problem. The aspect ratio is based on 1st two principal components. It shows the main shape of the problem. Outlier profiles (possible errors) show up as red blemishes Yellow-ish areas are groups of similar profiles Red-ish areas indicate abnormalities. Slide 78 Bad Profile - Positive Glitch Slide 79 Bad Profile - Negative Glitch Slide 80 Bad Profile - Wobble Slide 81 Bad Profile - Clock-change Artefact Slide 82 UK Transmission Model Development: Data Preparation: Model Inputs Slide 83 Data Preparation: Derive additional variables as possible inputs Think up as many candidate inputs as possible Anthropomorphize: Think like an ALN Sine and Cosine of Day and of Year. Represent and maintain cyclic nature of diurnal and annual cycles. Annual gas cycle is approximately a sine wave (obvious knowledge). Moving-average of Temperature Cooling Power (wind chill) Days Since 1 April 1990 (basis for spotting long term trends) Fuzzy-Flags (special periods) These merely highlight the incidences of special days They do not indicate demand effects Slide 84 Input Selection (1) Around 60 potential inputs Implies 2 60 possible choices. Too many for exhaustive search. Systematic search may be infeasible The search-space may be rough. Inputs may interact, especially in an unknown nonlinear model. In previous projects, standard methods such as correlation-based input selection or adding or pruning inputs one at a time have failed to find the optimum selection. The chosen selection method Genetic Algorithm Proven jack of all trades discrete optimization method Fitness function based on training and testing disposable ALNs. Slide 85 Input Selection (2) No simple consistent method - too many interactions and nonlinearities - use a genetic algorithm. Unsurprisingly, inputs having greatest correlation to the output were chosen by the GA. However, below a certain threshold of correlation, the correspondence is less: the GA chose some inputs having tiny correlation instead of other inputs of greater correlation. Only 32 choices in this example. The small black stumps indicate inputs chosen by the GA. Slide 86 Input Selection (3): Genetic Algorithm (GA) Inspired by Darwins Theory of Evolution Our GA: Around 100 generations of 50 individuals, initially random. An individual is a specific choice of inputs. Reproduction Crossover (mating) Make a new individual by combining randomly selected features from some of the fittest existing individuals. Mutation (small random changes) Invert one or more decisions as to which inputs to use. Survival of the Fittest The fitness of an individual is assessed by training an ALN with the given input selection, then testing it on separate test data. Actually we train and average the results of a few ALNs. Slide 87 Input Selection (4a): Genetic Algorithm: The Principles: Survival of the Fittest Survivors plus their offspring (produced by crossover & mutation) Slide 88 Input Selection (4b): Genetic Algorithm: The Principles: Crossover Slide 89 Input Selection (4c): Genetic Algorithm: The Principles: Mutation Slide 90 Input Selection (4d): Genetic Algorithm: The Principles: Overall Loop Slide 91 UK Transmission Model Development: Model Production Slide 92 ALN Training Tool: AlnFit-NGT Source code adapted from Dendronic Decisions Limited. Underlying Dendronic Learning Engine (a standard DLL). Method: Back-Validation Oldest year of data used for validation. Most recent years of data used for training. Train to the point (epoch) of least error on validation data. Slide 93 Implementation in Code Automatically translate descriptive form to VBA Ultimately implement as a set of ActiveX DLLs Topmost: a Wrapper DLL Provides a standard interface to the user-program. Generates derived inputs Decides which model to run (based on LDZ & time of year). ALNs DLLs (one for GMT, one for BST) Contain LDZ-specific models as Classes Type Definitions DLL Slide 94 UK Transmission Scope for Improvement Slide 95 Knowledge Refinement Find the best way to use recent demand or demand error Improved Weather Inputs Wind direction >1 weather station in same LDZ Refinement of our Methods and Tools: Automatic data error detection Genetic Algorithm - make it more robust and efficient (e.g. distributed) ALN training improvements Remaining Technical Issues - 1 Slide 96 Metrics Needed for model optimization and quality assessment Different metrics targetted at model developer and user? Kinds of Metrics Traditional MAPE and Max. Abs. Error Propose Median Abs. Error and Ave of top-10% Abs. Errors For comparability, normalize by St.Dev ? Data Sampling and Input Selection Is there a better way? WAID? Remaining Technical Issues - 2 Slide 97 Future Development Refinements: Within-Day Fixer (part-developed). Arbitrary-Horizon Fixer. Kalman Filter (on-line adaption). Future Problems: National gas demand Windpower Wish: Hands off Model Development Server Slide 98 UK Transmission Conclusions Slide 99 Regarding NGT: NGT have made effective use of datamining methods for electricity and gas demand forecasting. Quick & dirty feasibility models Longer development high-accuracy production models When run in combination with existing models, the overall accuracy is improved With financial benefits ! More General Lessons: ALNs are great! For such problems, back-validation is better than cross-validation. Slide 100 UK Transmission - End - Any Questions? Slide 101 Datamining-Based Gas Demand Forecasting Models Phase-I Models in service since July 2004 Phase-II Models GMT Models in service since January 2005 BST Models currently under development (for March 05) Phase II Enhancements: More intensive Genetic Algorithm (GA) runs Greater number of generations Greater mutation probability Greater choice of inputs Individual GA runs for each LDZ (hence potentially different input variables) Methodology verified by experiment