time series analysis · Time Series vs Sequence Data •Temporal data may be discrete or...

DataWarehousingandDataMining:TimeSeriesAnalytics

Berlin,February5,2018

PatrickSchäfer

Vorlesung: https://hu.berlin/vl_dwhdm17Übung: https://hu.berlin/ue_dwhdm17

Motivation• Temporaldataiscommoninmanydataminingapplicationsanddatawarehouses• ThereisatimedimensioninallM/ERmodels

• Applicationdomainsrangefrom:• Sensordata:environmentalsensorsmeasuretemperature,pressurehumidity

• Medicaldevices:electrocardiogram(ECG)andelectroencephalogram(EEG)

• Financialmarket:stockprices,economicindicators,productsales

• Meteorologicaldata:sedimentsfromdrillholes,earthobservationsatellitedata

Motivation

• Dataanalyticsbasedonsimilarity isthetoolforexploringthesedatasets• Examplesofsimilarityqueriesinclude• Findallstocksthatshowsimilartrends• Findthemostunusualheartbeatinapatient’sECGrecording• Findfrequentpatternsinabirdsoundrecording• FindthepatientswiththemostsimilarECGrecording• Findproductswithasimilarsalespattern

TimeSeriesDefinition

• Definition:ATimeSeriesisasequence(orderedcollection)ofnrealvaluesattimestamps !", … , !% :

& = (", … , (%

• TimeSeriesmaybeunivariateormultivariate• Univariate:asinglevalue)* isassociatedwitheachtimestamp+*.• Multivariate:, values)* = (./, ….0) areassociatedwitheachtimestamp+*.

• Thedimensionalityofatimeseriesreferstothenumberofvaluesateachtimestamp

TimeSeriesvsSequenceData

• Temporaldatamaybediscreteorreal-valued(continuous)• Timeseriescontainrealvaluesandsequence data(i.e.webclickstreams,DNAsequences,documents)referstodiscretedata• Differentalgorithmsapplicableforsequencedata:Hashing,Tries,NaïveBayes,Boyer-Moore,vectorspacemodel,tf-idf,wordembeddings…• Insomecasesatimesseriescanbeconvertedtoasequencebyuseofdiscretization,therebymakeuseofsequenceminingalgorithms

Pre-processing:MissingData

• Itisconvenienttohavetimeseriesthatareequallyspacedandsynchronizedacrosstimestamps• However,itiscommonfortimeseriestocontainmissingdata• Onemethodistoapplylinear,i.e.estimatethe(missing)valuesatdesiredtimestamps• Linearinterpolation:)23/ and)2 arevaluesofthetimeseriesattimes+*3/ and+*

)456 = )23/ ++456 − +*3/ 9 )2 − )23/

+* − +*3/+*3/ +*+456

missingdata

Equallyspaced

Pre-processing:NoiseRemoval

• Removesshort-termfluctuationandnoise• Binning:• Dividethedataintodisjoint intervalsofsize..ThencalculatethemeanvaluesT = );/ …);6 , with< =4

>,ineachinterval:);* =

∑ @ABCADB CEF GF

>• Thisreducesthenumberofpointsbyafactorof.

• Smoothing(Moving-Averages)• Dividethedataintooverlapping intervalsofsizekoverwhichtheaveragesarecalculated• Thus,theaverageiscomputedateachtimestamp+/, +> , +H, +>I/ , … ratherthanonlyattheintervalboundaries +/, +> , +>I/, +H> , … 7

Pre-processing:Normalization

• Timeseriesneedtobenormalized,especiallywhendifferentsensorsareusedtorecordthedata• Normalizationputsthedataonthesamescaletomakecomparisonsmeaningful(i.e.FahrenheitundCelsius)• Twomethodsarecommonlyused:• Z-normalization (isgenerallypreferred)• Range-basednormalization

Pre-processing:Z-Normalization

• Z-normalization:LetJ andK bethemeanandstandarddeviationofthetimeseries& = (",… , (% ,then

LMNO, P = )Q/, … , )Q4

with)′* =@C3S

Pre-processing:Range-basednormalization

• Range-basednormalization:Theminimumandmaximumvaluesoveralltimeseriesaredetermined,theneachvalueofthetimeseries& = (",… , (% ismappedtorange(0,1)by:

OUMVW_MNO, P = )Q/, … , )Q4

with)′* =@C30*4

YZ[ 30*4

TimeSeriesSimilarity

• Atthecoreoftimeseriesdataanalyticsthereare:• atimeseriesrepresentation.• asimilaritymeasuretocomparetwotimeseries.

SimilarityMeasures(DistanceMeasures)• Thesimilarityoftwotimeseries\ andP isexpressedintermsofarealvalueusingadistancemeasure:] \, P → ℝ`

• Asimilaritymeasureistheinverseofthedistancemeasure:itqualifiessimilar(/dissimilar)timeseriesbyasmall(/large)value• Timeseriessimilaritymeasuresaredesignedwithapplicationspecificgoals• MostcommonmethodsareEuclideandistance(ED)andDynamicTimeWarpingDistance(DTW)• OthercommondistancemeasuresareLongestCommonSubsequenceorEditDistance

EuclideanDistance(ED)

• Definition:TheEuclideandistancebetweentwotimeseries\ = (a/, … , a4) andC=(b/, … , b4),bothoflengthM,isdefinedas:

]cd \, e = f a* − b* H

• TheEDappliesalinearalignmentofthetimeaxis• EDcannotcopewithvariablelengthtimeseries• EDruntimeisO(n)

DynamicTimeWarping(DTW)

• DynamicTimeWarpingappliesanelastictransformationofthetimeaxistodetectsimilarshapesthathaveadifferentphase• Thisisessentiallyapeak-to-peakandvalley-to-valleyalignmentoftwotimeseries• Intuition:DTWcanbethoughtofasanextensionoftheED,whichusestwoindiceshandi representingbothtimeaxis• Theseindicesareincrementedindependently:

]djk \, e = f a* − b2H

DynamicTimeWarping(CostMatrix)

• DTWstartsbycomputingacostmatrixMwiththedistancesbetweenallpairsofvaluesin\ = (a/ …a4) ande = (b/ … b4)

• ThismatrixhasthedimensionalityMH andisgivenby:l*,2 = a* − b2

H, h, i ∈ 1…M

• DTWthensearchesfortheoptimalwarpingpath

DynamicTimeWarping(WarpingPath)

• Definition:AwarpingpathpisdefinedasasetoftuplesthatdefinesatraversalofthecostmatrixMwhereash andirepresentindicesofthevaluesin\ = (a/ …a4) ande =(b/ … b4):

o/ = ", " , 2,2 , …, M − 1, M − 1 , %, %oH = ", " , 1,2 , 1⏟

, 3⏟t

, 2,3 , …, M − 1, M , %, %

• Avalidwarpingpathhastosatisfytwoconditions:• Thestarthastobe(1,1)andtheendhastobe(n,n)• Thepathmayproceedbyamaximumofoneindex:0 ≤ hwI/ − hw ≤ 1 and0 ≤ iwI/ − iw ≤ 1 forallU < M

• EDisthespecialcaseofthediagonalinthecostmatrix

DynamicTimeWarping(Distance)

• Definition:TheDTWdistanceisdefinedasthewarpingpathwiththeminimaltotalcostthroughthecostmatrixl]djk \, e = ,hM f a* − b2

(*,2)∈y

|o ∈ l

• AsDTWisapeak-to-peakandvalley-to-valleyalignmentoftwotimeseries,itmayproducesuboptimalresultsifthereisavariablenumberofpeaksandvalleys• DTWruntimeis{(MH)

DynamicTimeWarping(Algorithm)int DTWDistance(

Time Series q[1..n], TimeSeries c[1..n]) { // cost matrixM:= array [0..n, 0..n] for i := 1 to n

M[i, 0] := infinity for i := 1 to n

M[0, i] := infinity M[0, 0] := 0

//find optimal pathfor i := 1 to n

for j := 1 to n cost := D(q[i], c[j]) // measure distance M[i, j] := cost + minimum(

M[i-1, j ], // insertion M[i , j-1], // deletion M[i-1, j-1]) // match

return M[n, n] }

DynamicTimeWarping(WarpingWindow)

• Awarpingwindowconstraintcanbesettoreducethesearchspace(computationalcomplexity)• Itdefinestheamountofwarpingallowedbetweeneachpairofpointsonthewarpingpatho:

h − i ≤ O 9 M∀ h, i }o

• DTWwithawarpingwindowconstraintOhasacomputationalcomplexityof{(MO)

Similarity:Wholevs.SubsequenceMatching

• Wholematching:thedistanceiscalculatedbetweentwowholetimeseries• Subsequencematching:Searchesforthebestsubsequencewithinalongertimeseries• ComplexityfortimeserieslengthnandsubsequencelengthwwithEuclideandistance:• Wholematching:{(M)• Subsequencematching:{ < M − <

long time series C

slide along

sliding window S

Query Q

Similarity D(Q,S)

most similar subsequence

Query Q

Query QQuery Q

Similarity D(Q,S)

dataset DStime series TWhole Matching

Subsequence Matching

TimeSeriesDataTransformations

• Avarietyofmethodsexistforreducingthedimensionalityoftimeseries(i.e.fornoisefiltering,fasterprocessing),• Real-valuedmethodstransformintoasmallernumberofnumericvaluesandsymbolicmethodstransformintodiscretevalues(sequences)

PiecewiseAggregateApproximation(PAA)

• Intuition:torepresentatimeseriesoflengthnthedataisdividedintowequal-sizedintervalsandthemeanvalueineachintervaliscalculated• PAA:atimeseriesP = ()/, … , )4) oflengthnisrepresentedbyaw-dimensionalsequenceofmeanvaluesC = ()/, … , )6),wherei-th elementiscalculatedas:

)*� =<

2Ä46 *3/ I/

PAA:0,08-0,08-0,240,180,450,340,00-0,190,050,900,03-0,32-0,70-0,210,30-1,72

PiecewiseAggregateApproximation(PAA)

DiscreteFourierTransform(DFT)

• Intuition:eachseriesoflengthncanbeexpressedbyalinearcombinationofsmoothperiodicsinusoidalseries• Eachwaveisrepresentedbyacomplexnumber(FourierCoefficient)• ThisrepresentationiscalledFrequencyDomain• TheDFTconcentratesmostofitsenergyinthefirstfewFouriercoefficients• Low-passfilter:OnecanapproximateatimeseriesbyitsfirstwFouriercoefficients

DFT:0-8.81 -20.7 -11.9 -6.28 -8.02 -0.67 15.31 -18.7-18.36 -5.67 16.84 -8.919 -23.8010.70 -21.92 25.255-1.321...

• TheDFTdecomposesatimeseriesToflengthnintoasumofnorthogonalbasisfunctionsusingsinusoidwaves• AFouriercoefficient(sinusoidwave)isrepresentedbythecomplexnumber: ÅÇ = OWUÉÇ, h,UVÇ , ÑNOÖ = 0,1… , M − 1

• Then-pointDFTofatimeseriesP = ()/ …)4) isthengivenby:]ÜP P = Å`,… , Å43/ = (OWUÉ`, h,UV`, … , OWUÉ43/, h,UV43/)

• withÅÇ =/

4∑ )* 9 W

3AáàâCä4

*Ä/ , ÑNOÖ} 0, M , i = −1�

• ThefirstFouriercoefficientisequaltothemeanofatimeseriesandcanbediscardedtoobtainoffsetinvariance:Å` =

4∑ )* 9 W

`4*Ä/

• UsingonlythefirstfewFouriercoefficientsisequaltwolow-passfiltering(smoothening)atimeseries• ComputationalComplexity:TheFastFourierTransformhasacomputationalcomplexityof{(MlogM)tocomputetheDFTofatimeseriesoflengthn

SymbolicAggregateApproximation(SAX)

• SAXconvertsthetimeseriesintoadiscretesequenceofsymbolsby• X-axis:AppliesdimensionalityreductionusingthePAAtransformation

• Y-axis:TransformseachPAAvalueusingdiscretizationtoasymbol

• Valuessampledfromaz-normalizedtimeseriesfollowGaussiandistribution• SAXappliesdiscretizationbasedonGaussiandistributiontoproduceequi-probablesymbols• Arealvaluedtimeseriesisthenrepresentedbyaword

116 J. Lin et al.

0 20 40 60 80 100 120

Fig. 5 A time series is discretized by first obtaining a PAA approximation and then using prede-termined breakpoints to map the PAA coefficients into SAX symbols. In the example above, withn = 128, w = 8 and a = 3, the time series is mapped to the word baabccbc

the time series to a binary vector. They demonstrated that discretizing the timeseries before clustering significantly improves the accuracy in the presence ofoutliers. We note that “clipping” is actually a special case of SAX, where a = 2.

3.3 Distance measures

Having introduced the new representation of time series, we can now definea distance measure on it. By far the most common distance measure for timeseries is the Euclidean distance (Keogh and Kasetty 2002; Reinert et al. 2000).Given two time series Q and C of the same length n, Eq. 3 defines their Euclid-ean distance, and Fig. 6A illustrates a visual intuition of the measure.

D (Q, C) ≡

!""#n$

(qi − ci)2 (3)

If we transform the original subsequences into PAA representations, Q andC, using Eq. 1, we can then obtain a lower bounding approximation of theEuclidean distance between the original subsequences by

DR(Q, C) ≡%

i=1(qi − ci)

This measure is illustrated in Fig. 6B. If we further transform the data intothe symbolic representation, we can define a MINDIST function that returnsthe minimum distance between the original time series of two words:

MINDIST(Q, C) ≡%

'dist(qi , ci)

(2 (5)

The function resembles Eq. 4 except for the fact that the distance betweenthe two PAA coefficients has been replaced with the sub-function dist(). Thedist() function can be implemented using a table lookup as illustrated in Table 4.

from:Linetal.:ExperiencingSAX:anovelsymbolicrepresentationoftimeseries

SymbolicAggregateApproximation(SAX)

SymbolicFourierApproximation(SFA)

• SFArepresentseachrealvaluedtimeseriesbyaword• SFAiscomposedof

a) approximation usingtheFouriertransformand

b) a dataadaptivediscretization• ThediscretizationintervalsarelearnedfromtheFouriertransformeddatadistributionratherthanusingfixedintervals

DFT0-8.81-20.7-11.9-6.28-8.02-0.6715.31-18.7-18.36-5.67-16.84-8.919[...]

DiscretizationCBBCCDCBBCBCB[...]

Raw:0.26790.24800.18280.08170.0051-0.023-0.052-0.082-0.111-0.075-0.032-0.022-0.029[...]

SymbolicFourierApproximation(SFA)

ComparisonofSFAandSAX

• Discretizationandapproximationcauselossofinformation• Thehigherthenumberofsymbolsandthealphabetsize,themoreexactistherepresentation

Propertiesof Symbolic Representations

• Noiseremoval• SAX:UsingPAAandquantization.• SFA:UsingtheDFT(low-passfilter)andquantization.

• Stringrepresentation• Allowsforstringdomainalgorithmslikehashingorthebag-of-wordstobeapplied

• DimensionalityReduction• AllowforindexinghighdimensionaldatausingtheiSAX index(SAX)ortheSFAtrie(SFA)

• Storagereduction• Sequenceshaveamuchlowermemoryfootprintthanreal-valuedtimeseries,i.e.lessthan1byte(“Char”)vs8byte(“Double”)foreachtimestamp

TimeSeriesDataAnalyticsTasks

Euclidean Distance

Cylinder

Funnel

caffeinchlorogenic acid

Motif Discovery Classification

Clustering

Discords

abnormal

Motifs

• AMotifisfrequentlyoccurringpatternorshapeinatimeseries• Distance-based(“exact”Motifs)

• Asubsequenceofatimeseriesissaidtosupportamotif,ifthedistancebetweenthesubsequenceandthemotifislessthanathreshold

• Sequentialpatternmining(“approximate”Motifs)• Discretizationisappliedtoconvertthetimeseriesintoasequence.Motifsdiscoverylendsmethodsfromsequentialpatternmining From:JonasSpenger’s BachelorThesis

Results - Quality

● 85 planted motif time series● 8 planted sequences (red) per

time series● Task: Recover the planted

motifs (red regions)

Sequences (red) from: Chen, Yanping, et al. "The ucr time series classification archive."

Distance-basedMotifs

• Thesearecommonlydefinedascontiguoussequencesofatimeseries• Approximatedistancematch• Amotif é/, … , é6 issaidtoapproximatelymatchacontiguoussubsequenceoftimeseriesT = )/ …)4 atpositionh with< ≤ M ,ifthedistancebetween é/ … é6 and )* …)*I63/ isatmost}.• ThemostcommondistancemeasureusedistheEuclideandistance

• Motifcount:Thenumberofmatcheswithinthreshold} ofamotiftothetimeseriesisdefinedasthenumberofmatches• Atypicalgoalistofindthemostfrequentmotifs

NaïveAlgorithmMotif findBestMotif(

Time Series (y1,…,yn), WindowLength w, Threshold epsilon)

for i := 1 to n-w+1 do

candidate_motif = (yi,…,yi+w-1);

for j:=1 to n-w+1 do

D = computeDistance(candidate_motif, (yj,…,yj+w-1));

If (D < epsilon) and (non-trivial match) then

increment count of candidate_motif by 1

end if

end for

if (candidate_motif has highest count so far) then

update best_candidate to candidate_motif;

end if

return best_candidate;

• Thenaïvealgorithmextractsallcandidatemotifsoflengthwfromatimeseriesandcomputesthedistancetoalloffsets

• Thenumberofmatchesiscounted• Trivialmatchesareoverlappingsub-sequences

(i.e.i =j)• Theapproachrequiresanested-loopandthe

numberofoperationsisequaltothesizeofthetimeseries,thus{ MH distancecomputations

• TotalcomplexityforEuclideandistance{ MH<

Ideastospeed-up Motifdiscovery

• Computeafastlowerboundforthedistance• Ifthelowerboundisgreaterthanepsilonthentherealdistancedoesnothavetobecomputed

• Approaches:• UsePAAandcomputedistanceonlyformeanvalues

]hé+ Å, è ≥ ,� 9 ]hé+(ÅQ, èQ)• UseSAXandaprecomputedlookuptablefordistancesbetweensymbols

RelationtoSequentialPatternMining

• First,convertthetimeseriesintoasequencesusingforexampleSAX• NowapplyStringminingalgorithms• ItemsetMining:discoverfrequentitemsets usingassociationruleminingalgorithms(ARM).TheA-Priorialgorithmwillbepresentedinthelecture

Questions?

time series analysis · Time Series vs Sequence Data •Temporal data may be discrete or...

Documents

Transcript of time series analysis · Time Series vs Sequence Data •Temporal data may be discrete or...

RSA6100A Series Real-Time Spectrum Analyzers, RSA5100A ... · Series Real-Time Signal Analyzers to take measurements in different application areas. To work through these examples

Indexing Time Series. Time Series Databases A time series is a sequence of real numbers, representing the measurements of a real variable at equal time.

TDS 200-Series Digital Real-Time Oscilloscope User …mmrc.caltech.edu/Oscilliscope/TDS200 User.pdf · User Manual TDS 200-Series Digital Real-Time Oscilloscope 071-0398-03 This document

Real-Time Remote Spectrum Analyzer - V5 RSA Series · 2018. 1. 23. · Real-Time Remote Spectrum Analyzer - V5 RSA Series Real-time RF/EMF Data-Logger with integrated PC, 1Hz to 20GHz

Real-time anomaly detection system for time series at scaleproceedings.mlr.press/v71/toledano18a/toledano18a.pdfReal-time anomaly detection system for time series at scale Meir Toledano,

REAL-TIME COMPUTER CONTROL OF A HIGH-PERFORMANCE, SERIES ELASTIC

Time series forecasting used for real-time anomaly ... · Time series forecasting used for real-time anomaly detection on websites author: Georgios Galvas A thesis presented for the

Asteroid Observations - Real Time Operational Intelligence Series

TDS 200-Series Digital Real-Time Oscilloscope User Manual

The Time Series Performance Of UK Real Estate Indices

A Time-Series Analysis of the Real Wages-Employment Relationship

AIE Adaptive Image Enhancer Series Real Time Video Processor ICs

RSA6100B Series Real-Time Signal Analyzers RSA5100A Series ... · Series Real-Time Signal Analyzers to take measurements in different application areas. To work through these examples

TDS 200-Series Digital Real-Time Oscilloscope Service Manual 200_Digital... · Service Manual TDS 200-Series Digital Real-Time Oscilloscope 071-0492-02 This document supports firmware

Real-Time Data Security Webinar | SQLstream | July 2013 Series

TDS 200-Series Digital Real-Time Oscilloscope Service … 200... · Service Manual TDS 200-Series Digital Real-Time Oscilloscope 071-0492-02 This document supports firmware version

Programmer Manual TDS 200-Series Digital Real-Time ...mmrc.caltech.edu/Oscilliscope/OSC_TDS200_PROG.pdf · Programmer Manual TDS 200-Series Digital Real-Time Oscilloscope 071-0493-01

RSA5100A Series Real-Time Signal Analyzer Service Manual

Hard Real-Time Computing Systems - rd.springer.com978-1-4614-0676-1/1.pdf · Hard Real-Time Computing Systems. Real-Time Systems Series Series Editor John A. Stankovic University

AccuTOF LC series DART (Direct Analysis in Real Time ... · PDF fileAccuTOF LC series DART (Direct Analysis in Real Time) Applications Notebook Edition April 2016 s Atmospheric pressure