ICMSCSME Computer Sciences,

43
i ICMSCSME 2015 ISBN 978-602-72198-2-3 PROCEEDING Exploring Mathematics and its Application in the Future2-3 October 2015 Hasanuddin University Makassar, Indonesia International Conference on Mathematics, Statistics, Computer Sciences, and Mathematics Education

Transcript of ICMSCSME Computer Sciences,

i

ICMSCSME 2015 ISBN 978-602-72198-2-3

PROCEEDING

“Exploring Mathematics and its Application in the Future”

2-3 October 2015

Hasanuddin University

Makassar, Indonesia

International

Conference on

Mathematics, Statistics,

Computer Sciences,

and Mathematics

Education

ii

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

Proceeding

International Conference on Mathematics,

Statistics, Computer Sciences, and Mathematics

Education

(ICMSCSME) 2015

Editors

Dr. Nurdin, S.Si., M.Si.

Sri Astuti Thamrin, S.Si., M.Stat., Ph.D.

Edy Saputra, S.Si.

Reviewer

Prof. Dr. S. Arumugam

Dr. Kiki A. Sugeng

Dr. Loeky Haryanto, M.S., M.Sc., M.A.T.

Utami Dyah Syafitri, Ph.D.

Sri Astuti Thamrin, S.Si., M.Stat., Ph.D.

___________________________________________________________

Publisher: Fakultas MIPA UNHAS

Address: Jl. Perintis Kemerdekaan KM 10 Tamalanrea 90245 Makassar

___________________________________________________________

iii

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

FOREWORD FROM CHAIRPERSON OF ICMSCSME 2015

Assalamu ‘alaikum warahmatullahi wabarakatuh

And sincerely greetings to all.

It is my great pleasure to welcome all our invited

speakers and participants to International Conference

on Mathematics, Statistics, Computer Sciences, and

Mathematics Educations 2015 (ICMSCSME 2015)

jointly organized by Mathematics Department

Faculty of Mathematics & Natural Sciences

Hasanuddin University, and Indonesian

Mathematical Society (IndoMS) Sulawesi Region.

The conference is attended by around 200

participants, they are from Nepal, Philippines, India, Slovakia, Malaysia,

and Indonesia.

It is hoped that the ICMSCSME 2015 will catalyze and increase academic

and research collaborations between institutions involved, internationally

and also locally. I sincerely hope that this will spur further advancement of

scientific research and fruitful collaborations between organizations.

Finally, I would like to congratulate all the speakers and participants for

their participation in this ICMSCSME 2015. On behalf of the conferences

organizing committee, I would like to take this opportunity to thank all who

have contributed either directly of indirectly to the success of the event for

their generous contributions.

Finally, to all ICMSCSME 2015 committee thumbs up for a job well done.

May Allah’s blessing be upon you, Aamin.

Thank you,

Wassalam,

Dr. Nurdin

Chair of ICMSCSME 2015

iv

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

FOREWORD BY DEAN OF MATHEMATICS AND NATURAL

SCIENCES FACULTY HASANUDDIN UNIVERSITY

I would like to congratulate the Mathematics Department, Mathematics and

Natural Sciences Faculty, Hasanuddin University and Indonesian Mathematical

Society Region Sulawesi (IndoMS) for successfully organizing this joint

conference of the International Conference on Mathematics, Statistics, Computer

Sciences, and Education Mathematics 2015 (ICMSCSEM-2015) and the South

East Asian Mathematical Society School (SEAMS School) on Coding and Graphs

2015.

I give me great pleasure to welcome all distinguished guests, invited speakers,

invited lecturer, and participants to UNHAS and Makassar Indonesia. For some of

you, this visit may probably be your first visit to Makassar and I wish you

SELAMAT DATANG. I hope your brief visit to Makassar, in particular Makassar

will be a memorable and fruitful one.

UNHAS is committed towards fulfilling the strategy set forth in the National

Higher Education Plan for Indonesia Higher Education Institution. This

conference demonstrates the commitment of UNHAS to promote

internationalization as one of its main agenda. International research collaboration

commitment includes collaboration in building new findings, teaching, and

learning, and service activities to create opportunities for collaborative efforts,

thus enhancing research and possible research exchange.

It is the aspiration of UNHAS to be an established research university and

UNHAS is continuously promoting international research collaboration. I

sincerely hope that this joint conference will be a platform where international

research collaborations can be fostered and consequently nurtured.

Hopefully is of benefit to all readers.

Yours faithfully,

Dr.Eng. Amiruddin

v

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

TABLE OF CONTENTS

COVER ……………………………………………………………………. i

PREFACE …………..………………………………………………………. iii

TABLE OF CONTENTS …………………………………………………… v

KEYNOTE SPEECH

Ismail Mohd and Ahmad Kadri Junoh, Non Usury Model For Conventional And

Islamic Banking System

……………………………………………………………………………… 1-19

Stephanus Suwarsono, Involving Culture in the Teaching and Learning of

Mathematics, as an Approach for Exploring and Understanding the Applications

of Mathematics

……………………………………………………………………………… 20-28

ORAL PRESENTATION

I. MATHEMATICS

M.1. Idha Sihwaningrum, Ari Wardayani, Suroto. Weak Type Inequality for

Maximal Operators on Morrey Spaces over Metric Measure Spaces

……………………………………………………………………………… 29-33

M.2. M.Imran, Naimah. Aris Existence of global attractor in strongly

continuous semigroups {𝑇𝑡} that has Lyapunov function of a metric space

……………………………………………………………………………… 34-40

M.3. Marjo-Anne B., Acob, Loyola, Jean O. An Algorithm for Propagating

Graceful Trees Using the Adjacency Matrix of a Given Graceful Caterpillar

……………………………………………………………………………… 41-48

M.4. Naimah Aris, A. Muh. Amil Siddik, Muh. Nur. The Existence of Global

Attractor for a Strongly Continuous Semigroup in Metric Space

……………………………………………………………………………… 49-54

M.5. Nur Erawaty, Integer Solutions for Pell’s Equation

……………………………………………………………………………… 55-62

M.6. Nur Ilmiyah Djalal, Armin Lawi, Aidawayati Rangkuti. Supply Chain

Management 3-Echelon

……………………………………………………………………………… 63-67

vi

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

M.7. Ratianingsih, R, Jaya, A.I, Santule, M.B. The Predator-Prey Model of

Fishery Cultivation in Conservation Zone with Top-Predator Attack

……………………………………………………………………………… 68-73

M.8. Usman Pagalay, Budimawan , Silvia Anggraini. The Stability Of

Harvesting Logistic Model On Fishery

………………………………………………………………………………. 74-78

II. STATISTICS

S.1. Erna Tri Herdiani. Variance Vector Control Chart with Mean Square

Successive Difference Method (MSSD) To Monitoring Variability Multivariate

Control Process

…………………………………………………………………………......... 79-84

S.2. Fatimah Ashara, Erna Tri Herdiani, and M. Saleh. Estimation Parameter of

Vector Autoregressive Model Using by Two Stage Least Square Method

……………………………………………………………………………… 85-88

S.3. Herianti, Anisa, Ladpoje. Multiple Imputation with PMM Method To

Estimate Missing Data On Nonresponse Item

……………………………………………………………………………… 89-96

S.4. Miftahuddin, Anisa K., Asma G. A Review of the Time Effects in the SST

Data using Modified GamboostLSS Models

…………………………………………………………………………....... 97-107

S.5. Muflihah, Armin Lawi, Erna Tri Herdiani, Prediction of Rainfall by State

Space Model For Missing Data

……………………………………………………………………………. 108-112

III. COMPUTER SCIENCE

SC.1. Heliawaty Hamrul, Hardiana. Data Warehouse Software To Support

Arrangement Of Standart 3 Borang Accreditation

......................................................................................................................113-121

SC.2. Loeky Haryanto, Nurdin. An algorithm for searching the total edge-

irregularity strength of the corona graph Pm⊙Pn with minimal weights on the

edges of its subgraph Pm

………………………………………………………………..……………122-132

SC.3 Monika S Rahayu, Armin Lawi, Sri Astuti Thamrin, The Comparison

Multiclass Classification Using Support Vector Machine.

……………………………………………………………………………..133-144

vii

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

SC.4. Octavian, Supri Bin Hj Amir, Armin Lawi, Determining of the Relations of

Specific Variables in Massive Database Using Association Rule Learning

……………………………………………………………………………..145-152

SC.5. Rahmawati, Sri Astuti Thamrin, Armin Lawi, Kernel Bayesian-Based

Classification For Microarray Data

……………………………………………………………………………..153-159

SC.6. Rini Anggraini, Armin Lawi, Sri Astuti Thamrin, Ensemble Support

Vector Machine Optimization Using Adaboost Algorithm

……………………………………………………………………………..160-166

IV. MATHEMATICS EDUCATION

ME.1. Budi Nurwahyu, Tatag Y.E.S, St. Suwarsono. Students’ Concept Image

of Permutation and Combination viewed from Difference of Gender with High

Ability in Basic Mathematics

……………………………………………………………………………..167-185

ME.2. Ety Tejo Dwi Cahyowati. The Reduction of Student Misperception in Set

Topic through Cognitive Conflic

……………………………………………………………………………..186-191

ME.3. Georgina Maria Tinungki. The Role of Cooperative Learning Type

Team Assisted Individualization to Improve the Students’ Mathematics

Communication Ability in the Subject of Probability Theory

……………………………………………………………………………. 192-199

ME.4. Masduki, Rita Pramujiyanti Khotimah. An Error Analysis of Students to

Solve The First Order Differential Equations

……………………………………………………………………………..200-205

ME.5. Muslimin, Muh. Hajarul Aswad A. Alternative Completion of Poverty in

Indonesia through Mudharabah

……………………………………………………………………………..206-210

ME.6. Nasrullah. Using Daily Problems to Measure Math Literacy and

Characterise Mathematical Abilities for Students in South Sulawesi

……………………………………………………………………………..211-218

ME.7. Nining Setyaningsih, Sri Rejeki dan Sri Sutarni. Developing A

Mathematics Instructional Model Based on RAKIR (Child Friendly, Innovative ,

Creative and Realistics)At Junior High School

……………………………………………………………………………..219-227

ME.8. Rita Pramujiyanti Khotimah, Masduki . Problem Solving Ability of

Students to Solve Ordinary Differential Equations

……………………………………………………………………………. 228-235

ME.9. Saleh Haji, M. Ilham Abdullah. Developing Students’ Mathematical

Communication Through Realistic Mathematics Learning

……………………………………………………………………………..236-244

viii

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

ME.10. Sitti Busyrah Muchsin, Dwi’s Concept Understanding Concerning

Operation on Integer

……………………………………………………………………………..245-250

ME.11. Sitti Maesarah. Analysis Mastery of Mathematics Teacher of

Implementation Curriculum 2013 in the Junior School

……………………………………………………………………………..251-256

ME.12. Tedy Machmud, Sumarno Ismail, Nursia Bito. Development of PCL

Approach in Mathematics Learning Integrated with Character Education at Junior

High Schools in Gorontalo Province

……………………………………………………………………………..257-263

ME.13. Yuda Satria Nugraha, The Effectiveness of Graph Theory’s Learning

Model Based on Decision-Making System Using Analytical Hierarchy Process

(AHP) (Case Study of Semester IV-C Students Academic Year 2014/2015)

……………………………………………………………………………..264-270

ME.14. Yumiati. The Application of Connecting, Organizing, Reflecting, and

Extending Learning in Enhancing Students’ Algebraic Thinking Skill

……………………………………………………………………………..271-283

II. STATISCTICS

- 79 -

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

Variance Vector Control Chart with Mean Square Successive

Difference Method (MSSD) To Monitoring Variability

Multivariate Control Process

Erna Tri Herdiani1

1Department Mathematics, Faculty Mathematics and Natural Sciences, Hasanuddin University

Jl. Perintis Kemerdekaan Km 10 Tamalanrea

email: [email protected]

ABSTRACT

Multivariate control chart to control the variance covariance matrix typically use

basic statistical matrix determinant and inverse matrix. In the case of variable data

that has quite a lot, this statistic is difficult in its calculations. To overcome this, the

vector Variance been proposed as an alternative statistics from statistical M - Box

and Jennrich preexisting. But in general, the variance covariance matrix involved in

the vector variance is estimated using the full data set ( FDS ) , and in this paper we

investigate the effect of variance vector control chart with the assessment of the

variance covariance matrix by using the Mean Squared Successive Differences (

MSSD ) . Furthermore, the results of which have been applied to the data obtained

by the weather in the city of Makassar in 2003 until 2012. Key Words: multivariate control charts, control charts, vector variance, variance

matrix Covariance, mean square successive difference

1. Introduction

Multivariate control chart used to monitoring variables together in a process. As is usually controlled mean vector and variance-covariance matrix. Charts for variance-covariance matrix is generally used with basic statistical matrix determinant and inverse matrix, see [7],[6]. In the case of data that has quite a lot of variables; vector variance was proposed as an alternative statistics of statistic M-Box and Jennrich, see [3], [4]. But in practice the variance-covariance matrix is estimated using a sample. Variance-covariance matrix can be estimated using the maximum likelihood method, which involves all sample results are available, then the estimation using the assessment method is called Full Set of Data (FSD) [1]. In this paper we investigate the creation of vector charts variance with the estimation of the variance-covariance matrix by using the Mean Squared Successive Differences (MSSD) [5].

2. Estimation Of Variance-Covariance Matrix

Let 𝑥1⃗⃗ ⃗, 𝑥2⃗⃗⃗⃗ , … . , 𝑥𝑛⃗⃗⃗⃗ be a random vector of random variables 𝑥 is multivariate

normal distribution with p variable, mean vector 𝜇 and the variance covariance matrix Σ. Estimation of variance covariance matrix can be done with the Full of Data

Set (FSD) and the Mean Squared Successive Differences (MSSD) methods. One of

the estimators used to estimate the variance covariance matrix Σ by involving all n

observations by maximum likelihood method is as follows:

- 80 -

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

𝑺1 = ∑1

𝑛𝑖 − 1(𝒙𝑖 − �̅�)(𝒙𝑖 − �̅�)𝑇

𝑛

𝑖=1

The second estimator using the difference between successive pairs on observations:

𝑣𝑖 = 𝒙𝑖 − 𝒙𝑖−1 , 𝑖 = 2, 3,… , 𝑛. Let 𝑿1, 𝑿2, 𝑿3, … , 𝑿𝑛 be a vector random multivariate with p variables, element-j of vector random is:

𝑿𝑗 =

[ 𝑋1𝑗

𝑋2𝑗

⋮𝑋𝑝𝑗]

𝑗 = 1, 2, … , 𝑛.

Where p is the number of quality characteristics and n is the number of samples, so

estimates of the mean vector is

𝐸(�̂�) = �̅� =

[ �̅�1

�̅�2

⋮�̅�𝑝]

Where �̅�𝑖 =1

𝑛∑ 𝑿𝑖𝑗

𝑛𝑗=1 , 𝑖 = 1, 2,… . , 𝑝. Vector of matrix V is

𝑽 =

[ 𝒗𝟐′

𝒗𝟑′

⋮𝒗𝒏′ ]

so

𝑺2 =1

2

𝑽′𝑽

𝑛

Both of these estimates S1 and S2 will be used to estimate the variance covariance

matrix of the sample to be used in the statistical variance vector.

3. Vector Variance Control Chart

Let X be a random vector which is a superposition of ( 1 )X and ( 2 )X , where

each dimension p and q , X = t

1 2X X . Suppose also, ( i ) = ( i )E X ; i = 1, 2 and

ij = t

( i ) ( i ) ( j ) ( j )E X X

; i, j = 1, 2. Therefore, the covariance matrix of X ,

It is called by 𝚺, can be written in the form of partition 𝚺 = 11 12

21 22

. [3] suggests

that [2] using 12 21Tr to measure the linear relationship between two random

vectors ( 1 )X and ( 2 )X . This parameter is called covariance vector which is the sum

of all diagonal elements of12 21 . Thus, as submitted by [3], 2

11Tr and 2

22Tr

respectively called vector variance (VV) between of 1X and 2

X . If p = q = 1,

covariance vector is the square of the variance covariance while the vector is the

square of the classical variance. Furthermore, the vector variance written with the

symbol 𝑇𝑟(𝚺2). the variance covariance matrix is estimated with the variance

covariance matrix of the sample variance covariance matrix samples are denoted

by𝑇𝑟(𝑆2).

- 81 -

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

[3] states that the asymptotic distribution of variance vector is 𝑁 (𝑇𝑟(𝚺2),8𝑇𝑟(𝚺4)

𝑛−1) ,

vector mean 𝑇𝑟(𝚺2) and variance 8𝑇𝑟(𝚺4)

𝑛−1. The value of this variance will be used

to establish statistical control chart of Variance Vector.

Theorem 1

Let 𝑿𝟏, 𝑿𝟐, ..., 𝑿𝒏 be a random sample that have multivariat normal distribution

𝑁𝑝(𝝁, 𝚺), If 𝑇𝑟(S2)𝑑→ 𝑁 (𝑇𝑟(𝚺2),

8𝑇𝑟(𝚺4)

𝑛−1) than variance vector control chart have

Upper Central Limit (UCL) : 𝑇𝑟(𝚺2) + 3√8𝑇𝑟(𝚺4)

𝑛−1 and Lower Control Limit (LCL)

: 𝑇𝑟(Σ2) − 3√8𝑇𝑟(𝚺4)

𝑛−1. It takes value of level significance 𝛼 = 0.0027.

Furthermore, if value of variance-covariance matrix is estimated by FDS then

variance vector control chart will become in Theorem 2.

Theorem 2

If variance-covariance matrix 𝚺 be estimated by full data set (FDS) as

𝑺1 = ∑1

𝑛𝑖 − 1(𝒙𝑖 − �̅�)(𝒙𝑖 − �̅�)𝑇

𝑛

𝑖=1

Then vector variance control chart will be have control limit as follows: Upper

Control Limit (UCL): 𝑇𝑟(S12) + 3√8𝑇𝑟(S1

4)

𝑛−1 and Lower Control Limit

(LCL): 𝑇𝑟(S12) − 3√8𝑇𝑟(S1

4)

𝑛−1.

Furthermore, if value of variance-covariance matrix is estimated by MSSD then

variance vector control chart will become in Theorem 3.

Theorem 3

If variance-covariance matrix 𝚺 be estimated by Mean Square Successive Difference (MSSD) as

𝑺2 =1

2

𝑽′𝑽

𝑛

Then vector variance control chart will be having control limit as follows: UCL

𝑇𝑟(S22) + 3√

8𝑇𝑟(S24)

𝑛−1 and LCL∶ 𝑇𝑟(S2

2) − 3√8𝑇𝑟(S2

4)

𝑛−1.

Based on theorem 2 and 3 will be applied on The Weather Data in Makassar City at

2003 until 2012.

4. Result and Discussion Study of Weather Data in Makassar in The Year

2003 To The Year 2012

In this case study used the data yearly, Air Temperature (𝑿1), irradiation sun (𝑿2),

humidity (𝑿3) and wind speed (𝑿4) in Makassar City at year 2003 to 2012, obtained

by each subgroup q, where q indicates the year defined by 𝑘 = 1, 2, … , 𝑞. With quality

- 82 -

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

characteristics p = 4 namely Temperature ( Celsius ) , irradiation sun ( percent ) ,

humidity ( percent ) and wind speed ( knots ) which is symbolized by 𝑿1, 𝑿2, 𝑿3 dan

𝑿4 . Table 1 Value of 𝑇𝑟(𝑆1

2) And 𝑇𝑟(𝑆22) to

The Weather Data Makassar City at 2003 until 2012

Year 𝑇𝑟(𝑆12) 𝑇𝑟(𝑆2

2)

2003 304340 19874.27

2004 236560 17392.14

2005 146460 7621.045

2006 293190 16591.38

2007 211490 22240.75

2008 708460 825.891

2009 257300 16574.13

2010 73927 19950.08

2011 325540 13309.43

2012 99778 7806.186

Source: Data Processing, 2015

The data in table 1 are shown in Figure 1.

Figure 1 Value of 𝑇𝑟(𝑆12) and 𝑇𝑟(𝑆2

2) by year

Also, variance vector control chart based on two difference estimation variance-

covariance matrix that can be seen in Table 2.

Table 2 Value variance vector control chart

Source: Data processing, 2015

0.00000E+00

2.00000E+05

4.00000E+05

6.00000E+05

8.00000E+05

1.00000E+06

2002 2004 2006 2008 2010 2012 2014

Va

lue

of

Tr(

S2)

Year

FSD 𝑀𝑆𝑆𝐷

UCL 888,750 143,700

CL 233,240 38,048

LCL -422,280 -67,599

- 83 -

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

If both of control chart is presented on figure than Its result on figure 2.

Figure 2. Variance Vector Control Chart

Based on figure 2, variance vector control chart by FDS method show that

observation data entrance all of them in control limit, furthermore based on MSSD

method show that there is one data out of control limit, that is data year 2008.

Therefore, it is necessary reprocessing variance vector control chart through MSSD

method until all the data into the control limit, by eliminating the data in 2008.

At the time of the data in 2008 eliminated the variance vector control chart with

MSSD method produce the data as follows:

Table 3 Value for Vector Variance Control Chart by

MSSD Method Before and After 2008 Omitted Data

Before

Omitted Data

After

Omitted Data

UCL 143,700 160,090

CL 38,048 40,531

LCL -67,599 -79,031

Source: Data processing, 2015

The results obtained after 2008 data is eliminated can be seen in Figure 3.

-5.00000E+05

-3.00000E+05

-1.00000E+05

1.00000E+05

3.00000E+05

5.00000E+05

7.00000E+05

9.00000E+05

2002 2004 2006 2008 2010 2012 2014

Val

ue

of

Tr(S

2)

Year

UCL FSD = 888,750

UCL MSSD = 143,700

LCL MSSD = - 67,599

LCL FSD = - 422,280

- 84 -

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

Figure 3. Vector Variance Control Chart Method with MSSD and

Data for 2008 has been omitted

Thus the vector variance control chart of Weather Data in Makassar in 2003 up to

2012. Vector variance control chart with different estimation is based FDS and

MSSD generates two different control chart. MSSD method of detecting where the

data in 2008 as a data outlier while the FDS does not. Thus, for further research, we

can continue to see which one is better if the FSD or MSSD.

5. Conclusion

Vector variance control chart based on variance-covariance matrix sample that is charged with FDS and MSSD. MSSD detected that produce that data for 2008 are out of control while not FSD. To determine the performance of charts that can be used, preferably in writing, then compare the two charts by using the Average Run Length (ARL). REFERENCES

[1] Anderson T.W. (2003). An Introduction Multivariate Statistical Analysis. Third Edition.

Page: 251 – 282. Standford University.

[2] Cleroux, R. (1987). Multivariate Association and Inference Problems in Data

Analysis, Proceedings of the Fifth International Symposium on Data Analysis

and Informatics, Vo. 1 Versailles, France.

[3] Djauhari, M.A. (2007). A Measure of Data Concentration. Journal of Probability

and Statistics, vol 2, No. 2, 139-155.

[4] Herdiani, E.T and Djauhari, M.A. (2013). Asymtotic distribution of Vector

Variance Standardized Variable Without Duplication, Journal of Concrete and

Applicable Mathematics – JCAAM, volume 11, Nomor 1, Januari 2013, 87-95. [5] Levinson, W. A., Holmes, D.S., and Mergen, A.E. (2002). Variation Charts For

Multivariate Processes,. Quality Engineering, volume 14, issue 4, 539-545

[6] Sindelar, M.F. (2007). Multivariate Statistical Proses Control For Correlation

Matrices. Pittsburgh : University of Pittsburgh.

[7] Tang, G.Y.N. (1998). The Intertemporal Stability of the Covariance and

Correlation Matrices of Hong Kong Stock Returns, Applied Financial

Economics, 8, pp. 359-365, 1998.

-1.00000E+05

-5.00000E+04

0.00000E+00

5.00000E+04

1.00000E+05

1.50000E+05

2.00000E+05

2002 2004 2006 2008 2010 2012 2014

Val

ue

of

Tr(S

2)

Year

UCL MSSD = 160,090

LCL MSSD = -79,031

- 85 -

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

Estimation Parameter of Vector Autoregressive Model

Using by Two Stage Least Square Method

Fatimah Ashara1, Erna Tri Herdiani2, and M. Saleh3

Department of Mathematics, Faculty of Mathematics and Natural Sciences,

Hasanuddin University,

Jl. Perintis Kemerdekaan Km.10 Tamalanrea, Makassar, Indonesia

*Corresponding Author: [email protected]

ABSTRACT

Model Vector Autoregressive (VAR) is an analytical tool that is very useful in

understanding the reciprocal relationship (interrelationship) between economic

variables as well as in the establishment of a structured economy. There are two

important assumptions to be aware of the data time series that can be formed into a

VAR model, namely: stationary, normal and mutually error-free. The method is

usually used to estimate the parameters of the VAR model is OLS, but the author

aims to develop this method by using TSLS method to estimate the parameters of the

VAR model. Steps in the method of Two Stage Least Squares (TSLS) there are two

stages. The first phase, parameter estimation using OLS, and continued in the second

stage, where the results of the estimation in the first stage is used to estimate the TSLS

stage. Based on this thesis, in addition to the VAR model is able to be estimated using

OLS, one of which is using the TSLS method

Keyword: Vector Autoregressive models, OLS, TSLS, analysis regression

1. Introduction

Forecasting method is a way to predict or estimate quantitatively and

qualitatively what happens in the future based on the relevant data in the past. Thus

forecasting method is expected to provide greater objectivity. Forecasting required to

perform a certain method and which method to use depends on the data and

information that will be predictable and the objectives to be achieved. One method

that is often used salhsatunya forecasting is a time series.

Basically the time series analysis is used to perform data analysis that

considers the influence of time, data collected periodically berdasaran time sequence,

can be within hours, days, weeks, months and years can be analyzed using the method

of analysis of time series data, the analysis of time series data is not can only be done

for one variable (univariate) but also to many variables(multivariate).

One is modeling for multivariate data with models Vector Autoregressive

(VAR). VAR was first introduced by C.A. Sims (1972) as a development of thought

Granger (1969). One use of the VAR model is to forecast or prediction (forecasting),

especially for projections or forecasting short-term (short-term forecast). VAR

models can also be used to see the effects of changes in the system of one variable to

- 86 -

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

another variable dynamically (Juanda and Junaidi, 2012). Basically VAR analysis

can be paired with a simultaneous equation model for this analysis into account

several dependent variable (tied) together in a model. So that the VAR model can be

estimated using two stage least square method because in general the TSLS method

used to estimate the simultaneous equations the variables have relevance, because the

results of the estimation using TSLS method is consistent and efficient.

2. Vector Autoregressive (VAR) Model

VAR models can be used to determine a causal relationship. As part of

econometrics, the VAR model is one of the discussions in multivariate time

series.Time series 𝑍𝑡 follow the model VAR (p) if it satisfies

𝑍𝑡 = 𝜙0 + 𝜙1𝑍𝑡−1 + ⋯ + 𝜙𝑝𝑍𝑡−𝑝 + 𝜀𝑡 , 𝑝 > 0

Where: 𝑍𝑡 : (𝑦1𝑡,…,𝑦𝑛𝑡)' size (𝑛 × 1)

𝜙1 : the coefficient matrix of size (n × n)

𝜀𝑡 : (𝜀1𝑡, … , 𝜀𝑛𝑡)′ is the n-dimensional white noise

𝑝 : number of variables

Because 𝜙0 assumed to be zero, then the VAR model can be written as:

𝜙1𝑍𝑡−1 + ⋯+ 𝜙𝑝𝑍𝑡−𝑝 + 𝜀𝑡 ,

We can be written in matrix form

[

𝑍1𝑡

𝑍2𝑡

⋮𝑍𝑛,𝑡

] = [

𝜙11 𝜙12 … 𝜙1𝑛

𝜙21

⋮𝜙𝑛1

𝜙22

⋮𝜙𝑛2

…⋱…

𝜙12

⋮𝜙𝑛𝑛

] [

𝑍1𝑡−1

𝑍2𝑡−2

⋮𝑍𝑛𝑡−𝑛

] + [

𝜀1𝑡𝜀2𝑡

⋮𝜀𝑛𝑡

]

From the matrix above it can be modeled in a general form:

𝑌 = 𝐵𝑋 + 𝜀

3. Result And Discuss

In general, VAR models can be estimated with Ordinary Least Square (OLS).

Basically VAR analysis can be paired with a simultaneous equation model because

in this analysis to consider some of the dependent variable (tied) together in a model.

Therefore VAR models are also able to be estimated by Two Stage Least Square

(TSLS) method. OLS parameter estimation method according to [2] forms the matrix

of the VAR model equation is:

𝑌 = 𝐵𝑋 + 𝜀

Then estimated by minimizing the least squares

𝜀′𝜀 = (𝑦 − 𝐵𝑋)′(𝑦 − 𝐵𝑋)

And generate estimator

�̂� = 𝑌𝑋′(𝑋𝑋′)−1

Parameter estimation method of TSLS is according to[4]. Parameter estimation TSLS

is estimated in two stages, in order to obtain its estimator also minimize the least

squares of equations

- 87 -

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

𝐴 = 𝛾�̂� + 𝜀

then 𝜀′𝜀 = (𝐴 − 𝛾�̂�)′(𝐴 − 𝛾�̂�)

And generate estimator

�̂� = 𝐴�̂�′(�̂��̂�′)−1

In this final task to be done is to prove the VAR model parameter estimation

through TSLS method.

VAR models in general

𝑍𝑡 = 𝜙1𝑍𝑡−1 + ⋯+ 𝜙𝑝𝑍𝑡−𝑝 + 𝜀𝑡 ,

In the form of a matrix in equation

𝑌 = 𝐵𝑋 + 𝜀

Or

𝒚 = (𝑋′⨂𝐼𝑛)𝜷 + 𝜺

3.1 VAR model parameter estimation stage 1, by OLS.

To get the OLS estimator, is done by minimizing the sum of squared error.

Where 𝜺 = 𝒚 − (𝑋′⨂𝐼𝑛)𝜷

then 𝜺′𝜺 = (𝒚 − (𝑋′⨂𝐼𝑛)𝜷 )′(𝒚 − (𝑋′⨂𝐼𝑛)𝜷)

So 𝑆(𝛽) = 𝒚′𝒚 + 𝜷′(𝑋𝑋′⨂𝐼)𝜷 − 2𝜷′(𝑿⨂𝐼)𝒚

Then to obtain OLS estimation then used partial derivative for squared error

𝜕𝑆(𝛽)

𝜕𝛽|𝛽=�̂� =

𝒚′𝒚+𝜷′(𝑋𝑋 ′⨂𝐼)𝜷−2𝜷′(𝑋⨂𝐼)𝒚

𝜕𝛽|𝛽=�̂�

= 0

= (𝑋𝑋′⨂𝐼)𝜷 − (𝑋⨂𝐼)𝒚 = 0

Or �̂� = [(𝑋𝑋′)−1⨂𝐼−1](𝑋⨂𝐼)𝒚

And then �̂� = (𝑋𝑋)−1(𝑋⨂𝐼)[((𝑋′⨂𝐼)𝜷 + 𝜺)]

Or 𝑣𝑒𝑐(�̂�) = �̂� = (𝑋𝑋′)−1(𝑋⨂𝐼)𝑣𝑒𝑐(𝑌)

So �̂� = 𝑌𝑋′(𝑋𝑋′)−1

Next determine 𝑌 to obtain data that will be used in order to obtain the TSLS method

𝒀 = 𝑩𝑿

= (𝑌𝑋′(𝑋𝑋′))−1𝑋

= 𝑃𝑥𝑦𝑋

3.2 VAR model parameter estimation stage 2, called by TSLS.

Suppose there is a vector 𝐴 size (𝑛 × 𝑡) to be regressed to the matrix 𝑌, then formed

a model

𝐴 = 𝛾�̂� + 𝜀

or 𝒂 = (�̂�′⨂𝐼)𝜸 + 𝜺

Where 𝛾 is the coefficient for the 𝑌 parameter.

Furthermore, the partial derivative is used to minimize the sum of squared error

Where 𝜺 = 𝒂 − (𝑋′⨂𝐼𝑛)𝜸,

- 88 -

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

Then 𝜺′𝜺 = (𝒂 − (𝑋′⨂𝐼𝑛)𝜸 )′(𝒂 − (𝑋′⨂𝐼𝑛)𝜸)

so that 𝑆(𝛽) = 𝜀′𝜀 = (𝒂 − ((�̂�′⨂𝐼)𝜸)′

) − (𝒂 − (�̂�′⨂𝐼))𝜸

= 𝒂′𝒂 − 2𝜸′(�̂�⨂𝐼)𝒂 + 𝜸′(�̂��̂�′⨂𝐼) 𝜸

Then to obtain TSLS estimation then used partial derivative for squared error

𝜕𝑆(𝛾)

𝜕𝛾|𝛾=�̂�=

𝒂′𝒂−2𝜸′(�̂�⨂𝐼)𝒂+𝜸′(�̂��̂�′⨂𝐼) 𝜸

𝜕𝛾|𝛾=�̂�=0

Partial derivatives from 𝜕𝑆(𝛽) is

𝜕𝑆(𝛽) = −2(�̂�⨂𝐼)𝒂 + 𝟐(�̂��̂�′⨂𝐼) 𝜸

or

�̂� = ((�̂��̂�′)−𝟏

⨂𝐼−1)(�̂�⨂𝐼)𝒂

And then

�̂� = ((�̂��̂�′)−𝟏

�̂�⨂𝐼𝐼)𝒂

So that

�̂� = 𝒗𝒆𝒄 (𝐴�̂�′)(�̂��̂�′)−𝟏

= (𝐴�̂�′)(�̂��̂�′)−𝟏

because �̂� = 𝑃𝑋𝑦𝑋 and can be written

�̂� = 𝐴((𝑃𝑥𝑦)𝑋)′((𝑃𝑥𝑦)𝑋((𝑃𝑥𝑦)𝑋)′))−1

𝑃𝑥𝑦′ = 𝑃𝑥𝑦 because 𝑃𝑥𝑦 is simmetric matrix, then can be written

�̂� = 𝐴(𝑋′𝑃𝑥𝑦)((𝑃𝑥𝑦)𝑋(𝑋′𝑃𝑥𝑦′))−1

�̂� = 𝐴𝑋′(𝑋𝑋′)−1𝑃𝑥𝑦−1 (4.14)

where 𝑃𝑥𝑦 = (𝐴�̂�′)(�̂��̂�′)−𝟏 , so TSLS estimator is

�̂� = 𝐴𝑋′(𝑋𝑋′)((𝐴𝑋′(𝑋𝑋′)−1)−1

4. Conclusion

Based on the results of the study it can be concluded that the work theoretically, TSLS

equation for the VAR model is

�̂� = 𝐴𝑋′(𝑋𝑋′)((𝐴𝑋′(𝑋𝑋′)−1)−1

REFERENCES

[1] Laub, Alan J. 2005. Matrix Analysis for Scientists and Engineers. university of

California. California

[2] Lutkepohl, Helmut. 1991. New Introduction To Multiple Time Series Analysis.

Europen University. New York

[3] Susilawati,Sumarni. 2014. Estimasi Parameter Model Vector Autoregressive

generalized Space Time Autoregressive Menggunakan Metode Two Stage

Least Squares, Tesis, Universitas Hasanuddin, Makassar.

[4] Wang, S. & Hsiao, C. 2006. Modified Two Stage Least Squares Estimator for

The Estimation Structural Vector Autoreg ressive Integrated Process.

Journal Of Econometrics. 135: 427-463.

[5] Wei, William W.S. 1994. Time Series Analysis. Addison- Wesley Publishing

Company : California.

- 89 -

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

Multiple Imputation with PMM Method To Estimate Missing Data

On Nonresponse Item

Herianti1, Anisa

2, Ladpoje

3,

1Department of Mathematics, Faculty of Mathematics and Natural Sciences, Hasanuddin

University

Jl. Perintis Kemerdekaan Km.10 Tamalanrea, Makassar, Indonesia

email: [email protected], 2,3

Department of Mathematics, Faculty of Mathematics and Natural Sciences,

Hasanuddin University

Jl. Perintis Kemerdekaan Km.10 Tamalanrea, Makassar, Indonesia

ABSTRACT

Missing data is incomplete information commonly found in the survey, census and

experiments. Some statistical analyzes were developed to deal the missing data

such as Multiple Imputation (MI), Maximum Likelihood, Weighting methods, and

so on. One of MI approach is Predictive Mean Matching (PMM) Method. PMM is

technique to fill missing data with a set values imputed from the nearest

observation values of the model. This paper reviews about application of PMM

method on data not normal distribution and normal data distribution. Both of the

data are complete. It is designed to omit such a way as to for monotone missing

data pattern and fulfill MAR (Missing At Random) missing data mechanism. The

missing data simulation analyzed by method of PMM as many as ten imputations.

The imputation outcome used to found Relative Efficiency (RE). The result of RE

values of not normal data distribution is faster convergence to 1 than normal data

distribution. So that, this research found that the PMM method work well on the

not normal data distribution.

Key Words: Multiple Imputation, PMM, Monotone Missing Data Pattern, MAR,

Relative Efficiency.

1. Introduction

Nowadays many researchers developed a method of imputation. It was statistical

analysis procedure in dealing with missing data. The imputation method is divided to

single imputation and MI. Filling in a single value to missing data is mention single

imputation. MI is the technique that replaces each missing value with two or more

plausible values as a representation of missing values [1]. PMM is one of multiple

imputation method. The advantage this method can ensures that the imputed values

are more reasonable when the assumption of normality were violated [2].

MI with PMM method procedure try to predict missing values based on the others

variable then the missing data are filled in m imputations to generate m complete data

- 90 -

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

sets. The m complete data sets are analyzed by using standard procedure (e.g.

regression). The results from the m complete data sets are combined to found RE for

inference.

Basuki [3] on his research used survey data IBS 2007 East Java indicate that PMM

method work better on not normal data distribution which is using univariate missing

data pattern. So that, this paper investigated application of PMM method on not

normal data distribution and normal distribution with monotone pattern of missing

data.

2.1 Patterns and Mechanism Of The Missing Data

The missing data pattern related to the form of missing value was observed in

the group of data. Some missing data patterns by Little and Rubin [4] described

as follow. The first section is a general pattern, where the missing values usually

have an arbitrary pattern. The second, univariate pattern is a pattern of missing

data has a single variable incomplete. The third, multivariate pattern occurs when

a subset of sample did incomplete the question sheet given because of losing on

contact, rejection, or other reasons. The fourth, monotone pattern is a data pattern

that occurs when the number of complete observations on the first question item

more than the second question and the number of observations of the second

question more than a third question and so on. For example, the respondent

moved house before the end of the study and researchers were not able to access

the location where the respondent moved. The fifth, file matching pattern is

missing data patterns that are planned. It is useful for collecting question sheet

items in large numbers at once reducing the burden on respondents. The sixth,

factor analysis pattern where the first item is a variable 𝑋 of size n × k missing

completely and the second item is a variable 𝑌 size n × p completely observed

with k < p. it can be seen as a factor analysis of multivariate regression analysis

with no predictor variables were observed.

Missing data mechanisms describe possible relationships between measured

variables and the probability of missing the data [5]. Little and Rubin [4]

distinguish three types missing data. Those are Missing Completely at Random

(MCAR) occurs when probability missing data on a variable is not related to

some other the observed variables and also not related on the value of the variable

itself. Missing at Random (MAR) occurs when probability missing data on a

variable related to other variables were observed but not related to the missing

value itself. An illustration, in a research is measure the weight and height of

students of Makassar city. Female respondents will tend refuse to provide a

response on the question of their weight. Random (NMAR) isn’t probability

missing data the variable depends on the variable itself.

- 91 -

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

2.2 Mi With Predictive Mean Matching Method

Imputation method is a method of filling the missing values to deal non

response items. Imputation method is divided into single imputation and

multiple imputation. Imputation method with a single value for each

nonresponse item is called single imputation method. At this stage of analysis

data, imputation value obtained from a single imputation is considered as if such

real data. The disadvantage of a single imputation is a value which is used to

replace the missing data does not describe the diversity of sample values when

one nonresponse model is actually formed. The disadvantages can be resolved

by using multiple imputations [6].

MI is a technique for replacing the missing data values with two or more

acceptable values and represents the probability distribution. There are m

number values for each missing data and will form m sets of data completed [7].

Composite method was first introduced by Rubin in 1987 then developed by

Little in 1988 to solve nonresponse multivariate. Little introduces composite

method called Predictive Mean Matching. Basically the same method Predictive

Mean Matching with Regression method, but the difference is each missing

values imputed from the nearest observation values of the model [6].

Analysis of missing data use MI with PMM need to pay attention to some

things, namely missing data pattern, missing data mechanism, variable types,

and distribution of data. PMM method assumes that the missing data mechanism

is MAR. It works on the missing data monotone pattern.

2.3 Relative Efficiency of Multiple Imputation

Relative efficiency of imputation results is used to determine how the better

of the population parameter estimates. This is related to how much number of

data is missing and the number of m imputation done. According to Bruin [9]

when the number missing data is very low then the efficiency can be achieved

with only doing a few imputations. However, if the number of missing data is

larger usually require m imputation more also to achieve sufficient efficiency

value. Some literatures use 3 to 5 imputations. However, Schafer [10]

recommend 3 to 10 imputations if information is missing quite a lot. A method

is said to be efficient if RE value is equal to one [6].

Parameter 𝒃′ = (�̂�0�̂�1�̂�2)𝑇 is the regression coefficient which complete data

imputation results. Point estimation of each component 𝒃′, suppose 𝑏 with the

average �̅� obtained as m imputation. �̅� Calculated by the formula:

�̅� = 1

𝑚∑𝑏𝑖

𝑚

𝑖=1

(1)

- 92 -

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

𝑏𝑖 is Point estimation of each component 𝒃′, b̅ is the average of bi, m is number

of imputation (1).

Let 𝑊𝑖 is the variance obtained from the variance covariance matrix of

regression parameters imputation result. It is (Mean Square Error) × (𝑋𝑇𝑋)−1.

Variance estimation in multiple imputation is partitioned into the within

imputation variance and the between imputation variance. �̅� is the within

imputation variance average with m imputations. According to Yuan [8], the

within imputation variance formula is

�̅� = 1

𝑚∑ 𝑊𝑖

𝑚

𝑖=1

(2))

Wi is the within imputation variance, �̅� is the within imputation variance

average (2).

Whereas the between imputation variance by the following formula:

𝐵 = 1

𝑚 − 1∑(𝑏 − �̅�)

2𝑚

𝑖−1

(3))

B is the between imputation variance (3).

According Yuan [8] the variance imputation total is combining the two variances

by the following formula:

𝑇 = �̅� + (1 +1

𝑚)𝐵 (4))

T is the variance imputation total (4).

Because statistics (b − b̅)T−1/2 is approximately distributed ast-distribution

then degrees of freedom [8], can be written the following:

𝑑𝑓 = (𝑚 − 1) [1 + 𝑚�̅�

(𝑚 + 1)𝐵]

2

(5))

df is degrees of freedom (5).

Statistics r is defined as the relative increase in variance due to nonresponse [8].

Its formula is as follows:

𝑟 = (1 + 𝑚−1)𝐵

�̅� (6)

r is the relative increase in variance (6).

The great m value result small r values and degrees of freedom 𝑑𝑓 can be much

large so that the distribution will be near normal [6]. Another very useful

statistics related to nonresponse is fraction. Fraction is a value which affects the

- 93 -

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

speed of convergence to a value. The larger fraction value then fraction of time

required to converge more slowly. Fraction can be calculated using the formula:

𝛾 =(𝑟 + 2)/(𝑑𝑓 + 3)

𝑟 + 1 ( (7))

γ is fraction (7).

Relative efficiency is the efficiency obtained by using m imputation. Its value is

obtained from m and 𝛾 the following formula [8]:

𝐸𝑅 = (1 +𝛾

𝑚)−1

( (8))

ER is relative efficiency (8).

3. Methodology

This paper has two groups of data’s for simulating. The first group of data is not

normally distributed and the second is normally distributed. Each group of data has

three variables. Let 𝑋1, 𝑋2, and 𝑌1 are variable of the not normal data distribution.

The normal data distribution give 𝑋3, 𝑋4, and 𝑌2. Both of the data are complete data

from a survey result conducted by Ilham Nurhidayah [11] and Agustina Karoma [12]

students of Hasanuddin University.

Each of the data group is designed to omit such a way as form monotone missing data

pattern and omission at random to fulfill MAR (Missing At Random) missing data

mechanism as shown in Table 1.We do three simulations of omission on each of data

group. The first, omission as many as 2% on 𝑋2 and 5% on 𝑌1. The second omission

is5% on 𝑋2 and 10% on 𝑌1. The last omission is 10% on 𝑋2 and20% on 𝑌1. Table 1. Simulation of Omission on Data 2% on 𝑿𝟐 and 5% on 𝒀𝟏

Number

of

Sample

Variables

𝑿𝟏 𝑿𝟐 𝒀𝟏

1

2

26

27

33

38

Information: : Missing data

: Complete data

The missing values are estimated by PMM method. Each missing value was replaced

by m=3 to m=10 imputations.

The steps of PMM method are [6]:

a. Model of a complete data use the regression model equation = 𝑿𝜷 + 𝜺 .

- 94 -

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

b. Calculate the value of the parameter estimation equation 𝒀 = 𝑿𝜷 + 𝜺 use the

least squares method. After the value of the model parameter estimation was

found, the next step is calculating the value of variance estimation error by

the formula:

�̂�2 =𝜀𝑇𝜀

𝑛−𝑝 ( (9))

σ̂2 is variance estimation error, εTε is sum of squares due to error, n is the

number of complete data observations, p is the number of parameters (9).

c. The estimation value obtained in step b is used in the imputation stage with

the following steps:

1. Calculate the value of 𝜎∗𝑖2 = �̂�2(𝑛 − 𝑝)/𝑔𝑖, where 𝑖 = 1, 2,… , 𝑚 and 𝑔𝑖

is a random variable generated from the distribution of chi square with

𝑛 − 𝑝(𝜒𝑛−𝑝2 ).

2. Calculate the new parameter values by 𝜷∗𝒊 = �̂� + 𝜎∗𝑖𝑽𝒖𝒁, where 𝑖 =

1, 2, … ,𝑚, 𝑽𝒖 is the upper triangular matrix in the Cholesky

decomposition and 𝒁 is a random variable as 𝑝 generated from standard

normal distribution, 𝑁(0,1).

3. The missing data are replaced by 𝑦∗𝑗 = 𝑿𝜷∗𝒊 where 𝑗 = the respondent

to n which has the nonresponse item.

4. Doing imputation is taking an observation the value of the closest to the

value of 𝑦∗𝑗. The next is repeating steps 1 to 3 as many as m times.

The results of the m complete data sets are combined to found RE for inference

(Excel 2010, was used for all the simulation).

4. Result and Discussion

The main results of the simulation are presented in Table 2 and Table 3.Table 2

displays the results from relative efficiency of imputation results of PMM method on

data which is not normal distribution. Relative efficiency of Missing data is 2% on

𝑋2 and 5% on 𝑌1have parameter estimation �̂�∗𝑖 obtained by using the PMM method.

The estimation process conducted 3-10 imputations. It is similarly columns 1 and 2

of relative efficiency. Columns of relative efficiency values are similar to each other.

A method PMM said to be efficient if the value of the relative efficiency is equal to

1. Three columns of relative efficiency values have time to converge towards one

fairly quickly. Despite the relative efficiency of each column has a number of

different missing data. RE value does not change significantly. It is meaning that the

estimation missing data on the not normal distribution data which result unbiased

estimation, just do 3-10 imputations.

- 95 -

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

Table 2 Relative Efficiency of Imputation Results of PMM Method on Not Normal Distribution Data

Number of

Imputation Parameter

Relative Efficiency

Missing 2%

on 𝑿𝟐 and

5% on 𝒀𝟏

Missing 5%

on 𝑿𝟐and

10% on 𝒀𝟏

Missing 10%

on 𝑿𝟐and

20% on 𝒀𝟏

3

�̂�∗0 0,9999 0,9998 0,9960

�̂�∗1 0,9999 0,9999 0,9999

�̂�∗2 0,9999 0,9999 0,9999

5

�̂�∗0 0,9999 0,9995 0.9995

�̂�∗1 0,9999 0,9999 0,9999

�̂�∗2 0,9999 0,9999 0,9999

10

�̂�∗0 0,9999 0,9999 0,9994

�̂�∗1 0,9999 0,9999 0,9999

�̂�∗2 0,9999 0,9999 0,9999

Source: The results of data analysis, 2015

Table 3 Relative Efficiency of Imputation Results of PMM Method on Normal Distribution Data

Number of

Imputation Parameter

Relative Efficiency

Missing 2%

on 𝑿𝟒 and

5% on 𝒀𝟐

Missing 5%

on 𝑿𝟒and

10% on 𝒀𝟐

Missing 10%

on 𝑿𝟒and

20% on 𝒀𝟐

3

�̂�∗0 0,9998 0,9859 0,9801

�̂�∗1 0,9939 0,9994 0,9636

�̂�∗2 0,9969 0,9999 0,9767

5

�̂�∗0 0,9997 0,9981 0,9951

�̂�∗1 0,9871 0,9999 0,9964

�̂�∗2 0,9922 0,9999 0,9977

10

�̂�∗0 0,9999 0,9997 0,9988

�̂�∗1 0,9988 0,9999 0,9996

�̂�∗2 0,9993 0,9999 0,9997

Source: The results of data analysis, 2015

However, The result of table 3 is different with table 2. Table 3 is relative efficiency

of imputation results of PMM method on data is not normal distribution. It has value

of relative efficiency with run the slow-moving to convergent. Each column of

relative efficiency has different RE. The greater the number of missing data, the

value of RE is the slower towards the 1 value. It is mean that PMM data need more

than 10 imputations to result unbiased estimation.

5. Conclusion

This paper discuss about PMM method on data with not distribution normally and

normal data. The sample size is 38. This sample is a complete simulation data so it is

shaping nonresponse data which is fill pole of monotone data and MAR. The data of

missing is estimated with 3-10 imputations to each group data. Based on the

simulation data got that PMM method work done on not normal distribution than

normal distribution data. But it is likely that only valid on the sample size and missing

data are small. The next researcher should research on population size which is has

more size with the big case of missing data.

- 96 -

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

REFERENCES

[1] Durrant, B. Gabriele (2005). Imputation Methods for Handling Item-

Nonresponse in the Social Science: A Methodological Review. National

Centre for Research Methods Working Paper Series.

[2] Horton, N., dan Lipsitz, S (2001). Multiple imputation in practice:

Comparison of software package for regression model with missing

variables. Journal American Statistical Association 55: 244-255.

[3] Basuki, R (2009). Imputasi Berganda Menggunakan Metode Regresi dan

Metode Predictive Mean Matching untuk Menangani Missing Data. Thesis.

Institute of Technology Sepuluh November. Surabaya.

[4] Little, R.J.A dan Rubin, D.B (2002). Statistical Analysis With Missing

Data. Cambridge: John Wiley & Sons, Inc. pp. 3-8

[5] Enders, Craig K (2010). Applied Missing Data Analysis. New York: A

Division of Guilford Publications, Inc.

[6] Mardiah, Hafti (2010). Imputasi Missing Value pada Data yang

Mengandung Outlier. Thesis. University of Padjadjaran, Bandung.pp. 18-

24

[7] Rubin, D.B (1987). Multiple Imuputation for Nonresponse in Surveis.

Canada : John Wiley & Sons, Inc,.

[8] Yuan, Yang C. (2000). Multiple Imputation for Missing Data : Concepts

and New Development. Rockville, MD: SAS Institute Inc. pp. 5.

[9] Bruin, J (2006). Statistical Computing Seminars Missing Data in SAS Part

1. UCLA. Statistical Consulting Group.

http://www.ats.ucla.edu/stat/sas/seminars/missing_data/mi_new_1.htm

[10] Schafer, Joseph L (1999) Multiple Imputation: A Primer. Journal

Statistical Methods in Medical Research, 8: 3-15.

[11]Ilham, Nurhidayah (2014). Analisis Faktor-Faktor Yang Mempengaruhi

Laba Usaha Dagang Pada Pasar Tradisional Di Kabupaten Pangkep.

Mini Thesis. Hasanuddin University, Makassar.

[12]Karoma, Agustina R (2013). Analisis Faktor-Faktor Yang Mempengaruhi

Pola Konsumsi Mahasiswa Indekos Di Kota Makassar. Mini Thesis.

Hasanuddin University, Makassar.

- 97 -

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

A Review of the Time Effects in the SST Data

using Modified GamboostLSS Models

Miftahuddin*, Anisa K., Asma G.

Department of Mathematics and Statistics, Faculty of Mathematics and Natural

Sciences,

Syiah Kuala University, Hasanuddin University, Benazir Bhutto University

Jl. T.M. Abdul Rauf No.6 Darussalam, Banda Aceh, Indonesia

*Corresponding Author: [email protected]

ABSTRACT

To predict Sea Surface Temperature (SST) data precisely, investigation of the time

effects of covariates is needed on monthly and yearly bases with several climate

features, e.g. humidity, temperature and rainfall on SST are of utmost importance.

Various approaches are used to investigate the effects of covariates on the SST data.

We proposed generalized additive models for location, scale, and shape by boosting

that consider autocorrelation AR(1) (called gamboostLSS-AR(1)). The proposed

method is applied on SST data and the results indicate that there are significant

relationships between covariates and response. GamboostLSS-AR(1) models are

used to examine the effects of trends of the data.

Keyword: Annual and seasonal effects, gamboostLSS-AR(1), MPI-AR(1).

1. Introduction

The climate data set have been used to obtain some parameter uncertainties,

where various parameter effects, such as the global, local, marginal or partial

aggregate levels and correlation effects were also found, Magnus et al. [1]. Recently

in 2012, Magnus et al. [1] proposed a climate model to investigate the effects of solar

radiation and the greenhouse effect on global warming. Their analysis is based on the

data from land stations only and does not consider the relationship between sea and

land data. Global warming interacts with SST patterns [2]. The increase in global

temperature has a significant impact on the earth's climate. The earth's climate system

is influenced by a large number of parameters. Sea Surface Temperature (SST) is one

of them. It affects the regional climate that influences the global climate variability,

specifically in the tropical Indian Ocean [3,4]. SST data is very useful in getting an

indication of the earth's climate, its variability, and the tropical climate variability [3,

5, 6, 7].

In this paper the SST data set is used to model the relationship of variables in

sea and land. The variables have different measurement scales. The observed ranges

are; SST (27-31 degree Celsius), air temperature (23-29 degree Celsius), relative

- 98 -

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

humidity (70-100 percent) and rainfall (0-400 millimeter). SST data is from one buoy

at the Indian Ocean in position 1.5N90E from 2006 to 2012 in the period 2006-2012

with 2263 daily observation. Three climate features have several missing values, i.e.

4.1 percent of air temperature, 0.044 percent of humidity and 4.286 percent of rainfall

covariate.

The SST data obtained from sea buoy is utilized in modelling. The issues in

SST data fitting are such as the gap (missing observation) and autocorrelation of the

available data. We proposed modified gamboostLSS models using penalized spline

(P-splines) basis function in [8] to overcome these problems in model fitting of SST

data. Marginal prediction interval (MPI) was investigated in [9], for generalized

additive models for location, scale and shape (GAMLSS) without considering

autocorrelation in model fitting. In this paper we investigated MPI-AR(1) with

autocorrelation at lag 1 of gamboostLSS model fitting. The model considering time

autocorrelation effect provides many useful insights. The hyper-parameters such as

location, scale, and shape provide more detailed information. The proposed models

have a flexible structure and smoothness that incorporates many effects of covariates.

It can be used in further investigation of the effects of the time covariates in location,

scale and shape parameters as well.

2. Gamboostlss Models with Consider Autocorrelation

The autocorrelation or serial correlation errors in the data can affect the

response over time. The SST data is collected from different locations and therefore

have variability. The SST data has two types of autocorrelation, i.e. spatial and time

autocorrelation. Time autocorrelation can be derived from periodical time units; such

as daily, monthly, seasonal, and annual basis. Generally, autocorrelation occurs due to

the heteroscedasticity (serially correlated) problems when there are possible violations

of assumptions, (a). E[εε'|X] = σ2In (b). E[εt, εt-1] = 0, [10, 11, 12].

We suggested a model auto-regressive AR(1) where generalizing differencing

approach is used to investigate autocorrelation in the data by incorporating an

autoregressive process.

Consider an additive model (AM) is y = f + ε where,

Yi = β0 + ∑ 𝑓𝑗(𝑋𝑖𝑗) 𝑝𝑗=1 + εi, i =1,...,n

…(1)

then errors εt and εt-1 are εt = yt - ft and εt-1 = yt-1 - ft-1. Referring to equation (1), we use

the AR(1) model in a formulation in our experiments as follows,

εt = ρ εt-1 + ut, t = 1, 2,...,n

…(2)

If we assume the ut are uncorrelated random errors with zero mean and constant

variances then,

- 99 -

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

E[ut] = 0, Var[ut] = σu 2, and, Cov[ut, us] = 0, t ≠ s, …(3)

and let assume that ε ~ N(0, σ2Ʌ), where Ʌ is a correlation matrix defined through an

AR(1) with parameter ρ. Generalized additive model (GAM) structure is given as,

g(𝜇) = g(E[Y|X1, X2,…, Xp]) …(4)

where g(.) is known as link function. From equation (1) we have:

𝑓∗(X) = β0 + ∑ 𝑓𝑗(𝑥𝑗)𝑝𝑗=1

We assume that Y response is univariate and continuous and the loss function 𝜌 is

assumed to be differentiable with respect to 𝑓∗(X) [13, 14, 15]. To estimate the

function 𝑓∗(X) minimizing the expected loss function 𝜌 (.), such that

𝑓∗(.) = argmin EY,X[𝜌(Yi, 𝑓∗(Xi))] …(5)

based on training data (yi, xi); i = 1, …, n. Also suppose that 𝑓∗(X, 𝛽) is an

approximate function with a set of parameters 𝛽𝜖ℝ𝑝. Due to the expectation in 𝑓∗(.)

is unknown, so minimizing expectation by gradient boosting algorithm. Furthermore,

the function f (.) can be estimated through a constrain or objective minimization of

the empirical risk (ER) as follows,

ER = 1

𝑛∑ 𝜌(𝑌𝑖 , 𝑓

∗(𝑋𝑖)𝑛𝑖=1 ) …(6)

which is implemented by gradient boosting algorithm, e.g. functional gradient

descent [16].

To observe the functional and distributional effects in construction of model

with time covariates, consider GAMLSS (Generalized Additive Models for Location,

Scale, and Shape) without random effects as follows:

gd(𝜙𝑑) = 𝛽0𝜙𝑑+ ∑ 𝑓𝑗𝜙𝑑

(𝑥𝑑𝑗)𝑝𝑗=1 = 𝜂𝜙𝑑, d = 1, 2, 3, 4. …(7)

The above model consists of two terms i.e.,

𝛽0𝜙𝑑, d = 1, 2, 3, 4 are the intercept term of the four submodels;

∑ 𝑓𝑗𝜙𝑑(𝑥𝑑𝑗)

𝑝𝑗=1 = Xd 𝛽𝑑 as a parametric term; 𝜙𝑑 and 𝜂𝜙𝑑= 𝜂𝑑 are vectors of

length n;

𝑓𝑗𝜙𝑑 are the type of effect that covariate j has on the distribution parameter 𝜙𝑑;

𝛽𝑑𝑇 = (𝛽1𝑑, … , 𝛽𝑝𝑑′𝑑) is a parameter vector of length pd’;

Xd is a known design matrix of order n x pd';

For instance, fj𝜙𝑑(xdj) is linear or smooth effect, categorical effect, and other effects

depending on the characteristic of the covariates [8,17]. Each distribution has a fitting

function. Through the link-function like in equation (4), precision can be achieved in

fitting process [18]. From above GAMLSS equation (7), we know that gd(.) is a

monotonic link function that is related to distribution parameter 𝜙𝑑 with predictor 𝜂𝑑

g1(𝜇) = 𝜂𝜇= 𝛽0𝜇+ ∑ 𝑓𝑗𝜇(𝑥𝑗)

𝑝𝑗=1 = X1 𝛽1; g2(𝜎) = 𝜂𝜎= 𝛽0𝜎+ ∑ 𝑓𝑗𝜎

(𝑥𝑗)𝑝𝑗=1 = X2 𝛽2

…(8a)

g3(𝜐) = 𝜂𝜐= 𝛽0𝜐+ ∑ 𝑓𝑗𝜐(𝑥𝑗)

𝑝𝑗=1 = X3 𝛽3; g4(𝜏) = 𝜂𝜏= 𝛽0𝜏+ ∑ 𝑓𝑗𝜏

(𝑥𝑗)𝑝𝑗=1 = X4

𝛽4 …(8b)

- 100 -

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

GAMLSS distribution from equation (7), which is represented by observations (yi, xiT)

for i = 1, 2,...,n where yi is response variable and xi = (xi1,..., xip)T is a set of the

covariates vector. The conditional density function (cdf) fY(yi|𝜙𝑖), depend on

𝜙𝑖 = (𝜇, 𝜎, 𝜐, 𝜏) …(9)

where a vector of four distribution parameters, i.e. 𝜇, 𝜎, 𝜐 and 𝜏 are location, scale,

skewness and kurtosis parameters, respectively [8,17,19,20]. In general, each

distribution parameter is modelled through its own additive covariate 𝜂𝜙𝑑and depends

on additive and covariates effects, such as nonlinear, smooth, interaction, etc [17,18].

Location parameter of distribution referred to the measurement of the center and scale

of distribution refers to the variance or dispersion, while shape parameter of

distribution refers to skewness and kurtosis. The optimization of the distribution

parameters of cdf in equation (9) for gamboostLSS models are,

(�̂�, �̂�, �̂�, �̂�) = argmin𝜂𝜇,𝜂𝜎,𝜂𝜐,𝜂𝜏 EY,X[𝜌(𝑌𝑖 , 𝜂𝜇(𝑋), 𝜂𝜎(𝑋), 𝜂𝜐(𝑋), 𝜂𝜏(𝑋)] …(10)

with loss function 𝜌 = -L the negative log-likelihood of the response distribution and

based on training data. By equation (6), gradient boosting approach to minimize the

ER is used,

ER = 1

𝑛∑ 𝜌(𝑌𝑖 , 𝜂𝜙𝑑

)𝑛𝑖=1 …(11)

The P-spline with autocorrelation errors is:

PLS(β) = (u − Bβ)T V−1 (u − Bβ) + λW(β,m) …(12)

where the correlation matrix V = [vij] as suggested in [21].

3. Methodology

To apply gamboostLSS model fitting in autocorrelation for the data, we use

autocorrelation of AR(1) model. Then, we used the gamboostLSS-AR(1) model fitting

for the SST data set. In general, the procedure of gamboostLSS-AR(1) model fitting

are as follows:

a). Determine the parameter ρ’s by using generalized least squares technique.

b). Decide the parameters of continuous and time covariates in base-learners

specification.

c). Determine assumption of the distribution of parameter in gamboostLSS-AR(1)

model.

d). Apply the single autocorrelation coefficient of ρ in gamboostLSS-AR(1) model

fitting.

e). Determine the suitable fitting for gamboostLSS-AR(1) model to obtain the

appropriate global model fitting, which produces submodels. The global model is

related to data response, while submodel, which is called local fitting, is related to

the covariates. By tuning hyper-parameters we can fit the time covariates in the

model to obtain appropriate global model fitting.

f). Select the appropriate model fitting to obtain the optimal global and local models

fitting by cross-validation of the final risk (CVrisk).

- 101 -

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

g). Specify MPI-AR(1) of local models fitting, mainly for time effects. We use step of

length factor v = 0.01 to 0.05, 0.1 and different stopping iteration (mstop) to obtain

MPI-AR(1).

Figure 1. Scatterplot of Sea Surface Temperature data in the period 2006-2015 from buoy at position 1.5N90E,

which is found in: www.pmel.noaa.gov/tao/.

Figure 1 shows that the data have characteristics, such as gaps, irregular peaks,

periodicity and autocorrelation. We summarized the SST data of buoy in the following

table.

Table 1. Univariate description of SST dataset during the period of 2006-2012

Variable Minimum Q1 Median Mean Q3 Maximum

SST 27.90 28.78 29.09 29.13 29.42 30.87

Temperature 23.57 25.90 26.47 26.45 27.00 30.95

Humidity 74.00 86.00 89.00 89.15 92.27 102.23

Rainfall 0.00 0.00 0.90 12.18 11.00 414.00

Table 1, displays the statistical description of SST climate features from buoy. The

central tendency of SST data in the above table shows almost similar value, except for

rainfall covariate. The dispersion of SST data is as follows: SST (2.97oC), temperature

(7.38oC), humidity (28.23%), and rainfall (414 mm).

4. Result and Discussion

In this study, we examined the SST data from the Tropical Atmosphere Ocean (TAO)

moored ocean buoy positioned at 1.5N90E which are in the Indian Ocean in the

period 2006-2012 with complete data case 2066 daily observations. Whereas the

three climate parameters from the Meulaboh land station are from the same period.

- 102 -

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

Preliminary analysis of the SST data (Figure 2a) reveals residual in

autocorrelation function (ACF). The ACF model of the buoy is ρ = 0.8835944 in high

autocorrelation category.

Figure 2, GamboostLSS-AR(1) model fitting for the SST data from the buoy at position

1.5N90E.

The ACF plot can be used to detect the pattern of autocorrelation errors of the SST

data. The change of pattern in peaks and magnitudes are also displayed in various

period. Figure 2(b) displays a smooth gamboostLSS-AR(1) model fitting with

considering time autocorrelation. The model produces 12 submodels as seen in Figure

3.

- 103 -

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

Figure 3. GamboostLSS-AR(1) model fitting produces 12 submodels.

Figure 3 shows the local model fitting of the SST dataset by using gamboostLSS-

AR(1) model. It consists of some figures which present the climate features which

are temperature, humidity, and rainfall. Each of them represents submodel of

gamboostLSS-AR(1) model. It can be seen that the mu parameters of temperature

and rainfall shows similar curves which are linear, according to linear base-learner.

In contrast, the mu and sigma parameters of temperature and rainfall covariates show

the opposite direction.

The mu parameter of temperature in smooth base-learner have unique curve, whereas

the mu and sigma parameters of humidity have an upward and downward curves. In

this figure, we also capture the Nrdays and Doy as time covariates to determine the

annual and seasonal effects respectively, of fitting model of the SST data. For annual

effects, the mu parameter shows decreasing trend before 1000 days and increasing

with a peak about 1500 days. Then annual effects decrease before the gap and slightly

stable after the gap in the mu parameter

It is a different trend in the sigma parameter. Decreasing of annual effects occurred

about 500 days and a slight increase before the gap and drastically increases after the

- 104 -

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

gap. For the seasonal term, the mu parameter shows a peak about 150 days (or April),

whereas the sigma parameter seen as the letter”V”.

Several studies on climate features such as rainfall variability related to SST

variability in [22] states that no clear relationship between rainfall and SST mainly

from October to March in region B (include Sumatra Island), whereas from December

to February shows a high precipitation and low SST. Although [22] used monthly

observations and different time periods, interestingly, we can see from our study that

relationship between SST and seasonal effects from April to August is downwards,

but different level (in magnitudes) from July to August with the base line as January

as captured in Figure 3. The graph shows that there is an increase in the monthly

effects from October to April, but different level (in magnitudes) from October to

February.

By SST data experiment, we obtain how P-splines smoothing property and gradient

boosting in gamboostLSS-AR(1) model can help to discover an underlying variability

structure of time covariates and other covariates. This result shows technique trade-

off between global and local appropriate fitting and detail visualization in location,

scale, and shape parameters. Meanwhile, some approaches addressed for the

interpretability of covariates related to spatial nature of climate features, such as local

variation by econometric models with kernel technique [1] and generalized linear

models (GLMs) in [23].

In addition, as inheritance of gamboostLSS [8,9], then gamboostLSS-AR(1) models

can be applied with cheap computationally cost, high dimensional data, suitable for

use with large data set, variable selection, handle complex data structure and relax to

cover common issues of the SST data. In this model, gradient boosting is central in

fitting process, prediction accuracy, handle various risk functions, simultaneous

process between model fitting and variable selection, and addresses multi-collinearity

issues [13,15,16].

Therefore, we consider 80% and 95% confidence interval for the MPI of the SST

data. It can be seen in the figures that the resulted models have different values of the

ν and mstop, have similar MPI-AR(1) patterns. This is also interesting because the

different values of control boosting parameters do not change MPI-AR(1) patterns.

However, we do not present plots of the MPI-AR(1) patterns because they are

structurally similar to those obtained from the gamboostLSS-AR(1) model fitting.

- 105 -

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

(a) MPI-AR(1) of seasonal effects (b) MPI-AR(1) of annual effects

Figure 4. The MPI-AR(1) of seasonal and annual effects of the gamboostLSS-AR(1)

model fitting for the SST data with ρ= 0.8835944 in size of length factor v= 0.01 and

mstop= 110000.

Furthermore, the results are presented using the size of length factor ν= 0.01 as

depicted in Figures 4. As can be seen from Figure 4, the annual effects curves (see

Figures 4, (a) and (b)) seem wider when the data available. In other words, the curves

of the missing data (gap) are closer to each other. MPI-AR(1) of the seasonal effects

at buoy show a bimodal curve. Removing autocorrelation effect on MPI-AR(1) shows

significant effects.

The smoothing of the SST data fitting depends on the selection of the hyper-parameters

for the base-learners of the gamboostLSS-AR(1) model. These parameters including

the degrees of freedom df, the number of knots, the stopping iteration mstop and the

coefficient autocorrelation ρ. This also determines the number of submodels to be

produced by the model.

5. Conclusion

One of the issues in SST data is the presence of autocorrelation in the data. Therefore,

we have proposed gamboostLSS-AR(1) model to deal with common issue of the SST

data. We have applied generalized differencing technique to reduce the time

autocorrelation of the SST data in fitting process.

Removing autocorrelation with AR(1) model has a large impact on global and local

model fitting. By tuning hyper-parameters, which are flexible and interpretable

estimates of annual and seasonal effects in climate features, we can achieve the

appropriate gamboostLSS-AR(1) models. The proposed model can be used in further

investigation of the effects of the time covariates in location, scale and shape

parameters.

- 106 -

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

We also computed MPI-AR(1) for gamboostLSS-AR(1) model fitting. The choice of

hyper-parameters in the model affects the MPI-AR(1). Through MPI-AR(1) of

gamboostLSS-AR(1) the missing data on the gap can be estimated with confidence

intervals.

REFERENCES

[1] Magnus, J. R., Melenberg, B., and Muris, C. Global Warming and Local Dimming: The

Statistical Evidence. Journal of the American Statistical Association, vol. 106: 452-

464, Taylor and Francis,2012.

[2] Xie, S. P., Deser, C., Vecchi, G. A., Ma, J., Teng, H. and Wittenberg, A. T., “Global

warming pattern formation: Sea Surface Temperature and Rainfall. Journal of American

Meteorological Society, vol. 23: 966–986, 2010.

[3] Schott, F.A., Xie, S. P., and P. McCreary, J. Indian Ocean Circulation and Climate

Variability. Reviews of Geophysics, vol. 47: 1-46, American Geophysical Union, 2009.

[4] Dommenget, D. and Jansen, M. Notes and Correspondence: Prediction of Indian Ocean

SST Indices with a Simple Statistical Model: A Null Hypothesis. Journal of the Climate,

vol. 22: 4930-4938, American Meteorological Society, 2009.

[5] North, G.R. and Stevens, M.J. Detecting climate signals in the surface temperature record.

Journal of climate, vol. 11: 563-577, 1998.

[6] Deser, C., Alexander, M. A., Xie, S. P. and Phillips, A. S. Sea Surface Temperature

Variability: Patterns and Mechanisms. The Annual Review of Marine Science, vol. 2:

115-143, 2010.

[7] B. P. Kumar, J. Vialard, M. Lengaigne, V. S. N. Murty, M. J. McPhaden, M. F. Cronin,

and K. G. Reddy. Evaluation of Air-sea heat and momentum fluxes for the tropical oceans

and introduction of TropFlux. CLIVAR, vol 58: 1-9, 2012.

[8] Mayr, A., Fenske, N., Hofner, B., Kneib, T., and Schmid, M. Generalized Additive Models

for Location, Scale and Shape for High Dimensional Data: a Flexible Approach Based

on Boosting. Journal of the Royal Statistical Society: Series C (Applied Statistics), 2012.

[9] Hofner, B., Mayr, A., and Schmid, M. GamboostLSS: An R Package for Model

Building and Variable Selection in the GAMLSS Framework, CRAN, 2014.

[10] Greene, W. H. Econometric Analysis. Prentice Hall, 2011.

[11] Hsiao, C. Analysis of Panel Data. Cambridge University Press, 2003.

[12] Baltagi, B. H. Econometric Analysis of Panel Data. John Wiley and Sons, 2005.

[13] Schmid, M. and Hothorn, T. Boosting additive models using component-wise P-splines.

Technical Report, no. 002: 1–21, 2007.

[14] Henning, C. and Kutlukaya, M. Some thoughts about the design of loss functions.

REVSTAT-Statistical Journal, vol. 5, no. 1: 19–39, 2007.

[15] Natekin, A. and Knoll, A. Gradient boosting machines, a tutorial. Frontiers in

Neororobotics, vol 7, 2013.

[16] Buhlmann, P. and Yu, B. Boosting with the L2 loss: regression and classification. Journal

of the American Statistical Association, vol. 98: 324–339, 2003.

[17] Mayr, A., Fenske, N., Hofner, B., Kneib, T., and Schmid, M. GAMLSS for High-

Dimensional Data-a Flexible Approach Based on Boosting. Ludwig-Maximilians-

Universitat Munchen: 1-29, 2010.

- 107 -

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

[18] Rigby, B. and Stasinopoulos, M.. A flexible regression approach using GAMLSS in R.

University of Athens, 2010.

[19] Rigby, R. A. and Stasinopoulos, D. M. Generalized Additive Models for Location, Scale

and Shape. The Journal of the Royal Statistical Society: Series C (Applied Statistics),

vol. 54: 507-554, 2005.

[20] Stasinopoulos, D. M. and Rigby, R. A. Generalized Additive Models for Location Scale

and Shape GAMLSS in R. Journal of Statistical Software. American Statistical

Association, vol. 23: 1-46, 2007.

[21] Diggle, P. J. and Hutchinson, M. F. On spline smoothing with autocorrelated errors.

Australian and new Zealand Journal Statistics, vol. 31, no. 1: 166–182, 1989.

[22] Aldrian, E. and Susanto, R. D. Identification of three dominant rainfall regions within

Indonesia and their relationship to sea surface temperature. International Journal of

Climatology, vol. 23: 1435-1452, 2003.

[23] Chandler, R. E. On the use of generalized linear models for interpreting climate

variability. Environmetrics, vol. 16: 699-715, 2005.

- 108 -

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

Prediction of Rainfall by State Space Model For Missing Data

Muflihah*, Armin Lawi, Erna Tri Herdiani

Department of Mathematics, Faculty of Mathematics and Natural Sciences, Hasanuddin

University,

Jl. Perintis Kemerdekaan Km.10 Tamalanrea, Makassar, Indonesia

*Corresponding Author: [email protected]

ABSTRACT

State space models is the model used by the kalman filter equations consisting of

observation and transition equations . In this paper airm to apply the space model in

estimating missing data, and the determine the accuracy the estimattio of state space

model on the missing data. The aplication of the model used time series data of

monthly rainfall. From the data of the rainfall it was simulated that 19 missing data

would be estimated using the state space model . The estimastion results were

analysed using Paired Samples T Test. The results of the state space model estimation

shows U Theil's statistic value of 0.0742 which indicates that this model is valid and

feasible to be used.

Keyword:State space model, kalman filter, missing data, rainfall.

1.Introduction

Missing data is undesired by researchers because it may cause difficulties for

analysis and decision-making process. Cryer in 1986 [2]declared that if the position

data is lost there in the middle it is necessary to fill the estimation of the missing data.

Research by [4]showed on the estimation of the missing data by using multiple

models one of which is the model state space using data abduction wallet in Chicago

and produce optimal estimates for missing data. A common treatment used to

estimate the missing data is to fill in the missing data with the average value of the

data series. This method has many deficiency because it leads to reduced diversity of

data that may result correlation in the data to be biased, so this method is not feasible

for use again.

Another method often used in handling missing data is the method of listwise

deletion and pairwaise deletion, however in this method if the value of the missing

data occurs in the majority of the data, the information from the data will be wasted.

So that even this method is not appropriate for use.At 2006, David conducts research

on estimating the missing data by using multiple models one of which is the model

state space. The study was conducted using data theft of money in Chicago that the

data is not stationary and random lost data position five times in a sequence that

produces estimates closer to the actual value. Model State Space is a new approach

in time series analysis. This modeling approach can incorporate multiple time series

- 109 -

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

models such as ARIMA Box-Jenskins and structural time series. This model is an

orderly and flexible. General in the sense that can be applied to the overall modeling

of time series with Box-Jenkins method and flexible because it can be applied to

univariate and multivariate time series. In the ARIMA model, this approach can

facilitate the handling of data lost due to only one prediction, estimation can be

updated easily

Besides, research has been carried out by the Kalman filter Mirawati, et al

(2013) who make predictions on rainfall data and found a pretty good value in

describing the actual rainfall patterns. Daniel &Leo in 2013 [3] and has conducted

research on the estimation of missing data by using state space models for which data

is applied to the nonlinear and AR (1) with the result that a satisfactory value. Kalman

filter is a recursive algorithm that gives optimal estimation that depends on a series

of information and knowledge about the parameters of the model state space

(Mumtaz, 2009). Its main purpose is to estimate the state vector 𝜃𝑡. There are two

steps involved in this process called predict stage and update stage. Kalman filter has

provided us with two prediction equations.

Starting from the research study it appeared an idea in the application of the model

state space through a Kalman filter for handling cases of missing data that was applied

to the time series data of rainfall

2. Methodology

Data from this study are secondary data obtained from the Center for Meteorologi,

Klimatogi and Geofisika Region IV Makassar in the form of rainfall data Tenete Riau

district of Barru period 1980 to 2014.

.

The analysis method used in this study is the estimation of the missing data using the

model state space is applied to rainfall data Tenete Rilau Barru.

The model state space is one of modeling in time series analysis. This model consists

of two equations that observation equation and transition equation. Observation

equation is

𝑋𝑡 = 𝐻𝑇𝜃𝑡 + 𝜀𝑡 , 𝜀𝑡~𝑖𝑖𝑑. 𝑁(0, 𝑍𝑡) (1)

And transition equation is

𝜃𝑡 = 𝐺𝜃𝑡−1 + 𝐾𝜂𝑡 , 𝜂𝑡~𝑖𝑖𝑑.𝑁(0, 𝑄𝑡) (2)

The advantages of using the model state space is possible to put different types of

time series models into state space formulation.

The estimation process is done with the help of numerical methods Rstudio program

using function na.StructTS the zoo package, besides Microsoft Excel and SPSS for

Windows version 20 is also used for graphing and calculating error rates.Work stages

of this study are as follows: The firs phase is a Identification of data. In this prase we

must to prepare the data in the form of time series in Micrsoft Excel, then Change the

- 110 -

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

data in Microsoft Excel Ecstasy into csv and Import data from csv file to R studio.

The secont phase is a estimate the missing data (Process Kalman Filter), this phase

begine with Activate the zoo in the program package Rstudio then orders using na.

StuctTs with the data that has been formatted Once the results appear on the graph

plots the original data and the data of estimation and data estimation results, and for

re to estimated by simulating some observational data are considered as missing data.

(Emptying some observation data). The third phase is a verify the model, the first

step of this phase is Testing the estimation of lost data using simulated data with

actual observed data and then to deternmine how large the model accuracy estimation

accuracy of the data is lost by using model state space. From Of all the above steps

it will produce the estimated value of the missing data.

3. Result And Discuss

Rainfall data used are monthly rainfall data Tanete Rilau Barru. based on the average

normal, rain type Tanete Rilau a monsoon type. Data used in the estimation of

missing data with as many as 180 state space model of the data with regard to some

data missing at random and the amount of data lost money have been as many as 19

data.The estimation of data missing from the simulation data is done using state space

models can be seen in Table 1.

Table 1. Comparison of the actual value and the estimated state space models

NO TIME ACTUAL ESTIMATED

1. Feb 1982 574 300

2. Oct 1982 9 69

3. Jan 1985 458 497

4. Apr 1990 201 222

5. Febr 1991 257 350

6. March 1993 254 309

7. July 1993 15 51

8. Sept 1994 0 -27

9. Jan 1995 485 461

10. May 1995 99 99

11. Sept 1995 16 48

12. Nov1999 386 329

13. Aug 2003 7 1

14. Juny 2006 130 78

15. Aug 2008 0 -26

16. March 2008 433 385

17. Des 2008 677 690

18. April 2014 209 210

19. Sep 2014 0 -9

Source: BMKG Wilayah IV Makassar (Actual)

- 111 -

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

From Table 1 we can see the results of state space model estimation there is a negative

value, whereas undue rainfall value should be positive and the lowest is 0. .All

negative estimation data is actually worth 0. This indicates that the model state space

is difficult to predict the value 0, so that the negative is assumed to 0, because the

lowest estimated value from state space models is negative while the value of the

lowest rainfall is 0.

The following comparison of the actual data presented by state space model es

timation using the following cha

a. Column chart b. Line chart

Gambar 1.Comparison of rainfall data to estimate the model state space

Further test the validity of the model state space to determine the appropriateness of

the model used by the results shown in Table 2.

Table 2: Result U statistical value estimation model state space

U UM US UC U (UM+ US+UC)

0.074 0.003 0.006 0.997 0.999

Based on the Table 2. value that can be seen U Theil's for 0.074 are close to 0 and

value UM of 0.0031 < 0.2 and a value of U (UM + US + UC) of 0.997.

4. Conclution

Estimates of missing data by using state space model of the ideal and valid enough

to be used according to the criteria U Theil's. As a suggestion from this paper In

future studies, it is suggested that using positive distribution assumptions to avoid

negative estimated value

- 112 -

Proceeding of International Conference on Mathematics, Statistics, Computer

Sciences, and Mathematics Education (ICMSCSME) 2015

ISBN 978-602-72198-2-3

REFERENCES

[1] Aswi., and Sukarna( 2006). Analisis Deret Waktu. Makassar: Andira

Publisher

[2] Cryer, J.D (1986). Time Series Analysis. Boston: PWS-KENT Publishing

Company

[3] Daniel, B.K and Leo, O.O (2013). Generalized Estimation of Missing

Observation in Nonlinier Time Series Model using State Space

Representation. American Journal of Theoritical and Applied Statistics.

2(2):21-28.

[4] David, Sheung Chi Fung. (2006). Methods for the Estimation of Missing

Values in Time Series. Australia : Faculty of Communications, Health and

Science - Edith Cowan University.

[5] Janacek, G., & Swift, L. (1992). Time Series Forecasting Simulation &

Application. West Sussex, England : Ellis Horwood Limited.

[6] Mumtaz, H. (2009) State Space models and The Kalman Filter. (Online),

(http://www.pftac.org/filemanager/files/Macro_Training/CCBS_2009/3_kalm

anfilter.pdf, accessed 17 January 2014).

[7] Zivot, Eric. (2006). State Space Models and The Kalman Filter. (Online),

(http://faculty.washington.edu/ezivot/econ584/notes/statespacemodels.pdf,

accessed 26 April 2013).