Rough set based decision tree for identifying vulnerable and food insecure households

35
Rough Set based Decision Tree for Identifying Vulnerable and Food Insecure Households Rajni Jain 1 , S. Minz 2 and P. Adhiguru 1 1 Sr. Scientist, NCAP, Pusa, New Delhi 2 Associate Professor, Jawaharlal Nehru University

Transcript of Rough set based decision tree for identifying vulnerable and food insecure households

Page 1: Rough set based decision tree for identifying vulnerable and food insecure households

Rough Set based Decision Tree for Identifying Vulnerable and Food Insecure Households

Rajni Jain1, S. Minz2 and P. Adhiguru1

1Sr. Scientist, NCAP, Pusa, New Delhi

2Associate Professor, Jawaharlal Nehru University

Page 2: Rough set based decision tree for identifying vulnerable and food insecure households

Outline Problem Knowledge Discovery Process Data Mining Classification Task of Data Mining Methodology: RDT Dataset for this Study Classifier Model Evaluation

Page 3: Rough set based decision tree for identifying vulnerable and food insecure households

Problem of Food Security Most often, available Funds are scarce Need to target the Food security

program to most vulnerable group. Exhaustive surveys exclusively for this

purpose will be very costly and time consuming.

Need to learn simple concepts to facilitate identification of target beneficiaries people on the basis of morphological characteristics.

Page 4: Rough set based decision tree for identifying vulnerable and food insecure households

Knowledge Discovery in Dataset

Data Target Data

Pre-processed Data

Tranformed Data

PatternsKnowledge

Selection PreprocessingT

ransformation

Data M

ining

Interpretation

•Selection phase defines KDD problem by focusing on a subset of data attributes or data samples on which KDD is to be performed.

•Preprocessing care to be taken not to induce any unwanted bias. They include removing noise and missing data handling

•Transformations may be combining attributes or discretizing continuous attributes

•In Data Mining step many different learning and modeling algorithms are potential candidates

Page 5: Rough set based decision tree for identifying vulnerable and food insecure households

Data Mining Tasks

Classification Decision Tree Decision Rule

Summarization Association rules Characteristic rules

Page 6: Rough set based decision tree for identifying vulnerable and food insecure households

Classification

Training Data Classification Algorithm Rules/Tree/Formula

Step I

New DataClassification Rules Label the class

Step II

Estimate the predictive accuracy of the model. If acceptable Step III

Step III

Page 7: Rough set based decision tree for identifying vulnerable and food insecure households

Data

Training Data The data used for developing the model

Test Data The data used to estimate the

evaluation parameter of the model New Data

Condition attributes known but decision attribute is not known

Page 8: Rough set based decision tree for identifying vulnerable and food insecure households

Basis of Classification Algorithms

Rough Sets Decision tree Learning Statistics Neural Network Genetic Algorithms

None of the method is suitable for all types of domain

Page 9: Rough set based decision tree for identifying vulnerable and food insecure households

Methodology: Machine Learning

Rough Sets Decision Tree induction Rough set based Decision Tree

induction (RDT) Two phases RS for dominant attributes selection J4.8 for decision tree induction

Page 10: Rough set based decision tree for identifying vulnerable and food insecure households

Rough Sets 1980, Prof. Z. Pawlak,

A Polish Mathematician

Indiscernible- similar Objects (say Patients, households etc.)

Indiscernibility Relation

Id H M T F

1 n y h y

2 y n h y

3 y y vh y

4 n y n n

5 y n h n

6 n y vh y

Page 11: Rough set based decision tree for identifying vulnerable and food insecure households

Indiscernibility Relation - contd..

U/IND(H)={{1,4,6}{2,3,5}}

U/IND(F)={{1,2,3,6},{4,5}}

Flu Patients

}:)(/{ XYPINDUYXP

}:)(/{ XYPINDUYXP

XPXPXBNp )(

Id H M T F

1 n y h y

2 y n h y

3 y y vh y

4 n y n n

5 y n h n

6 n y vh y

Page 12: Rough set based decision tree for identifying vulnerable and food insecure households

Lower and Upper Approximation

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

16 17 18 19 20

21 22 23 24 25

26 27 28 29 30

Let the Bigger Square represent the domain of the universe

Small Squares represent the partitions of the universe for a given set of attributes P. All objects in a partition are indiscernible.

Oval represents the concept X to be defined

P (X)= {13,14,18,19}

P(X)={7,8,9,12,13,14,15,17,18,19,20,22,23,24}

Coming down in the other square,

P={7},P(X)={7}, so crisp set

1 2 3 4

5 6 7 8

9 10 11 12

Page 13: Rough set based decision tree for identifying vulnerable and food insecure households

Important Terms Reduct: R

A minimum set of attributes that preserve the IND relation.

Decision relative reduct Core : C

Intersection of all Reducts Johnson’s method for single efficient reduct

computation GA based algorithm for multiple reducts

computation

Page 14: Rough set based decision tree for identifying vulnerable and food insecure households

Architecture of RDT Model

Data

Reduct Computation Algorithm

Reduct

Remove attributes absent in reduct

Reduced Training Data

ID3 Algorithm

DT

Page 15: Rough set based decision tree for identifying vulnerable and food insecure households

y n

0

CHLD

1

HAGE

1 0LAND

young

middleold

Very old

0 1

1 0

Decision Tree

Page 16: Rough set based decision tree for identifying vulnerable and food insecure households

Dataset Source

Primary Survey data of 180 rural households from three villages as a part of the Project by Dr. P. Adhiguru at National Centre for Agricultural Economics and Policy Research (NCAP), India

3 different production systems from Dharampuri district of Tamilnadu state

Actual food intake was measured by 24 hours recall method. Later corresponding nutrients intake was worked out

Page 17: Rough set based decision tree for identifying vulnerable and food insecure households

Attributes Attributes are the variables in the dataset

that are used to describe the objects Any attributes is either qualitative or

quantitative In classification problem two types of

attributes are considered Condition attributes - Independent Variables Class or Decision attributes -Dependent

Variable

Page 18: Rough set based decision tree for identifying vulnerable and food insecure households

Food Groups Cereals and Millets Pulses Green leafy

vegetables Fruits Milk Fats and oils Roots and Tuber Sugar

Nutrients Protein Energy Calcium Iron Vitamin A Vitamin C

Energy is used as a proxy for measuring food insecurity of the household

Page 19: Rough set based decision tree for identifying vulnerable and food insecure households

HouseHold_Id1. Land: Whether house has its own land2. Hedu: Highest education of the head3. Hage: age of the head in the household4. Chld: Whether children in the family5. Flsz: No of members in the family6. PrWm: Proportion of Women to Family Size7. Hstd: whether own home stead garden8. Pear: proportion of earning to family sizePCENER: Energy/Capita/day in terms of KCAL9. Decision: Derived from PCENER

Morphological Attributes

Page 20: Rough set based decision tree for identifying vulnerable and food insecure households

Average Calorie Intake In Tamil Nadu, Average intake per consumer

unit per day in Kcal= 2347 In Tamil Nadu, Calorie intake of the lowest

decile per consumption unit per day in Kcal= 1551

For All India, Calorie intake of the lowest decile per consumption unit per day in Kcal= 1954

To identify poorest of the poor, lowest decile average figure was used

If Energy <1500 then decision attribute is labeled 0 means poorest of the poor or vulnerable to food insecurity Else 1 means not vulnerable to food insecurity

Page 21: Rough set based decision tree for identifying vulnerable and food insecure households

Revisiting Problem Most often, available Funds are scarce Need to target the Food security

program to most vulnerable group. Exhaustive surveys exclusively for this

purpose will be very costly and time consuming.

Need to learn simple concepts to facilitate identification of target beneficiaries people on the basis of morphological characteristics.

Page 22: Rough set based decision tree for identifying vulnerable and food insecure households

Concepts to be Learned from Rural Household Dataset

Decision Tree A hierarchical structure with root node

and sub trees as children Rules

Tree may be mapped to rules traversing the path from root to leaves

Page 23: Rough set based decision tree for identifying vulnerable and food insecure households

Softwares

Rosetta for Rough set Analysis Weka for Decision tree induction C++ programs for interfacing

between the two softwares Excel for Evaluation of the classifiers

Page 24: Rough set based decision tree for identifying vulnerable and food insecure households

Descriptionof Learning Algorithms

Algorithm Description

RS Rough set with full discernibility decision relative reduct

CJU Continuous data, J4.8, unpruned DT

CJP Continuous data, J4.8 algorithm, pruned DT

DID3 RS based discretization, no reduct, ID3

RDT RS based discretization, global reduct, ID3

DJU Discretized using RS, J4.8, unpruned

DJP Discretized using RS, J4.8, pruned

RJU Discretized, global reducts, J4.8, unpruned DT

RJP Discretized, global reducts, J4.8, pruned DT

DRJU Discretized, dynamic reduct, J4.8, unpruned DT

DRJP Discretized using RS, dynamic reduct, J4.8, pruned DT

Page 25: Rough set based decision tree for identifying vulnerable and food insecure households

DT and corresponding rules

Page 26: Rough set based decision tree for identifying vulnerable and food insecure households

Evaluation Experiment using 10 fold Cross

Validation Accuracy on Test data (A) Complexity (S) Number of Rules (Nr) Number of attributes (Na) Cumulative Score (CS)

)111

(4

1

NaNrSACS

Page 27: Rough set based decision tree for identifying vulnerable and food insecure households

Evaluation of Simplified DT

Accuracy =73%Complexity = 43Number of rules = 9Num. of attributes = 40 :poorest and vulnerable to food insecurity1: not vulnerable to food insecurity

Page 28: Rough set based decision tree for identifying vulnerable and food insecure households

Id A S Nr Na CSRS 51 1003 149 6.7 0.17CJU 69 173 26 8 0.21CJP 73 40 10 7 0.25DID3 60 262 79 7.3 0.19RDT 59 269 82 6.8 0.19DJU 67 188 56 7.1 0.21DJP 73 43 16 4.2 0.26RJU 68 177 55 6.4 0.21RJP 72 43 17 4.0 0.27DRJU 67 186 56 6.6 0.21DRJP 73 43 9 4.0 0.28

Comparing Algorithms using CS

Page 29: Rough set based decision tree for identifying vulnerable and food insecure households

Nutrition Dataset

Accuracy

0.0

20.0

40.0

60.0

80.0

100.0

%

Complexity

0

100

200

300

400

500

Rules

0

30

60

90

120

150

Attributes

0

2

4

6

8

RS

CJU

CJP

DID

RDT

DJU

DJP

RJU

RJP

DRJU

DRJP

Page 30: Rough set based decision tree for identifying vulnerable and food insecure households

CHLD

HAGE

FLSIZE

PEAR

1

0

01

0

1 1

1

1

<4040

yn

<45

[41,51) >51

<44

>4

>45

[45,54)

DT(DRJP) - Nutrition Data

Accuracy=73% Complexity=43

Attributes=4 Rules=9

Page 31: Rough set based decision tree for identifying vulnerable and food insecure households

Benefits

Cost Effective Timely Simple to understand and implement No scope for personal Bias

Page 32: Rough set based decision tree for identifying vulnerable and food insecure households

Constraints

Development or model building requires expertise

Lack of synergy among disciplines Adequate sample of data Region specific Mindset towards conventional and

traditional techniques

Page 33: Rough set based decision tree for identifying vulnerable and food insecure households

References

1. Adhiguru, P. and C. Ramasamy 2003. Agricultural-based Interventions for Sustainable Nutritional Security. Policy Paper 17. NCAP, New Delhi, India.

2. Han, J. and M. Kamber 2001. Data Mining: Concepts and Techniques. MK3. Hand, D., Mannila, H. and P. Smyth 2001. Principles of Data Mining. PHI.4. Minz S. and R. Jain 2003. Rough Set based Decision Tree Model for

classification, In Proc of 5th Intl. Conference, DaWak 03, LNCS 2737.5. Minz, S. and R. Jain 2005. Refining decision tree classifiers using rough set

tools. International Journal of Hybrid Intelligent Systems, 2(2):133-147.6. Pawlak, Z. 2001. Drawing Conclusions from Data-The Rough Set Way. IJIS

16: 3-11.7. Polkowski, L. and A. Skowron 2001. Rough Sets in Knowledge Discovery 1

and 2, Heidelberg, Germany: Physica-Verlag.8. Quinlan, J. R. 1993. C4.5: Programs for Machine Learning. Morgan

Kauffman.9. Rosetta, Rough set toolkit at http://www.idi.ntnu.no/~aleks/ rosetta/.10. Witten, I. H. and E. Frank 2000. Data Mining: Practical Machine Learning

Tools and Techniques with Java Implementations, MK11. Wroblewski, J. 1998. Genetic algorithms in decomposition and classification

problems. In: Polkowski, L. and Skowron, A., Rough Sets in Knowledge Discovery 1 and 2, Heidelberg, Germany: Physica-Verlag 472-492.

12. Ziarko, W. 1993. Variable precision rough set model, Journal of Computer and System Sciences 46: 39-59.

Page 34: Rough set based decision tree for identifying vulnerable and food insecure households

?

Page 35: Rough set based decision tree for identifying vulnerable and food insecure households

Thank You