Rough set based decision tree for identifying vulnerable and food insecure households

Rough Set based Decision Tree for Identifying Vulnerable and Food Insecure Households

Rajni Jain1, S. Minz2 and P. Adhiguru1

1Sr. Scientist, NCAP, Pusa, New Delhi

2Associate Professor, Jawaharlal Nehru University

Outline Problem Knowledge Discovery Process Data Mining Classification Task of Data Mining Methodology: RDT Dataset for this Study Classifier Model Evaluation

Problem of Food Security Most often, available Funds are scarce Need to target the Food security

program to most vulnerable group. Exhaustive surveys exclusively for this

purpose will be very costly and time consuming.

Need to learn simple concepts to facilitate identification of target beneficiaries people on the basis of morphological characteristics.

Knowledge Discovery in Dataset

Data Target Data

Pre-processed Data

Tranformed Data

PatternsKnowledge

Selection PreprocessingT

ransformation

Data M

ining

Interpretation

•Selection phase defines KDD problem by focusing on a subset of data attributes or data samples on which KDD is to be performed.

•Preprocessing care to be taken not to induce any unwanted bias. They include removing noise and missing data handling

•Transformations may be combining attributes or discretizing continuous attributes

•In Data Mining step many different learning and modeling algorithms are potential candidates

Data Mining Tasks

Classification Decision Tree Decision Rule

Summarization Association rules Characteristic rules

Classification

Training Data Classification Algorithm Rules/Tree/Formula

Step I

New DataClassification Rules Label the class

Step II

Estimate the predictive accuracy of the model. If acceptable Step III

Step III

Data

Training Data The data used for developing the model

Test Data The data used to estimate the

evaluation parameter of the model New Data

Condition attributes known but decision attribute is not known

Basis of Classification Algorithms

Rough Sets Decision tree Learning Statistics Neural Network Genetic Algorithms

None of the method is suitable for all types of domain

Methodology: Machine Learning

Rough Sets Decision Tree induction Rough set based Decision Tree

induction (RDT) Two phases RS for dominant attributes selection J4.8 for decision tree induction

Rough Sets 1980, Prof. Z. Pawlak,

A Polish Mathematician

Indiscernible- similar Objects (say Patients, households etc.)

Indiscernibility Relation

Id H M T F

1 n y h y

2 y n h y

3 y y vh y

4 n y n n

5 y n h n

6 n y vh y

Indiscernibility Relation - contd..

U/IND(H)={{1,4,6}{2,3,5}}

U/IND(F)={{1,2,3,6},{4,5}}

Flu Patients

}:)(/{ XYPINDUYXP

}:)(/{ XYPINDUYXP

XPXPXBNp )(

Id H M T F

1 n y h y

2 y n h y

3 y y vh y

4 n y n n

5 y n h n

6 n y vh y

Lower and Upper Approximation

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

16 17 18 19 20

21 22 23 24 25

26 27 28 29 30

Let the Bigger Square represent the domain of the universe

Small Squares represent the partitions of the universe for a given set of attributes P. All objects in a partition are indiscernible.

Oval represents the concept X to be defined

P (X)= {13,14,18,19}

P(X)={7,8,9,12,13,14,15,17,18,19,20,22,23,24}

Coming down in the other square,

P={7},P(X)={7}, so crisp set

1 2 3 4

5 6 7 8

9 10 11 12

Important Terms Reduct: R

A minimum set of attributes that preserve the IND relation.

Decision relative reduct Core : C

Intersection of all Reducts Johnson’s method for single efficient reduct

computation GA based algorithm for multiple reducts

computation

Architecture of RDT Model

Data

Reduct Computation Algorithm

Reduct

Remove attributes absent in reduct

Reduced Training Data

ID3 Algorithm

DT

y n

0

CHLD

1

HAGE

1 0LAND

young

middleold

Very old

0 1

1 0

Decision Tree

Dataset Source

Primary Survey data of 180 rural households from three villages as a part of the Project by Dr. P. Adhiguru at National Centre for Agricultural Economics and Policy Research (NCAP), India

3 different production systems from Dharampuri district of Tamilnadu state

Actual food intake was measured by 24 hours recall method. Later corresponding nutrients intake was worked out

Attributes Attributes are the variables in the dataset

that are used to describe the objects Any attributes is either qualitative or

quantitative In classification problem two types of

attributes are considered Condition attributes - Independent Variables Class or Decision attributes -Dependent

Variable

Food Groups Cereals and Millets Pulses Green leafy

vegetables Fruits Milk Fats and oils Roots and Tuber Sugar

Nutrients Protein Energy Calcium Iron Vitamin A Vitamin C

Energy is used as a proxy for measuring food insecurity of the household

HouseHold_Id1. Land: Whether house has its own land2. Hedu: Highest education of the head3. Hage: age of the head in the household4. Chld: Whether children in the family5. Flsz: No of members in the family6. PrWm: Proportion of Women to Family Size7. Hstd: whether own home stead garden8. Pear: proportion of earning to family sizePCENER: Energy/Capita/day in terms of KCAL9. Decision: Derived from PCENER

Morphological Attributes

Average Calorie Intake In Tamil Nadu, Average intake per consumer

unit per day in Kcal= 2347 In Tamil Nadu, Calorie intake of the lowest

decile per consumption unit per day in Kcal= 1551

For All India, Calorie intake of the lowest decile per consumption unit per day in Kcal= 1954

To identify poorest of the poor, lowest decile average figure was used

If Energy <1500 then decision attribute is labeled 0 means poorest of the poor or vulnerable to food insecurity Else 1 means not vulnerable to food insecurity

Revisiting Problem Most often, available Funds are scarce Need to target the Food security

program to most vulnerable group. Exhaustive surveys exclusively for this

purpose will be very costly and time consuming.

Need to learn simple concepts to facilitate identification of target beneficiaries people on the basis of morphological characteristics.

Concepts to be Learned from Rural Household Dataset

Decision Tree A hierarchical structure with root node

and sub trees as children Rules

Tree may be mapped to rules traversing the path from root to leaves

Softwares

Rosetta for Rough set Analysis Weka for Decision tree induction C++ programs for interfacing

between the two softwares Excel for Evaluation of the classifiers

Descriptionof Learning Algorithms

Algorithm Description

RS Rough set with full discernibility decision relative reduct

CJU Continuous data, J4.8, unpruned DT

CJP Continuous data, J4.8 algorithm, pruned DT

DID3 RS based discretization, no reduct, ID3

RDT RS based discretization, global reduct, ID3

DJU Discretized using RS, J4.8, unpruned

DJP Discretized using RS, J4.8, pruned

RJU Discretized, global reducts, J4.8, unpruned DT

RJP Discretized, global reducts, J4.8, pruned DT

DRJU Discretized, dynamic reduct, J4.8, unpruned DT

DRJP Discretized using RS, dynamic reduct, J4.8, pruned DT

DT and corresponding rules

Evaluation Experiment using 10 fold Cross

Validation Accuracy on Test data (A) Complexity (S) Number of Rules (Nr) Number of attributes (Na) Cumulative Score (CS)

)111

(4

1

NaNrSACS

Evaluation of Simplified DT

Accuracy =73%Complexity = 43Number of rules = 9Num. of attributes = 40 :poorest and vulnerable to food insecurity1: not vulnerable to food insecurity

Id A S Nr Na CSRS 51 1003 149 6.7 0.17CJU 69 173 26 8 0.21CJP 73 40 10 7 0.25DID3 60 262 79 7.3 0.19RDT 59 269 82 6.8 0.19DJU 67 188 56 7.1 0.21DJP 73 43 16 4.2 0.26RJU 68 177 55 6.4 0.21RJP 72 43 17 4.0 0.27DRJU 67 186 56 6.6 0.21DRJP 73 43 9 4.0 0.28

Comparing Algorithms using CS

Nutrition Dataset

Accuracy

0.0

20.0

40.0

60.0

80.0

100.0

%

Complexity

0

100

200

300

400

500

Rules

0

30

60

90

120

150

Attributes

0

2

4

6

8

RS

CJU

CJP

DID

RDT

DJU

DJP

RJU

RJP

DRJU

DRJP

CHLD

HAGE

FLSIZE

PEAR

1

0

01

0

1 1

1

1

<4040

yn

<45

[41,51) >51

<44

>4

>45

[45,54)

DT(DRJP) - Nutrition Data

Accuracy=73% Complexity=43

Attributes=4 Rules=9

Benefits

Cost Effective Timely Simple to understand and implement No scope for personal Bias

Constraints

Development or model building requires expertise

Lack of synergy among disciplines Adequate sample of data Region specific Mindset towards conventional and

traditional techniques

References

1. Adhiguru, P. and C. Ramasamy 2003. Agricultural-based Interventions for Sustainable Nutritional Security. Policy Paper 17. NCAP, New Delhi, India.

2. Han, J. and M. Kamber 2001. Data Mining: Concepts and Techniques. MK3. Hand, D., Mannila, H. and P. Smyth 2001. Principles of Data Mining. PHI.4. Minz S. and R. Jain 2003. Rough Set based Decision Tree Model for

classification, In Proc of 5th Intl. Conference, DaWak 03, LNCS 2737.5. Minz, S. and R. Jain 2005. Refining decision tree classifiers using rough set

tools. International Journal of Hybrid Intelligent Systems, 2(2):133-147.6. Pawlak, Z. 2001. Drawing Conclusions from Data-The Rough Set Way. IJIS

16: 3-11.7. Polkowski, L. and A. Skowron 2001. Rough Sets in Knowledge Discovery 1

and 2, Heidelberg, Germany: Physica-Verlag.8. Quinlan, J. R. 1993. C4.5: Programs for Machine Learning. Morgan

Kauffman.9. Rosetta, Rough set toolkit at http://www.idi.ntnu.no/~aleks/ rosetta/.10. Witten, I. H. and E. Frank 2000. Data Mining: Practical Machine Learning

Tools and Techniques with Java Implementations, MK11. Wroblewski, J. 1998. Genetic algorithms in decomposition and classification

problems. In: Polkowski, L. and Skowron, A., Rough Sets in Knowledge Discovery 1 and 2, Heidelberg, Germany: Physica-Verlag 472-492.

12. Ziarko, W. 1993. Variable precision rough set model, Journal of Computer and System Sciences 46: 39-59.

Thank You

Rough set based decision tree for identifying vulnerable and food insecure households

Education

Transcript of Rough set based decision tree for identifying vulnerable and food insecure households