Rough set based decision tree for identifying vulnerable and food insecure households
-
Upload
aims-agricultural-information-management-standards -
Category
Education
-
view
1.066 -
download
0
Transcript of Rough set based decision tree for identifying vulnerable and food insecure households
Rough Set based Decision Tree for Identifying Vulnerable and Food Insecure Households
Rajni Jain1, S. Minz2 and P. Adhiguru1
1Sr. Scientist, NCAP, Pusa, New Delhi
2Associate Professor, Jawaharlal Nehru University
Outline Problem Knowledge Discovery Process Data Mining Classification Task of Data Mining Methodology: RDT Dataset for this Study Classifier Model Evaluation
Problem of Food Security Most often, available Funds are scarce Need to target the Food security
program to most vulnerable group. Exhaustive surveys exclusively for this
purpose will be very costly and time consuming.
Need to learn simple concepts to facilitate identification of target beneficiaries people on the basis of morphological characteristics.
Knowledge Discovery in Dataset
Data Target Data
Pre-processed Data
Tranformed Data
PatternsKnowledge
Selection PreprocessingT
ransformation
Data M
ining
Interpretation
•Selection phase defines KDD problem by focusing on a subset of data attributes or data samples on which KDD is to be performed.
•Preprocessing care to be taken not to induce any unwanted bias. They include removing noise and missing data handling
•Transformations may be combining attributes or discretizing continuous attributes
•In Data Mining step many different learning and modeling algorithms are potential candidates
Data Mining Tasks
Classification Decision Tree Decision Rule
Summarization Association rules Characteristic rules
Classification
Training Data Classification Algorithm Rules/Tree/Formula
Step I
New DataClassification Rules Label the class
Step II
Estimate the predictive accuracy of the model. If acceptable Step III
Step III
Data
Training Data The data used for developing the model
Test Data The data used to estimate the
evaluation parameter of the model New Data
Condition attributes known but decision attribute is not known
Basis of Classification Algorithms
Rough Sets Decision tree Learning Statistics Neural Network Genetic Algorithms
None of the method is suitable for all types of domain
Methodology: Machine Learning
Rough Sets Decision Tree induction Rough set based Decision Tree
induction (RDT) Two phases RS for dominant attributes selection J4.8 for decision tree induction
Rough Sets 1980, Prof. Z. Pawlak,
A Polish Mathematician
Indiscernible- similar Objects (say Patients, households etc.)
Indiscernibility Relation
Id H M T F
1 n y h y
2 y n h y
3 y y vh y
4 n y n n
5 y n h n
6 n y vh y
Indiscernibility Relation - contd..
U/IND(H)={{1,4,6}{2,3,5}}
U/IND(F)={{1,2,3,6},{4,5}}
Flu Patients
}:)(/{ XYPINDUYXP
}:)(/{ XYPINDUYXP
XPXPXBNp )(
Id H M T F
1 n y h y
2 y n h y
3 y y vh y
4 n y n n
5 y n h n
6 n y vh y
Lower and Upper Approximation
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
26 27 28 29 30
Let the Bigger Square represent the domain of the universe
Small Squares represent the partitions of the universe for a given set of attributes P. All objects in a partition are indiscernible.
Oval represents the concept X to be defined
P (X)= {13,14,18,19}
P(X)={7,8,9,12,13,14,15,17,18,19,20,22,23,24}
Coming down in the other square,
P={7},P(X)={7}, so crisp set
1 2 3 4
5 6 7 8
9 10 11 12
Important Terms Reduct: R
A minimum set of attributes that preserve the IND relation.
Decision relative reduct Core : C
Intersection of all Reducts Johnson’s method for single efficient reduct
computation GA based algorithm for multiple reducts
computation
Architecture of RDT Model
Data
Reduct Computation Algorithm
Reduct
Remove attributes absent in reduct
Reduced Training Data
ID3 Algorithm
DT
y n
0
CHLD
1
HAGE
1 0LAND
young
middleold
Very old
0 1
1 0
Decision Tree
Dataset Source
Primary Survey data of 180 rural households from three villages as a part of the Project by Dr. P. Adhiguru at National Centre for Agricultural Economics and Policy Research (NCAP), India
3 different production systems from Dharampuri district of Tamilnadu state
Actual food intake was measured by 24 hours recall method. Later corresponding nutrients intake was worked out
Attributes Attributes are the variables in the dataset
that are used to describe the objects Any attributes is either qualitative or
quantitative In classification problem two types of
attributes are considered Condition attributes - Independent Variables Class or Decision attributes -Dependent
Variable
Food Groups Cereals and Millets Pulses Green leafy
vegetables Fruits Milk Fats and oils Roots and Tuber Sugar
Nutrients Protein Energy Calcium Iron Vitamin A Vitamin C
Energy is used as a proxy for measuring food insecurity of the household
HouseHold_Id1. Land: Whether house has its own land2. Hedu: Highest education of the head3. Hage: age of the head in the household4. Chld: Whether children in the family5. Flsz: No of members in the family6. PrWm: Proportion of Women to Family Size7. Hstd: whether own home stead garden8. Pear: proportion of earning to family sizePCENER: Energy/Capita/day in terms of KCAL9. Decision: Derived from PCENER
Morphological Attributes
Average Calorie Intake In Tamil Nadu, Average intake per consumer
unit per day in Kcal= 2347 In Tamil Nadu, Calorie intake of the lowest
decile per consumption unit per day in Kcal= 1551
For All India, Calorie intake of the lowest decile per consumption unit per day in Kcal= 1954
To identify poorest of the poor, lowest decile average figure was used
If Energy <1500 then decision attribute is labeled 0 means poorest of the poor or vulnerable to food insecurity Else 1 means not vulnerable to food insecurity
Revisiting Problem Most often, available Funds are scarce Need to target the Food security
program to most vulnerable group. Exhaustive surveys exclusively for this
purpose will be very costly and time consuming.
Need to learn simple concepts to facilitate identification of target beneficiaries people on the basis of morphological characteristics.
Concepts to be Learned from Rural Household Dataset
Decision Tree A hierarchical structure with root node
and sub trees as children Rules
Tree may be mapped to rules traversing the path from root to leaves
Softwares
Rosetta for Rough set Analysis Weka for Decision tree induction C++ programs for interfacing
between the two softwares Excel for Evaluation of the classifiers
Descriptionof Learning Algorithms
Algorithm Description
RS Rough set with full discernibility decision relative reduct
CJU Continuous data, J4.8, unpruned DT
CJP Continuous data, J4.8 algorithm, pruned DT
DID3 RS based discretization, no reduct, ID3
RDT RS based discretization, global reduct, ID3
DJU Discretized using RS, J4.8, unpruned
DJP Discretized using RS, J4.8, pruned
RJU Discretized, global reducts, J4.8, unpruned DT
RJP Discretized, global reducts, J4.8, pruned DT
DRJU Discretized, dynamic reduct, J4.8, unpruned DT
DRJP Discretized using RS, dynamic reduct, J4.8, pruned DT
DT and corresponding rules
Evaluation Experiment using 10 fold Cross
Validation Accuracy on Test data (A) Complexity (S) Number of Rules (Nr) Number of attributes (Na) Cumulative Score (CS)
)111
(4
1
NaNrSACS
Evaluation of Simplified DT
Accuracy =73%Complexity = 43Number of rules = 9Num. of attributes = 40 :poorest and vulnerable to food insecurity1: not vulnerable to food insecurity
Id A S Nr Na CSRS 51 1003 149 6.7 0.17CJU 69 173 26 8 0.21CJP 73 40 10 7 0.25DID3 60 262 79 7.3 0.19RDT 59 269 82 6.8 0.19DJU 67 188 56 7.1 0.21DJP 73 43 16 4.2 0.26RJU 68 177 55 6.4 0.21RJP 72 43 17 4.0 0.27DRJU 67 186 56 6.6 0.21DRJP 73 43 9 4.0 0.28
Comparing Algorithms using CS
Nutrition Dataset
Accuracy
0.0
20.0
40.0
60.0
80.0
100.0
%
Complexity
0
100
200
300
400
500
Rules
0
30
60
90
120
150
Attributes
0
2
4
6
8
RS
CJU
CJP
DID
RDT
DJU
DJP
RJU
RJP
DRJU
DRJP
CHLD
HAGE
FLSIZE
PEAR
1
0
01
0
1 1
1
1
<4040
yn
<45
[41,51) >51
<44
>4
>45
[45,54)
DT(DRJP) - Nutrition Data
Accuracy=73% Complexity=43
Attributes=4 Rules=9
Benefits
Cost Effective Timely Simple to understand and implement No scope for personal Bias
Constraints
Development or model building requires expertise
Lack of synergy among disciplines Adequate sample of data Region specific Mindset towards conventional and
traditional techniques
References
1. Adhiguru, P. and C. Ramasamy 2003. Agricultural-based Interventions for Sustainable Nutritional Security. Policy Paper 17. NCAP, New Delhi, India.
2. Han, J. and M. Kamber 2001. Data Mining: Concepts and Techniques. MK3. Hand, D., Mannila, H. and P. Smyth 2001. Principles of Data Mining. PHI.4. Minz S. and R. Jain 2003. Rough Set based Decision Tree Model for
classification, In Proc of 5th Intl. Conference, DaWak 03, LNCS 2737.5. Minz, S. and R. Jain 2005. Refining decision tree classifiers using rough set
tools. International Journal of Hybrid Intelligent Systems, 2(2):133-147.6. Pawlak, Z. 2001. Drawing Conclusions from Data-The Rough Set Way. IJIS
16: 3-11.7. Polkowski, L. and A. Skowron 2001. Rough Sets in Knowledge Discovery 1
and 2, Heidelberg, Germany: Physica-Verlag.8. Quinlan, J. R. 1993. C4.5: Programs for Machine Learning. Morgan
Kauffman.9. Rosetta, Rough set toolkit at http://www.idi.ntnu.no/~aleks/ rosetta/.10. Witten, I. H. and E. Frank 2000. Data Mining: Practical Machine Learning
Tools and Techniques with Java Implementations, MK11. Wroblewski, J. 1998. Genetic algorithms in decomposition and classification
problems. In: Polkowski, L. and Skowron, A., Rough Sets in Knowledge Discovery 1 and 2, Heidelberg, Germany: Physica-Verlag 472-492.
12. Ziarko, W. 1993. Variable precision rough set model, Journal of Computer and System Sciences 46: 39-59.
?
Thank You