DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN HEALTH SCIENCES

49
DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN HEALTH SCIENCES BKSinha Ex-Faculty, ISI, Kolkata April 17, 2012

description

DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN HEALTH SCIENCES. BKSinha Ex-Faculty, ISI, Kolkata April 17, 2012. What is Data Integration ?. Integration of Multiple Indicators Existence of several different indicators - PowerPoint PPT Presentation

Transcript of DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN HEALTH SCIENCES

Page 1: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN

HEALTH SCIENCES

BKSinha

Ex-Faculty, ISI, Kolkata

April 17, 2012

Page 2: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

What is Data Integration ?

• Integration of Multiple Indicators • Existence of several different indicators• Desired to provide an AGGREGATE OR

Over-all Measure….in an objective and statistically sound approach

• Multiple Criteria Decision Making [MCDM]• Advocated by Hwang & Yoon (1981) :

Multiple Attribute Decision Making : Methods & Applications : A State-of-the-Art-Survey. Springer-Verlag, Berlin

Page 3: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

HEALTH ISSUES• Air / Surface / Water Pollution :

Different Sources & Their Effects

• *******************

• US EPA : TRI Data Base

• Toxic Release Inventory [TRI] Data

• EPA’s 33/50 Program

• TRI Data for 17 Chemicals during long years :1987-1994 for 50 States & DC

Page 4: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

TOXIC RELEASE INVENTORY[TRI] : US EPA

TOXIC CHEMICALS…… BENZENE CADMIUM CARBON• TETRACHLORIDE CHOLOFORM• CYANIDE LEAD MERCURY• NICKEL TOLUENE M-XYLENE…

• TRI Data…..expressed as % …• Less the Better….More the Worse

Page 5: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

NATURE OF DATA & PROBLEM States VS Chemicals : TRI Data[Coded] Benzene I 7% II 12% Q. Which State is the Least III 17% Hit by Benzene ?

IV 9% Ans. VI V 14% AND Worst Hit ? III VI 6% Single Chemical….. VII 15% NO PROBLEM AT ALL VIII 16% TO RANK THE STATES FROM BEST TO WORST...

Page 6: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

ADD ONE MORE CHEMICAL...States VS Chemicals : TRI Data Benzene CADMIUM• I 7% 13 %• II 12% 9% III 17% 4%• IV 9% 11%• V 14% 10%• VI 6% 11%• VII 15% 9% VIII 16% 11%

Q. Combine the Two Chemicals : Which State is Worst ? How to Combine ?

Page 7: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

AND ADD MORE……

• States \ Chemicals [TRI Data] • Be Cd Ca Tr Ch Cy … • I : 7% 13% 21% 2% 34% 21% …• • CONCEPT OF DATA MATRIX• X = (( XiJ )), 1 i K; 1 j N • K Locations & N Data Sources • DATA INTEGRATION FOR OVER-ALL

TRI INDEX FOR GLOBAL COMPARISON

Page 8: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

Data Matrix States VS Chemicals : TRI Data • Be Cd Ca Tr Ch Cy L • I 7% 13% 21% 2% 34% 21% 17% • II 12% 9% 18% 3% 42% 28% 11%• III 17% 4% 23% 7% 22% 19% 23%• IV 9% 11% 17% 5% 25% 23% 19%• V 14% 10% 13% 8% 21% 19% 25%• VI 6% 11% 19% 5% 33% 21% 22%• VII 15% 9% 13% 4% 38% 19% 28%• VIII 16% 11% 10% 5% 33% 20% 25%

Page 9: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

Application Areas

• Disease Prevalence Statistics• Disease Symptom Statistics • Health Statistics Demographic Statistics• Human Development Index Statistics *********** Data Integration : Common Problem Techniques are quite general ….

Page 10: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

Nature of Data

• Locations versus Features : Quantitative data providing impacts of features on the locations based on similarity principle

• Purpose : Overall Ranking of Locations based on Combined Evidence from a Pool of Features

• Features may / may not have equal importance in the process of ‘combining evidences’

Page 11: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

Aggregate Methods…..

• Some kind of “aggregate” …..pooling of TRI Data to a single value for each State for over-all comparison

• TRI Data • Total TRI for I = 115 [over 7 features]• Average TRI for I = 16.43%• Compute Average for Each State &

Compare the averages across all states

Page 12: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

Aggregate Methods….• AM…..GM…..HM…..

• Use of Median as Representative of TRI

• TRI Data

• I…Median = 17% II…Median = 12%

• III….19% ETC ETC…..

Q. ARE ALL CHEMICALS EQUALLY HARMFUL ? Ans. Possibly NOT !

Q. Are all Features Equally Important ?

Page 13: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

Concept of Weight…..

• Subject Specialist’s Knowledge…..• Choice of Weights : Rel. Importance• Wts.[TRI] : 2.0 3.5 1.0 4.5 5.0 7.0 2.0• Interpretation of weights…..• Total of Weights = 25.0• Rel. Wts : 2.0/25 = 8%, 3.5/25 =14%• etc etc….for all chemicals…..• Total of Rel Wts. = 1 OR 100 % • Use Rel. Wts. to compute Weighted AM,

GM, HM etc

Page 14: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

Use of Ranks…..Convert Scores into Ranks for Each ItemTRI Data Matrix :Convert into Rank Matrix Benzene Cadmium etc etc TRI Scores Ranks • I 7% ……...2• II 12% …….4• III 17% …….8 • IV 9% ……...3 • V 14% …….5• VI 6% …..1• VII 15% …...6• VIII 16% …...7

Page 15: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

Rank Matrix….States VS Chemicals : TRI Data ….ranks • Be Cd Ca Tr Ch Cy L • I 2• II 4• III 8• IV 3 etc etc etc• V 5 for each chemical• VI 1• VII 6 Then use “aggregate” methods• VIII 7 based on ranks

Page 16: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

Why Ranks …..

• Raw Scores……

• Aggregate methods are sensitive to Outliers….too high or too low values…

• Extreme Values….

• Use of Trimmed Mean

• Ranking…..recommended for Robust Results……

Page 17: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

Less Known Methods….

• TOPSIS METHOD• ELECTRE METHOD [computation-intensive…..]• Concepts : TOPSIS Method• Features…..Locations…. Ideal Location

Anti-Ideal Location• Distance from Ideal….from Anti-ideal• Within Feature Variation • Composite Index

Page 18: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

TOPSIS METHODTechnique for

Ordering Preferences by Similarity to Ideal Solution

Uses Concepts of

• Ideal & Anti-Ideal Locations• Distance from Ideal & Anti-Ideal Locations• Weight of Features • Sum of Squares for each feature

Page 19: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

Philosophy for TOPSIS

• TOPSIS (Technique for Ordering Preferences by Similarity to Ideal Solution)

• In the absence of a natural course of action for over-all summary measure and ranking….next best alternative course of action would be to assign top rank to the one which has shortest distance from the ideal and farthest distance from the anti-ideal…..

Page 20: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

Concepts: Ideal & Anti-Ideal States

States VS Chemicals : TRI Data• Be Cd Ca Tr Ch Cy L • I 7% 13% 21% 2% 34% 21% 17% • II 12% 9% 18% 3% 42% 28% 11%• III 17% 4% 23% 7% 22% 19% 23%• IV 9% 11% 17% 5% 25% 23% 19%• V 14% 10% 13% 8% 21% 19% 25%• VI 6% 11% 19% 5% 33% 21% 22%• VII 15% 9% 13% 4% 38% 19% 28%• VIII 16% 11% 10% 5% 33% 20% 25% *********************************************************• Ideal... 6% 4% 10% 2% 21% 19% 11%• Anti- 17% 13% 23% 8% 42% 28% 28%• Ideal

Page 21: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

Ideal & Anti-Ideal……• Hypothetical Locations!

• Abs. Best / Worst States ……Hypothetical• Setting up the Limits for others…..• Ranking of the others…..

• Better - Placed States ?• Closer to Ideal : Distance from Ideal….small• AND ALSO Far from Anti-Ideal :

Distance from Anti-Ideal…Large

Page 22: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

Concepts of Distance…..

Euclidean Distance…..

Ideal : 6% 4% 10% 2% 21% 19% 11%

Anti- 17% 13% 23% 8% 42% 28% 28%

Ideal

Squarred Distance between Ideal & Its Anti

= (6-17)^2 + (4-13)^2 + …. + (11-28)^2

Page 23: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

Computations….• Distance between Location & ID OR NID….• I : 7% 13% 21% 2% 34% 21% 17% • ID: 6% 4% 10% 2% 21% 19% 11%

NID 17% 13% 23% 8% 42% 28% 28%

Sq.Dis. [ I vs ID]

(7- 6)^2 =1, (13 – 4)^2 =81, ……

Sq. Dis. [ I vs NID]

(7-17)^2 =100, (13 – 13)^2=0, … ….

Page 24: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

Sq. Dist. Comp. vs Ideal Features\ Chemicals

Locations 1 2 3 4 5 6 7

I 1 81 121 0 169 4 36

II 36 25 64 1 441 81 0

III 121 0 169 25 1 0 144

IV 9 49 49 9 16 16 64

V 64 36 9 36 0 0 196

VI 0 49 81 9 144 4 121

VII 81 25 9 4 289 0 289

VIII 100 49 0 9 144 1 256

Page 25: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

Sq. Dist. Comp. vs Anti-Ideal Features\ Chemicals

Locations 1 2 3 4 5 6 7

I 100 0 4 36 64 49 121

II 25 16 25 25 0 0 289

III 0 81 0 1 400 81 25

IV 64 4 36 9 289 25 81

V 9 9 100 0 441 81 9

VI 121 4 16 9 81 49 36

VII 4 16 100 16 16 81 0

VIII 1 4 169 9 81 64 9

Page 26: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

Choice of Weights…..

• Wts.[TRI] : 2.0 3.5 1.0 4.5 5.0 7.0 2.0

• Rel. Wts : .08 .14 .04 .18 .20 .28 .08

Sum of Squares for each feature

over all locations

Be : 7^2+12^2+17^2+9^2+14^2+6^2

+15^2+16^2 = 1236

Page 27: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

Computation of Feature-wiseSum of Squares

Features \ Sum of Squares

1[Be] 1236

2 [Cd] 810

3 [Ca] 2382

4 [Tr] 217

5 [Ch] 5386

6 [Cy] 3678

7 [L] 3818

Page 28: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

Formation of Composite Indices….

Ingredients • Distances, Weights & Sum of Squares • Composite Index [CI] : 2 Components derived from Ideal & Anti-Ideal locations For Each Location : Added over all Features Sq.Distance x Wt of Feature Divided by Sum of Squares of feature

Page 29: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

Two Components…

Page 30: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

u’s and v’s….min. & max.

Page 31: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

Details of Computations…..

• State I :

L2 [I, IDR] = [(7- 6)^2 x 0.08 / 1236 + ….]1/2

L2 [I, NIDR =[ (7-17)^2 x 0.08 / 1236 + …]1/2

• CI = Composite Index

= L2 [I, IDR] / {L2 [I, IDR} + L2 [I,NIDR]}• It is a RATIO between 0 and 1 ….smaller

the ratio, better is the placement of the State in over-all comparison across states …..

Page 32: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

Computational Details : Ideal• Locations / Features for

Sq Distance wrt Ideal x Weight / SS of FeaturesI .000065 .014 .002032 0.00 .006276 .000304 .000754

II .002330 .004321 .001075 .000829 .016376 .006166 0.00

III .007832 0.00 .002838 .020725 .000037 0.00 .003016

IV .000582 .008469 .000823 .007461 .000594 .001216 .001340

V .004144 .006222 .000151 .029844 0.00 0.00 .004105

VI 0.00 .008469 .001360 .007461 .005348 .000304 .002534

VII .005241 .004321 .0000015 .003316 .010732 0.00 .006053

VIII .006470 .008469 0.00 .007461 .005348 .000076 .005362

Page 33: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

Computational Details : Anti-IdealLocations / Features for

Sq Dis. wrt Anti-Ideal x Weight / SS of FeaturesI .006472 0.00 .000067 0.029862 .002411 .003724 .002534

II

Page 34: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

Final Ranking Table…• States L2 [., IDR] L2 [., NIDR] CI Rank • I• II• III• IV etc etc etc • V• VI• VII• VIII

Page 35: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

Choice of Weights….• Internal & External Importance of Environmental

factors….• Use of Shanon’s Entropy Measure ….• Define piJ = XiJ / i XiJ = proportion….• Compute for each item• (J) = - i piJ ln piJ / ln (K) • Use • w(J) = (1 - (J)) / r(1- (r))

• Alternatively….use w(J) proportional to cv2 of Item J …coeff of variation [cv] computed from the data matrix…..

• • •

Page 36: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

Extensions…..• Ranking depends critically on Choice of

Distance Measure & Choice of Weights

• Distance Measure : Squared Distance [L2]

• Mean Deviation : L1 – Norm

Page 37: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

Reversal of Roles of Rows & Cols.

Page 38: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

Pollution Data on 50 US States

Page 39: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

US Pollution Data [contd.]....

Page 40: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

US Pollution Data [contd.]….

Page 41: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

Results of TOPSIS Analysis : Two Sets of Weights [Entropy & CV] &

Two Distance Measures [L-1 & L-2]

Page 42: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

Results …contd…..

Page 43: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

Results…contd….

Page 44: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES
Page 45: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES
Page 46: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES
Page 47: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES
Page 48: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

Questions…..

• Q1. Can the indicators be expressed in original units of measurement or we only need % ?

• Ans. Yes….original units will do since the formulae indicate unit-free computations. Also see US Original Pollution Data at the end.

• Q2. What about interdependence among the indicators ?

Page 49: DATA INTEGRATION TECHNIQUES WITH APPLICATIONS IN  HEALTH SCIENCES

Questions….

• Ans. It is believed that the indicators are seemingly uncorrelated. If there is any functional dependence, only the smallest subset of them should be used.

• Q3. What about PCA ?

• Ans. That won’t lead to ranking of the locations. Also it will be difficult to interpret the linear combinations of the indicators.