On Whom Should I Perform this Lab Test Next? An Active ... · On Whom Should I Perform the Lab Test...

On Whom Should I Perform the Lab Test on Next?An Active Feature Elicitation Approach

Sriraam Natarajan1, Srijita Das2, Nandini Ramanan2, Gautam Kunapuli1 and Predrag Radivojac21 University of Texas at Dallas

2 Indiana University Bloomingtonsriraam.natarajan,[email protected] sridas,nramanan,[email protected]

AbstractWe consider the problem of active feature elicita-tion in which, given some examples with all thefeatures (say, the full Electronic Health Record),and many examples with some of the features (say,demographics), the goal is to identify the set of ex-amples on which more information (say, lab tests)need to be collected. The observation is that someset of features may be more expensive, personal orcumbersome to collect. We propose a classifier-independent, similarity metric-independent, gen-eral active learning approach which identifies ex-amples that are dissimilar to the ones with the fullset of data and acquire the complete set of featuresfor these examples. Motivated by four real clinicaltasks, our extensive evaluation demonstrates the ef-fectiveness of this approach.

1 IntroductionAcquiring meaningful data for learning has long been a cher-ished goal of artificial intelligence, and is especially relevantin data scarce domains such as medicine. While there areplethora of data regarding several diseases, in some cases, it iscrucial to obtain information that is particularly relevant to thelearning task. The problem of choosing an example to obtainits class label has been addressed as active learning [Settles,2012]. There have been several extensions of active learn-ing that included presenting a set of features [Raghavan etal., 2006; Druck et al., 2009], or getting labels over clus-ters [Hofmann and Buhmann, 1998], or preferences [Odomand Natarajan, 2016] or in sequential decision making [Lopeset al., 2009], to name a few.

Our problem is motivated by a different set of medicalproblems – that of recruiting patients for a clinical study.Consider the following scenario of collecting data (cogni-tive score and fMRI, both structural and functional) for anAlzheimer’s study. Given a potentially large cohort size, thefirst step could be to simply collect the demographic infor-mation for everyone. Now, given a small amount of completedata from a related study, say the Alzheimer’s Disease Neu-roInitiative (ADNI), our goal is to recruit subjects who would

Figure 1: The active feature elicitation setting. The top part is thefully observed data and the bottom right (grey shaded area) is theunobserved feature set.

provide the most information for learning a robust, general-ized model. This scenario is highlighted in Figure 1. The toppart shows the part of the data that is fully observed (poten-tially from a related study). The bottom left quadrant showsthe observed features of the potential cohorts and the rightquadrant is the data that needs to be collected for the mostuseful potential recruits. Given, the labels of the potential re-cruits, the goal is to identify the most informative cohorts thatwould aid the study.

Inspired by the success of active learning methods, we de-fine the problem of active feature elicitation (AFE) where thegoal is to select the best set of examples on whom the missingfeatures can be queried on to best improve the classifier per-formance. At a high level, our algorithm (also called AFE)at each iteration, identifies examples that are most differentfrom the current set of fully observed examples. These arethen queried for the missing features, their feature-values areobtained and added to the training set. Then, the models areupdated and the process is repeated until convergence. Thisis a general-purpose framework. Any distance metric thatworks well with the data and the model can be employed. Socan a classifier that is capable of handling the specific intri-cacies of the data. Finally, the convergence criteria can bedecided based on the domain.

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

3498

We make a few key contributions. First, we identify andformally define the problem of actively acquiring featuresfrom a selected set of examples. Second, we show the po-tential of this approach in four real medical prediction tasks:Alzheimer’s from fMRI and cognitive score, Parkinson’sfrom potential risk factors, rare diseases based on a surveyquestionnaire and predicting post-partum depression (PPD)in a non-clinical setting. Finally, we empirically demonstratethat AFE is particularly effective in recall while not sacrific-ing the overall performance in these four real tasks.

2 Related WorkOur work is closely related to active learning where the mostinformative examples are chosen based on a scoring metric tobe labeled when learning with semi-supervised data. Activelearning [Settles, 2012] relies on the fact that if an algorithmcan only solicit labels of a limited number of examples, thenit should choose them judiciously since not all examples pro-vide the same amount of information. Active learning has along history of being successfully employed with a variety ofclassifiers such as logistic regression [Lewis and Gale, 1994;Lewis and Catlett, 1994], support vector machines [Tong andKoller, 2001b], Bayesian network learning [Tong and Koller,2000; 2001a] and in sequential decision making tasks such asimitation learning [Judah et al., 2014] and inverse reinforce-ment learning [Odom and Natarajan, 2016].

Active learning has also been used with missing features.Zheng et al. [2002], for instance, considered a setting wherean imputation method was used to fill the incomplete featuresubset and used scoring methods to acquire the most informa-tive example for labeling. Melville et al. [2004] used uncer-tainty sampling to acquire the maximally informative exam-ples from the partially observed set of features during trainingtime. There is also a different body of work where instancesare chosen based on individual feature utilities [Melville etal., 2005; Lizotte et al., 2003].

Our work is heavily inspired by the work of Kanani andMelville [2008] on active feature acquisition that addresseda similar problem where a few examples with the full setof features are present while others are incomplete exam-ples. Their work also scored these examples based on un-certainty sampling and then updated the model at predictiontime. There are a few key differences between this work andours. First, our model updates occur during training and notduring the test time. Second, we return a single model on thebest set of training data while their approach had two differ-ent models for testing. Our approach explicitly grows the setof training examples for a single model iteratively. Finally,they employed uncertainty sampling on the observed featuresets (which is a baseline in our approach), while we explicitlycompute the distance between the two sets of points.

This work was later extended by Thahir et al. [2012] forprotein-protein interaction prediction where an extra termwas added to the utility function that explicitly computedthe value of adding an example to the classifier set. Whileit is possible to compute the value of adding an example tothe training set in our work, we will pursue this as a fu-ture research direction. The AFA framework was then later

generalized and rigorously analyzed by Saar-Tsechansky etal. [2009] where even class labels can be considered to bemissing and acquired. We assume that these labels are ob-served and that full sets of features need to be acquired. Akey difference to this general direction is that both the ob-served set of examples and observed set of features are sig-nificantly smaller in our work compared to the general AFAsetting which is clearly demonstrated in our experiments.

Bilgic and Getoor [2007] took a different approach to asimilar task where they assumed different costs for misclassi-fication and information acquisition. They proposed a prob-abilistic framework that explicitly modeled this dependencyand developed an algorithm to identify the set of features thatcan be optimally identified. Using such a strategy for dis-covering sets of features that one could acquire for differentsets of patients is an interesting direction. Feature elicitationis inspired by the preference framework of concept learningby Boutilier et al. [2009], where minimax regret is used forcomputing the utility of subjective features. The violated con-straints are repeatedly added to the computation and can po-tentially make the problem harder to solve.

To summarize, our work is inspired by the contributionsfrom several of these related works but differs in the mo-tivation of collecting more features by identifying the rightset of examples during training time to improve the model.As mentioned, we differ from active feature acquisition inboth motivation and execution – we collect a large numberof features from a small set of examples during training anduse distances to calculate the most diverse set of examples(Genome sequencing would best exemplify such a scenario).The other important difference is in the number of observedfeatures, which is assumed to be much smaller in our work.And our solution that explicitly computes the relationship be-tween the observed and unobserved data is independent of thechoices of classifiers and distance functions. One of the keyassumptions that we make is that all unobserved features arecollected for the selected examples and identifying the rele-vant set of features along the lines of Bilgic and Getoor is anexciting direction for future work.

3 Motivating Medical TasksWe motivate active feature elicitation with four medical tasks:

1. Parkinson’s: The Parkinson’s Progression Markers Ini-tiative (PPMI) is an observational study with the aimof identifying biomarkers that impact Parkinson’s pro-gression [Marek et al., 2011]. The data can be dividedbroadly into four categories: imaging data, clinical data,bio-specimens and demographics. Among these datatypes, while other modalities are either costly or cumber-some to obtain, the total Montreal Cognitive AssessmentScore (MoCA) is a standard measure that can be used toselect subjects for whom information from other modal-ities can improve classifier performance significantly.

2. Alzheimer’s: The Alzheimer’s Disease NeuroIntiative(ADNI) aims to test whether serial MRI, PET, biolog-ical markers, and clinical and neuro-psychological as-sessments can measure the progression of mild cogni-tive impairment and early Alzheimer’s disease. Given a


3499

small number of subjects with all the measurements, de-mographic data can be used to select subjects for whomobtaining more information such as the cognitive MM-Score and imaging data could maximize performance inpredicting Alzheimer’s progression.

3. Rare Diseases: A recent work [MacLeod et al., 2016]focused on predicting rare diseases from a survey ques-tionnaire that consisted of questions in the following cat-egories: demographics, technology use, disease infor-mation and healthcare provider information. The setof diseases in the study includes Ehlers Danlos Syn-drome (23%), Wilson’s Disease (21.9%), Kallmann’sSyndrome (9.9%), etc. Demographics can be used toidentify the future participants of the survey as this canavoid more personal questions such as technology useand the provider details along with the disease informa-tion itself.

4. Post-Partum Depression: This work collects demo-graphic information along with several sensitive ques-tions including relationship troubles, social support, eco-nomic status, infant behavior and the CDC questions toidentify PPD in subjects outside the clinic [Natarajan etal., 2017]. As with the earlier cases, demographics canbe used to recruit the subjects on whom more sensitiveinformation can be collected.

These varied medical tasks demonstrate the need for employ-ing an active feature elicitation approach that allows for col-lecting relevant information in an effective manner. While thepresented motivating tasks are medical, one could imaginethe use of such approach in any domain where some featuresare either expensive or cumbersome to obtain.

4 Active Feature ElicitationLet us denote the label of an example i as yi, the set offully observed features (i.e., the features that are observedfor the entire data set) as Xo, the set of features that are par-tially observed as Xu, the set of fully observed examples setas Eo = 〈〈X1

o,X1u, y

1〉...〈Xko ,X

ku, y

k〉〉 and the set of par-tially observed examples as Eu = 〈〈X1

o, y1〉...〈X`

o, y`〉〉. The

learning problem in our setting can be defined as follows:Given: A data set with Eo and Eu.To Do: Identify the best set of examples Ea ⊂ Eu forwhich to obtain more information Xu such that the classifierperformance improves.

In the above definitions, the notion of best and improvehave been intentionally left vague. This definition allows forany notion of the best examples and improvement of the clas-sifier. In our work, to be precise, we consider best to denotethe set of the examples with maximal difference to the ob-served set and performance to be the log-likelihood of theclassifier. The classifier we consider is the well-understoodgradient-boosting [Friedman, 2001].

Since our focus is on clinical (study) data, our hypothesesis that the best examples are chosen to obtain extra informa-tion from those that are significantly different from the re-maining examples. In principle, any distance function couldbe used to determine the set of examples Ea from Eu that are

significantly different from the ones in Eo. We use the meanKullback-Leibler (KL) divergence between every exampleei = 〈Xi

o, yi〉 in Eu and every example ej = 〈Xj

o,Xju, y

j〉in Eo to determine the set of examples in Eu that are dif-ferent from the observed set Eo. To compute this mean KL-divergence at every iteration t, we use the current models:M tu = Pt(y

i | Xio) andM t

o = Pt(yj | Xj

o, Xju), learned on

the two different sets of data. More precisely, we compute themean distance of an example Xi

u from all the observed exam-

ples 〈Xjo, X

ju〉 as, MDi =

1

|Eo|

|Eo|∑j=1

Dij , where the distance

Dij is the asymmetric KL divergence:

Dij = KL(P(yi | Xi

o) ‖ P(yj | Xjo, X

ju)). (1)

A natural question to ask is: what is the need for two differentdistributions, even if they are conditionals on the target. Notethat, with the set of examples Eo, all the features are assumedto be fully observed. Ignoring the informative features (Xj

u)when computing the distances can lead to a loss of informa-tion and our experiments confirmed this. Hence, we employthe model learned over the full set of features for the fullyobserved example set Eo, which is typically smaller than theunobserved set in the initial iterations.

Now that the distances have been computed, we next sortthem to pick the n most distinct examples from Eu. These nexamples are queried for their missing features and are thenadded to the training set before the model is retrained. Notethat at each iteration, the model P(yj | Xj

o,Xju) is updated

after the examples are appropriately chosen and queried. Themodel P(yi | Xi

o) remains unchanged because it is trained onXo of the entire example set Eo∪Eu. The process is repeateduntil convergence or a predetermined budget is realized.

This presents a generalized and unifying framework, whichcan be adapted in multiple ways:

1. As we discuss in Section 4.1, this formulation admits alarge class of divergences and distance metrics for com-puting distances between examples in Eo and Eu. Onecould imagine the use of other classifiers, kernels orlearned metrics [Kunapuli and Shavlik, 2012] as well.

2. The gradient boosting classifier can be replaced with anyclassifier. Our framework is classifier-agnostic, allowingthe user to select the best one for the task at hand.

3. Various convergence criteria can also be used. For in-stance, one could simply preset the number of iterations,or employ a tuning set to determine the change in perfor-mance from the previous iteration or compute the dif-ference between scores from successive iterations. Weemploy this final strategy: computing the difference be-tween log-likelihoods of the training data in successiveiterations. If the difference is smaller than ε, we termi-nate the algorithm. One could also imagine reducing thenumber of queries at every iteration (i.e., successivelyreduce n = n

n+∆ ) such that the number of examples se-lected at each iteration naturally comes down to 0.

We present the algorithm for active feature elicitation inAlgorithm 1. The AFE algorithm takes as input the set of


3500

Algorithm 1 Active Feature Elicitation

1: function ActiveFeatureElicitation(Eo, Eu, n, ∆)2: t = 0 . iteration counter3: Mt = TrainInitialModel(Eo, Eu, Xo, Xu)4: while n ≥ 1 do . while not converged5: MD = 0 . initialize mean divergences6: for i = 1 to |Eu| do7: D = 0 . init divergence for unobserved ex. i8: for j = 1 to |Eo| do9: Dj = ComputeDistance(Ei, Ej , Mt)

10: end for

11: MDi =∑|Eo|

j=1Dj

|Eo| . average distance12: end for13: Equ = GetTopN(MD)14: . n most divergent partially-observed examples15: Eqo = AppendNewFeature(Equ)16: . actively query to elicit missing features17: Eo = Eo ∪Eqo . add queried to observed18: Eu = Eu \Equ . remove queried from unobs.19: Mt = UpdateModel(Eo, Eu, Xo, Xu)20: . retrain or update classifier21: n = n

n+∆ . check convergence/update budget22: end while

return TrainFinalModel(Eo)23: end function

fully labeled examples (Eo), the set of partially labeled ex-amples (Eu), the number of active learning examples for eachquery step (n) and step size (∆). In this algorithm, suf-ficient decrease in step size ( n

n+δ ) is used as the stoppagecriterion (lines 4, 21) as an example. This can be replacedby other task-relevant budgets or convergence criteria as dis-cussed previously.

After initializing mean distances of each unlabeled exam-ple, AFE iterates through every partially-labeled example inEu, and computes the mean distance to all the fully labeledexamples in Eo based on the divergence between the respec-tive current models. The n-most divergent (dissimilar) exam-ples are selected and features are actively obtained for theseexamples (AppendNewFeature). These examples are thenadded to Eo and removed from Eu. A new model can betrained (or updated depending on the choice of classifier), andthe process is repeated. Note that Mt consists of two classi-fiers – one trained on Eo, which contains new examples pro-vided by the user after active feature elicitation with all thefeatures, and the other trained on entire data Eo ∪ Eu. Afterconvergence, Eo has the full set of training examples.

In our experiments, we employ gradient boosting as theclassifier, KL-divergence to identify the unobserved exampleswhose features we would like to acquire, and the differencein average log-likelihoods between two iterations as our con-vergence criterion.

4.1 Other Model DivergencesThe KL divergence is a special case of the Csiszar f -divergence [Csiszar, 1967], which is a generalized measurefor the difference between two probability distributions, in

our case P(yi | Xio) and P(yj | Xj

o, Xju). Df (P ‖

Q) =∫

Ωf (dP/dQ) dQ. Generally, given two distributions

P and Q over some space Ω, for a convex function f (withf(1) = 0), the divergence of P from Q is defined as Df (P ‖Q) =

∫Ωf (dP/dQ) dQ. Such f -divergences satisfy non-

negativity, monotonicity and convexity, though they are notalways symmetric. Several well-known distribution distancemeasures are special cases of the f -divergence. For exam-ple, the χ2-divergence might be well-suited for histogramdata [Kedem et al., 2012], while the Hellinger distance mightbenefit applications with highly-skewed data [Cieslak et al.,2012].

Recently it was shown that families of divergences includ-ing the α- and β-divergence are also a special cases of the f -divergence [Cichocki and Amari, 2010]. The latter includesgeneralizations of measures such as the Euclidean distanceand the Itakura-Saito distance, which are appropriate for un-supervised and semi-supervised learning problems. See Deza& Deza [2013] for other useful distance functions. We con-sider the usefulness of various divergences to different ma-chine learning problem types and applications in future work.

4.2 Multi-class and Other ExtensionsAs our approach is algorithm- and divergence-agnostic, it canbe seamlessly extended to multi-class settings. As long asthe underlying classification algorithm can produce (multino-mial) distributions over the label space: p = P(yi | Xi

o)and q = P(yj | Xj

o, Xju), we can use any model divergence

discussed in Section 4.1.There are several other possible extensions of the proposed

approach. First is the necessity to move beyond active learn-ing; while standard methods acquire a label for each exam-ple, in many situations where the goal is to understand whyan event happens (such as clinical studies), it is necessary toobtain more tests/features. Also, given that the original modelis learned from a small set of features, the model will not benecessarily generalizable. Second, it is possible that somespecific set of features are the most informative for specificexample. For instance, some subjects’ predictions will bene-fit from some lab test while a different test is a better indicatorfor someone else. Extending our framework to handle thesedifferent types of examples/feature combinations is outsidethe scope and is an interesting future direction.

5 Empirical EvaluationWe now present evaluation results on one standard UCI dataset (PIMA, [Smith et al., 1988]) and four real medical tasksto demonstrate the efficacy of our approach. It must be men-tioned clearly that while healthcare is one domain, the datasets are varied: from online behavior to images to risk factorsto survey. The goal is to demonstrate the versatility of theapproach with real problems. It must also be noted that whilehealthcare is considered in this work, the ideas are not limitedto this domain and any problem where a small set of data isfully observed and the rest are partially observed can renderitself as a useful domain for the proposed approach.

1. Parkinson’s prediction from clinical study: The taskis to predict the occurrence of Parkinson’s disease from


3501

Figure 2: Recall (left), F1 (middle), g-mean (right) for (from top to bottom) ADNI, PPMI, Rare Disease, PPD and PIMA.


3502

Data set # Pos # Neg # Features # ExamplesFO PO FO PO

PPMI 554 919 1 35 5 1174ADNI 76 260 6 69 10 294

Rare Disease 87 174 6 63 10 198PPD 38 115 8 33 6 147PIMA 268 500 4 4 10 681

Table 1: Data set details. FO - Fully observed, PO - Partially ob-served. # Pos is number of positive examples, # Neg is number ofnegative examples, # Features (FO) is the number of features in thefully observed feature set, # Features (PO) is number of featuresin the partially observed feature set, # Examples (FO) is numberof examples in the fully observed example set, # Examples (PO) isnumber of examples in the partially observed example set.

different modalities. We focus on a smaller set of fea-tures, primarily motor and non-motor assessments re-sulting in a set of 37 attributes including the class label.The observed feature is the MoCA test result, while theother 35 motor scores are treated as unobserved.

2. Alzheimer’s prediction from ADNI: We assume thatdemographics are observed, while cognitive score (MM-Score) and fMRI image features are unobserved. We usethe AAL Atlas (http://www.slicer.org.) to segment theimage into 108 Regions of Interest (RoIs), and for eachRoI, we derive their summary attributes: white matter,cerebral spinal fluid, and gray matter intensities alongwith regional variance, size and spread.While the original data set has three classes:Alzheimer’s (AD), Cognitively Normal (CN) andMildly Cognitively Impaired (MCI), we consider the bi-nary task of predicting AD vs. the rest. The presence ofMCI subjects makes this particular task challenging, yetinteresting; this is because these subjects may or maynot end up having Alzheimer’s eventually. Identifyingthe right set of subjects to target for feature elicitationcan considerably improve classifier performance.

3. Rare disease prediction from self-reported surveydata: The task is to predict if a subject has a rare dis-ease [MacLeod et al., 2016]; by definition, a rare dis-ease is hard to diagnose and affects less than 10% ofthe world’s population. The data for this prediction taskarises from survey questionnaires and we assume thatdemographic data are fully observed. Other survey an-swers concerning technology use, disease informationand healthcare details are treated as unobserved.

4. Post-partum depression prediction from online ques-tionnaire data: Recently, Natarajan et al. [2017] em-ployed online questionnaires to predict PPD from demo-graphics, social support, relevant medical history, childbirth issues and screening data. We assume that demo-graphics are observed and are used to select subjects onwhom the rest of the data can be collected for learning.

We also test our algorithm on the well-studied PIMA Indi-ans Diabetes data to demonstrate generality. Table 1 showsthe details of these domains; a common characteristic acrossall domains is class imbalance where it is important that themost informative subjects are added to the training set.

Evaluation Methodology: All data sets are partitioned

into 80% train and 20% test. Results are averaged over 10runs with a fixed test set. At each active learning step, wesolicit 5 new data points until convergence. As mentionedearlier, Friedman’s [2001] gradient-boosting was employedas the underlying classifier with the same settings across allmethods and KL-divergence is our distance metric. We com-pare three different evaluation metrics: recall (to measure theclinically relevant sensitivity), F1-score, and geometric meanof sensitivity and specificity (gmean), that provide a reason-ably robust evaluation in the presence of class imbalance. Weconsidered AUCROC but as pointed out by Davis and Goad-rich [2006], for severe class imbalance data sets, this is notideal and hence we settled on our metrics.

Baselines: In addition to the proposed AFE approach,we considered three other baselines: (1) Randomly choos-ing points to query which can potentially yield strong resultswhen closer to convergence. This method is denoted as RND;(2) We also used uncertainty sampling on partially observedexample set using only the fully observed features. The top5 instances that have the highest entropy were then queriedfor unobserved features and added to the training set. Thisis denoted as USObs; (3) In the third approach, we imputedall missing features using mode as the feature value; uncer-tainty sampling is then employed by computing the entropyon the full feature set, following which the top 5 values werechosen for querying. This baseline is denoted as USAll.Other active-learning baselines can be considered (such asmin-max), but these generally tend to be prohibitively expen-sive in large feature spaces.Results: We aim to answer the following questions:

Q1: Does AFE perform better than choosing the examples ran-domly for active learning?

Q2: Does AFE outperform other active baselines that employentropy to choose examples?

Q3: Does AFE robustly handle imbalanced data?Q4: Can AFE be useful for a semi-supervised formalism?Q5: Is AFE faithful to the motivation and is the proposed so-

lution effective in modeling clinical data?The results across the five domains and all the three metrics

are presented in Figure 2. It can be observed that AFE out-performs RND on all domains across all metrics, specificallyin recall in the first few iterations in 4 of the 5 domains wherethe effect of choosing the most informative set of examplescan have the maximal impact on the classifier performance.As expected, the variance in recall due to random selectionof examples is high. This can be seen in the left column,where the performance does not increase steadily in all do-mains. Similar observations can be made for the geometricmean. This allows us to answer Q1 affirmatively.

Observe that as we add more informative examples, theperformance improves significantly for the proposed AFE ap-proach over the rest of the classifiers. This demonstrates thatthe gains are not necessarily at the beginning alone. Addingmore useful examples can construct a more robust training setthat can lead the classifier to a superior performance. Notethat one of the reasons that we are averaging over 10 runs isto alleviate/minimize the effect of sampling bias in the con-struction of training set (particularly in the initial samples).


3503

As expected, the variance is initially on the higher side indi-cating the effect of sampling bias on smaller training sets butit decreases as more examples are added. However, this ef-fect is minimal for AFE that chooses good training examplescompared to other methods. Understanding the effect of theinitial choice of samples on the performance of the classifieris itself an interesting future research direction.

Similar results can be observed when comparing AFE toUSObs and USAll in that AFE consistently outperforms thetwo active learning baselines across all the domains in all themetrics. In general, the use of only observed features still ap-pears better than using imputation (mode) to fill in the miss-ing values and then use them for entropy. We speculate thatthe use of better imputation techniques may improve the per-formance. However, the difference between AFE and USAllin recall and gmean across all domains is statistically signifi-cant in several iterations. This suggests that other imputationtechniques may marginally improve the performance, but itappears that the use of imputation may not influence the finalperformance and a good selection of examples is necessary.This strongly answers Q2 affirmatively.

Another natural question to ask is how the variance in per-formance of the different methods tend to behave across alldata sets? It was generally observed that the AFE had thesmallest variance both in the selection of the first few exam-ples and in the last few iterations of the algorithms. The vari-ance of AFE across all data sets was at least half the varianceof the Random baseline (RND) on an average in the first fewiterations and as low as 10% of random selection’s variancein later iterations. This is also consistent across all metrics.When compared to USObs and USAll, in general, the aver-age variance of AFE was lower across all metrics and all datasets. While AFE’s variance is significantly better than therandom selection, the differences to the uncertainty methods,while better, are not necessarily significant.

As shown in Table 1, all domains are imbalanced. Theclass prior ratio is particularly skewed in the case of ADNIand PPD data sets. In these two domains, it can be seenclearly that the proposed approach achieves a recall of over0.8 after just a few early iterations. This demonstrates thatAFE can identify the most important examples that allowfor increasing the clinically relevant sensitivity effectively en-abling us to answer Q3 positively as well. Our results in alldomains show that this method achieves high recall withoutsignificantly sacrificing precision making it an ideal choicefor semi-supervised imbalanced data sets (Q4). The intuitionhere is that because AFE obtains high recall across all datasets, in domains where many examples are labeled, it pro-vides an opportunity for selecting the right sets of examplesthat can be labeled or extended with more features. AlthoughAFE is not semi-supervised, it provides an opportunity for de-veloping methods that can learn from partially labeled data.

Our original motivation was to identify the set of subjectson whom to perform specific lab tests, given some basic in-formation about the potential recruits. Our algorithm doesnot use any extra information about the potential recruits be-yond the observed features and their labels (whether they arecases or controls) for identifying the best set of subjects toelicit more information about. Secondly, we do not make any

assumptions about the underlying distributions of the data orthe classifier employed while learning. Finally, we make ef-fective use of the fully observed data (from possibly a relatedstudy) by using them to compute the distance with the po-tential cohorts. This clearly demonstrates that AFE is indeedfaithful to the original goals and the results clearly show theefficacy of AFE thus answering (Q5) in the affirmative.

6 ConclusionWe considered the problem of eliciting new sets of featuresbased on a small amount of fully observed data. We addressthis problem specifically in the context of medical domainswith severe class imbalance. Our proposed active feature elic-itation approach computes the similarity between the poten-tially interesting examples with the fully observed examplesand chooses the most significantly different examples to elicitthe feature information. These are then added to the fullyobserved set and the process continues until convergence.Experiments on four real high-impact medical tasks clearlydemonstrate the effectiveness and efficiency of the proposedapproach. Our approach has a few salient features – it is do-main, model and distance, i.e., representation agnostic in thatany reasonable classifier and a compatible distance metric fora specific domain can be employed in a plug and play manner.

We currently elicit all missing features for every chosenexample at every iteration. One could extend this work byidentifying sets of features that are most informative for ev-ery example (i.e., the most relevant lab test for each subject)along the lines of the work of Krishnapuram et al. [2005] thataddressed a similar problem using multi-view, co-trainingsetting which could allow for the realization of the visionof personalized medicine. Another interesting future re-search direction could be to identify groups of examples (sub-populations) that would provide the most information at eachiteration. Extending the work to handle fully relational/graphdata is another possible direction. Finally, rigorous evalua-tion of the approach on more clinical data sets can yield moreinteresting insights into the proposed approach.

AcknowledgementsThe authors acknowledge the support of Indiana University’sPrecision Health Initiative. SN and GK also gratefully ac-knowledge the National Science Foundation grant no. IIS-1343940. Any opinions, findings, and conclusion or recom-mendations expressed in this material are those of the au-thors and do not necessarily reflect the view of the PrecitionHealth Initiative or the United States government. Data usedin the preparation of this article were obtained from theParkinson’s Progression Markers Initiative (PPMI) database.For up-to-date information on the study, visit www.ppmi-info.org. PPMI - a public-private partnership - is funded bythe Michael J. Fox Foundation for Parkinson’s Research andfunding partners, including Abbvie, Avid, Biogen, Bristol-Myers Squibb, Covance, GE Healthcare, Genentech, Glax-oSmithKline, Lilly, Lundbeck, Merck, Meso Scale Discov-ery,Pfizer, Piramal, Roche, Servier, Teva, UCB, and GolubCapital.


3504

References[Bilgic and Getoor, 2007] M. Bilgic and L. Getoor. VOILA:

efficient feature-value acquisition for classification. NCAI,pages 1225–1230, 2007.

[Boutilier et al., 2009] C. Boutilier, K. Regan, and P. Viappi-ani. Online feature elicitation in interactive optimization.ICML, pages 73–80, 2009.

[Cichocki and Amari, 2010] A. Cichocki and S. Amari.Families of alpha- beta- and gamma- divergences: flexibleand robust measures of similarities. Entropy, 12(6):1532–1568, 2010.

[Cieslak et al., 2012] D. A. Cieslak, T. R. Hoens, et al.Hellinger distance decision trees are robust and skew-insensitive. Data Min Knowl Discov, 24(1):136–158,2012.

[Csiszar, 1967] I. Csiszar. Information measures of differ-ence of probability distributions and indirect observations.Studia Sci Math Hungar, 2:299–318, 1967.

[Davis and Goadrich, 2006] J. Davis and M. Goadrich. Therelationship between precision-recall and ROC curves.ICML, pages 233–240, 2006.

[Deza and Deza, 2013] M. M. Deza and E. Deza. Encyclo-pedia of distances. Springer, 2013.

[Druck et al., 2009] G. Druck, B. Settles, and A. McCallum.Active learning by labeling features. EMNLP, pages 81–90, 2009.

[Friedman, 2001] J. H. Friedman. Greedy function ap-proximation: a gradient boosting machine. Ann Stat,29(5):1189–1232, 2001.

[Hofmann and Buhmann, 1998] T. Hofmann and J. M. Buh-mann. Active data clustering. NIPS, pages 528–534, 1998.

[Judah et al., 2014] K. Judah, A. Fern, et al. Imitation learn-ing with demonstrations and shaping rewards. AAAI,pages 1890–1896, 2014.

[Kanani and Melville, 2008] P. Kanani and P. Melville.Prediction-time active feature-value acquisition for cost-effective customer targeting. Workshop on Cost SensitiveLearning at NIPS, 2008.

[Kedem et al., 2012] D. Kedem, S. Tyree, et al. Non-linearmetric learning. NIPS, pages 2573–2581. 2012.

[Krishnapuram et al., 2005] B. Krishnapuram, D. Williams,et al. Active learning of features and labels. Workshop onLearning with Multiple Views at ICML, 2005.

[Kunapuli and Shavlik, 2012] G. Kunapuli and J. Shavlik.Mirror descent for metric learning: a unified approach.ECML PKDD, pages 859–874, 2012.

[Lewis and Catlett, 1994] D. D. Lewis and J. Catlett. Het-erogeneous uncertainty sampling for supervised learning.ICML, pages 148–156, 1994.

[Lewis and Gale, 1994] D. D. Lewis and W. A. Gale. Asequential algorithm for training text classifiers. SIGIR,pages 3–12, 1994.

[Lizotte et al., 2003] D. J. Lizotte, O. Madani, andR. Greiner. Budgeted learning of naive-bayes classi-fiers. UAI, pages 378–385, 2003.

[Lopes et al., 2009] M. Lopes, F. Melo, and L. Montesano.Active learning for reward estimation in inverse reinforce-ment learning. ECML PKDD, pages 31–46, 2009.

[MacLeod et al., 2016] H. MacLeod, S. Yang, et al. Identify-ing rare diseases from behavioural data: a machine learn-ing approach. CHASE, pages 130–139, 2016.

[Marek et al., 2011] K. Marek, D. Jennings, et al. TheParkinson Progression Marker Initiative (PPMI). ProgNeurobiol, 95(4):629–635, 2011.

[Melville et al., 2004] P. Melville, M. Saar-Tsechansky,et al. Active feature-value acquisition for classifier induc-tion. ICDM, pages 483–486, 2004.

[Melville et al., 2005] P. Melville, M. Saar-Tsechansky,et al. An expected utility approach to active feature-valueacquisition. ICDM, pages 745–748, 2005.

[Natarajan et al., 2017] S. Natarajan, A. Prabhakar, et al.Boosting for postpartum depression prediction. CHASE,pages 232–240, 2017.

[Odom and Natarajan, 2016] P. Odom and S. Natarajan. Ac-tive advice seeking for inverse reinforcement learning.AAMAS, pages 512–520, 2016.

[Raghavan et al., 2006] H. Raghavan, O. Madani, andR. Jones. Active learning with feedback on features andinstances. J Mach Learn Res, 7(Aug):1655–1686, 2006.

[Saar-Tsechansky et al., 2009] M. Saar-Tsechansky,P. Melville, and F. Provost. Active feature-valueacquisition. Manag Sci, 55(4), 2009.

[Settles, 2012] B. Settles. Active learning. Synthesis Lec-tures on Artificial Intelligence and Machine Learning,6(1), 2012.

[Smith et al., 1988] J. W. Smith, J. E. Everhart, et al. Usingthe ADAP learning algorithm to forecast the onset of dia-betes mellitus. Proc Annu Symp Comput Appl Med Care,pages 261–265, 1988.

[Thahir et al., 2012] M. Thahir, T. Sharma, and M. K. Gana-pathiraju. An efficient heuristic method for active featureacquisition and its application to protein-protein interac-tion prediction. BMC Proc, 6(Suppl 7):S2, 2012.

[Tong and Koller, 2000] S. Tong and D. Koller. Active learn-ing for parameter estimation in Bayesian networks. NIPS,pages 647–653, 2000.

[Tong and Koller, 2001a] S. Tong and D. Koller. Activelearning for structure in Bayesian networks. IJCAI, pages863–869, 2001.

[Tong and Koller, 2001b] S. Tong and D. Koller. Supportvector machine active learning with applications to textclassification. J Mach Learn Res, 2(Nov):45–66, 2001.

[Zheng and Padmanabhan, 2002] Z. Zheng and B. Padman-abhan. On active learning for data acquisition. ICDM,pages 562–569, 2002.


3505

On Whom Should I Perform this Lab Test Next? An Active ... · On Whom Should I Perform the Lab Test...

Documents

Transcript of On Whom Should I Perform this Lab Test Next? An Active ... · On Whom Should I Perform the Lab Test...