Manifold for Machine Learning Assurance · Manifold learning tries to capture this mapping so that...

Manifold for Machine Learning AssuranceTaejoon Byun

University of MinnesotaMinneapolis, Minnesota

[email protected]

Sanjai RayadurgamUniversity of MinnesotaMinneapolis, Minnesota

[email protected]

ABSTRACTThe increasing use of machine-learning (ML) enabled systems incritical tasks fuels the quest for novel verification and validationtechniques yet grounded in accepted system assurance principles. Intraditional system development, model-based techniques have beenwidely adopted, where the central premise is that abstract modelsof the required system provide a sound basis for judging its imple-mentation. We posit an analogous approach for ML systems usingan ML technique that extracts from the high-dimensional trainingdata implicitly describing the required system, a low-dimensionalunderlying structure—a manifold. It is then harnessed for a rangeof quality assurance tasks such as test adequacy measurement, testinput generation, and runtime monitoring of the target ML system.The approach is built on variational autoencoder, an unsupervisedmethod for learning a pair of mutually near-inverse functions be-tween a given high-dimensional dataset and a low-dimensionalrepresentation. Preliminary experiments establish that the pro-posed manifold-based approach, for test adequacy drives diversityin test data, for test generation yields fault-revealing yet realistictest cases, and for run-time monitoring provides an independentmeans to assess trustability of the target system’s output.

CCS CONCEPTS• Software and its engineering → Software testing and de-bugging; • Computing methodologies→ Machine learning.

KEYWORDSmachine learning testing, neural networks, variational autoencoder

ACM Reference Format:Taejoon Byun and Sanjai Rayadurgam. 2020. Manifold for Machine LearningAssurance. In New Ideas and Emerging Results (ICSE-NIER’20), May 23–29,2020, Seoul, Republic of Korea. ACM, New York, NY, USA, 4 pages. https://doi.org/10.1145/3377816.3381734

1 INTRODUCTIONMachine-learning enabled systems are being increasingly adoptedin safety-critical applications. This brings into renewed focus theimportant problem of assurance of such systems and drives researchinto novel and effective techniques for V&V of ML systems.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected]’20, May 23–29, 2020, Seoul, Republic of Korea© 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM.ACM ISBN 978-1-4503-7126-1/20/05. . . $15.00https://doi.org/10.1145/3377816.3381734

In traditional development, model-based techniques are well-established and central to a variety of assurance activities. A model,by definition, abstracts away irrelevant system details, retainingonly what is essential for its intended purposes. ML techniquesproduce such a model for a given task by learning the essentialfeatures from provided data. If a model useful for a task can belearned instead of being hand-coded, can models useful for qual-ity assurance—such as behavioral model in traditional softwareengineering—also be similarly learned? We answer this questionaffirmatively here using a specific learning technique that appearsto be a good fit for multiple assurance activities.

Two considerations influence our choice of the learning approachfor such assurance models: The ability to (i) identify inputs of inter-est for exercising the target system; and (ii) flag outputs of concernwhen the target system is exercised. The former is useful for con-structing various scenarios to assess the ML system, and the latterfor judging its exhibited behavior, both very important for assur-ance tasks. We investigated variational autoencoders, which learnto-and-fro maps between a high-dimensional input space and alow-dimensional manifold space—the learned implicit model of thedata—such that their composition approximates the identity func-tion on the input used for training. If we can carve out regions ofinterest in this low-dimensional manifold space (the model domain),we can then use the two learned maps to effectively address bothconsiderations. This is the basis for the proposed manifold-basedassurance techniques elaborated in the sequel.

In a set of preliminary experiments performed on image clas-sification models, we find that the manifold-based coverage has astronger correlation than neuron coverage to semantic features inthe data. We also present results from manifold-based test genera-tion which shows that realistic yet fault-revealing test inputs can begenerated in volume, which can help in addressing a target system’sweaknesses. Finally, we briefly discuss a manifold-based runtimemonitor to assess output trustability for a prototype ML-basedaircraft taxiing system.

2 PRELIMINARIESMathematically, a manifold is a topological space that is locally Eu-clidean (e.g. the surface of the Earth). In ML, manifold hypothesisstates that real world data X presented in high-dimensional spacesRdX are expected to concentrate in the vicinity of a manifold M

of a much lower dimension dM embedded in RdX [1]. Manifoldlearning tries to capture this mapping so that a complex datasetcan be encoded into a meaningful representation in a smaller di-mension, serving several purposes such as data compression andvisualization [3]. Among many manifold learning approaches, webuilt our techniques on top of variational autoencoder as they pro-vide unique advantages over other methods, such as the capability

arX

iv:2

002.

0314

7v1

[cs

.LG

] 8

Feb

202

0

https://doi.org/10.1145/3377816.3381734

https://doi.org/10.1145/3377816.3381734

https://doi.org/10.1145/3377816.3381734

ICSE-NIER’20, May 23–29, 2020, Seoul, Republic of Korea Taejoon Byun and Sanjai Rayadurgam

��

(� ∣ �)��(� ∣ �)��

Conditional 2-stage VAE

�[1]~�(0, 1)

�[0]~�(0, 1)

Figure 1: AVAEwithκ = 2. The colored points comprise a 2Dmanifold of MNIST encoded in N([0, 0]T , I2). The color rep-resents class labels from digit 0 (dark navy) to 9 (burgundy).

of synthesizing new inputs with expected outputs from the mani-fold by interpolating among existing data points.

Variational autoencoder (VAE) is a latent-variable generativemodel which is capable of producing outputs similar to inputs bydetermining a latent-variable space Z and an associated probabilitydensity function (PDF) P(z). Its goal is to ensure that for eachdatapoint x in a given dataset X , there is at least one valuationof the latent variables z ∈ Z which causes the model to generatex that is very similar to x . This is achieved by optimizing θ for adeterministic function f : Z×Θ → X such that the random variablef (z;θ ) produces outputs similar to x ∈ X when z is sampled fromP(z). VAE assumes that any probability distribution P(z) can beobtained by applying a (sufficiently sophisticated) function fθ to aset of independent normally distributed variables z.

For modeling the unknown PDF of latent variables P(z |x) fromwhich to run the decoder fθ (z), we need an encoder functionд(x ;ϕ)which can take an x and return a distribution for z of values that arelikely to produce x . This д is called a probabilistic encoder and ismodeled with a PDF Q(z |X ). It is designed to produce two outputs,mean дµz and variance дσz 2 , of the encoded z. Thus, д encodeseach x ∈ X as a distribution, where the mean дµz has the highestprobability of being reconstructed to x .

Since P(z |x) was assumed to be a multivariate Gaussian, theposterior distribution Qϕ (z |x) must match the P(z |x) so that wecan relate P(x) to Ez∼QP(x |z), the expected value of generatedinput x given a latent variable z when z is sampled from thespace encoded by Q . This is achieved by minimizing the VAE loss:L(θ ,ϕ) =

∫X −EQϕ (z |x )[log Pθ (x |z)] + KL[Qϕ (z |x)| |P(z)]µдt (dx),

where µдt (dx) is the ground-truth probability mass of a dx on X ,which leads to

∫X µдt (dx) = 1. The first term is responsible for

minimizing the reconstruction error between x and x , and the sec-ond term for making the posterior distribution of z match the priordistribution. Interested readers are referred to [6] for a VAE tutorial.

As an example, Figure 1 illustrates the structure and the opera-tion of a VAE, with the size of the manifold dimension κ set to 2 fora 2D visualization. The graph on the left illustrates a probabilisticencoder implemented as a neural network, and the graph on theright illustrates a probabilistic decoder implemented as anotherneural network. The learnt manifold is illustrated with a clusterof points in a 2D plane that are encoded from the training data,centered around Gaussian mean (0, 0) with a variance of I2. Thisspace is continuous, smooth, and locally resembles Euclidean space.Similar images are placed close to each other, as can be seen fromthe clear separation among inputs from different classes.

For any unseen input x , the encoder will produce a manifoldrepresentation z that, when passed to the decoder, produces a syn-thetic image x similar to the original image x . This encoder-decoderpair is attached back-to-back during training, but can subsequentlybe taken apart and used separately for different purposes.

3 TEST ADEQUACYCoverage criteria are used as proxy measures of software test ade-quacy. Structural coverage criteria—such as statement or branchcoverage—are frequently used, as they can be measured on thesource code, often the only software artifact available. Althoughtheir efficacy and correlation to fault-finding effectiveness havebeen disputed for long [7], they are widely adopted in the industryas they offer a straight-forward and inexpensive gauge of test suitegoodness. In this spirit, several coverage criteria have been pro-posed to account for the structure of deep neural networks, suchas neuron coverage [14], and their efficacy is also debated [12].

Alternatively, test adequacy can also be measured on otherartifacts such as requirements or behavior models. These offerimplementation-independent measures of test adequacy and a fur-ther benefit of traceability of tests to higher-level artifacts. One canview the manifold as a compact model of the input space describedby the training data. Thus, coverage criteria defined on the manifoldare in effect model-based coverage criteria for ML systems.

3.1 Manifold Combination CoverageVAE assumes that each latent dimension is distributed indepen-dently. During training, each latent variable learns some indepen-dent semantic feature, although no control can be assumed on whathuman-understandable feature it learns. If we conveniently regarda latent variable in a VAE trained on MNIST dataset as encodingthe thickness of the hand-written digits, this variable will display agradation of thickness as it changes its value from the left side to theright side of the Gaussian PDF. By discretizing this continuous vari-able to representative categories, such as three levels of thickness,one can easily measure if a test suite covers every category.

The key idea of manifold combination coverage is to apply cate-gory partitioning on the latent variables that comprise a manifold,and cover them combinatorially. Combination over the latent di-mensions mandates a test suite to include inputs that correspond toevery possible interactions among the latent features. Based on thisintuition, we define k-section manifold combination coverage asfollows; a t-way combination coverage can also be defined similarly:

Definition 3.1. k-section manifold combination coverage:For a manifold ofκ latent variables, divide the range of each variableinto k sections of equal probability mass. A test suite T achieves k-section manifold combination coverage if there exists one or moresample z = дϕ (x).x ∈ T for each of the kκ combinations of sections.

3.2 ExperimentWe performed a preliminary experiment to compare the utility ofthe manifold combination coverage (MCC) with neuron coverage(NC) and neuron boundary coverage (NBC) [13] in measuring thediversity of a test dataset. We first present the coverage scoresachieved by each criterion on different subsets of the EMNIST [4]dataset which consists of 28,000 images per class. Next, we quantify

Manifold for Machine Learning Assurance ICSE-NIER’20, May 23–29, 2020, Seoul, Republic of Korea

Table 1: Coverage achieved by subsets of EMNIST dataset (%)

Crit. Sub-suite by class label All Train0 1 2 3 4 5 6 7 8 9MCC 19.6 14.1 24.6 19.6 29.9 17.6 20.1 22.7 19.5 26.9 83.6 88.8NC 73.4 73.3 73.4 73.4 73.4 73.4 73.4 73.4 73.4 73.4 73.5 73.5NBC 3.4 0.5 6.3 4.4 4.5 6.7 3.1 4.2 5.8 3.1 14.6 0.0

the association of coverage obligations—such as a combination ofmanifold section, or a neuron—to the class label, which is used as arepresentative semantic feature for this study. We use Cramer’s V, ameasure between 0 (no association) and 1 (complete association) ofassociation between two nomial variables, to see how strongly thecoverage obligations align with the semantic feature(s) of interest.

For learning the MNIST [10] manifold, we trained a VAE ofκ = 8. During encoding, we only took the means from the encoderand discarded the standard deviations to obtain a deterministicrepresentation. For an instantiation of the manifold combinationcoverage, we set k = 3, which yields 38 = 6, 561 combinations. Thechoice of κ and k here was arbitrary, and further study is neededto understand the effect of these choices.

Table 1 shows the coverage achieved for each subsuite (one perlabel) of the EMNIST dataset and the MNIST train set (last column).As known, high NC scores were achieved easily, failing to show anydifferences among the subsuites. The Cramer’s V score for NC was0.13, indicating a very low association. NBC showed low coveragescores, but was able to pronounce some differences between thesubsuites. The V score was 0.42, showing a medium association.MCC scored 14% to 30% per subsuite, and the coverage on the wholeset was similar to the coverage achieved by the MNIST trainingdataset, indicating that MCC was a good measure of completeness.The V score was 0.83, indicating a high association.

4 TEST GENERATIONThe goal of testing an image classifier is to find faults in a modelwhich cause discordance between existing conditions and requiredconditions [16]. One way of finding faults is to run lots of testdata on the model, check the outputs against the expectation, andidentify fault-revealing inputs. However, constructing such a testdataset can be difficult and expensive due to the high cost of datacollection and laborious labeling.

Test generation attempts to solve this problem by synthesizingtest cases—pairs of inputs and expected outputs—that are likelyto trigger faults in the model [16]. However, generating realisticfault-revealing images is not a trivial task because images reside ina very high dimension. Most of the existing approaches either gen-erate synthetic adversarial inputs [14], or rely on domain-specificmetamorphic relations—such as changing the brightness of the im-age while keeping the object recognizable—for generating realisticinputs. While both of these approaches are useful, they also comewith limitations, as 1) the presence of adversarial inputs does notdeliver any insight on the performance of the model on naturallyoccurring inputs, and 2) the diversity of the generated inputs arebound by the metamorphic relations and the power of the toolwhich implements the translation. If one can reliably generate a di-verse set of realistic inputs that are also fault-revealing, the testingprocess can be much accelerated since those counter-examples canhighlight the weakness of the model under test.

sample (�, )�

�

�

� �

�

�

� (� ∣ �) �

Test generator

(� ∣ �, �)��(� ∣ �, �)��

collect( , )� �

?� ≠ �

Conditional 2-stage VAE Model under test

(� ∣ �, �)��′(� ∣ �, �)��′

Figure 2: Manifold-based test generator

For achieving this goal, we propose a novel manifold-based testgeneration framework, giving a concrete form to the idea alsoproposed by Yoo [15]. The key is to treat the manifold as a searchspace for finding points from which new inputs can be synthesized.A search can be applied on themanifold space to find fault-revealinginputs, and further, the regions of the manifold can be related tothe performance of the model under test. For brevity, we illustratethe approach with random sampling as the search method.

4.1 Manifold-based Random Test GenerationA vanilla VAE can generate new inputs but not the expected outputlabels. To overcome the inefficiency of manually labeling synthe-sized images, we employ conditional VAE [9], which can learnmanifold conditioned by class label. Compared to the vanilla VAE,it has an extra input–the condition y for class label–in the inputlayer of both the encoder and the decoder. The loss function usedfor training is slightly modified VAE loss function, with the PDFof the latent variable P(z) conditioned with y as P(z |y). When gen-erating a new input, a desired label y needs to be provided to thedecoder along with a latent vector z. We also adopt Two-stageVAE [5], a recent innovation that overcomes blurriness of inputssynthesized by VAE, a problem previously misunderstood as aninherent limitation of the technique.

Figure 2 shows the manifold-based test generation framework.First of all, a Two-stage VAE is trained one after the other as sug-gested [5]. The first VAE is trained to reconstruct the original inputimage x while learning the first-stage manifold P(z |x). The secondVAE is trained to reconstruct the first-stage encoding z while learn-ing the second-stage manifold P ′(u |z). For both VAEs, the labels ofthe images are given to the encoder and the decoder as y duringtraining so that we can condition the generated test inputs withdesired outputs. Once the VAEs are trained, the encoders are dis-carded and only the decoders are used for generating new images.A new sample x can be synthesized by feeding in a choice of vectoru with a label y to the second-stage decoder, and in turn z and yto the subsequent first-stage decoder. Each element of the latentvector u is drawn independently from a unit Gaussian distribution,which is the prior assumed on the manifold.

4.2 ExperimentWe performed a preliminary experiment on MNIST dataset to seeif the proposed approach can generate realistic fault-revealing testcases. The model under test was a convolutional neural net with594,922 trainable parameters which achieved 99.15% accuracy onthe MNIST validation data. The test generator was a 2-stage condi-tional VAE with 21,860,515 trainable parameters and κ = 32 latentdimensions, trained for five and half hours on a desktop equipped

ICSE-NIER’20, May 23–29, 2020, Seoul, Republic of Korea Taejoon Byun and Sanjai Rayadurgam

Figure 3: Synthesized fault-revealing test cases

Figure 4: Confidence scores assigned by manifold monitor

with a GTX1080-Ti GPU. Its FID score—which quantifies the realismof the synthesized images—was 6.64 for reconstruction and 8.80 forsampling new inputs, considered very high and comparable to thatof the state-of-the-art generative adversarial networks.

For obtaining 1,000 fault-revealing test cases on our target, ittook about six minutes and 300,256 total test cases (99.67% accu-racy with random testing). These test cases were then manuallyinspected, taking 28 minutes (1.7 sec/test) to filter out marginalcases. 663 inputs were determined to be valid fault-revealing cases.The rest were either too confusing even to human, or were misla-beled. A sample of the fault-revealing cases are presented in Figure 3by the order of the true label (automatically generated) with thepredicted label and associated probability from the CNN undertest. The results indicate that the fault-revealing inputs are indeedcorner cases—peculiar hand-written digits—yet realistic enough tobe recognized correctly by human.

5 RUNTIME MONITORINGThe need for run-time monitoring arises from the probabilistic na-ture of the ML systems—even with a near-prefect model, cornercases almost always exist, and architectural mitigation is often theonly solution. Out-of-distribution detection (OOD) attempts to solvethis problem by detecting whether a new input falls outside the dis-tribution of the problem domain [11]. In the spirit of manifold-basedML assurance, we studied an OOD detection method that leveragesVAE. As the VAE learns a distribution which can be handled ana-lytically, the probability of seeing an input x can be computed as ajoint probability of encountering a latent vector z = дϕ (x) in themanifold space. We defined a function conf (z) = (∏ |z |

i e−0.5z2i )1/κ ,

which is a joint probability of z under unit normal distributionsnormalized to a value in the unit interval. The result can be seen asa confidence measure for an input x being in-distribution.

We assessed the feasibility of this approach by integrating it intoan autonomous aircraft taxiing system prototype. In such systems,detection must be performed in real-time so that appropriate ac-tions can be taken as needed. The VAE-based monitor was ableto provide real-time confidence estimate with a moderate compu-tational overhead, and showed that it can flag corner-case inputs.Figure 4 illustrates some exemplary cases in 1) normal condition, 2)end of the runway, 3) runway region with skid marks, and 4) edgeof the runway. The confidence score roughly corresponded withthe probability mass of each case in the training data.

6 FUTUREWORKWe proposed novel approaches supported by early results to lever-age manifolds learned by VAEs for quality assurance of ML systems.

Further work is required to explore and fully exploit its poten-tial. To keep t-way combinatorial testing tractable, small t valuesare desirable. Such an approach and its effectiveness need rigor-ous investigation. The efficacy of manifold coverage is likely to beenhanced when combined with measures that are designed to pri-oritize corner-cases, such as surprise adequacy or uncertainty [2, 8].This idea needs to be developed further and evaluated. The idea ofsearching the manifold space for generating interesting test inputsneeds further development. Such approaches need to be empiri-cally compared with each other and random generation. The use ofVAE as a runtime monitor to detect potential problem behaviorsneeds further refinement. Specifically, ways to assess confidenceand evaluate the metrics’ performance are needed. Finally, the di-mensionality reduction achieved by this approach could benefitformal analysis techniques that may not be otherwise applicable atscale. Whether VAEs provide appropriate abstractions for formalverification needs investigation.

ACKNOWLEDGMENTSThis work was supported by AFRL and DARPA under contractFA8750-18-C-0099.

REFERENCES[1] Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation

Learning: A Review and New Perspectives. IEEE Transactions on Pattern Analysisand Machine Intelligence 35, 8 (Aug 2013), 1798–1828.

[2] Taejoon Byun, Vaibhav Sharma, Abhishek Vijayakumar, Sanjai Rayadurgam, andDarren Cofer. 2019. Input Prioritization for Testing Neural Networks. In 2019IEEE Intl. Conference On Artificial Intelligence Testing (AITest). 63–70.

[3] Lawrence Cayton. 2005. Algorithms for manifold learning. Univ. of California atSan Diego Tech. Rep 12, 1-17 (2005).

[4] Gregory Cohen, Saeed Afshar, Jonathan Tapson, and André van Schaik. 2017.EMNIST: an extension of MNIST to handwritten letters. CoRR abs/1702.05373(2017). arXiv:1702.05373

[5] Bin Dai and David P. Wipf. 2019. Diagnosing and Enhancing VAE Models. CoRRabs/1903.05789 (2019). arXiv:1903.05789

[6] Carl Doersch. 2016. Tutorial on Variational Autoencoders. ArXiv (2016).arXiv:1606.05908

[7] Alex Groce, Mohammad Amin Alipour, and Rahul Gopinath. 2014. Coverageand Its Discontents. In Proceedings of the 2014 ACM International Symposium onNew Ideas, New Paradigms, and Reflections on Programming & Software. ACM,255–268.

[8] Jinhan Kim, Robert Feldt, and Shin Yoo. 2019. Guiding Deep Learning System Test-ing Using Surprise Adequacy. In Proceedings of the 41st International Conferenceon Software Engineering (ICSE ’19). IEEE Press, 1039–1049.

[9] Anders B. L. Larsen, Søren K. Sønderby, and Ole Winther. 2015. Autoencodingbeyond pixels using a learned similarity metric. CoRR abs/1512.09300 (2015).

[10] Yann LeCun. 1998. The MNIST database of handwritten digits. (1998). http://yann.lecun.com/exdb/mnist/

[11] Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. 2018. A Simple UnifiedFramework for Detecting Out-of-Distribution Samples and Adversarial Attacks.In Advances in Neural Information Processing Systems 31. Curran Associates, Inc.,7167–7177.

[12] Zenan Li, Xiaoxing Ma, Chang Xu, and Chun Cao. 2019. Structural CoverageCriteria for Neural Networks Could Be Misleading. In ICSE–NIER ’19. IEEE Press,Piscataway, NJ, USA, 89–92.

[13] L. Ma, F. Juefei-Xu, F. Zhang, J. Sun, M. Xue, B. Li, C. Chen, T. Su, L. Li, Y. Liu, J.Zhao, and Y.Wang. 2018. DeepGauge: Multi-granularity Testing Criteria for DeepLearning Systems. In Proceedings of the 33rd ACM/IEEE Intl. Conf. on AutomatedSoftware Engineering (ASE 2018). ACM, New York, NY, USA, 120–131.

[14] Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2017. DeepXplore: Au-tomated Whitebox Testing of Deep Learning Systems. In SOSP ’17. ACM Press,New York, New York, USA, 1–18. arXiv:1705.06640

[15] Shin Yoo. 2019. SBST in the Age of Machine Learning Systems: Challenges Ahead.In Proceedings of the 12th International Workshop on Search-Based Software Testing.IEEE Press, 2–2.

[16] Jie M. Zhang, Mark Harman, Lei Ma, and Yang Liu. 2019. Machine Learn-ing Testing: Survey, Landscapes and Horizons. CoRR abs/1906.10742 (2019).arXiv:1906.10742

http://arxiv.org/abs/1702.05373



http://yann.lecun.com/exdb/mnist/

http://yann.lecun.com/exdb/mnist/



Manifold for Machine Learning Assurance · Manifold learning tries to capture this mapping so that...

Documents

Transcript of Manifold for Machine Learning Assurance · Manifold learning tries to capture this mapping so that...