Thesis Proposal: Automated Testing of Robotic and …afsoona/proposal.pdf · 2020-05-02 · to...

Thesis Proposal: AutomatedTesting of Robotic andCyberphysical Systems

Afsoon Afzal

School of Computer ScienceCarnegie Mellon University

Pittsburgh, PA 15213

Thesis Committee:Claire Le Goues (Chair), Carnegie Mellon University

Michael Hilton, Carnegie Mellon UniversityEunsuk Kang, Carnegie Mellon University

John-Paul Ore, North Carolina State University

Submitted in partial fulfillment of the requirementsfor the degree of Doctor of Philosophy.

Copyright c© 2020 Afsoon Afzal

Keywords: testing cyberphysical systems; robotics testing; automated quality assurance;simulation-based testing; challenges of testing; automated oracle inference;

AbstractRobotics and cyberphysical systems are increasingly being deployed to settings where they arein frequent interaction with the public. Therefore, failures in these systems can be catastrophicby putting human lives in danger and causing extreme financial loss. Large-scale assessment ofthe quality of these systems before deployment can prevent these costly damages.

Because of the complexity and other special features of these systems, testing, and morespecifically automated testing, faces challenges. In this thesis proposal, I study the unique chal-lenges of testing robotics and cyberphysical systems by conducting a number of qualitative,quantitative, and mixed method studies, and propose an end-to-end automated testing pipelineto provide tools and methods that can help roboticists in large-scale automated testing of theirsystems. My key insight is that we can use (low-fidelity) simulation to automatically test roboticand cyberphysical systems, and identify many potentially catastrophic failures in advance at lowcost.

My thesis statement is: Robotics and cyberphysical systems have unique features such asinteracting with the physical world and integrating hardware and software components, whichcreates challenges for automated, large-scale testing approaches. Software-in-the-loop (low-fidelity) simulation can facilitate automated testing for these systems. Machine learning ap-proaches (e.g., clustering) can be used to create an automated testing pipeline, which includesautomated oracles and automated test input generation.

To support this statement, I propose the following work. In the preliminary work, which isalready completed, I conducted a qualitative study and interviewed robotics practitioners abouttheir testing practices and challenges. I identified nine main challenges roboticists face whiletesting their systems. In a case study on ARDUPILOT autonomous vehicle software, I investi-gated the potential impact of using low-fidelity software-based simulation on exposing failuresin robotics systems, and showed that low-fidelity simulation can be an effective approach indetecting bugs and errors with low cost in robotic systems.

I propose to further study features in robotics simulators that are the most important for au-tomated testing, and the challenges of using these simulators by conducting a large-scale surveywith robotics practitioners.

I propose an approach to automatically generate oracles for cyberphysical systems usingclustering, which can observe and identify common patterns of system behavior. These patternscan be used to distinguish erroneous behavior of the system and act as an oracle for an automatedtesting pipeline.

Finally, I propose to investigate automated test generation for these systems by first, iden-tifying a suitable quality metric to evaluate the quality of test suites, and second, automaticallygenerating test suites that target under-tested behaviors.

iii

Contents

1 Introduction 11.1 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Proposal Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Review of Literature and Background 52.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Challenges of Testing Robotic Systems 103.1 Testing in robotics: practices and challenges . . . . . . . . . . . . . . . . . . . . 103.2 Software-based simulation for testing . . . . . . . . . . . . . . . . . . . . . . . 133.3 Simulators for testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Automated Oracle Inference 164.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.4 Limitations and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5 Automated Test Generation 225.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

6 Proposed Timeline 26

7 Conclusion 28

Bibliography 29

iv

1 Introduction

Robotic systems are systems that sense, process, and physically react to information from the realworld [3]. In the past decade, robotic systems have become increasingly important in everydayhumans’ lives. In the past, the use of these systems were mostly limited to industrial settings,in isolation and under specific safe conditions, which prevented potential extreme damages topeople. However, robotic systems are now frequently used in a variety of (unsafe) settings suchas avionics, transportation and medical operations. It is now more important than ever to ensurethe safety and quality of these system before deploying them as failures in these systems can becatastrophic. In October 2018 and March 2019, two Boeing 737 MAX airplanes crashed due toplanes’ aggressive dive caused by erroneous angle of attack sensor data, killing all 346 peopleabroad [1].

Automated quality assurance, or more specifically automated testing, is widely used in soft-ware development [24]. However, robotic systems have specific properties that make deploymentof automated testing challenging in practice: 1) robots are comprised of hardware, software, andphysical components, which can be unreliable and non-deterministic [43, 57, 73], 2) they interactwith the physical world via inherently noisy sensors and actuators, and are sensitive to timingdifferences [73], 3) they operate within the practically boundless state space of reality, makingemergent behaviors (i.e., corner cases) difficult to predict [43], and 4) the notion of correctnessfor these systems is often non-exact and difficult to specify [79]. As a result, in many casesthe integration and system-as-a-whole testing is performed manually, mostly in the field [11].Field testing is an important part of robotics development, but it could result in expensive anddangerous failures. In addition, field testing is highly limited by the scale of the scenarios andenvironments that it could be applied to. For example, testing an autonomous drone under highlywindy conditions require either an expensive setup to artificially recreate high speed wind orwaiting for the condition to naturally happen. None of these options are practical, which leavessome features untested in practice.

The ultimate goal of this work is to make robotics and cyberphysical systems safe, andimprove the quality assurance of these systems. For this purpose, we first need to identifythe challenges of testing commonly faced by robotics developers. Even though many studieshave investigated the state of testing (especially automated testing) of software systems in prac-tice [29, 41, 47, 88, 98], little attention has been paid towards automated testing of robotics andcyberphysical systems. As part of this proposed thesis, I conducted a qualitative study withrobotics practitioners, to better understand the robotics testing practices currently being used,and identify challenges and bottlenecks preventing roboticists from automated testing [11]. Iidentified three themes of challenges in testing robotic systems: 1) real-world complexities, 2)

1

Automated testinput generator Test inputs

Software-in-the-loop simulation

Executiontraces Automated oracle Labeled

traces

Figure 1.1: Automated testing pipeline for a cyberphysical system using simulation. An auto-mated test input generation technique (Chapter 5) creates test inputs that will be executed on theCPS in simulation, which results in execution traces. These traces are provided to an automatedoracle (Chapter 4) to be labeled as either correct or erroneous.

community and standards, and 3) component integration. I present this study in more details inSection 3.1.

Even though hardware testing is an essential part of quality assurance for robotic systems [7,104], it is (almost) orthogonal to the quality of operating software; if the behavior of the hard-ware and the environment can perfectly be simulated, we no longer need the actual hardware anda real environment to test the software of the system. However, to this day no simulator is ableto perfectly simulate every aspect of the environment. In fact, most of the simulators providelow fidelity and are very limited [35, 36, 119]. In our qualitative study, we found that roboticspractitioners find simulation ineffective and difficult to use [11]. In this thesis, I propose to con-duct a quantitative study to inspect features and limitations of robotics simulators, and identifythe most pressing issues that prevent or discourage developers from using robotic simulators forperforming automated tests.

Although low-fidelity software-in-the-loop (SITL) simulators are not perfect, they can still bevery effective tools in detecting and preventing failures by allowing cheap, large-scale, automatedtesting. The ExoMars lander crash that occurred in 2016 is an interesting example that shows po-tential effectiveness of simulation-based testing. The crash that cost approximately $350 millionwas later recreated in the simulation environment, which shows the role that simulation-basedtesting could play in preventing failures in the field [5]. In a similar case, a report issued bythe National Transportation Safety Board (NTSB) on Boeing 737 MAX crashes illustrates thatthe specific failure modes that could lead to uncommanded plane dive (e.g., erroneous sensordata) were not properly simulated as part of functional hazard assessment validation tests, andas a result, were missed by NTSB’s safety assessment [2]. To illustrate the extent to which low-fidelity simulation-based testing may be used to detect failures in robotics systems, we conductedan empirical case study on a robotic system [107]. In this study, presented in Section 3.2, weshowed that more than half of the bugs found in the system over time can manifest in low-fidelitySITL simulation. Specifically, we found that of all bugs, only 10% require particular environ-mental conditions (not available in low-fidelity simulation) to manifest, and 4% only manifest onphysical hardware.

As a result, in absence of high-fidelity simulators, low-fidelity SITL simulation can be usedfor systematic, large-scale automated testing of CPSs, as its low fidelity only prevents discoveringof a small number of bugs in the system. Figure 1.1 presents an automated testing pipeline of aCPS. In this thesis, I propose techniques to achieve this pipeline by creating an automated oracle(Chapter 4), and automated test input generation techniques (Chapter 5) that can effectivelyexpose faulty system behavior of the system in simulation, before field testing. My proposed

2

techniques will not require pre-defined specifications or models of the system.SITL simulation-based testing, as any other testing method, requires an oracle that can dif-

ferentiate between correct and incorrect system behavior [22]. This oracle can take many forms,from formal specifications to manual inspection (human judgment) [10, 63, 76, 78, 94, 122]. Forlarge-scale, automated testing, we require an oracle that can automatically label executions ofthe system and detect failures. However, manually defining such oracle for a robotics system re-quires extensive knowledge about the (usually very complex) system and considering all possiblescenarios that the robot may face in advance [82]. For example, the oracle for a self-driving carmay simply specify that the vehicle should not collide with any objects. Even though this oraclecan detect failures in conditions where the vehicle hits a pedestrian or an object, it does not takeinto account cases where colliding with an object, such as a plastic bag in the air, does not imposeany danger and should be allowed. In other words, robotic systems’ correct behavior may varybased on the conditions that are affected by the unpredictable environment, system’s configura-tions, timing and randomness. An accurate oracle needs to consider all possible conditions andspecify the correct behavior of the system in those conditions [69].

An automated approach of generating the oracle can take us one step closer to a pipelinefor systematic, large-scale automated testing presented in Figure 1.1. Existing approaches inautomated specification mining and invariant inference, formal verification and statistical mod-els have tackled this problem in the past [17, 25, 27, 37, 38, 39, 42, 44, 51, 58, 71, 72, 89, 90,91, 115, 122]. In Section 2, I describe these techniques in more details and discuss their ad-vantages and limitations. Broadly, robotics and autonomous systems have features that limit theapplication of existing approaches. One such feature is the fact that many of these systems in-volve third-party components without access to the source code. This feature heavily limits theapplication of many existing tools as they require complete access to the source code to a ex-tract system’s specifications. Another feature of robotics systems as already mentioned involvescontext-dependent behavior of the system, which eventually requires a method that is able tolearn disjunctive models, depending on context. Finally, the amount of noise and randomness inthe environment in which these systems operate requires special attention.

My key insight is that by observing many executions of the system, as a blackbox, I can findpatterns of correct or normal behavior of the system, and mark executions that do not complywith these patterns as abnormal or erroneous behavior. I propose techniques to generate theseblackbox models from a set of previously observed executions of the system, and use them asan oracle that can differentiate between correct and erroneous behavior of the system. I evaluatethe accuracy of these models by their ability to correctly label a set of traces. The details to thiswork is presented in Chapter 4.

Finally, studies have shown that the quality of test inputs highly affect exposure of faulty be-haviors in testing [60, 62, 92]. An automated testing pipeline, as the one presented in Figure 1.1,requires an automated method of generating effective test inputs. In this work, I propose to inves-tigate different methods of automated test generation, including (but not limited to) random andevolutionary test generation methods, which target corner cases and rarely observed behaviors toexpose more bugs and failures. However, to be able to compare different test suites, I first needto identify a test suite quality metric that can measure the ability of the test suites to reveal faultsin the system. I propose to study the performance of several test suite quality metrics (e.g., codeand model coverage) on robotic and cyberphysical systems, and use the most suitable metrics to

3

develop and compare automated test suite generation methods.

1.1 Thesis StatementRobotics and cyberphysical systems have unique features such as interacting with the physicalworld and integrating hardware and software components, which creates challenges for auto-mated, large-scale testing approaches. Software-in-the-loop (low-fidelity) simulation can facili-tate automated testing for these systems. Machine learning approaches (e.g., clustering) can beused to create an automated testing pipeline, which includes automated oracles and automatedtest input generation.

1.2 ContributionsThis proposal contains a set of qualitative and quantitative studies, where I conduct interviewsand surveys and use grounded theory to analyze the data. In addition, I conduct a case studyon bugs in popular open-source ARDUPILOT system. I use the same ARDUPILOT system, inaddition to possibly more cases, to evaluate performance of the proposed techniques both interms of automatically creating more accurate oracles, and generating more fault revealing testinputs.

My thesis will contribute in the following ways:1. It identifies the challenges of automated testing for robotics systems and discovers the

practices currently being used in the field of robotics.

2. It shows that simulation-based testing can be an effective approach in identifying faults inthese systems.

3. It identifies the challenges of using simulators for the purpose of (automated) testing, andthe most prominent issues with currently available simulators.

4. It proposes a black-box approach to automatically infer oracles for these systems based onobserved executions of the robot in the simulated environment.

5. It proposes methods to automatically generate tests that explore various system behaviors,specifically targeting rarely observed and under-tested behaviors to expose failures in rareconditions.

1.3 Proposal OutlineIn this proposal I first provide an overview of literature on the topics related to this proposal,and background on related topics (Chapter 2). In the following chapters Chapter 3, 4 and 5, Iidentify three research thrusts and describe the preliminary work done for each thrust as well asthe proposed work. In Chapter 6, I propose a timeline for completing the ongoing research andfinally in Chapter 7, I conclude this thesis proposal.

4

2 Review of Literature and Background

The following sections give an overview of related work and background that inform the pro-posed work.

2.1 Related work

Robotic systems Robots are systems that sense, process, and physically react to informationfrom the real world [3]. Robotic systems are a subcategory of cyberphysical systems [66], whichinclude non-robotics systems such as networking systems or power grids. However, roboticsystems are subject to system constraints that do not apply to CPSs broadly (such as a need forautonomy, route planning, and mobility).

Robotic systems differ in several important dimensions [40, 43, 57, 73, 79, 101] as com-pared to conventional software: (1) Robots are comprised of hardware, software, and physicalcomponents, which can be unreliable and non-determinstic [43, 57, 73]. (2) Robots interactwith the physical world via inherently noisy sensors and actuators, and are sensitive to timingdifferences [73]. (3) Robots operate within the practically boundless state space of reality, mak-ing emergent behaviors (i.e., corner cases) difficult to predict [43]. (4) For robotic systems, thenotion of correctness is often non-exact and difficult to specify [79]. These characteristics intro-duce unique challenges for testing, such as the need to either abstract aspects of physical realityor conduct extensive testing in the real world.

Challenges of testing robotics and CPSs A number of studies have investigated software test-ing practices broadly, and the challenges facing these practices [29, 41, 47, 88, 98]. Runeson [98]conducted a large-scale survey on unit testing with 19 software companies, and identified unittest definitions, strengths, and problems. Causevic et al. [29] qualitatively and quantitativelystudy practices and preferences on contemporary aspects of software testing.

In a technical report, Zheng et al. [118] report on a study of verification and validation in cy-berphysical systems. The paper finds that there are significant research gaps in addressing verifi-cation and validation of CPS, and that these gaps potentially stand in the way of the constructionof robust, reliable and resilient mission-critical CPS. The paper also finds that developers havea lack of trust in simulators, and one of the main research challenges they identify is integratedsimulation. Seshia et al. [101]. introduce a combination of characteristics that define the chal-lenges unique to the design automation of CPSs. Marijan et al. [79] speculate over a range ofchallenges involving testing of machine learning based systems.

5

Duan et al. [40] extract 27 challenges for verification of CPSs by performing a large-scalesearch on papers published from 2006 to 2018. Alami et al. [14] study the quality assurancepractices of the Robot Operating System (ROS)1 community by using qualitative methods such asinterviews with ten participants, virtual ethnography, and community reach-outs. They learn thatimplementation and execution of QA practices in the ROS community are influenced by socialand cultural factors and are constrained by sustainability and complexity. However, their resultsonly apply to a specific robotics framework and cannot be generalized to non-ROS systems.

Luckcuck et al. [77] systematically surveyed the state of the art in formal specification andverification for autonomous robotics, and identified the challenges of formally specifying andverifying (autonomous) robotic systems. Their study focuses on formal specification as a methodof quality assurance and does not provide information regarding other testing practices within thewider field of robotics.

Wienke et al. [112] conducted a large-scale survey to find out which types of failures cur-rently exist in robotics and intelligent systems, what their origins are, and how these systemsare monitored and debugged. Sotiropoulos et al. [103] performed a study of 33 bugs in aca-demic code for outdoor robot navigation. The study found that for many navigation bugs, onlya low-fidelity simulation of the environment is necessary to reproduce the bug. Koopman andWagner [69] highlight the challenges of creating an end-to-end process that integrates the safetyconcerns of a myriad of technical specialties into a unified approach. Beschastnikh et al. [26]looked at several key features and debugging challenges that differentiate distributed systemsfrom other kinds of software.

Anomaly and intrusion detection There are a number of studies on anomaly detection incyberphysical systems [32, 48, 53, 56, 86, 93, 110, 122]. He et al. [55] proposes an approach forcreating autoregressive system identification (AR-SI) oracles for CPSs. Based on the assumptionthat many CPSs are designed to run smoothly when noises are under control, AR-SI automaticallydetermines whether a trace is erroneous or correct by checking the smoothness of the system’sbehavior. Theisslet et al. [105] propose an approach that reports anomalies in the multivariatetime series which point the expert to potential faults.

Chen et al. [30] build models by combining mutation testing and machine learning: theygenerate faulty versions (mutants) of the tested system and then learn SVM-based models us-ing supervised learning over the resultant data traces corresponding to system execution. Theyevaluate on a model of a physical water sanitation plant. Ghafouri et al. [49] show that com-mon supervised approaches in this context are vulnerable to stealthy attacks. An unsupervisedtechnique [59] evaluated on the same treatment plant model trains a Deep Neural Net (DNN) toidentify outliers. Ye et al. [117] use a multivariate quality control technique to detect intrusionsby building a long-term profile of normal activities in information systems and using the normprofile to detect anomalies.

Other approaches target the detection of particular attack classes specifically. Choi et al. [33]present a technique that infers control invariants to identify external physical attacks againstrobotic vehicles. Alippi et al. [18] learn Hidden Markov Models of highly correlated sensor datathat are then used to find sensor faults. Abbaspour et al. [6] train adaptive neural networks over

1https://ros.org

6

https://ros.org

faults injected into sensor data to detect fault data injection attacks in an unmanned aerial vehicle.Hutchison et al. [57] outline a framework for automated robustness testing of autonomy sys-

tems. Abdelgawad et al. [8] introduce a systematic model-based testing approach to evaluatethe functionality and performance of a real-time adaptive motion planning system. Menghi etal. [84] propose an automated approach to translate CPS requirements specified in a logic-basedlanguage into test oracles specified in Simulink’s simulation language for CPS.

The oracle problem Fully automated testing for CPSs requires oracles that can determinewhether a given CPS behaves correctly for a given set of inputs [22]. In typical research andpractice, domain experts manually provide CPS oracles in the form of a set of partial specifica-tions, or assertions [10, 63, 76, 78, 94, 122]. However, manually writing such specifications istedious, complex, and error-prone [50, 82].

A number of techniques proposed approaches for inferring invariants or finite state modelsdescribing correct software behavior perform what is known as dynamic specification mining.Existing dynamic specification mining techniques can be classified into four categories basedon the kind of models that they produce: data properties (a.k.a. invariants) [37, 42, 51, 89, 90],temporal event properties [25, 27, 72, 115], timing properties [91, 100], and hybrid models [17,71, 91] that combine multiple types of model. These techniques are generally poorly-suited tothe CPS context. Most require source code access or instrumentation, and none are suitable fortime series data. Techniques like Daikon [42] and its numerous successors (e.g., DySy [37],SymInfer [90], Nguyen et al. [89], or Dinv [51], among others) learn source- or method-leveldata invariants rather than models of correct execution behavior. Techniques like Texada [72]and Perracotta [115] do learn temporal properties between events but do not model or learntemporal data properties, a key primitive in CPS execution (Artinali [17] comes closest to thisgoal, learning event ordering and data properties within an event).

As another way of approaching the oracle problem for CPSs, studies have used metamor-phic testing to observe the relations between the inputs and outputs of multiple executions ofa CPS [74, 106, 120]. Lindvall at al. [74] exploit tests with same expected output accordingto a given model to test autonomous systems. Zhou and Sun [120] use metamorphic testingto specifically detect software errors from the LiDAR sensor of autonomous vehicles. Tian etal. [106] introduce DeepTest, a testing tool for automatically detecting erroneous behaviors ofDNN-driven vehicles. As an oracle, they use metamorphic testing by checking that propertieslike steering angle of an autonomous vehicle remain unchanged in different conditions such asdifferent weather or lighting.

Automated Test generation Software test automation significantly improves the quality, andautomated test suite generation significantly affects the software test automation, and is a veryintegral part of the automation process [67]. Automated test suite generation for CPSs includesnumerous model-based approaches [80, 85, 113]. These approaches require a model of the sys-tem in a particular format (e.g., Simulink or MATLAB models) to generate a set of test cases thatreach the highest coverage of the model. In the absence of such models, search-based techniqueshave shown promise in automated test suite generation of CPSs [50, 54, 108, 109].

Search-Based Software Testing (SBST) is a method for automated test generation based on

7

optimization using meta-heuristics [15, 81]. The SBST approaches require a fitness functionthat has crucial impact on their performance [15, 99]. In a study on java programs, Salahirad etal. [99] showed that fitness functions that thoroughly explore system structure should be used asprimary generation objectives, supported by secondary fitness functions that explore orthogonal,supporting scenarios. Fraser et al. [45] propose a novel paradigm in which whole test suites areevolved with the aim of covering all coverage goals at the same time while keeping the total sizeas small as possible.

Arrieta et al. [19] propose a search-based approach that aims to cost-effectively optimizethe test process of CPS product lines by prioritizing the test cases that are executed in specificproducts at different test levels. By applying SBST to automated driving controls, Gladisch etal. [50] show that SBST is capable of finding relevant errors and provide valuable feedback tothe developers, but requires tool support for writing specifications. Bagschik et al. [20] proposea generation of traffic scenes in natural language as a basis for a scenario creation for automatedvehicles. Similarly, Gambi et al. [46] recreate real car crashes as physically accurate simulationsin an environment that can be used for testing self-driving car software.

To take uncertainty that is unavoidable in the behaviors of CPSs into consideration at varioustesting phases, including test generation, Ali et al. [16] propose uncertainty-wise testing, arguingthat uncertainty (i.e., lack of knowledge) in the behavior of a CPS, its operating environment,and in their interactions must be explicitly considered during the testing phase.

Mutation Testing Mutation testing is a fault-based testing technique which provides a testingcriterion called the mutation score. The mutation score can be used to measure the effectivenessof a test set in terms of its ability to detect faults [60]. Achieving higher mutation scores im-proves the fault detection significantly [92]. On safety-critical systems mutation testing could beeffective where traditional structural coverage analysis and manual peer review have failed [21].

2.2 BackgroundArduPilot The open-source ARDUPILOT project2, written in C++, uses a common frameworkand collection of libraries to implement a set of general-purpose autopilot systems for use with avariety of vehicles, including, but not limited to, submarines, helicopters, multirotors, and aero-planes. ARDUPILOT is extremely popular with hobbyists and professionals alike. It is installedin over one million vehicles worldwide and used by organizations including NASA, Intel, andBoeing, as well as many higher-education institutes around the world.3

I use ARDUPILOT as one of my case studies due to being highly popular and open-source,and its rich version-control history, containing over 30,000 commits since May 2010, and for itsconsistent bug-fix commit description conventions. ARDUPILOT has been widely used in studieson CPSs as it represents a fairly complex open-source CPS [9, 55, 74, 111, 122], and contains300,000 lines of code (measured using SLOC).

To facilitate rapid prototyping and reduce the costs of whole-system testing, ARDUPILOT

offers a number of simulators for most of its vehicles (excluding submarines). In general, those2http://ardupilot.org3http://ardupilot.org/about

8

http://ardupilot.org

http://ardupilot.org/about

platforms simulate the dynamics of the vehicle under test, feed artificial sensor values to thecontroller, and relay the state of its actuators to the physics simulation. Hardware-in-the-loop(HIL) simulators are used to perform testing on a given flight controller hardware device bydirectly reading from and writing to it. In contrast, software-in-the-loop (SITL) simulators test asoftware implementation of the flight controller by running it on a general-purpose computer.

Robot Operating System (ROS) The Robot Operating System (ROS) [95] is a flexible frame-work for writing robot software. 4 It is a collection of tools, libraries, and conventions that aim tosimplify the task of creating complex and robust robot behavior across a wide variety of roboticplatforms. In ROS, nodes are processes that perform computation, and they communicate witheach other by passing messages. A node sends a message by publishing it to a given topic, whichis simply a string such as “odometry” or “map”. A node that is interested in a certain kind ofdata will subscribe to the appropriate topic. There may be multiple concurrent publishers andsubscribers for a single topic, and a single node may publish and/or subscribe to multiple topics.In general, publishers and subscribers are not aware of each others’ existence.

ROS is a relatively young framework (first released in 2009), and is currently used by thou-sands of people around the world; ROS has more than 34,000 registered users on ROSanswers,the main Q&A platform for ROS users.5 It follows an annual release model that is both similarto and linked to Ubuntu, and to this day, there have been 12 official, released ROS distributions(e.g., Lunar, Kinetic, and Jade). ROS is designed with the purpose of encouraging collabora-tive robotics software development by allowing robotics developers to build upon each other’swork [4].

4https://ros.org5http://download.ros.org/downloads/metrics/metrics-report-2019-07.pdf

9

https://ros.org

http://download.ros.org/downloads/metrics/metrics-report-2019-07.pdf

3 Challenges of Testing Robotic Systems

This section describes parts of the completed work of this thesis proposal, as well as proposedwork. As described in Chapter 1, robotics and cyberphysical systems have features such as non-deterministic behavior and noisy sensors that make them different from conventional softwaresystems. These features can specifically create challenges for automated testing approaches.Even though many studies have focused on understanding the challenges of testing in conven-tional software systems, to our knowledge, no such study has been done on robotics.

To better understand the state of (automated) testing in robotics, my collaborators and I firstconducted a series of qualitative and quantitative studies. We first identified challenges of (auto-mated) testing in the field of robotics, and testing practices currently being used in this field byconducting a series of qualitative studies with robotics practitioners. In this study, we identified9 challenges that robotics practitioners face while testing their robotic systems.

Secondly, we investigated the potential impact of using low-fidelity software-based simula-tion on exposing failures in robotics systems by conducting a case study. In this study, we showedthat low-fidelity simulation can be an effective approach in detecting bugs and errors with lowcost in robotic systems.

Finally, as features of SITL simulators highly impact the automated testing of these systems,I propose to conduct a large-scale survey to identify features in robotics simulators that are themost important for automated testing, and the challenges of using these simulators. The resultsof this study can encourage simulator designers and developers to prioritize features that make asimulator more suitable for automated testing of robotic systems.

3.1 Testing in robotics: practices and challengesAs described in Chapter 2, unique features of robotics and cyberphysical systems such as interac-tion with the real world through noisy sensors and actuators, introduce challenges to automatedtesting and validation of these systems. However, no prior studies have identified testing practicesand challenges for robotic systems. Our goal in this study is to gain an in-depth understandingof existing testing practices and challenges within the robotics industry. We conducted a seriesof semi-structured interviews with robotics practitioners from a diverse set of companies. Theresult of this study is published in the International Conference on Software Testing, Verificationand Validation (ICST 2020) [11].

Interviews are useful instruments for getting the story behind a participant’s experiences,acquiring in-depth information on a topic, and soliciting unexpected types of information [83,102]. We developed our interview script by performing a series of iterative pilots.

10

Table 3.1: Sample questions on the interview script.Pr

actic

e •What are all the different types of testing you do?• Can you describe your test running/writing process?• How much of your testing is done for certification?•Which types of tests find the most problems?

Test

ing

Cha

lleng

es •What is difficult about writing tests?• Have these difficulties ever made you giving up on writing the tests at all?• Is there any part of writing tests that is not difficult?•What types of tests do you have the most difficulty running?• In your experience, is there anything that helps with making it easier to run tests?• For your tests that are not fully automated, why are they not?•What tools/frameworks/techniques do you use to simplify running tests?• Do you use simulation?

Gen

eral •What do you think is the most important bottleneck in the way of testing in robotics?

• How do you think the difficulties of testing in robotics differ from your otherexperiences in other software development domains?

We recruited our participants through a variety of means. Our goal was to select participantsfrom a broad range of positions and to sample across a diversity of industries, company size,and experience. We interviewed 12 robotics practitioners with a variety of backgrounds andexperiences, representing 11 robotics companies and institutions ranging from small startups tolarge multi-national companies. Our participants cover a variety of educational backgroundssuch as computer science, mechanical engineering, math and physics, and their institutional roleincludes developer, test engineer, project manager, and CTO.

Interviews and coding We conducted semi-structured interviews that lasted between 30 to 60minutes over phone, video chat, or in person. We prepared an interview script with detailedquestions providing insight into our research questions. A subset of questions on the interviewscript are presented in Table 3.1. However, we only used the script to guide the interviews. Weadjusted interview questions based on the experience of the participant to gain a deeper under-standing of their testing practices and challenges. We took notes from interviewee responsesand recorded the interviews with their consent to validate our notes. We then used a grounded,iterative approach to code our notes. We first labeled responses based on their relevance to ourresearch questions. Then, we iteratively coded the notes based on common themes, discussedthe codes and redefined them.

Validation To validate the results of our study and conclusions, we sent a full draft of theresults to our participants. We asked participants to inform us of their level of agreement withour conclusions and to provide their thoughts on our results. In total, six of the participantsresponded to our request. Four responded in total agreement with the results. The other twoparticipants that responded provided specific feedback on our interpretation of their responses,and we incorporated their feedback into the final version of this paper.

11

Table 3.2: A summary of the challenges of designing, running, and automating tests for roboticsthat we identified based on participant responses.

ID Title Description

C1 Unpredictablecorner cases The challenge of attempting to anticipate and cover for all possible

edge cases within a large (and possibly unbounded) state spacewhen designing tests.

C2 Engineeringcomplexity A disproportionate level of engineering effort is required to build

and maintain end-to-end test harnesses for robotic systems withrespect to the benefit of those tests.

C3 Culture oftesting The challenge of operating within a culture that places little value on

testing and provides developers with few incentives to write tests.

C4

Coordination,collaboration,anddocumentation

A lack of proper channels for coordination and collaboration amongmultiple teams (especially software and hardware teams), and a lackof documentation.

C5 Cost andresources The cost of running and automating the tests in terms of human

hours, resources and setup, and running time.

C6 Environmentalcomplexity The inherent difficulties of attempting to account for the

complexities of the real world when simulating, testing, andreproducing full-system behavior.

C7 Lack of oracle The challenge of specifying an oracle that can automaticallydistinguish between correct and incorrect behavior.

C8Software andhardwareintegration

Difficulties that arise when different software and hardwarecomponents of the system are integrated and tested.

C9 Distrust ofsimulation A lack of confidence in the accuracy and validity of results obtained

by testing in simulation and synthetic environments, and a solereliance on field testing.

Results After iterative coding of the interviews, we identified 12 testing practices that are men-tioned by our participants. These practices include a wide range of manual and automated test-ing approaches such as unit testing for automated testing of small individual code-level softwarecomponents, and manual field testing of the system in a real-world environment that shares sim-ilarities with deployed environment.

In addition, we identified 9 challenges that roboticists face while designing, running, andautomating tests for their systems summarized in Table 3.2. Many of these challenges such asunpredictable corner cases and lack of oracle are well-known challenges of testing systems inother fields as well. However, features of robotic systems such as interacting with real worldthrough sensors make these challenges even harder to resolve. We identified 3 major themes

12

among the identified challenges: 1) real-world complexities, 2) community and standards, and3) component integration. We supported these themes by showing quotes from our participants,and later speculated on the implications of each theme and provided suggestions for tacklingtheir associated challenges.

One of the contributions of this study is identifying the challenges of testing created by real-world complexities. Software-based simulation could be a promising approach towards abstract-ing away a number of these complexities and prepare an opportunity to conduct large-scale testson the system. However, as pointed out by C9, robotics practitioners generally distrust the re-sults of executions in simulation because of the low fidelity of the simulated environment andtoo much abstractions of the real world.

3.2 Software-based simulation for testingAs described in Section 3.1, robotics practitioners generally distrust low fidelity simulation, andfind it ineffective in exposing robotics bugs. To evaluate the validity of this common belief,we conducted a case study on ARDUPILOT system (presented in Section 2.2). This study ispublished in the International Conference on Software Testing, Verification and Validation (ICST2018) [107].

We first identified potential bug-fixing commits within the version-control history of theARDUPILOT project using a number of automated filters on the edited file types and keywordsin the commit message. We manually inspected all 333 commits that passed this initial filteringstage and identified 228 bug-fixing commits.

After reaching a consensus on the list of bug-fixing commits, we manually inspected eachcommit to determine whether the bug can be reproduced in simulation, and if so, what are therequirements for triggering and detecting it. We labeled each historic bug based on the following7 questions:

1. Does triggering or observing the bug rely on physical hardware?

2. Is the bug only triggered when handling concurrent events?

3. Which kinds of input are required to trigger the bug?

4. At which stage in the execution does the bug manifest?

5. Is the bug only triggered under certain configurations?

6. Is the bug only triggered in the presence of certain environmental factors (e.g., wind, hu-man presence)?

7. How does the bug affect the behavior of the system?Responding to these questions allows us to realize whether it is possible to trigger and manifestthese bugs in low-fidelity SITL simulation. Requiring a physical hardware, or presence of cer-tain environmental factors, for instance, makes triggering and manifesting a bug in low-fidelitysimulation extremely difficult (if not impossible).

Results Our results, based on an empirical study of historical bugs in real-world robotics sys-tem, dispute the common belief among robotics practitioners that the fidelity of simulators are

13

not sufficient enough to catch most of the bugs in robotics systems. We discovered that, contraryto our expectations, the majority of bugs can indeed be reproduced using SITL simulation ap-proaches without the need for complex triggering mechanisms (e.g., environmental conditions,concurrent events, component failures). We found that only a small minority–approximately10%–of bugs are dependent on particular environmental factors. Below, we discuss the findingsof our analysis in terms of the proposed questions:

1. In total, only 10 of 228 bugs relied on the presence of physical hardware for their detectionor observation.

2. We determined that only 13 out of 228 bugs (5%) require concurrent events in order to betriggered.

3. Mimicking continuous radio-controller (RC) inputs, usually provided by a human-operator,is significantly more challenging than supplying the system with discrete, well-formed in-puts (i.e., ground control system (GCS) commands, and preprogrammed missions). En-couragingly, our findings show that 165 of the 212 bugs do not rely on continuous input.

4. We observed that almost 80% of bugs (184 out of 228) occur during the normal operationof the robot.

5. We determined that 81 of 228 bugs depend on either a particular static (53) or dynamic(20) configuration, or a combination of both (8).

6. To our surprise, once again, we discovered that only 22 of 228 bugs depend on environ-mental factors.

7. We found that 205 bugs resulted in observable, behavioral changes to the program, and therest result in corruption of log files or system crash.

The findings of our study strongly support the idea of applying cheap, simulated-based testingapproaches to the problem of detecting bugs in robotics systems. However, we also found thatcontinuous events, in the form of radio-controller inputs, and specific configurations are requiredto trigger a large number of bugs. We believe that both of these challenges, whilst difficult, can beovercome by developing specialized testing methods and leveraging and building upon existingknowledge in, e.g., testing of highly configurable systems [64].

3.3 Simulators for testing

Our case study on ARDUPILOT system described in Section 3.2 showed that low-fidelity SITLsimulation can be used for systematic, large-scale automated testing of CPSs to discover a largenumber of bugs in the system, and can take us one step closer to the automated testing pipelinein Figure 1.1.

However, in the qualitative study presented in Section 3.1, our participants referred to anumber of challenges of using software-based simulators, and discussed them in depth. Thesechallenges included the low-fidelity of simulators, their hard to use interface, and the cost andeffort required to set them up for a particular system.

I propose to conduct a large-scale survey with robotics practitioners to explore challenges thatprevents developers from using the SITL simulators for automated testing, and identify simulator

14

features that are the most useful for automated testing. In this study, I will design a survey mainlyfocusing on the following questions:

1. Do robotics developers perceive inherent value in simulation-based testing?

2. What are the most common problems associated with simulation-based testing?

3. How could software-based simulation platforms be improved to encourage developers touse simulation for software-based testing?

4. Are developers using software-based simulation for automated testing?

5. What prevents developers from using software-based simulation to automate their testing?The questions on the survey will include both multiple choice and open-ended questions. To

analyze the open-ended questions, I will use an iterative coding approach to summarize commonthemes of responses. I will distribute this survey among robotics practitioners through multipleplatforms including Twitter, Reddit, and robotics forums and mailing lists.

The results of this study can encourage simulator developers and designers to implement andenhance features in SITL simulators that improve application of automated testing.

15

4 Automated Oracle Inference

This section describes parts of the completed work of this thesis proposal. As presented inFigure 1.1, an oracle that can automatically distinguish between correct and incorrect behaviorsof the system is essential to achieve an automated testing pipeline. However, generating an oracleis a well-known problem in software engineering [22], and as presented in Section 3.1, is oneof the challenges of testing robotics systems. Although many approaches have been proposed toaddress the oracle problem for pure software (cyber) systems [10, 63, 76, 78, 94, 122], they arenot appropriate for robotics and cyberphysical systems (CPSs) for the following reasons:

1. CPSs often contain proprietary third-party components (such as cameras or other sensors)for which source code is unavailable, and so techniques should minimize or avoid relyingon source code access.

2. CPSs are inherently non-deterministic due to noise in both their physical (e.g., sensors,actuators, feedback loops) and cyber components (e.g., timing, thread interleaving, randomalgorithms) and may react to a given command in a potentially infinite number of subtlydifferent ways that are considered to be acceptable. That is, for a given input and operatingenvironment, there is no single, discrete response that is correct, but rather an envelope ofresponses that are deemed correct. And so techniques should be robust to small, inherentdeviations in behavior.

3. The CPS may respond differently to a given instruction based on its environment, config-uration, and other factors (i.e., its operating context). For example, a flying copter mayrefuse to fly to the specified point if its battery is depleted. And so, techniques must becapable of capturing contextual behaviors for a given command.

To tackle the above mentioned challenges, I propose Mithra: a novel oracle learning approach,which identifies patterns of normal (common) behaviors of the system by applying a multi-step clustering approach to the telemetry logs collected over many executions of the system insimulation. Mithra uses the identified clusters as the core of its oracle and determines correctnessof system’s executions based on their similarity to identified clusters. It tackles all three abovementioned challenges as it does not require source code access, avoids over generalization andis robust to small deviations from expected behavior, and identifies contextual behaviors.

4.1 BackgroundStatistical machine learning approaches deal with the problem of finding a predictive functionbased on data. These approaches are applied on a collection of instances of the data, which acts

16

as the input to the learning process (i.e., training data). What the algorithm can learn from thetraining data varies in different approaches [121].

Supervised learning methods take a collection of training data with given labels (e.g., “male”and “female”), and learn a predictive model y = f(x), which can predict the label (y) of a giveninput (x). Depending on the domain of label y, supervised learning problems are further dividedinto classification and regression. Classification is the supervised learning problem with discreteclasses of labels, while regression is applied on continuous labels. Support vector machines(SVM), decision trees, linear and logistic regressions, and neural networks are all examples ofsupervised learning algorithms [70].

Unsupervised learning techniques work on an unlabeled training data. Common unsuper-vised learning tasks include: (1) clustering, where the goal is to separate the n instances intogroups, (2) novelty detection, which identifies the few instances that are very different from themajority, and (3) dimensionality reduction, which aims to represent each instance with a lowerdimensional feature vector while maintaining key characteristics of the training sample. Cluster-ing approaches such as k-means [75], k-medoids [65] and hierarchical clustering [61], in generalsplit the training data into k clusters, such that instances in the same cluster are similar, accordingto a similarity measure, and instances in different clusters are dissimilar. The number of clustersk may be specified by the user, or may be inferred from the training sample itself [114].

The distance measure selected for learning approaches highly impacts their performance, anddepends on the type of data, the goal of learning, and the distribution of the data.A common dis-tance metric is Euclidean distance (i.e., L2 norm) between two datapoints, which is inexpensiveto compute. In this chapter, we use Dynamic Time Warping [23] and Eros [116] to measuresimilarity between two time-series data based on their shape.

4.2 ApproachGiven a CPS that accepts a vocabulary of discrete commands and produces a log of its telemetry(e.g., ArduCopter), the goal is to learn a set of contextual behaviors for each of those commandsfrom a training set of telemetry logs. To do so, we created Mithra, which extracts the set of rele-vant execution traces for each command from a set of telemetry logs, and applies a novel, three-step clustering process to those traces. Using its learned behavioral clusters, Mithra constructsa classifier for each command that marks execution traces as either CORRECT or ERRONEOUS

based upon their similarity to the contextual behaviors represented by those clusters.Mithra applies time series clustering to a training set of execution traces collected from many

normal command executions for a CPS to learn the set of unique, contextual behaviors for thosecommands. These traces can be collected by executing different scenarios or missions in simu-lated environment or in real-world execution, and should cover a diverse set of system behaviors.

Studies have shown that approaches for clustering and classifying time series that are basedon comparing differences in shape are often superior in terms of performance than those thatcompare differences in time [13, 96]. As such, Mithra clusters execution traces based on theiroverall shape using an approach inspired by three-step clustering of large time series datasets [12].Figure 4.1 provides a high-level overview of Mithra’s clustering approach. During the first step(preclustering), Mithra computes a set of initial clusters, referred to as preclusters, based on a

17

Training traces

Precluster 1 Precluster 2 Precluster 3

Subcluster 1 Subcluster 2 Subcluster 1 Subcluster 2 Subcluster 3 Subcluster 1 Subcluster 2

Behavioral cluster 1

Step 1:Preclustering onlow-resolutiondata with DTW

Step 2:Subclustering

onhigh-resolutiondata with Eros

Step 3:Merging

subclusters withsimilar centroids

Behavioral cluster 2 Behavioral cluster 3

Figure 4.1: An overview of Mithra’s three-step clustering approach of preclustering, subclus-tering, and merging. Solid lines represent individual traces, and dashed lines represent clustercentroids.

lower-fidelity approximation of trace similarity. In the second step (purifying), each precluster issplit into a smaller set of subclusters based on the Eros similarity of traces within that precluster.In the final step (merging), Mithra creates the set of behavioral clusters by merging subclusterswith highly similar centroids based on their dynamic time warping (DTW) distance. Finally,Mithra computes µβ and σβ for each behavioral cluster β based on the distance from the traceswithin β to the centroid of that cluster, which Mithra uses to construct the decision boundaryfor β. Every step in this approach is designed to improve the overall accuracy of the behavioralclusters while keeping the approach scalable to a large number of traces.

The behavioral clusters detected by Mithra for each command are representative of the qual-itatively different modes of behavior observed for that command. These behaviors include bothbehaviors that are frequently observed and assumed to be correct (e.g., clusters with more thanone hundred traces) and behaviors that are rarely observed and suspected to be erroneous (e.g.,clusters with fewer than five traces).

Given a previously unseen execution trace τ for a command, Mithra first finds the behavioralcluster β∗τ ∈ BC that most closely resembles τ based on the DTW distance between τ and thecentroid of each cluster:

β∗τ = argminβ∈BC

DTW (τ, cβ)

Mithra then uses β∗τ to predict the label `τ for that trace as:

18

`τ =

ERRONEOUS if |β∗τ | < ρ

ERRONEOUS if DTW (τ, cβ∗τ) > µβ∗

τ+ θσβ∗

τ

CORRECT otherwise

where |β| is the number of traces within β, ρ ∈ Z+ is the rarity threshold, and θ ∈ R+ isthe acceptance rate. If β∗τ contains fewer than ρ traces, it is assumed to represent a rare andthus, erroneous behavior, and so, τ is marked as ERRONEOUS. In the more likely case whereβ∗τ contains at least ρ traces, then β∗τ itself is assumed to represent a common and thus, correctbehavior. In that case, Mithra uses the precomputed DTW distance to determine whether τ lieswithin the decision boundary of β∗τ , and if so, labels it as CORRECT. The acceptance rate θ is usedto alter the extent of the decision boundary and provides the user with a means of controlling theprecision-recall tradeoff of the classifier to their preferences.

4.3 EvaluationWe evaluated Mithra on ArduCopter system described in Section 2.2, which has been widely usedin studies on CPSs as it represents a fairly complex open-source CPS. Evaluating Mithra on aCPS requires a set of traces to be used as training data, and a set of ground truth ERRONEOUS andCORRECT traces to be used as evaluation (test) data. As a source of training data for Mithra, werecorded traces for 2500 randomly generated missions in simulation. To provide idempotency,we use Docker1 to run each mission in simulation in its own sandboxed container.

As the source of evaluation data, we obtained 24 bugs for the ArduPilot system: 11 of whichare real-life bugs, and 13 are inspired by real-life bugs. For each bug scenario, we used a hand-written mission template, tailored to that scenario, to randomly generate 10 missions that triggerand manifest the bug. After running each mission, we collect line coverage of the executionto ensure that the executed mission does in fact execute the lines of interest (i.e., faulty lines).Finally, we use the generated missions to construct a balanced evaluation set of correct anderroneous traces.

We publicly release our dataset of real-world ArduPilot bugs, mission templates, and gener-ated traces to allow others to reproduce and extend our experiments. Additionally, we believe thatthis dataset can serve as a valuable benchmark to the community for future CPS fault detectionstudies.2

Our research questions and results are as follows:RQ1 (Accuracy) How accurately does Mithra’s clustering method distinguish between correct

and erroneous traces?To assess the accuracy of Mithra, we classify every trace in the evaluation set. As the ac-ceptance rate is increased, recall decreases and precision increases, resulting in a moreconservative model that detects fewer erroneous traces overall, but ensures that tracesmarked as erroneous are more likely to be truly erroneous. Naturally, the overall accu-racy of Mithra remains fairly steady as the acceptance rate is increased, demonstrating the

1https://docker.org2https://bit.ly/39vlygI

19

https://docker.org

https://bit.ly/39vlygI

tradeoff between the number of false negatives and false positives. Mithra (θ = 1.5) suc-cessfully detects 56% of erroneous traces and marks 75% of truly correct traces as correct.

RQ2 (Conceptual Validation) How do Mithra’s individual steps influence the overall accuracyof the classifier?Each of the three steps of Mithra’s clustering approach is designed to improve the accuracyof its detected clusters. To evaluate the individual impact of those steps, we use the outputproduced by each step (i.e., preclusters, subclusters, and behavioral clusters) as the inputto the classifier, which we use to measure the performance of each step. Mithra’s precisionand accuracy is significantly improved by its identification of subclusters (i.e., purifying).Merging effectively reduces the number of reported clusters while ensuring that informa-tion is preserved. By first applying preclustering to a low-resolution version of its trainingdata, Mithra significantly improves its overall precision and accuracy.

RQ3 (Effectiveness) How does the classification accuracy of Mithra compare to AR-SI [55], astate-of-the-art oracle learning approach for CPSs?To compare our approach with the state-of-the-art, we implement He et al.’s approachfor creating autoregressive system identification (AR-SI) oracles for CPSs [55], and com-pare its performance on the evaluation data against Mithra. The median precision, recall,and accuracy of AR-SI are 62.2%, 39.0%, and 57.8% respectively, compared to Mithra’s74.7%, 56%, and 69.3%. Mithra significantly outperforms AR-SI in terms of precision,recall, and accuracy, and identifies more erroneous traces while reporting fewer false pos-itives.

Overall, we showed that Mithra achieves the highest median accuracy of 69.3%, successfullydetecting 56% of erroneous traces, and labeling 75% of truly correct traces as correct. Mithrasignificantly outperforms AR-SI on accuracy, precision, and recall. In addition, we showed thatMithra correctly labels the traces more consistently than AR-SI and is less affected by random-ness.

4.4 Limitations and future work

Our approach, like others, treats anomalous behavior as erroneous [17, 31, 42, 72, 91, 117].However, anomalous behavior also includes corner cases and rare behaviors that are not ob-served during training, which are not necessarily erroneous. Although labeling exceptional-yet-correct behaviors as erroneous leads to more false positives, it also informs developers of under-tested functionality, and is in general valuable to the testing process. In addition, Mithra reportsfrequently-observed-but-erroneous behaviors as correct. We believe that frequently-observedbehaviors are unlikely to be erroneous as developers would be able to easily spot them.

Like all dynamic specification mining techniques, the performance of our approach dependson data it is provided [17, 42, 72, 91]. If the provided traces do not provide sufficient coverageof the unique behaviors of the robot, our approach will fail to identify those behaviors. In thenext chapter, I propose approaches to generate high quality tests which can eventually be used togenerate training traces with higher coverage of unique behaviors of the system.

20

In future, I propose to apply Mithra on at least one more CPS, possibly a ROS system, tominimize the threat of solely evaluating on a single system and show Mithra’s effectiveness onROS systems.

21

5 Automated Test Generation

In this section I present ongoing and proposed work. One of the steps in the automated test-ing pipeline presented in Figure 1.1 is automated test input generation. Overall, my goal is toidentify the greatest number of faults in the system using automated testing in simulated environ-ment. Automated test suite generation is an integral part of test automation process, significantlyaffecting how well the testing process can find all the flaws in the system (i.e., software testquality) [67]. Therefore, an automated software testing should include automated test suite gen-eration.

5.1 MotivationThe input space in a CPS is massive (e.g., system’s configuration, environment, commands andparameters), making it infeasible to be exhaustively covered [97]. Therefore, searching in theinput space for automatically constructing meaningful test cases is one of the most importantneeds in the CPS testing domain [50].

The automated test input generation approaches broadly can be divided into two categories:1) Model-based testing, which provides techniques for automatic test case generation using mod-els extracted from software artifacts, and 2) Search-Based Software Testing (SBST), which isa method for automated test generation based on optimization using meta-heuristics [81]. Asmentioned in Section 2.1, many model-based approaches have been proposed for automated testgeneration of CPSs [80, 85, 113]. However, these approaches require a model of the system (e.g.,Simulink models or finite state machines), which may not be available for many CPSs. SBSTapproaches on the other hand, do not require models of the system, and are shown to be effectivein finding errors in CPSs (e.g., automated driving control applications [50]).

However, an evolutionary (search-based) approach for test generation requires a fitness func-tion that can measure the quality of generated tests and evolve them towards higher quality. Thefitness function can highly affect the SBST process and its effectiveness [15]. I hypothesizethat conventional metrics such as code and branch coverage may not be the best measures oftest quality in CPSs as these systems have special structures and features. For example, dueto inevitable non-determinism and noise in these systems, uncertainty-wise model-based testingsolutions have been proposed [16], which explicitly aim to generate tests with higher coverage ofknown uncertainties. In addition, scenarios that execute the faulty code may not result in a faultybehavior unless a specific condition arises. This phenomena is called Failed Error Propagation(FEP), and reduces the effectiveness of testing [34]. Therefore, generating tests with high codecoverage may not lead to the manifestation of more failures in systems with high levels of FEP.

22

Table 5.1: An example measurements based on three metrics (M ) on three test suites (TS), andthe ground-truth test suite quality (Q). The measurements are normalized to µ = 0.5, σ2 =0.16. In this example, metric M2 has the lowest distance from the ground-truth test suite qualitymeasure, and is more suitable to be used as a metric for test suite quality.

Metrics TS1 TS2 TS3

Metric M1 1.00 0.50 0.00Metric M2 0.00 1.00 0.50Metric M3 1.70 0.20 0.23Test Suite Quality Q 0.23 1.70 0.20

I propose to identify a scalable, cost-efficient fitness measure for test suite quality of CPSs,and investigate different methods of evolutionary (search-based) automated test input generationmethods to specifically target corner cases and rarely observed behaviors to expose more bugsand failures. I propose to answer two research questions:RQ1: What metrics can best evaluate the quality of test suites in CPSs?

RQ2: What is the performance of different automated test generation approaches at exposingfaulty behaviors in CPSs?

By responding to these research questions, I will identify quality measures for CPS test suites,examine the severity of FEP in CPSs, evaluate test generation approaches on these systems, andfinally propose test generation approaches capable of generating higher quality test suites.

5.2 MethodologyRQ1 (Quality Metrics) In this work, I define the quality or effectiveness of a test suite as itsability to reveal faults in the system [67]. The first research question focuses on identifying theability of different metrics to measure the quality of CPS test suites. The goal of this RQ is toidentify metrics that can best represent a test suite’s quality. In other words, given a set of testsuites TS and a ground-truth test suite quality measure Q, we look for a metric M∗ from the setof metrics M such that:

M∗ = arg minMc∈Mδ(Q,Mc, TS)

where δ measures the distance between two metrics on a set of test suites. A common distancemeasure between two metrics is Euclidean distance (i.e., L2 norm):

δ(M1,M2, TS) =

√ ∑TSa∈TS

(M1(TSa)−M2(TSa))2

The Euclidean distance is inexpensive to compute, and the distance between any two testsuites is not affected by the addition of a new test suite to the study. However, the distancescan be greatly affected by differences in scale among the metrics [28]. Therefore, we first needto normalize the output of all metrics. In this study, I will use Z-score normalization [52]. Z-score normalization can be applied to raw scores gathered from different metrics, and takes intoaccount both the mean value and the variability in a set of raw scores.

23

Table 5.1 presents an example of normalized measurements of three metrics on three testsuites. In this case, metric M2 has the lowest Euclidean distance from the ground-truth TSQ,and is more suitable to be used as a metric to measure test suite quality in this system.

To conduct this study I need to assemble three pieces; First, I need a set of metrics M toevaluate. Second, I need to generate a set of test suites TS. Third, I need to be able to measurethe ground-truth test suite quality TSQ. For the set of metrics, I will use a number of differentmetrics proposed in the literature such as statement and model coverage, and MC/DC, and possi-bly combinations of these metrics as suited. As test suites, I will use random or off-the-shelf testgeneration approaches to generate test suites. Finally, to be able to measure the quality of testsuites for the ground-truth, I require a set of CPS bugs, and an accurate oracle that can correctlylabel whether an execution trace is erroneous or correct.

To create the set of bugs in a CPS we can use the same approach as the one described inChapter 4. I can find historical bugs in the system. However, finding these bugs requires a lotof manual labor. On the other hand, we can inject bugs to the system. This process can beautomated and is much faster. Even though injected faults may not be a good representative ofreal bugs in CPSs, and introduce a threat to validity of the evaluation [68, 87], they can also beeffective measures for the quality of test cases [21]. For this part of the work, I will use mutationscore [60] as ground truth metric of test suite quality, where the test suite’s score is based on thenumber of injected bugs it can expose.

To compute the mutation score for a given test suite, I require an accurate oracle that canact as ground truth and determine whether a test execution is faulty (i.e., exposes the injectedbug). I can use the oracle created in Chapter 4 to label the test outputs. However, these oraclesare not 100% accurate and are biased towards a specific approach. The evaluation of automatedtest generation should be done independently of how good the oracles are, as these two steps areorthogonal and can be replaced by another (better) approach at any time.

I have the following options to approach the problem of labeling test outputs as ground truthfor computation of mutation score:

1. I can take the approach similar to the evaluation of test oracles in Chapter 4, where asground truth I labeled all traces that executed the faulty code as erroneous, and correctotherwise. However, as mentioned, this is a simplifying assumption and a threat to validityof the evaluation. I can limit the set of CPS defects to those that manifest if the faulty codeare executed. In this case, I can automatically check whether executing a test covers thefaulty line and consider that as fault revealing test. The disadvantage of this approach isthat it limits the CPS faults that can be used in the evaluation and is biased towards testsuites with high code coverage.

2. For the CPS under test, I can use a set of specifications to determine the correctness ofsystem’s behavior. If such specifications are already available I will use the provided spec-ifications, otherwise, I will manually write these specifications or ask experts to writethem. This approach can only be applied to fairly simple systems. Accurately specifyingthe behavior of a complex CPS is difficult [50]. Therefore, if I select this approach for gen-erating ground truth data, I have to limit the evaluation to fairly simple systems. However,this approach has the advantage of being consistent, reproducible, and unbiased.

3. I can ask a number of experts to manually label system’s executions and create the ground

24

truth data. The advantage of this approach is that it can be applied to complex systems.However, it requires manual attention of experts, which may be biased and unreliable. Inaddition, since this is a very time consuming process, it can only be used for a very smallnumber of traces, and will result in very small test suites.

Depending on the CPSs available for evaluation, I will select one of the above options, andmeasure the quality of test suites generated by different approaches.

Note that by responding to RQ1 with this study, I can also examine the severity of FEP inCPSs. In other words, I can study how often execution of faulty lines result in a failure. Under-standing the severity of FEP in CPSs can guide future quality assurance tools and techniques.

RQ2 (Test Generation Performance) In the second research question, I will evaluate per-formance of different test generation approaches on CPSs based on the metric(s) identified inRQ1, and propose novel approaches of generating higher quality test suits for CPSs. I will focuson search-based approaches since their application to CPSs have been promising [19, 20, 50,54, 108, 109], and they do not require an accurate model of the system, the way model-basedapproaches do [80, 85, 113].

I will start by evaluating evolutionary methods such as genetic algorithm to generate testsuites, and later expand those methods to generate higher quality test suites. Since the fitnessfunction of SBST approaches has significant impact on their performance [15, 99], I will substi-tute a better fitness functions informed by the answer to RQ1. For instance, if the study in RQ1shows that code coverage is a strong metric for measuring the quality of tests, I will propose atest generation approach that aims for maximizing code coverage.

I envision to adapt ideas from model-based test generation approaches such as uncertainty-wise testing [16], and fuzzing techniques to generate a more diverse set of test inputs (e.g.,command parameters, configurations, or simulation environments) that specifically target cornercases, and points of failure.

Time permitting, I will use higher quality tests to refine the automated oracle clusters inChapter 4. I believe that having a more diverse set of traces as training data for Mithra can resultin higher accuracy of behavioral clusters, which can lead to more accurate oracle.

25

6 Proposed Timeline

I propose the following schedule with an expected defense date of May 2021. At the time ofwriting this thesis proposal I have completed and published the qualitative study described inSection 3.1, and the case study on ARDUPILOT bugs described in Section 3.2. I prepared thesurvey in Section 3.3, and have collected responses to the survey. However, I need to analyze thecollected data and prepare the article to be submitted to a robotics venue.

I have implemented automated oracle learning approach, Mithra, described in Section 4, andsubmitted to FSE 2020. I have made initial investigations into the work proposed in Section 5.• January-March 2020

Complete analysis of the data collected from survey on robotics simulators and sub-mit the results to IROS (deadline March 1).

Apply Mithra on a ROS system, and prepare the results to be submitted to FSE (dead-line March 5).

• March-June 2020Finish the proposal document, and carry out the thesis proposal.

Select a number of robotic systems as samples and prepare the engineering infras-tructure needed to be able to run them in simulation.

Prepare infrastructure to generate tests for the selected systems.

Write specifications for the selected systems that can be used as oracle to the tests.• June-July 2020

Prepare the infrastructure to be able to collect a number of quality metrics on a testsuite, such as code coverage.

Prepare infrastructure to create mutants of system with faulty behavior.

Create a diverse set of test suites too be executed on the systems.• July-September 2020

Collect the results of comparing different quality metrics for test suites.

Write up the results to be submitted to a software engineering venue (possibly ICSE2021).

• September-December 2020Literature review and initial study on test generation approaches on robotics and cy-berphysical systems.

26

Design and implementation of test generation for selected systems.• December 2020-March 2021

Conduct experiments by generating test suites with the purpose of exposing under-tested behaviors and detecting higher number of failures.

Collect the results of the experiments and prepare an article to be submitted to asoftware engineering venue (possibly FSE 2021).

• March-May 2021Finishing up all works in progress, and resubmitting possibly rejected papers.

Writing up thesis document.

Prepare for defense.• May 2021

Defend and graduate.

27

7 Conclusion

In summary, this thesis proposal covers a number of empirical and non-empirical studies withthe purpose of improving automated testing of robotic and cyberphysical systems. The empiricalstudies identify the state and challenges of testing CPSs, while non-empirical studies proposean automated testing pipeline that improves the quality assurance of these systems in simulatedenvironments.

My work encourages more research and studies to be conducted in this field by identifyingthe specific challenges of testing CPSs in simulation and illustrating the advantages of automatedsimulation-based testing. In addition, the automated testing pipeline I propose, which includesautomated oracle inference and automated test generation, improves current state of automatedtesting, and enables future studies on improving different pieces of the proposed techniques (e.g.,oracle refinement). The implementation of all the techniques discussed in this thesis proposal willbe publicly released to be used by researchers in the future.

28

Bibliography

[1] Boeing issues 737 max fleet bulletin on AoA warning after lion aircrash. https://theaircurrent.com/aviation-safety/boeing-nearing-737-max-fleet-bulletin-on-aoa-warning-after-lion-air-crash/,. Accessed: 2020-01-17.

[2] Assumptions used in the safety assessment process and the effects of multiple alerts andindications on pilot performance. https://www.ntsb.gov/investigations/AccidentReports/Reports/ASR1901.pdf, . Accessed: 2020-01-17.

[3] Robotics and cyber-physical systems. https://www.cis.mpg.de/robotics/.Accessed: 2019-12-15.

[4] About ROS. https://www.ros.org/about-ros/. Accessed: 2020-01-16.

[5] Schiaparelli landing investigation makes progress. https://www.esa.int/Science_Exploration/Human_and_Robotic_Exploration/Exploration/ExoMars/Schiaparelli_landing_investigation_completed, 2016. Accessed 2020-01-17.

[6] Alireza Abbaspour, Kang K Yen, Shirin Noei, and Arman Sargolzaei. Detection of faultdata injection attack on UAV using adaptive neural network. Procedia computer science,95:193–200, 2016.

[7] Sara Abbaspour Asadollah, Rafia Inam, and Hans Hansson. A survey on testing for cyberphysical system. In International Conference on Testing Software and Systems, pages194–207. Springer, 2015.

[8] Mahmoud Abdelgawad, Sterling McLeod, Anneliese Andrews, and Jing Xiao. Model-based testing of a real-time adaptive motion planning system. Advanced Robotics, 31(22):1159–1176, 2017.

[9] Mikhail Afanasov, Aleksandr Iavorskii, and Luca Mottola. Programming support for time-sensitive adaptation in cyberphysical systems. ACM SIGBED Review, 14(4):27–32, 2018.

[10] Sheeva Afshan, Phil McMinn, and Mark Stevenson. Evolving readable string test inputsusing a natural language model to reduce human oracle cost. In International Conferenceon Software Testing, Verification and Validation, ICST ’13, pages 352–361, 2013.

[11] Afsoon Afzal, Claire Le Goues, Michael Hilton, and Christopher Steven Timperley. Astudy on challenges of testing robotic systems. In Proceedings of the International Con-ference on Software Testing, Verification and Validation (ICST), ICST’20. IEEE, 2020.

29

https://theaircurrent.com/aviation-safety/boeing-nearing-737-max-fleet-bulletin-on-aoa-warning-after-lion-air-crash/

https://theaircurrent.com/aviation-safety/boeing-nearing-737-max-fleet-bulletin-on-aoa-warning-after-lion-air-crash/

https://www.ntsb.gov/investigations/AccidentReports/Reports/ASR1901.pdf

https://www.ntsb.gov/investigations/AccidentReports/Reports/ASR1901.pdf

https://www.cis.mpg.de/robotics/

https://www.ros.org/about-ros/

https://www.esa.int/Science_Exploration/Human_and_Robotic_Exploration/Exploration/ExoMars/Schiaparelli_landing_investigation_completed




[12] Saeed Aghabozorgi and Teh Ying Wah. Clustering of large time series datasets. IntelligentData Analysis, 18(5):793–817, 2014.

[13] Saeed Aghabozorgi, Ali Seyed Shirkhorshidi, and Teh Ying Wah. Time-series clustering–a decade review. Information Systems, 53:16–38, 2015.

[14] Adam Alami, Yvonne Dittrich, and Andrzej Wasowski. Influencers of quality assurancein an open source community. In International Workshop on Cooperative and HumanAspects of Software Engineering, CHASE’18, pages 61–68. IEEE, 2018.

[15] Shaukat Ali, Lionel C Briand, Hadi Hemmati, and Rajwinder Kaur Panesar-Walawege. Asystematic review of the application and empirical investigation of search-based test casegeneration. Transactions on Software Engineering, 36(6):742–762, 2009.

[16] Shaukat Ali, Hong Lu, Shuai Wang, Tao Yue, and Man Zhang. Uncertainty-wise testing ofcyber-physical systems. In Advances in Computers, volume 107, pages 23–94. Elsevier,2017.

[17] Maryam Raiyat Aliabadi, Amita Ajith Kamath, Julien Gascon-Samson, and Karthik Pat-tabiraman. ARTINALI: dynamic invariant detection for cyber-physical system security.In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering,ESEC/FSE 2017, Paderborn, Germany, September 4-8, 2017, pages 349–361, 2017.

[18] Cesare Alippi, Stavros Ntalampiras, and Manuel Roveri. Model-free fault detection andisolation in large-scale cyber-physical systems. Transactions on Emerging Topics in Com-putational Intelligence, 1(1):61–71, 2016.

[19] Aitor Arrieta, Shuai Wang, Goiuria Sagardui, and Leire Etxeberria. Search-based test caseprioritization for simulation-based testing of cyber-physical system product lines. Journalof Systems and Software, 149:1–34, 2019.

[20] Gerrit Bagschik, Till Menzel, and Markus Maurer. Ontology based scene creation forthe development of automated vehicles. In Intelligent Vehicles Symposium, IV’18, pages1813–1820. IEEE, 2018.

[21] Richard Baker and Ibrahim Habli. An empirical evaluation of mutation testing for im-proving the test quality of safety-critical software. IEEE Transactions on Software Engi-neering, 39(6):787–805, 2012.

[22] Earl T Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo. The oracleproblem in software testing: A survey. Transactions on Software Engineering, 41(5):507–525, 2015.

[23] Donald J Berndt and James Clifford. Using dynamic time warping to find patterns in timeseries. In KDD workshop, volume 10, pages 359–370. Seattle, WA, 1994.

[24] Antonia Bertolino. Software testing research: Achievements, challenges, dreams. InFuture of Software Engineering, FOSE’07, pages 85–103. IEEE Computer Society, 2007.

[25] Ivan Beschastnikh, Yuriy Brun, Michael D. Ernst, and Arvind Krishnamurthy. Inferringmodels of concurrent systems from logs of their behavior with csight. In 36th InternationalConference on Software Engineering, ICSE ’14, Hyderabad, India - May 31 - June 07,2014, pages 468–479, 2014.

30

[26] Ivan Beschastnikh, Patty Wang, Yuriy Brun, and Michael D. Ernst. Debugging distributedsystems. Queue, 14(2):91–110, 2016.

[27] Alan W Biermann and Jerome A Feldman. On the synthesis of finite-state machines fromsamples of their behavior. Transactions on Computers, 100(6):592–597, 1972.

[28] Dibya Jyoti Bora and Anil Kumar Gupta. Effect of different distance measures on theperformance of k-means algorithm: an experimental study in matlab. Computer Scienceand Information Technologies, 5(2):2501–2506, 2014.

[29] Adnan Causevic, Daniel Sundmark, and Sasikumar Punnekkat. An industrial survey oncontemporary aspects of software testing. In International Conference on Software Test-ing, Verification and Validation, ICST’10, pages 393–401. IEEE, 2010.

[30] Yuqi Chen, Christopher M Poskitt, and Jun Sun. Learning from mutants: Using codemutation to learn and monitor invariants of a cyber-physical system. In 2018 IEEE Sym-posium on Security and Privacy (SP), pages 648–660. IEEE, 2018.

[31] Yuqi Chen, Christopher M. Poskitt, and Jun Sun. Learning from mutants: Using codemutation to learn and monitor invariants of a cyber-physical system. In Symposium onSecurity and Privacy, S&P ’18, pages 648–660, 2018.

[32] Long Cheng, Ke Tian, and Danfeng Daphne Yao. Orpheus: Enforcing cyber-physicalexecution semantics to defend against data-oriented attacks. In Computer Security Appli-cations Conference, pages 315–326, 2017.

[33] Hongjun Choi, Wen-Chuan Lee, Yousra Aafer, Fan Fei, Zhan Tu, Xiangyu Zhang,Dongyan Xu, and Xinyan Xinyan. Detecting attacks against robotic vehicles: A con-trol invariant approach. In Conference on Computer and Communications Security, CCS’18, pages 801–816, 2018.

[34] David Clark, Robert M Hierons, and Krishna Patel. Normalised squeeziness and failederror propagation. Information Processing Letters, 149:6–9, 2019.

[35] Daniel Cook, Andrew Vardy, and Ron Lewis. A survey of AUV and robot simulators formulti-vehicle operations. In IEEE/OES Autonomous Underwater Vehicles (AUV), pages1–8. IEEE, 2014.

[36] Jeff Craighead, Robin Murphy, Jenny Burke, and Brian Goldiez. A survey of commercial& open source unmanned vehicle simulators. In Proceedings of International Conferenceon Robotics and Automation, ICRA’07, pages 852–857. IEEE, 2007.

[37] Christoph Csallner, Nikolai Tillmann, and Yannis Smaragdakis. Dysy: Dynamic symbolicexecution for invariant inference. In Proceedings of the 30th International Conference onSoftware Engineering, ICSE ’08, pages 281–290. ACM, 2008.

[38] Ankush Desai, Tommaso Dreossi, and Sanjit A Seshia. Combining model checking andruntime verification for safe robotics. In International Conference on Runtime Verification,pages 172–189. Springer, 2017.

[39] Ankush Desai, Shaz Qadeer, and Sanjit A Seshia. Programming safe robotics systems:Challenges and advances. In International Symposium on Leveraging Applications ofFormal Methods, pages 103–119. Springer, 2018.

31

[40] Pengfei Duan, Ying Zhou, Xufang Gong, and Bixin Li. A systematic mapping study onthe verification of cyber-physical systems. IEEE Access, 6:59043–59064, 2018.

[41] Emelie Engstrom and Per Runeson. A qualitative survey of regression testing practices.In International Conference on Product Focused Software Process Improvement, pages3–16. Springer, 2010.

[42] Michael D Ernst, Jeff H Perkins, Philip J Guo, Stephen McCamant, Carlos Pacheco,Matthew S Tschantz, and Chen Xiao. The daikon system for dynamic detection of likelyinvariants. Science of computer programming, 69(1-3):35–45, 2007.

[43] Lukas Esterle and Radu Grosu. Cyber-physical systems: challenge of the 21st century. e& i Elektrotechnik und Informationstechnik, 133(7):299–303, 2016.

[44] Marie Farrell, Matt Luckcuck, and Michael Fisher. Robotics and integrated formal meth-ods: necessity meets opportunity. In International Conference on Integrated Formal Meth-ods, pages 161–171. Springer, 2018.

[45] Gordon Fraser and Andrea Arcuri. Whole test suite generation. Transactions on SoftwareEngineering, 39(2):276–291, 2012.

[46] Alessio Gambi, Tri Huynh, and Gordon Fraser. Generating effective test cases for self-driving cars from police reports. In Joint Meeting on European Software EngineeringConference and Symposium on the Foundations of Software Engineering, ESEC/FSE’19,pages 257–267, 2019.

[47] Vahid Garousi and Junji Zhi. A survey of software testing practices in canada. Journal ofSystems and Software, 86(5):1354–1376, 2013.

[48] Amin Ghafouri, Aron Laszka, Abhishek Dubey, and Xenofon Koutsoukos. Optimal detec-tion of faulty traffic sensors used in route planning. In International Workshop on Scienceof Smart City Operations and Platforms Engineering, pages 1–6, 2017.

[49] Amin Ghafouri, Yevgeniy Vorobeychik, and Xenofon Koutsoukos. Adversarial regressionfor detecting attacks in cyber-physical systems. In International Conference on ArtificialIntelligence, ICOAI ’18, pages 3769–3775, 2018.

[50] Christoph Gladisch, Thomas Heinz, Christian Heinzemann, Jens Oehlerking, Anne vonVietinghoff, and Tim Pfitzer. Experience paper: Search-based testing in automated drivingcontrol applications. In International Conference on Automated Software Engineering,ASE’19, pages 26–37. IEEE, 2019.

[51] Stewart Grant, Hendrik Cech, and Ivan Beschastnikh. Inferring and asserting distributedsystem invariants. In Proceedings of the 40th International Conference on Software En-gineering, ICSE 2018, Gothenburg, Sweden, May 27 - June 03, 2018, pages 1149–1159,2018.

[52] Jiawei Han, Jian Pei, and Micheline Kamber. Data mining: concepts and techniques.2011.

[53] Yoshiyuki Harada, Yoriyuki Yamagata, Osamu Mizuno, and Eun-Hye Choi. Log-basedanomaly detection of cps using a statistical method. In International Workshop on Empir-ical Software Engineering in Practice, IWESEP ’17, pages 1–6, 2017.

32

[54] Florian Hauer, Alexander Pretschner, Maximilian Schmitt, and Markus Groetsch. Indus-trial evaluation of search-based test generation techniques for control systems. In Inter-national Symposium on Software Reliability Engineering Workshops, ISSREW’17, pages5–8. IEEE, 2017.

[55] Zhijian He, Yao Chen, Enyan Huang, Qixin Wang, Yu Pei, and Haidong Yuan. A systemidentification based oracle for control-cps software fault localization. In InternationalConference on Software Engineering, ICSE ’19, pages 116–127, 2019.

[56] Michael W Hofbaur and Brian C Williams. Mode estimation of probabilistic hybrid sys-tems. In International Workshop on Hybrid Systems: Computation and Control, pages253–266, 2002.

[57] Casidhe Hutchison, Milda Zizyte, Patrick E. Lanigan, David Guttendorf, Michael Wagner,Claire Le Goues, and Philip Koopman. Robustness testing of autonomy software. InInternational Conference on Software Engineering - Software Engineering in Practice,ICSE-SEIP ’18, pages 276–285, 2018.

[58] Felix Ingrand. Recent trends in formal validation and verification of autonomous robotssoftware. In International Conference on Robotic Computing, IRC’19, pages 321–328.IEEE, 2019.

[59] Jun Inoue, Yoriyuki Yamagata, Yuqi Chen, Christopher M Poskitt, and Jun Sun. Anomalydetection for a water treatment system using unsupervised machine learning. In Interna-tional Conference on Data Mining Workshops, ICDMW ’17, pages 1058–1065, 2017.

[60] Yue Jia and Mark Harman. An analysis and survey of the development of mutation testing.IEEE transactions on software engineering, 37(5):649–678, 2010.

[61] Stephen C. Johnson. Hierarchical clustering schemes. Psychometrika, 32(3):241–254,1967.

[62] Rene Just, Darioush Jalali, Laura Inozemtseva, Michael D Ernst, Reid Holmes, and Gor-don Fraser. Are mutants a valid substitute for real faults in software testing? In Inter-national Symposium on Foundations of Software Engineering, FSE’14, pages 654–665,2014.

[63] Aaron Kane, Thomas Fuhrman, and Philip Koopman. Monitor based oracles for cyber-physical system testing: Practical experience report. In International Conference on De-pendable Systems and Networks, DSN ’14, pages 148–155, 2014.

[64] Christian Kastner, Alexander Von Rhein, Sebastian Erdweg, Jonas Pusch, Sven Apel,Tillmann Rendel, and Klaus Ostermann. Toward variability-aware testing. In InternationalWorkshop on Feature-Oriented Software Development, pages 1–8. ACM, 2012.

[65] Leonard Kaufmann and Peter Rousseeuw. Clustering by means of medoids. Data Analysisbased on the L1-Norm and Related Methods, pages 405–416, 01 1987.

[66] Siddhartha Kumar Khaitan and James D McCalley. Design techniques and applicationsof cyberphysical systems: A survey. IEEE Systems Journal, 9(2):350–365, 2014.

[67] Manju Khari. Empirical evaluation of automated test suite generation and optimization.Arabian Journal for Science and Engineering, pages 1–17, 2019.

33

[68] Nobuo Kikuchi, Takeshi Yoshimura, Ryo Sakuma, and Kenji Kono. Do injected faultscause real failures? a case study of linux. In International Symposium on Software Relia-bility Engineering Workshops, ISSRE’14, pages 174–179. IEEE, 2014.

[69] Philip Koopman and Michael Wagner. Autonomous vehicle safety: An interdisciplinarychallenge. IEEE Intelligent Transportation Systems Magazine, 9(1):90–96, 2017.

[70] Sotiris B Kotsiantis, I Zaharakis, and P Pintelas. Supervised machine learning: A reviewof classification techniques. Emerging artificial intelligence applications in computer en-gineering, 160:3–24, 2007.

[71] Caroline Lemieux. Mining temporal properties of data invariants. In International Con-ference on Software Engineering, ICSE ’15, pages 751–753, 2015.

[72] Caroline Lemieux, Dennis Park, and Ivan Beschastnikh. General LTL specification min-ing. In 30th IEEE/ACM International Conference on Automated Software Engineering,ASE 2015, Lincoln, NE, USA, November 9-13, 2015, pages 81–92, 2015.

[73] Husheng Li. Communications for control in cyber physical systems: theory, design andapplications in smart grids, chapter 1-Introduction to cyber physical systems. MorganKaufmann, 2016.

[74] Mikael Lindvall, Adam Porter, Gudjon Magnusson, and Christoph Schulze. Metamorphicmodel-based testing of autonomous systems. In International Workshop on MetamorphicTesting, MET ’17, pages 35–41, 2017.

[75] S. Lloyd. Least squares quantization in pcm. Transactions on Information Theory, 28(2):129–137, 2006.

[76] Pablo Loyola, Matt Staats, In-Young Ko, and Gregg Rothermel. Dodona: automatedoracle data set selection. In International Symposium on Software Testing and Analysis,ISSTA ’14, pages 193–203, 2014.

[77] Matt Luckcuck, Marie Farrell, Louise A Dennis, Clare Dixon, and Michael Fisher. Formalspecification and verification of autonomous robotic systems: A survey. ACM ComputingSurveys, 52(5):100, 2019.

[78] Haroon Malik, Hadi Hemmati, and Ahmed E Hassan. Automatic detection of perfor-mance deviations in the load testing of large scale systems. In International Conferenceon Software Engineering, ICSE ’13, pages 1012–1021, 2013.

[79] Dusica Marijan, Arnaud Gotlieb, and Mohit Kumar Ahuja. Challenges of testing machinelearning based systems. In International Conference On Artificial Intelligence Testing,AITest’19, pages 101–102. IEEE, 2019.

[80] Reza Matinnejad, Shiva Nejati, Lionel C Briand, and Thomas Bruckmann. Automatedtest suite generation for time-continuous simulink models. In International Conferenceon Software Engineering, ICSE’16, pages 595–606, 2016.

[81] Phil McMinn. Search-based software testing: Past, present and future. In InternationalConference on Software Testing, Verification and Validation Workshops, ICST’11, pages153–163. IEEE, 2011.

[82] Phil McMinn, Mark Stevenson, and Mark Harman. Reducing qualitative human oracle

34

costs associated with automatically generated test data. In International Workshop onSoftware Test Output Validation, pages 1–4, 2010.

[83] Carter McNamara. General guidelines for conducting interviews, 1999.

[84] Claudio Menghi, Shiva Nejati, Khouloud Gaaloul, and Lionel Briand. Generating auto-mated and online test oracles for simulink models with continuous and uncertain behav-iors. 2019.

[85] Swarup Mohalik, Ambar A Gadkari, Anand Yeolekar, KC Shashidhar, and S Ramesh.Automatic test case generation from simulink/stateflow models using model checking.International Conference on Software Testing, Verification and Reliability, 24(2):155–180,2014.

[86] Patric Nader, Paul Honeine, and Pierre Beauseroy. lp-norms in one-class classificationfor intrusion detection in scada systems. Transactions on Industrial Informatics, 10(4):2308–2317, 2014.

[87] Roberto Natella, Domenico Cotroneo, Joao A Duraes, and Henrique S Madeira. On faultrepresentativeness of software fault injection. IEEE Transactions on Software Engineer-ing, 39(1):80–96, 2012.

[88] SP Ng, Tafline Murnane, Karl Reed, D Grant, and TY Chen. A preliminary survey onsoftware testing practices in australia. In Australian Software Engineering Conference,pages 116–125. IEEE, 2004.

[89] ThanhVu Nguyen, Deepak Kapur, Westley Weimer, and Stephanie Forrest. Using dynamicanalysis to generate disjunctive invariants. In 36th International Conference on SoftwareEngineering, ICSE ’14, Hyderabad, India - May 31 - June 07, 2014, pages 608–619,2014.

[90] ThanhVu Nguyen, Matthew B. Dwyer, and Willem Visser. Syminfer: inferring programinvariants using symbolic states. In International Conference on Automated SoftwareEngineering, ASE’17, pages 804–814, 2017.

[91] Tony Ohmann, Michael Herzberg, Sebastian Fiss, Armand Halbert, Marc Palyart, IvanBeschastnikh, and Yuriy Brun. Behavioral resource-aware model inference. In ACM/IEEEInternational Conference on Automated Software Engineering, ASE ’14, Vasteras, Sweden- September 15 - 19, 2014, pages 19–30, 2014.

[92] Mike Papadakis, Donghwan Shin, Shin Yoo, and Doo-Hwan Bae. Are mutation scorescorrelated with real fault detection? a large scale empirical study on the relationship be-tween mutants and real faults. In International Conference on Software Engineering,ICSE’18, pages 537–548. IEEE, 2018.

[93] Fabio Pasqualetti, Florian Dorfler, and Francesco Bullo. Cyber-physical attacks in powernetworks: Models, fundamental limitations and monitor design. In Conference on Deci-sion and Control and European Control Conference, pages 2195–2201, 2011.

[94] Fabrizio Pastore, Leonardo Mariani, and Gordon Fraser. Crowdoracles: Can the crowdsolve the oracle problem? In International Conference on Software Testing, Verificationand Validation, ICST ’13, pages 342–351, 2013.

35

[95] Morgan Quigley, Ken Conley, Brian Gerkey, Josh Faust, Tully Foote, Jeremy Leibs, RobWheeler, and Andrew Y Ng. Ros: an open-source robot operating system. In ICRAworkshop on open source software, volume 3, page 5. Kobe, Japan, 2009.

[96] Chotirat Ann Ratanamahatana and Eamonn Keogh. Three myths about dynamic timewarping data mining. In International Conference on Data Mining, ICDM ’05, pages506–510, 2005.

[97] Danda B Rawat, Joel JPC Rodrigues, and Ivan Stojmenovic. Cyber-physical systems:from theory to practice. CRC Press, 2015.

[98] Per Runeson. A survey of unit testing practices. IEEE software, 23(4):22–29, 2006.

[99] Alireza Salahirad, Hussein Almulla, and Gregory Gay. Choosing the fitness function forthe job: Automated generation of test suites that detect real faults. Software Testing,Verification and Reliability, 29(4-5):e1701, 2019.

[100] Lukas Schmidt, Apurva Narayan, and Sebastian Fischmeister. TREM: a tool for miningtimed regular specifications from system traces. In Proceedings of the 32nd IEEE/ACMInternational Conference on Automated Software Engineering, ASE 2017, Urbana, IL,USA, October 30 - November 03, 2017, pages 901–906, 2017.

[101] Sanjit A Seshia, Shiyan Hu, Wenchao Li, and Qi Zhu. Design automation of cyber-physical systems: Challenges, advances, and opportunities. IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems, 36(9):1421–1434, 2016.

[102] Forrest Shull, Janice Singer, and Dag I. K. Sjøberg, editors. Guide to Advanced EmpiricalSoftware Engineering. 2008.

[103] Thierry Sotiropoulos, Helene Waeselynck, and Jeremie Guiochet. Can robot navigationbugs be found in simulation? an exploratory study. In Software Quality, Reliability andSecurity, QRS ’17, pages 150–159, 2017.

[104] Mohammad Tehranipoor and Cliff Wang. Introduction to hardware security and trust.Springer Science & Business Media, 2011.

[105] Andreas Theissler. Detecting known and unknown faults in automotive systems usingensemble-based anomaly detection. Knowledge-Based Systems, 123:163–173, 2017.

[106] Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. Deeptest: automated testing ofdeep-neural-network-driven autonomous cars. In Proceedings of the 40th InternationalConference on Software Engineering, ICSE 2018, Gothenburg, Sweden, May 27 - June03, 2018, pages 303–314, 2018.

[107] Christopher Steven Timperley, Afsoon Afzal, Deborah S Katz, Jam Marcos Hernandez,and Claire Le Goues. Crashing simulated planes is cheap: Can simulation detect roboticsbugs early? In Proceedings of the 11th International Conference on Software Testing,Verification and Validation (ICST), pages 331–342. IEEE, 2018.

[108] Cumhur Erkan Tuncali. Search-based Test Generation for Automated Driving Systems:From Perception to Control Logic. PhD thesis, Arizona State University, 2019.

[109] Cumhur Erkan Tuncali, Theodore P Pavlic, and Georgios Fainekos. Utilizing S-TaLiRo asan automatic test generation framework for autonomous vehicles. In International Con-

36

ference on Intelligent Transportation Systems, ITSC’16, pages 1470–1475. IEEE, 2016.

[110] Vandi Verma, Geoff Gordon, Reid Simmons, and Sebastian Thrun. Real-time fault diag-nosis [robot fault diagnosis]. Robotics & Automation Magazine, 11(2):56–66, 2004.

[111] Westley Weimer, Stephanie Forrest, Claire Le Goues, and Miryung Kim. Cooperative,trusted software repair for cyber physical system resiliency. Technical report, Universityof Virginia Charlottesville United States, 2018.

[112] Johannes Wienke and Sebastian Wrede. Results of the survey: failures in robotics andintelligent systems. arXiv preprint arXiv:1708.07379, 2017.

[113] Andreas Windisch. Search-based testing of complex simulink models containing stateflowdiagrams. In International Conference on Software Engineering, ICSE’09, pages 395–398. IEEE, 2009.

[114] Rui Xu and Don Wunsch. Clustering, volume 10. John Wiley & Sons, 2008.

[115] Jinlin Yang, David Evans, Deepali Bhardwaj, Thirumalesh Bhat, and Manuvir Das. Per-racotta: Mining Temporal API Rules from Imperfect Traces. In Proceedings of the 28thInternational Conference on Software Engineering, ICSE ’06, pages 282–291, 2006.

[116] Kiyoung Yang and Cyrus Shahabi. A pca-based similarity measure for multivariate timeseries. In International Workshop on Multimedia Databases, MMDB ’04, pages 65–74,2004.

[117] Nong Ye, Syed Masum Emran, Qiang Chen, and Sean Vilbert. Multivariate statisticalanalysis of audit trails for host-based intrusion detection. Transactions on computers, 51(7):810–820, 2002.

[118] Xi Zheng, Christine Julien, Miryung Kim, and Sarfraz Khurshid. On the state of the art inverification and validation in cyber physical systems. The University of Texas at Austin,The Center for Advanced Research in Software Engineering, Tech. Rep. TR-ARiSE-2014-001, 1485, 2014.

[119] Vivienne Jia Zhong, Rolf Dornberger, and Thomas Hanne. Comparison of the behaviorof swarm robots with their computer simulations applying target-searching algorithms.International Journal of Mechanical Engineering and Robotics Research, 2018.

[120] Zhi Quan Zhou and Liqun Sun. Metamorphic testing of driverless cars. Communicationsof the ACM, 62(3):61–67, February 2019.

[121] Xiaojin Zhu and Andrew B Goldberg. Introduction to semi-supervised learning. Synthesislectures on artificial intelligence and machine learning, 3(1):1–130, 2009.

[122] Ehsan Zibaei, Sebastian Banescu, and Alexander Pretschner. Diagnosis of safety inci-dents for cyber-physical systems: A uav example. In International Conference on SystemReliability and Safety, ICSRS ’18, pages 120–129, 2018.

37

Thesis Proposal: Automated Testing of Robotic and …afsoona/proposal.pdf · 2020-05-02 · to...

Documents

Transcript of Thesis Proposal: Automated Testing of Robotic and …afsoona/proposal.pdf · 2020-05-02 · to...