Machine Learning in Software Engineering

NEW TRENDS IN LEARNING FOR SOFTWARE ENGINEERING

Alaa HamoudaDepartment of Computer Engineering,

Engineering Faculty, Al-Azhar University, Egypt

1

Agenda

• Introduction

• Software Engineering Phases

• Machine Learning Overview

• Applications of ML in SWE with each process:– Project Planning

– Requirements

– Design

– Implementation

– Testing

– Maintenance

• Conclusion

2

Problem Definition

• There is a need to meet the challenge of developing and maintaining large and complex software systems.

• Machine learning methods have been playing an increasingly important role in many software development and maintenance tasks.

3

SWE Phases

4

Overview of ML

• Machine learning methods fall into the following broad categories: supervised learning and unsupervised learning. Supervised learning deals with learning a target function from labeled examples. Unsupervised learning attempts to learn patterns and associations from a set of objects that do not have attached class labels.

• Supervised learning can be divided into eager and lazy classifiers

5

Overview of ML

6

Overview of ML

7

8

The loan data (reproduced)Approved or not

9

A decision tree from the loan dataDecision nodes and leaf nodes (classes)

Agenda

• Introduction




– Requirements

– Design

– Implementation

– Testing

– Maintenance

• Conclusion

10

Project Planning

• The statistics report failure rate of 70% for the software

• The cost overrun has been indicated 189%

• The researches show that inaccurate estimation is the root factor of fail in the most software project fails.

11

Size Estimation

• Size -- Effort - Cost

• twenty-eight out of the collected sixty publications (almost 47%) deal with the issue of how to build models to predict or estimate certain property of software development process or artifacts.

12

Function Point

13

Internal Logical File: File accessed and maintained by the application under developmentExternal Interface File: File accessed by the Processing Logic, but maintained by another applicationExternal Input: An elementary process that processes data that comes from outside the application boundary.–Maintains ILF

External Output: An elementary process that sends data outside the application boundary.-EO represents information to user through processing logic in addition to retrieval of data

External in Query: An elementary process that sends data outside the application boundary-EQ presents information to a user through retrieval of data from ILF/EIF.-No data manipulation or processing logic.

Size estimation (Cont’)

Input:• Function points• Project domains• Number of components types:

– Number of menu components– Number of inputs components– Number of output components

ML Algorithm:• Neural NetworkOutput:• LOC to be fed to the cost estimation stage

14

Size estimation (Cont’)

15

Effort Estimation

Input:• Line of Code (generated from the size estimation)• Scale factors• Cost Drivers

Algorithm:• Fuzzy Inference Engine

Output:• Estimated efforts (e.g. man-hours)

16

Inputs (scale factors)

Factor Explanation

Precedentedness

(PREC)

Reflects the previous experience of the

organization

Development

Flexibility (FLEX)

Reflects the degree of flexibility in the

development process.

Risk Resolution (RESL) Reflects the extent of risk analysis carried out.

Team Cohesion (TEAM) Reflects how well the development team knows

each other and work together.

Process maturity (PMAT) Reflects the process maturity of the organization.

17

Factor Explanation LOC Line of Code

Inputs (Cost Drivers) Attribute Type Description

RELY Product Required system reliability

CPLX Product Complexity of system modules

DOCU Product Extent of documentation required

DATA Product Size of database used

RUSE Product Required percentage of reusable components

TIME Computer Execution time constraint

PVOL Computer Volatility of development platform

STOR Computer Memory constraints

ACAP Personnel Capability of project analysts

PCON Personnel Personnel continuity

PCAP Personnel Programmer capability

PEXP Personnel Programmer experience in project domain

AEXP Personnel Analyst experience in project domain

LTEX Personnel Language and tool experience

TOOL Project Use of software tools

SCED Project Development schedule compression

SITE Project Extent of multisite working and quality of inter-

site communications 18

Using Fuzzy Logic

19

Effort Estimation directly from UCP

In the previous method:

• FP (size) -- > LOC (size) -- > Effort

Another method:

• UCP (size) -- > Effort (directly)

20

Effort Estimation

21

Use Case Point Calculation

22

Productivity

23

Project Complexity

• Level 1: the project team is familiar with this type of project and the team has developed similar projects in the past. The number and type of interfaces are simple. The project will be installed in normal conditions where high security or safety factors are not required. Moreover, Level 1 projects are those of which around 20% of their design or implementation parts are reused (came from old similar projects).

• Level 2: This is similar to level1 category with a difference that only about 10% of these projects are reused.

24

Project Complexity (Cont’d)

• Level 3: the technology, interface, installation conditions are normal. Furthermore, no parts of the projects had been previously designed or implemented.

• Level 4: the project is required to be installed on a complicated topology/architecture such as distributed systems. Moreover, in this level, the number of variables and interface is large.

• Level 5: This is similar to Level4 but with additional constraints such as a special type of security or high safety factors.

25

Effort Estimation

26

Effort Estimation (Cont’d)

The results show that the proposed ANN model outperforms:

• Regression models by 8%

• UCP models by 50%

27

Agenda

• Introduction




– Requirements

– Design

– Implementation

– Testing

– Maintenance

• Conclusion

28

Requirements Analysis

29

Business Analysis System Analysis


30


31


Lexicons Phase-I Phase –II

User Noun Actor

fills Verb Action

the Article -------

form Noun Object

32

Requirements

• Reverse engineering where we have legacy systemsthat are critical to the operation of an organizationwhich uses them and that must still be maintained.

• Most legacy systems were developed before softwareengineering techniques were widely used. Thus theymay be poorly structured and their documentationmay be either out-of-date or non-existent.

• In order to bring to bear the legacy systemmaintenance, the first task is to recover the design orspecification of a legacy system from its source orexecutable code

33

Agenda

• Introduction




– Requirements

– Design

– Implementation

– Testing

– Maintenance

• Conclusion

34

Design

1. Finding Fault Prone components for reuse

2. UI Design

35

Components Re-use

• Software quality classification models can be used to indicate which program modules are fault-prone (FP) and not fault-prone (NFP).

• These models can be used to select the best candidate modules.

36

Components Re-use

AttributeU_1 Number of unique operators

N_1 Total number of operators

U_2 Number of unique operands

N_2 Total number of operands

V(G) McCabe’s cyclomatic complexity

N_L Number of logical operators

LOC Lines of code

ELOC Executable lines of code37

User Interface Design

• Learnability is an important aspect of usability

• users lose up to 40% of their time due to “frustrating experiences” with computers, with one of the most common causes of these frustrations being missing, hard to find, and unusable features of the software.

38


• Nielsen defines that a highly learnable system could be categorized as “allowing users to reach a reasonable level of usage proficiency within a short time”.

• Web usage map is mined through Label Sequential Rule

39


40

Agenda

• Introduction




– Requirements

– Design

– Implementation

– Testing

– Maintenance

• Conclusion

41

Implementation

• Implementation is a core process in the software engineering life cycle.

• One of the challenges in this phase is the modularization –or remodularization-.

• Genetic algorithms have been successfully used to address this problem.

• The objective is to improve the module quality (MQ). All versions of MQ are combinations of cohesion and coupling into a single weighted fitness function.

42

Implementation (Cont’d)

• Clustering has also been applied to package coupling, to reduce overall package size and to explore the relationship between design and code level software structure.

• Additional objectives might include closeness to original module structure, business goals, technical constraints, testability, and other metrics that may be important in finding a good module structure.

43


44


• Refactoring is to rewrite existing source code in order to improve its readability, reusability or structure without affecting its meaning or behavior.

• For project managers it is interesting to know which locations are likely to demand refactoring. Refactoring improves the understandability of the code, but on the other hand requires development time

45


• Researches screen evolution data from versioning systems of open source projects.

• ArgoUML and the Spring framework are examples developed in Java and consist of 5000 and 10000 classes each.

• Each class is usually placed in a separate file in Java, thus they use files equivalent to classes and focus on files for our analysis.

46


The used features can be divided into different categories:

• Size

This category contains size measures such as lines of code from an evolution perspective: linesAdded, linesModified, or linesDeletedrelative to the total LOC (lines of code) of a file.

47

Implementation (Cont’d)• Team

The number of authors of files influences the way software is developed. It is expected that the more authors are working on the changes the higher the probability of rework and mistakes.

• Complexity of existing solution

According to the laws of software evolution, software continuously becomes more and more complex. Changes are more difficult to add as the software is more difficult to understand and the contracts between existing parts have to retain. As a result they investigate the changeCount in relation to the number of changes during the entire history of each file.

48


• New Requirements

In software development projects usually new classes are added to object-oriented systems when new requirements have to be satisfied. They use the information whether a file was newly introduced during the prediction period

• Relational Aspects

One of the most important features of this category are couplings such as the number of changes/revisions where other files have been committed with.

49


• With the described features, the number of refactorings is predicted

50


• Decision tree and neural network are used as classifiers.

• The F-measure was about 65%. • It is clear that several features such as lines

activity rate and number of lines altered per commit provide much information for the assessment of refactorings.

• But also the structure of the system is crucial for refactorings, as the number of co-changed files and the number of files introduced during the maintenance are relevant features.

51

Agenda

• Introduction




– Requirements

– Design

– Implementation

– Testing

– Maintenance

• Conclusion

52

Testing• Software quality models help ensure the

reliability of the delivered products.

• Early detection of fault-prone software components enables verification experts to concentrate their time and resources on the problem areas of the software system under development.

• Accurate prediction of fault-prone modules enables the verification and validation activities focused on the critical software components.

53

Testing (Cont’d)

54

Testing (Cont’d)

• Decision trees correctly predicted 79.3% of high development effort fault-prone modules (detection rate), while the trees generated from the best parameter combinations correctly identified 88.4% of those modules on the average.

55

Agenda

• Introduction




– Requirements

– Design

– Implementation

– Testing

– Maintenance

• Conclusion

56

Maintenance

• Software maintenance is widely recognized to be the most expensive and time-consuming aspect of the software process.

• A relevance relation maps a tuple of system elements to a value indicating how related they are.

• These software change repositories reflect a history of the system, which includes actions that result in the creation of new relationships and the strengthening of the existing relationships in the software.

57

Maintenance (Cont’d)

58


• Software entities include documents, source files, routines, modules, variables, and even the entire software system.

• A relevance relation is a predictor that maps tuples of two or more software entities to a value r quantifying how relevant, that is, connected or related, the entities are to each other.

• r shows the strength of relevance among the entities.

59


60

Maintenance Effort Prediction• If the predictions are based on formal software

development effort prediction models, such as the estimation part of the Function Point Analysis, essential differences in characteristics between software development and software maintenance are neglected

• The focus of software development is the creation of software, but the focus of software maintenance is more the change of software.

• The development of a software application typically is a one-of-a-kind project, but the maintenance activities on an application usually comprise a large number of tasks carried out over a long period of time in a relatively stable environment.

62

Maintenance Effort Prediction

• Some researches collected data on:

– 109 randomly selected maintenance tasks

– 70 applications

– The size of the applications varied from a few thousand lines of code (LOC) to about 500,000 LOC

– the age of the applications varied from less than a year to more than 20 years

– The functions of the applications included payroll, order entry, billing and invoicing, inventory control, service management, and personnel administration.

63


The following data was collected for each maintenance task:

• Type of maintenance task, i.e., corrective or perfective.

• Priority of task, i.e., high, medium or low priority.

• Maintainer’s knowledge and confidence about how to solve the task immediately after having read or heard the task specification.

• Years of experience as maintainer, and on the maintained application.

• Education level of the maintainer.

• Work-hours (effort) spent on the task.

• Task size and the programming language

• Age and size of the changed application. 64


Most Important features:

• Cause: Corrective maintenance = 0, otherwise = 1 • Change: More than 50% of the effort is believed to be

spent on updating of code compared to inserting and deleting the code = 0, otherwise = l

• Mode: More than 50% of the effort is believed to be spent on development of new modules (New module mode) = 0, otherwise (Embedded mode) = 1

• Confidence: The maintainer believes he knows how to solve the task when the task specification is read/heard the first time = 0 (High confidence), otherwise = 1 (Medium or low confidence).

65


Less effect features:

• Type of language

• Maintainer experience

• Task priority

• Application age

• Application size

66


• Neural network and regression were used as approaches for effort prediction.

• The prediction accuracy was acceptable (error of 60%).

• A recommended use of an effort prediction model is, therefore, to support the expert predictions.

• Another important use of a formal prediction model may be to support the collection and analysis of maintenance data in order to enable improvement of the maintenance process and product.

67

Open Problems

• Most of presented work are immature and a lot of related issues are still open.

• Machine learning can help in the requirements engineering phase in developing knowledge based systems and ontologies to manage the requirements and model problem domains

68

Open Problems (Cont’d)

• One of the most difficult problems is the problem of transforming requirements into architectures. Much research is needed in this area to address the ever increasing complexity of functional and non-functional requirements.

69

Open Problems (Cont’d)

• One area that has received some attention is the use of automated algorithms with machine learning to make repair assignments.

• In any case, more studies with respect to the appropriate criteria for selecting assignment policy, reward mechanisms and management goals need to be undertaken.

70

Conclusion

• The existing work certainly proves that the field of software engineering is a fertile ground for the application of machine learning methods.

• It is clear that there is an increased interest in the niche area of machine learning and software engineering.

71

Conclusion (cont’d)

• The strength of machine learning methods lies in the fact that they have sound mathematical and logical justifications

• The power of machine learning methods does not come from a particular induction method, but instead from proper formulation of the problems and from crafting the representation to make learning tractable.

72

Conclusion (cont’d)

• Machine learning can play a good role in the different phases of software engineering; project planning, requirements analysis, design, implementation, testing, and even in maintenance

• It is expected that this interest in applying machine learning in software engineering tasks will increase significantly especially with the increase interest in the empirical software engineering.

73

Thank you very much

74

Machine Learning in Software Engineering

Engineering

Transcript of Machine Learning in Software Engineering