Application of neural networks to the detection of …algorithms for example). This supports the...

89
i Application of neural networks to the detection of fraud in workers’ compensation insurance Inês Bruno de Oliveira Application to a Portuguese insurer Project Work proposal presented as partial requirement for obtaining the Master’s degree in Statistics and Information Management

Transcript of Application of neural networks to the detection of …algorithms for example). This supports the...

Page 1: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

i

Application of neural networks to the detection of fraud

in workers’ compensation insurance

Inês Bruno de Oliveira

Application to a Portuguese insurer

Project Work proposal presented as partial requirement for

obtaining the Master’s degree in Statistics and Information

Management

Page 2: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

i

Title: Application of neural networks to the detection of fraud in wci

Subtitle: Application to a Portuguese insurer

Inês Bruno de Oliveira MEGI

20

17

Page 3: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

1

Page 4: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

2

NOVA Information Management School

Instituto Superior de Estatística e Gestão de Informação

Universidade Nova de Lisboa

APPLICATION OF NEURAL NETWORKS TO THE DETECTION OF FRAUD

IN WORKERS’ COMPENSATION INSURANCE

Application to a Portuguese insurer

by

Inês Oliveira

Project Work report presented as partial requirement for obtaining the Master’s degree in Statistics

and Information Management, with a specialization in Risk Analysis and Management

Advisor: Prof. Rui Alexandre Henriques Gonçalves

November 2017

Page 5: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

3

ACKNOWLEDGEMENTS

This work was completed thanks to a special group of people. First, Professor Rui thank you for your

guidance, patience, and optimism. Also for all those late e-mails, a very special thank you.

I must thank my parents, my brother and Bruno for all the understanding, motivation, and support

which were impressive in the most challenging moments of this work.

I shall thank Raquel and my close friends who stood by me during the late nights this work has taken.

Lastly, a special thank you to all employees involved in the gathering of the data and meetings in the

insurance company that made this work possible.

Page 6: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

4

ABSTRACT

Insurance relies on a complex trust-based relationship in which a policyholder pays in advance to be

protected in the future. In Portugal, workers’ compensation insurance is mandatory which may

restrict the course of action of both players. Insurers face significant losses, not only due to its core

business, but also due to the swindles of claimants and policyholders. Insureds may not have in the

market what they really want to acquire which may encourage fraudulent actions. Traditional fraud

detection methods are no longer adequately protecting institutions in a world with increasingly

sophisticated fraud techniques. This work focuses on creating an artificial neural network which will

learn with insurance data and evolve continuously over time, anticipating fraudulent behaviours or

actors, and contribute to institutions risk protection strategies.

KEYWORDS

Risk; Fraud; Insurance; Neural Networks

Page 7: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

5

INDEX

1. Introduction .................................................................................................................. 9

1.1. Background and Theoretical Framework .............................................................. 9

1.2. Study Relevance .................................................................................................. 10

1.3. Problem Identification and Study Objectives ...................................................... 12

2. Literature review ........................................................................................................ 13

2.1. Operational risk ................................................................................................... 13

2.2. Fraud .................................................................................................................... 15

2.3. Fraud in Insurance business ................................................................................ 17

2.4. Fraud in workers’ compensation insurance ........................................................ 21

2.4.1. Legal Context ................................................................................................ 22

2.5. Analytical models and Data Mining ..................................................................... 24

2.6. Neural Networks .................................................................................................. 28

3. Methodology .............................................................................................................. 34

3.1. Research Methodology ....................................................................................... 34

3.2. Data Collection Process ....................................................................................... 35

3.2.1. Meetings ....................................................................................................... 36

3.2.2. Extracting data .............................................................................................. 39

3.2.3. Data Transformation .................................................................................... 39

3.3. Data Overview ..................................................................................................... 40

3.4. Assumptions ........................................................................................................ 41

3.5. Model ................................................................................................................... 41

4. Results and discussion ................................................................................................ 43

4.1. Measures ............................................................................................................. 51

5. Conclusions ................................................................................................................. 54

6. Limitations and recommendations for future works ................................................. 58

7. Bibliography ................................................................................................................ 60

8. Annexes ...................................................................................................................... 65

LIST OF FIGURES

Page 8: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

6

Figure 1 - Biological Neuron ...................................................................................................... 9

Figure 2 - Artificial Neuron ......................................................................................................... 9

Figure 3 - Structure of risk management ................................................................................. 13

Figure 4 - Fraud Triangle .......................................................................................................... 15

Figure 5 - Overfitting ................................................................................................................ 32

Figure 6 - Hinton Diagram ........................................................................................................ 33

Figure 7 - Data treatment ......................................................................................................... 35

Figure 8 Software Connections ............................................................................................... 36

Figure 9 - ROC index for 15 and 50 hidden units ..................................................................... 44

Figure 10 - ROC plot for the 4th model ..................................................................................... 44

Figure 11 - ROC index for BackProp model .............................................................................. 44

Figure 12 - ROC Index comparison between pilot and no preliminary training ...................... 45

Figure 13 - ROC curve for control 2 .......................................................................................... 47

Figure 14 - Article based ROC plot ........................................................................................... 48

Figure 15 - ROC plot for control 3 ............................................................................................ 49

Figure 16 - Known activation functions ................................................................................... 50

Figure 17 - Example of a distribution of a variable .................................................................. 65

Figure 18 - Correlation between two variables........................................................................ 66

Figure 19 - Fraud classification ................................................................................................. 67

Figure 20 - Misclassification Rate ............................................................................................. 68

Figure 21 - Cumulative Lift ....................................................................................................... 68

Figure 22 - Cumulative % Response ......................................................................................... 68

Figure 23 - Cumulative Lift ....................................................................................................... 69

Figure 24 – Cumulative % Response ........................................................................................ 69

Figure 25 - Misclassification Rate ............................................................................................. 69

Figure 26 - Cumulative Lift ....................................................................................................... 70

Figure 27 - Cumulative % Response ......................................................................................... 70

Figure 28 - Misclassification Rate ............................................................................................. 70

Figure 29 - Misclassification Rate ............................................................................................. 71

Figure 30 - Cumulative % Response ......................................................................................... 71

Figure 31 - Cumulative Lift ....................................................................................................... 71

Figure 32 - Misclassification Rate ............................................................................................. 72

Figure 33 - Cumulative Lift ....................................................................................................... 72

Figure 34 - Cumulative % Response ......................................................................................... 72

Figure 35 - Misclassification Rate ............................................................................................. 73

Figure 36 - Cumulative % Response ......................................................................................... 73

Page 9: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

7

Figure 37 - Cumulated Lift ........................................................................................................ 73

Figure 38 - Cumulative Lift ....................................................................................................... 74

Figure 39 - Cumulative % Response ......................................................................................... 74

Figure 40 - ROC plot ................................................................................................................. 74

Figure 41 - Cumulative Lift ....................................................................................................... 75

Figure 42 Cumulative % Response ........................................................................................... 75

Figure 43 - Cumulative Lift ....................................................................................................... 76

Figure 44 - Misclassification Rate ............................................................................................. 76

Figure 45 - Cumulative % Response ......................................................................................... 76

Figure 46 - Cumulative Lift ....................................................................................................... 77

Figure 47 - Cumulative % Response ......................................................................................... 77

Figure 48 - Cumulated lift ......................................................................................................... 78

Figure 49 - Misclassification Rate ............................................................................................. 78

Figure 50 - Cumulative % Response ......................................................................................... 78

Figure 51 - Cumulative Lift ....................................................................................................... 79

Figure 52 – Cumulative % Response ........................................................................................ 79

Figure 53 - Misclassification Rate ............................................................................................. 80

Figure 54 – Cumulative % Response ........................................................................................ 80

Figure 55 - Cumulative Lift ....................................................................................................... 80

Figure 56 - ROC plot ................................................................................................................. 81

Figure 57 - Cumulative Lift ....................................................................................................... 81

Figure 58 - Cumulative % Response ......................................................................................... 81

Figure 59 - Cumulative Lift ....................................................................................................... 82

Figure 60 - Cumulative % Response ......................................................................................... 82

Figure 61 - Misclassification rate.............................................................................................. 83

Figure 62 – Cumulative % Response ........................................................................................ 83

Figure 63 - Cumulative Lift ....................................................................................................... 83

Figure 64 - Cumulative Lift ....................................................................................................... 84

Figure 65 - ROC chart ............................................................................................................... 84

Figure 66 - Cumulative % Response ......................................................................................... 84

Figure 67 - Cumulative Lift ....................................................................................................... 85

Figure 68 – Cumulative % Response ........................................................................................ 85

Figure 69 - Misclassification rate.............................................................................................. 85

Figure 70 - Cumulative Lift ....................................................................................................... 86

Figure 71 – Cumulative % Response ........................................................................................ 86

Figure 72 - Algorithms overview .............................................................................................. 87

Page 10: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

8

LIST OF ABBREVIATIONS AND ACRONYMS

RMSE Root mean square error

WCI Workers’ compensation insurance

Bus Emp Business Employee

Stat Emp Statistical Employee

Cla Emp Claim Employee

Page 11: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

9

1. INTRODUCTION

1.1. BACKGROUND AND THEORETICAL FRAMEWORK

As background to this study, we will introduce the concept of operational risk and its management

and present an initial notion of fraud. In addition, we shall exhibit some concepts of artificial

intelligence.

Management of operational risks is not a new practice; it has always been important for financial

institutions to try to prevent fraud, maintain the integrity of internal controls (of processes and

systems), reduce errors in transaction processing and even protect against terrorism. However its

importance has steadily increased being comparable to the management of credit and market risk. It

is clear that operational risk differs from other financial risks because it is not taken due to the

expectation of a reward, nevertheless it exists in the natural course of corporate activity (Basel, 2003;

Solvency II, 2009).

Solvency II is the regulatory and supervisory framework for insurers. Its purpose is to establish a

common risk management system and risk measurement principles for every insurance and

reinsurance company in the European Union. This framework was developed to mitigate some

inefficiencies of the former Solvency I (Kaļiņina & Voronova, 2014; Vandenabeele, 2014). The

Solvency II framework is based on three pillars: quantitative, qualitative and supervision

requirements. In a regulatory level, insurance fraud is included in Pillar I, in the calculation of the

Solvency Capital Requirement in the operational risk parcel (Francisco, 2014). As a sustainable

company, an insurer aims to eliminate its risk by managing, planning, evaluating and controlling its

processes which are practices of the mechanism known as risk management (Kaļiņina & Voronova,

2014).

Internal and external fraud are included in the categorization of operational risk by the Operational

Risk Insurance Consortium and defined as intentional misconduct and unauthorized activities by

internal or external parties respectively (Patel, 2010). Currently, fraud detection methods can be

divided into four categories: business rules, auditing, networks and statistic models and data mining

(Shao & Pound, 1999). The use of data mining is due to its financial efficiency finding evidence of

fraud through the application of mathematic algorithms in the available data (Phua, Lee, Smith, &

Gayler, 2010). One example of these mathematical algorithms is neural networks; this model is

inspired in biological neural networks (see resemblance in figures below). Artificial Neural Networks

(ANNs) are systems of interconnected neurons which exchange messages between each other.

Figure 2 - Biological Neuron Figure 1 - Artificial Neuron

Page 12: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

10

There are many applications of ANNs in today’s business: (i) financial institutions are improving their

decision making processes by enhancing the interpretation of behavioural scoring systems and

developing superior ANN models of credit card risk and bankruptcy, (ii) securities and trading houses

are improving forecasting techniques and trading strategies, (iii) insurers are improving their

underwriting techniques, (iv) manufacturers are improving their product quality through predictive

process control systems (Li, 1994). In fact, neural networks (as Generative Adversarial Networks) are

being used to adjust datasets in order to make them usable for further investigations (using neural

algorithms for example). This supports the industrialization of neural networks in the current days

(Douzas & Bação, 2017)

1.2. STUDY RELEVANCE

In this section, we intent to clarify the reasons as to why we should be concerned about fraud and its

detection methods, considering the perspective of insureds and insurers and the consequences for

financial institutions. This concerns, in particular, the Portuguese insurance sector which is

undergoing a deep change in terms of its administration system. Indeed, a close relationship

between insurers and banks has always been beneficial before but the 2008 financial crisis, which

has affected both financial markets, is proof that this may be no longer true (APS, 2016a).

According to Pimenta (2009), the estimates of total fraud in insurance in Portugal are about 1,5% to

2,0% of the national GDP which is congruent with what’s happening internationally: in Australia, the

numbers round 1,3% of GDP, Canada rounds 2,1% of GDP, France 2,0% of GDP, Ireland 4%, USA 6%

and Germany 9%. In fact, the total cost of insurance fraud (non-health insurance) in the US is

estimated to be more than 40$ billion per year and in the UK, (overall) fraud is costing £73

thousands of millions a year (Baesens & Broucke, 2016).

LexisNexis (2012) gives, in its study, an interesting view on costs control: hurricane Andrew (which

destroyed the USA east coast in 1992) resulted in $16 billions in damages; hurricane Katrina (Gulf

coast in 2005) caused $41 billions in losses. These natural disasters have similarities with operational

risk: we can have methods to detected them, to some extent, prevent them (early warnings) but still,

when they do happen, they can cause between nearly zero to substantial losses. Now note that, in

this same country the total amount of costs with insurance fraud is $96.8 billion a year. Clearly, early

identification could reduce costs relating to claims that are later found to be fraudulent. In these

cases, recoveries are limited to only about 3% of the fraud occurrences (LexisNexis, 2012). However,

insurance fraud is still seen as a victimless crime with little or no consequences. Indeed, studies point

out that 68% of people say “getting away with it” is the cause of fraudulent claiming; 12% say it is

okay to present claims about objects that are not lost/damaged or for personal injuries that did not

occur, while 7% say they have already done it. These numbers sum up the society’s standpoint as to

insurance fraud. These swindles are not so victimless as they cost up to $200 to $300 a year of

additional premiums to policyholders. This is the way insurers have to recover the money they lose

with fraud (considered as 10% of losses) (Lesch & Brinkmann, 2012; Maio, 2013). Although

investigating fraud is believed to be useful, insurers consider it to be standard practice to assume a

certain amount of loss. Despite the numbers (currently 10% to 20% of all claims may be fraudulent

and 5% are due to external fraud) insurers are reluctant to stall claims for investigation as they fear

to mistakenly target a legitimate claim and an honest policyholder. This fear is, in fact, very valid as

Page 13: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

11

estimates of the incidence of suspicious claims are at least 10 times as great as the number of claims

that are prosecuted for fraud (Dionne & Wang, 2013; SAS, 2012). Maybe because of this, only market

leading companies report properly the abuses they suffer. Small companies fear any damages to

their reputation, they cannot afford losing mistakenly investigated clients and do not possess the

means to properly split fraudulent from non-fraudulent claims. Still, 62% of international companies

with more than 5000 employees declare being victims of fraud (Pimenta, 2009).

There was a study conducted by PWC (2016) which included 5428 companies from 40 countries. It

showed that 43% of those companies suffered at least one economic crime in the last two years. The

sector with a higher percentage of companies that consider themselves as fraud victims is the

insurance sector with 57%. A daunting fact is the contamination effect of fraud as it affects multiple

segments such as automobile (more than one-third of injury claims in the 90s had elements of fraud),

health (in this segment, losses with fraud are about 3%), personal accidents, property risk and so on

(Amado, 2015; Tennyson, 2008). According to the European Insurance committee, fraud takes up to

5 to 10% of the claim amounts paid for non-life insurance (Baesens & Broucke, 2016).

Regarding workers’ compensation insurance, the Portuguese sector follows the trend: the sector is

deeply out of balance and in need of a restructuring. Several factors have led to this results such as

(i) significant modifications to the Portuguese legislation which has led to increasing responsibilities

from insurers, (ii) relevant increase of life expectancy resulting in a greater burden to insurers due to

the annuity’s payments, resembling what happens in social welfare institutions, (iii) decrease of

interest rates forcing re-evaluations of future responsibilities, and resulting in greater provisions by

insurance companies, (iv) increase in claim’s costs due to fraudulent behaviours (of policyholders,

insured persons and even providers.) Although improving, results continue to be catastrophic. In

2015, the sector had generated -88€ million as technical results (APS, 2016a). The size of workers’

compensation insurance in Portugal is considered significant compared with other sectors. Numbers

show that 238 thousand is the number of claims that occurred in 2015 (1.5% higher than 2014) with

an average cost per claim of 2 484€. The claim’s ratio ( claims’ cost over premiums) is not a good

number to show being 108.3% in 2015 which is still a significant increase from the 115.3% of 2014

(APS, 2016b). In Portugal, CTFRAUDE was formed in 2006 by the Insurers Portuguese Association,

which came however too late to tackle such an important risk for insurers as it represents between

5% to 20% of each institution’s total risk (APS, 2016a; Dionne & Wang, 2013).

Alarming results also of an European survey that revealed huge gaps in the way companies are

fighting fraud. Despite the non-stationary characteristic of fraud, which means its detection

techniques must continuously be refined to react to novel fraudulent behaviour, only 13% of

questioned insurers are using a broad set of tools including business rules, business analytics and

advanced analytics. Additionally, only 21% keeps track of fraud levels in real time (Paasch, 2008; SAS,

2012). These results may justify the numbers of European predictions. Estimations show that fraud in

insurers may vary between 2% and 10% of paid claims. 8€ to 12€ thousand millions are estimated to

be lost to fraud in the total countries of European Union (Fernandes, 2015). A SAS (Statistical

Analysis System) study showed that, of the insurance companies which use business analytics, 57%

had their number of detected fraud cases growing by 4% while only 16% of the companies that did

not use business analytics achieve this fraud detection level (SAS, 2012).

Page 14: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

12

There is no official data about workers’ compensation insurance fraud in Portugal, but there are signs

that it is increasing; Alda Correia (member of CTFRAUDE of APS) informed, in an interview, that the

numbers amounted to 1% in 2007 and 2% in 2012, with a resulting increase in the premiums paid by

insureds (Pimenta, 2009; Ribeiro, 2012). It is evident that there is a gap in the insurance fraud

literature which needs investigation. On the other hand, it is known that fraud detection in financial

institutions requires an extensive amount of data as well as data processing resources. These have

only became commonly available in the last decade.

This work will allow to accurately identify fraud when it happens and efficiently prevent its

materialization in the future. This will not only reduce significant losses, it will also encourage

innovation in the fraud prevention research area and attract more potential investors. In the next

section, we will demonstrate how this work will contribute to this field and what we aim to

accomplish with it.

1.3. PROBLEM IDENTIFICATION AND STUDY OBJECTIVES

The fraud vulnerability of workers’ compensation insurance is, now, clearly obvious. An aggressive

strategy should be put in place, without disrupting relationships with clients by investigating them

too frequently or too deeply. Insurance companies have to adopt new methodologies to detect fraud

and reduce the necessary time for manual processes due to fraudulent claims (Phua et al., 2010). The

undoubtedly severe losses due to fraud schemes should be sufficient reason to set in place more

efficient fraud detection and prevention methods, aiming to avoid future suspicious claims even

before filing.

This work proposes a new model with continuous learning features, that is, every time is used it

improves itself before the next test. The main goal of this project is to improve fraud’s detection and

mitigation methods with neural networks using data from a Portuguese insurance company, with the

intention generalizing the conclusions and making them useful for other companies in the market.

From gathered information, a neural network model will be derived to perceive fraud. The first

specific goal to be achieved is the identification of the significant variables for fraud detection in

workers’ compensation insurance claims. This requires a statistical analysis of the data and a

comparison with others already tested in the fraud detection research area. Next step is, the

identification of the most appropriate types of neural networks to use. An extensive literature review

to study and summarize the real and potential applications of the several types of neural networks is

conducted. This is crucial for the project as there are several types of neural networks and also quite

a few ANN algorithms. Lastly, we select the best suited model for the problem under study. This

should be based on the last two specific goals.

In the next section, we shall explain how we intend to achieve the defined goals by identifying

processes and procedures aimed at obtaining the required information to create the new model.

Page 15: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

13

2. LITERATURE REVIEW

2.1. OPERATIONAL RISK

The 2008 financial crisis frightened people and made them consider the consequences of their

actions, including borrowers, lenders, politicians, and regulators. The unsustainability of the bold

investments, greed and human failures of risk management and governance led to worldwide

problems (Thirlwell, 2010). Financial institutions should have well defined risk management models

and which should include assessment, analysis and elimination of risk (Kaļiņina & Voronova, 2014).

As seen in figure 3, risk culture and risk appetite of an institution should be taken into consideration

when seeking risk management strategies. Risk appetite is the measure of acceptance of risk (its

profits and losses) of an organization. This appetite is one of the components of the risk cultures just

as stress testing, risk identification, risk assessment, risk measurement, reporting, risk monitoring

and connection with business strategy (Kaļiņina & Voronova, 2014).

Figure 3 - Structure of risk management

Risk can be categorized by typology of exposure in several categories such as market risk and credit

risk (both within financial risk), liquidity risk, legal and regulatory risk, business and strategic risk,

operational risk and reputational risk (Basel, 2003; Solvency II, 2009). Operational risk remains

somewhat forgotten and, according to Thirlwell (2010), whose fundamental elements are “dramatic

failures of corporate governance and risk management” and a “systemic breakdown in accountability

and ethics” which is clearly directly linked to human behaviours. Although there are several

definitions of operation risk, the one adopted in the European Union defines it as the risk of loss

arising from inadequate or failed internal processes, personnel, systems, or from external events.

Operational risk is a change in value caused by the fact that actual losses, incurred for inadequate or

failed internal processes, people and systems, or from external events (including legal risk), differ

from the expected losses (European Banking Authority, 2016; Kaļiņina & Voronova, 2014). This type

of risk is becoming increasingly relevant due to the unexpected major losses it originated, though it

Page 16: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

14

has been considered a low probability risk (Basel, 2003). This increase suggests financial institutions

have significantly reduced their risk management activity for operational risk which is unadvisable as

it lies at the heart of all risks: it involves everybody. Recall, managing people is one of the four

elements of the Basel definition of operational risk (Dionne & Wang, 2013; Thirlwell, 2010).

There are different standpoints regarding operational risk and its impacts. Lewis & Lantsman (2005)

state operational risk is not related to market forces (it is idiosyncratic) which means it does not

propagate through companies. However, in non-authorized activities like fraud (included in

operational risks) there is clearly a propagation point in which the exchange of information may shift

the proportions of risk the company is exposed to. Due to this, a new operational risk approach –

top-down - is being used. This should decrease costs in financial and human resources (Gonçalves,

2011). Lastly, the highlight of operational risk management is that it results in better conscious

decision making. For that, a set of processes should be in place as loss analysis, risk and control

assessment, risk indicators monitorization and scenario analysis. On the other hand, managing this

type of risk is contingent to understanding human (and organisational) behaviour (Thirlwell, 2010).

Fraud is a specific part of operational risk, either being internal or external. Prevention and detection

of this risk are the motivation of this work. In the next section we explain its details and debate its

importance.

Page 17: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

15

2.2. FRAUD

Solvency II states seven risk types in operational risk: external fraud, internal fraud, employment

practices and workplace safety, clients, products and business practices, damage to physical assets,

business descriptions and system failures and execution, delivery and process management (Dionne

& Wang, 2013).

The Oxford Dictionary defines fraud as “wrongful or criminal deception intended to result in financial

or personal gain” which seems incomplete: deception does not necessarily mean fraud; some

characteristics need to be fulfilled: it should be uncommon, time-evolving and damage-making. Also,

the individual should be aware he/she is committing fraud (it is not enough to make a false

statement, to be classified as fraud the person has to know he/she is making a false statement the

moments such statement is pronounced) (Pimenta, 2009; Vlasselaer et al., 2015). Fraud has a certain

seasonality; during economic growth it is mitigated, during recession it is stimulated (as necessity

increases, moral standards degenerate while risk aversion may be boosted); sometimes this is

enough to caught the attention of managers who consequently intensify operational risk

management. (Dionne & Wang, 2013) According to a KPMG study, during the financial crisis (2007-

2011) there was a 74% increase in financial institutions internal fraud. Other studies support these

results: Allen and Balli (2007) state that operational losses are related to business cycles and

Chernobai, Jorion, & Yu, (2011) found a dependence between crisis and operational risk (Dionne &

Wang, 2013; “Record Year of Fraud Reports in 2011 - ABC News,” 2012).

The basis of fraud is human behaviour, according to a hypothesis formulated by Donald Cressey, i.e.

the fact that people with financial problems know that problem can be resolved by violating the

position of financial trust. Furthermore, they are able to adjust their conceptions of themselves as

trusted persons or entrusted. The fraud triangle (figure 4) comprehends the underling motives and

drivers of fraud (Baesens, Vlasselaer, & Verbeke, 2015; Soares, 2008).

Figure 4 - Fraud Triangle

Page 18: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

16

Pressure is the main motivation for fraudulent behaviours; it arises of a problem which does not

seem possible to be resolved in a licit way. Usually, it results in financial fraud (desire of better

lifestyle, game addiction, debts). Next driver of fraud, Opportunity: it is the precondition to the

action of committing fraud, the sense that the person will not be discovered therefore relieving the

pressure. These opportunities arise from weaknesses within institutions, such as lack of internal

control, miscommunications, mismanagement or absence of internal auditing. Lastly, Rationalization:

this driver explains why fraudsters do not refrain from committing fraud by justifying the act to

themselves as a once-in-a-lifetime act. This way, their conduct appears, to themselves, acceptable.

Each one of these three elements adjust themselves resulting in fraud probability evolution through

time (Baesens et al., 2015; Soares, 2008).

There is a variety of types of fraud as credit card fraud, counterfeit, corruption, product warranty

fraud, healthcare fraud, money laundering, identity theft, plagiarism, and insurance fraud. The

former will be explored in the next section.

Page 19: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

17

2.3. FRAUD IN INSURANCE BUSINESS

There are records evidencing intentional sinking of ships (ship scuttling) in order to claim the

insurance money dated from between 1100 AC and 146 AC. These acts could lead to death penalties

and this fact is thought to have originated life insurance. Also, with the growth of railway and

automobile industry there was an expansion in insurance contracts and, consequently, insurance

fraud (Niemi, 1995; Viaene, Derrig, Baesens, & Dedene, 2002). Insurance fraud is about an intention

of an insured person (or policyholder or provider) to deceive the insurance company to profit from it,

or vice-versa, based on a contract. The insurer may sell contracts with non-existing coverages or fail

to submit premiums and the insured may exaggerate claims, present forged medical expenses or

fake damages (Derrig, 2002; Lesch & Brinkmann, 2012). The fact that there is a contract between the

fraudster and the victim is an important differentiation from other types of fraud. Here, we can

invoke the concept of moral hazard which recognizes that the contract reduces the insureds'

incentives to prevent losses, and may even led to exaggerated or fake damages, increasing the

probability to retrieve the premium paid by insurers (Francisco, 2014; Tennyson, 2008).

At a regulatory level, insurance fraud is in Pillar I of Solvency II (calculus of Solvency Capital

Requirement - the Operational Risk parcel). It is of great importance for the insurance sector that

insurance fraud is controlled as fraudulent behaviours cause additional costs for all policyholders

(fraud costs are incorporated in underwriting) and fraud weakens the system consuming resources

that should be addressing real claims instead. Indeed, costs with insurance fraud round 96.8

thousands of millions USD worldwide. These costs not only lead to inequities in fraud but also to

inefficiencies: fraud distorts insurance purchase decisions, loss preventions and claiming incentives

(Baesens et al., 2015; Tennyson, 2008). In fact, insurance fraud numbers may be increasing due to

the fraud opportunity expected in the context of the insurance market growth of the insurance field.

According to Tennyson (2008) allied to this may be the intensive supervision (Niemi, 1995). This

increase of monitoring processes affects the policyholder-insurer relationship which can compromise

premiums and profits. There is a trade- off in initiating or not an investigation: this must be debated

in terms of resources (people and capital) assigned to the investigation, costs in legal processes and

benefits to the company. A more thorough investigation directly impacts the costs of fraudulent

claims (if there is a claim denial or reduction in the claim payment) while potentially discouraging

others to commit fraud (Francisco, 2014; Tennyson, 2008). This depends on the fraudster and his

perceived probability of fraud success, degree of risk aversion, motive, discount rate and

reputational risk aversion. Once fraud or its suspicion is detected, it is of the best interest of the

insurance company to come to an agreement with the fraudster aiming to minimize the costs (not

only due to the claim but also legal actions). If an investigation concludes fraud actually happened,

then, according to Tennyson (2008) a termination of all policies of that policyholder should take

place as well as a rigorous investigation on any open claim to pay the minimum possible. However,

because of the significance of the client or its contributions to the company’s portfolio, insurers may

not want to terminate all contractual relations with a fraudster client and, instead, to manage its

fraud temptation. Refusal to pay the costs and getting the money back is usually enough to settle the

case.

Page 20: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

18

Again, insurance fraud depends mostly on human behaviour; Tennyson (2008) has some interesting

conclusions on this matter. Fraud campaigns need to be more intense as fraud acceptance is shared

between peers. “Fraud breeds Fraud” is a common saying in the finance industry; if fraud tolerance

starts increasing, so does fraud costs (Paasch, 2008). The author also states that penalties are not as

effective as high detection probabilities in preventing fraud. Still, even these will have a non-

significant effect on people that had negative interactions with insurers or have a bad consideration

about them. Internal Reward mechanisms are the most important decision-making criteria on

whether an individual should deceit, so it is on this point that the campaigns should focus on.

Curiously, studies point out that highly educated women and the elderly are less tolerant towards

insurance fraud (Lesch & Brinkmann, 2012). Regarding acceptance towards insurance fraud, a person

may be:

• Moralist, least tolerant towards insurance fraud, agrees with strong penalties

• Realist, consider fraud dishonest but finds it justifiable in some cases

• Conformist, accepts insurance fraud as commonplace

• Critics, most tolerant towards insurance fraud

Insurance offenders may be classified into three types:

• Freeloaders: law-abiding individuals who take advantages in real situations

• Amateurs: individuals with intention to commit fraud who make false claims and multiple

insurance contracts

• Criminals: individuals, usually with criminal records, that belong to broad criminal

organisations and commit organized fraud

Although the number of insurance criminals is increasing, in its broad-spectrum insurance fraud is

committed on a day-to-day basis by freeloaders and amateurs (Brites, 2006; Clarke, 1990; Tennyson,

2008).

Insurance fraud may be able to be categorized, but firstly it must be clear that fraud in this context

should be reserved for meant acts, provable beyond reasonable doubt (Derrig, 2002).

• Criminal or hard fraud: meant, and against the law, aiming for a financial profit for the

fraudster due to a deception under the claim

• Suspected criminal fraud

• Soft fraud or systematic abuse: whenever one of the characteristics of criminal fraud is not

met

• Suspected soft fraud

There is a wide discrepancy between a suspected fraud and proved one; to be exact, there is a 25%

gap between their respective volume (Derrig, 2002; Francisco, 2014) Soft fraud is characterized by

build up claims and criminal fraud should be incontestable enough to prosecute. Soft fraud is more

Page 21: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

19

common than hard fraud, but costs associated with it are smaller. Therefore, usually fraud campaigns

emphasize criminal fraud and usually use number of cases brought to convictions as a measurable

indicator. However, being the most common, soft fraud is the one that mostly reduces the trust

relationship between parties; therefore, in order to anticipate fraud opportunities and reduce this

type of fraud, it is advisable to update policyholders on latest fraud discovers and provide moral cues

(SAS, 2012; Tennyson, 2008). In short, campaigns will only work if focused on the right motives

fraudsters have. The ultimate reason for growth of the insurance business is due to the necessity of

people to insure items that generate value like properties (domestic goods, cars) and activities

(travel, business). This growth turned insurance companies into ruthless entities, since without

claims, the only contact they have is payment of the premiums resulting in seeing fraud against them

as a victimless crime. As insurers determine their scope (exceptions must be made for statutory

insurance) they do not publicly emphasize that they are victims of fraud. This passes on to the public

that insurance companies are robust enough to accept its costs and removes the ethical paradox of

fraud. Under some circumstances, one can understand the rationalization of committing fraud; there

is a lack of involvement by potential policyholders in the creation of the policies.

When insureds fill a claim, they have the temptation to exaggerate it to recover part of the premiums

paid or recover deductibles and some still see insurance as a potentially profitable financial

investment. In fact, for some decades insurers were not concerned with this issue and may have

made some mistakes by overlooking some policies (Lesch & Brinkmann, 2012; Niemi, 1995). One

must remember that insurance contracts are agreements to pay for accidental damages and the

business of insurance is to pay claims in a timely and efficient manner thus when fighting fraud, the

insurance company must focus on its direct goals (reduce losses in on-going investigations and also

maintaining the brands marketing as customer-friendly) and indirect goals (fraud’s prevention)

(Derrig, 2002; Niemi, 1995). The costs of insurance fraud are paid by policyholders with whom the

insurer still has a contractual relationship to maintain, as long as the premiums do not become

excessive and stay competitive. Obviously, these clients may switch to other companies if faced with

increasing amount of bureaucracy and monitor processes. In terms of prevention, France and USA

are the most advanced in the field and they suggest a significant effort should be put into the

preventing losses rather than preventing insurance fraud (Niemi, 1995). To protect themselves

efficiently, insurers should start to re-think themselves as service companies: they provide protection

(in wide terms) should an event outside of the control of either party occur. On the other hand,

insurers shall increase quality costumer experience by redefining the insurance’s role which should

strengthen the relation of trust therefore reducing insurance fraud (Lesch & Brinkmann, 2012).

The Portuguese association of insurers (APS) congregates 99% of insurance and reinsurance

companies operating in the Portuguese market. In 2006, it was created CTFRAUDE, a technical

commission formed by members of each insurer whose experiences are addressed and some

solutions are proposed. It should have a role in prevention strategies of the multiple companies as

well as in the standardization of expert reports in insurance and also to facilitate the relationship

between insurers and law enforcements (APS, 2016a; Correia, 2010) In Portugal, the internal

resolution of fraud related problems without disseminating them to the public is customary and still

relies mostly on expert’s knowledge and imagination to discover deception revealed by a tip,

randomly, internal or external auditing or even a legal notice. Pimenta (2009) states two paradoxes

which are valuable. The first is the fraud control paradox which states that the implementation of

fraud detection mechanisms initially translates into an increase of fraud numbers, because they are

Page 22: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

20

revealed in the statistics. The other is that the experts discovering increasing types of fraud leads into

even more frauds because fraudsters also have access to this information. Indeed, a fraudster has

some valuable characteristics (he/she may have a successful career), such as creativity and flexibility;

and these can be used for good as much as evil (Fernandes, 2015; Pimenta, 2009).

Page 23: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

21

2.4. FRAUD IN WORKERS’ COMPENSATION INSURANCE

Insurance fraud is a wide concept as it comprises a lot of different sub-types of (insurance) risk. Some

may be worth mentioning like casualty fraud (people not owning the insured property or build-up on

claims about it) or personal accident fraud (fake an accident to get payment of medical bills). Others

shall have more detail as they have a significant dimension (Lesch & Brinkmann, 2012). Healthcare

fraud has some magnitude in the literature due to its frequency: it is widespread fraud. Although it is

very compact in a sense as it takes place usually as build-ups of claims other than organized fraud.

(Tennyson, 2008) Fraud in healthcare insurance is the criminal activity with the highest growth in the

last decade in the USA, considered now more profitable than credit card fraud (Fisher, 2008). Amado

(2015) used social networks to build a model capable of predicting the participants more likely to

commit fraud. Automobile fraud has significant related studies because its claims often need expert

intervention which implies a multitude of resources. Even so, it is known as being hard to prove and

most of the cases it is only considered suspicious. One identified cause of this is the consumer

privacy which makes it hard to identify claims as fraudulent. Another is, many types of claims in the

auto insurance that is hard to find standard characteristics for fraudsters (Francisco, 2014; Tennyson,

2008).

Workers’ compensation insurance (WCI) is a statutory insurance product whose portfolio’s is

significantly wide – after years with catastrophic numbers, since 2014 its growing in terms of pricing

and accident rates - and its fraud rates are considerable. Recall, the estimates of insurance fraud in

Portugal are nearly 1,5% to 2,0% of the national GDP and represent between 5% to 20% of the total

risk of an institution. Officially, APS has no numbers about fraud in WCI in Portugal, however

CTFRAUDE states, in 2012, 2% of claims were fraudulent.

As WCI is a very standard product not varying much among insurers, the same happens with its’

fraudulent activity. It occurs when a policyholder understates the number of employees of the

company or its payroll or even states its own company as business activity different (less risky) than it

is. Regarding claiming, fraudsters usually get away with deliberate or fake injuries, multiple claims,

fabricated treatment, non-work-related or prior injury, misrepresentation of wage loss or other (APS,

2016a; Derrig, 2002; Galewitz, 2009). Fighting WCI fraud, insurers need to set strategies carefully as,

unlike most insurance contracts, the individual filling the claim may not be the policyholder who does

pay the premiums. This is to say, the claimant does not have costs and has some incentives to fill

claims. A strategy which insurers may use to lower the number of claims may be increasing the

probability of denial of claims; the reasoning is as follows: increasing claim-denial rates means

decreasing acceptance rates and consequently decreasing expected benefits from claim therefore

decreasing claim-filling probability. However, this strategy is questionable as an aggressive denial

approach may affect the premiums paid by policyholders – they are often based on client’s loss

experience. There are some interesting studies which highlight moral hazard and information

asymmetry in the insurers-insureds relationship (Biddle, 2001). Krueger (1990) studied the relation

between benefit generosity and claim frequency which may be due to moral hazard. Butler & Worrall

(1991) studied the effect of changes in the expected benefits of WCI assuming that medical

payments depend only on true rate and severity of injuries. Butler, Durbin & Helvacian (1996) found

a positive co-relation between benefit generosity and soft-tissue injury claims – known as easy to

fake and difficult to verify. Smith (1990) studied the non-work-related injuries occurred on weekends

(usually claims filled on Mondays) in order to rely on WCI. Park, Krebs, & Mirer, (1996) studied the

Page 24: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

22

amount of non-filled WCI eligible injuries claims and Gardner, Kleinman, & Butler, (2000) showed the

claim filling decision-making is positively dependent on past successful filled claims by co-workers.

In Portugal, insurance businesses are supervised by ASF (Insurance and Pension Funds Authority

Supervision) in charge of prudential and behavioural regulation. FAT is the workers’ compensation

fund which ensures protection for those who, due to the risks associated with their activity, cannot

find insurance, policyholders with insolvent companies or even insurers facing governmental

pension raise. (ASF, n.d.)

2.4.1. Legal Context

In Portugal, legislation on work accidents arose in the industrial revolution era, with the growth in

the use of machines in the productive processes, which boosted the number of accidents and the

severity of them. Initially, the burden of proof rested with the injured person which left many

accidents with no repair (guilt was not proved). A new type of responsibility rose then: it is enough

that a damage takes place for an obligation to repair it to exist. It rests on the principle that the one

who takes the benefit of the labour and the risks it creates, should support the harming

consequences of it (responsibility for the risk). This was written in the Law 83, July 24, 1913 which

changed with time1. Currently, it is reflected in the Portuguese legislation 98/2009, September 4,

which regulates the work accidents and professional illnesses repair regimes including professional

rehabilitation and reintegration. The evolution of the lawful regime is characterized by an increase of

injured rights and employers responsibilities (consequently, insurers’ responsibilities) (APS, 2016b;

Martinez, 2015; Vieira Gomes, 2013).

The notion of work related accident may be mistakenly seen as simple. It is so complex that some

countries do not have a legal definition of it like Austria, for example. The notion of work accident

was born linked to the professional risk theory, a specific risk different from the general risk of life

that every human being takes. There are two theories worth mentioning: professional risk theory and

economic risk theory. The first one, assumes a cause-effect relation (causal link) between the work

executed and the accident: there is only a work accident if it is due to the core risk of the

professional activity. The second considers that the accidents that take place in the workplace during

working hours, even if they do not occur directly due to the main activity’s risk, should be repaired.

The current Portuguese legal regime follows the last (Vieira Gomes, 2013). Currently work accident is

defined (in accordance to the Law 98/2009, September 4) as an accident verified in the work place

and time, which has induced directly or indirectly a body damage, functional disturbance or illness

from which resulting in a decrease of work capacity or earning, or death. There are, of course, some

extensions to this concept. (APS, 2016b).

All insurers must obey to the same law (Portuguese Law) and, in addition, they must present to the

client the same general conditions regarding WCI, which are written by the state. The absence of

insurance for accidents at works is punishable and can imply a penalty. So, there is some more

legislation worth mentioning: Ordinance 256/2011, July 5 approves the uniform part of the workers’

compensation policy’s general conditions for dependent workers and Regulatory Rule 3/200-R,

January 8 applies for independent workers, Law 102/2009, September 4 regulates the health and

1 Law 1942, July 27, 1936; Law 2127, August 3, 1965 and Law 100/97, September 13 were the ones

which followed. The last, enhanced the legal system for the self-employed people.

Page 25: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

23

safety at work, Law 7/2009, February 12 approves the revision of the Labour Code which is approved

by Law 99/2003, august 17 (regulated by Law 35/2004, July 29) and decree-law no. 142/99 of April,

30 creates the workers’ compensation fund (FAT) (APS, 2016b). Regarding fraud related legislation,

the regulatory rule no. 10/2009-R, June 25 aims to establish the general principles between insurers,

insureds, policyholders and third parties with respect to complaints and anti-fraud politics. There is

also a definition of fraud against insurers as “Intentional acts or omissions, even if in an attempted

manner, with a view to obtain an unlawful advantage for themselves or for a third party, in

connection with the conclusion or execution of insurance contracts or underwriting, in particular

those involving a coverage or undue payment” (Francisco, 2014). Furthermore, article 131 F of the

decree-law no. 2/2009, January 5, states that insurance companies should have an anti-fraud

strategy with prevention, detection and report of fraud situations and these should be applied at any

stage of the policy’s lifecycle and not just only when there is a claim (Brites, 2006; Correia, 2010). As

a crime, fraud against insurers is not on the criminal code, however one can find multiple articles

related to it as article 217 and 218 about simple and qualified swindling, article 219 insurance related

swindling, article 256 document falsification crime, article 258 technical notation falsification crime,

article 260 false medical certificate crime, article 299 criminal association crime and article 366 crime

simulation crime. Although useful, article 219 does not include multiple scams the criminal can do

like coercion on a third person to present the claim, does not have an identical structure to swindling

crime and does not cover damage and physical integrity crime (Brites, 2006; Correia, 2010; Francisco,

2014).

Page 26: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

24

2.5. ANALYTICAL MODELS AND DATA MINING

In this section, we will discuss the basic notions of some detection algorithms that are currently used

by institutions. Initially, we shall identify the basic steps of the fraud cycle from fraud detection to

fraud prevention. This is the correct order. Most failures in fighting fraud are due to errors in the

detection phase; if there are failures here, there is a great probability they will propagate. Fraud

detection methods allow to diagnose fraudulent activities whereas prevention aims to avoid fraud

(Baesens et al., 2015; Hartley, 2016). Certainly, one should not work without the other. Firstly, the

problem must exist, fraudulent activities must be caught, and this happens applying a myriad of

detection methods to real data of the institution. Then, some suspicions rise according to the

company’s risk culture. An investigation must follow, usually by an expert leading to the confirmation

phase: there was a fraud. This is not a straight forward process but more like an (in)finite loop

because when fraud is confirmed then some adjustments may be made to the detection model and

new fraud risks are applied to the data. However, this does not compromise the (next) prevention

phase characterised by the avoidance of fraud being committed in the future (Baesens et al., 2015).

Detection failures are repeatedly associated with its input: the company’s data. Tailor-made

solutions and difficulties in external facts integration make unreliable and incomplete data which are

the number one problem for risk managers in a financial institution. The company’s database, while

adapted for each business, should have the capacity to evaluate the potential losses and potential

businesses, identify risk sources, point control weaknesses, and present history records regarding

losses. Instead, detection fails as soon as there is the need to adapt the selection of insureds, or

policies or even network providers (Gonçalves, 2011; Hartley, 2016; Lesch & Brinkmann, 2012).

There is a multitude of analytical fraud detection methods which cannot be ranked in isolation; their

validation should have in consideration their purpose and the company’s risk culture. However, there

are some common indicators one should look for when examining them as statistical accuracy, ease

of interpretability, operational efficiency, economical cost and flexibility with regulatory compliance.

Current literature points out four main types of detection methods: business rules, auditing, social

networks and statistical and data-mining models – these last two are usually connected (Baesens et

al., 2015; Shao & Pound, 1999).

Business rules are guidelines and limitations for the company’s processes and status to meet their

goals, comply with the regulations and process data. They are usually rules for the integration of

external-sources data, data validation, anomalies detection or trigger warnings regarding abnormal

situations aiming to decrease insurance risk (Baesens et al., 2015; Francisco, 2014; Herbest, 1996).

For a better understanding, Baesens et al. (2015) gave some business rules as example, such as “If

the insured requires immediate assistance (e.g., to prevent the development of additional damage),

arrangements will be made for a single partial advanced compensation” but this comes with a

potential risk of the partial compensation paid being higher than the real loss (Copeland, Edberg,

Panorska, & Wendel, 2012; Vlasselaer et al., 2015).

Auditing concerns with the reports validation to ensure honesty from all stakeholders. This approach

is very costly and time-consuming as experts are hired to review some claims case-by-case.

Therefore, it is only applied in small samples (Copeland et al., 2012; Francisco, 2014; Schiller, 2006).

Page 27: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

25

Social Networks consist in the implementation of networks connecting suspicious or fraudulent

entities by modelling connections between entities in claims (Baesens et al., 2015; Jans, Van Der

Werf, Lybaert, & Vanhoof, 2011) In the pros list, the analysis ability can be fully automatic and

updating continuously. Also, containing not only information on entities but relations between them

may contribute to reveal wider fraud plans or criminal organizations (Baesens et al., 2015; SAS,

2012).

Although the last techniques are considered useful in fraud detection, their use is starting to be

complementary: a simple and cost-effective method to detect fraud in existing claim is through

statistical approaches which are proven to be more efficient than analysing individual claims. These

enhance precision (with limited resources and large volumes of information the detection power

increase with data driven techniques), operational and cost efficiency (data-driven approaches are

easily manoeuvred to comply with time constraints without needing too many human resources).

Some methods are worth mentioning such as outlier detection which can detect fraud different from

the historical cases observed (the analysis can be completed with histogram, box plots or z-scores),

or clustering (discover of abnormal individuals concerning all or some characterises). To compare

groups, one can also use profiling (model with history behaviour). Many types of comparisons (real

data versus expected values) can be made also through correlation and regression techniques to

evaluate datasets (Bolton & Hand, 2002; Copeland et al., 2012; Vlasselaer et al., 2015). Data Mining,

also known as Knowledge Discovery, is referenced as the ability to translate data in information using

powerful algorithms (based on ever-increasing processing power and storage capacity). Industries

value these characteristics as their datasets are becoming increasingly greater both in dimension and

in complexity (Bação & Painho, 2003)

Artis, Ayuso, & Guillen (2002) worked on discrete choice models to classify claims in fraudulent or

non-fraudulent and it worked 75% of the cases. Brockett, Derrig, Golden, Levine, & Alpert (2002)

proved that fraud feature data have a unique, non-linear scoring model that separates claims.

Viaene, Derrig, Baesens, & Dedene (2002) contributed to the claim sort development problem. A

problem with these approaches is that they rely on red flag alerts which generate too many false

positives and consequently, low efficiency and high costs. Fraud has a dynamic nature; it evolves and

so should its detection methods (Copeland et al., 2012; Hartley, 2016; Lesch & Brinkmann, 2012). The

most common issue found in fraud detection literature is the volume of the false positives. Shall we

assume we need a model that correctly identifies 99% of the valid cases as valid and 99% of the

fraudulent cases as fraudulent; it means a false-negative and false positive rate of 1% which seems

acceptable. Shall we also assume the fraud ratio is maintained at 2%. This means, for every 100

claims detected as fraudulent, 33 are in fact legitimate. It means a waste of resources but also

contributes to a deterioration of the relation with clients. On the other hand, misclassifying

fraudulent cases as legitimate may translate in future severe losses to the company (Paasch, 2008).

Due to this, severe requirements are set to data mining techniques; they are used when there is a

well-defined problem not solved by queries and reporting and represents, usually, a financial

advantage and more resources’ efficiency (Gonçalves, 2011; Phua et al., 2010). As fraud is dynamic

and fraudsters use diversified swindles so should detection methods. Detection techniques should

adapt to new data and evolve with it which is the main advantage of data mining techniques. Data

Mining is characterized by detecting patterns in databases using statistical and mathematical

techniques, machine learning and artificial intelligence, gaining and analysing more efficiently the

information (Bação & Painho, 2003; Bolton & Hand, 2002; Ngai, Hu, Wong, Chen, & Sun, 2011).

Page 28: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

26

Kose et al. (2015) differentiate data mining techniques into proactive and reactive. The first are

online techniques showing the risk degree and relevant justification of a claim before even accepting

it. The former detects fraud retrospectively, after accepting the claims. A more common distinction

in the literature is between supervised and unsupervised learning techniques. Supervised methods

(predictive analytics) learn from historical information to differentiate fraudulent activities. A

downside is the necessity of observed data with correctly identified labelled samples (fraud or

legitimate claims) which also reduces the learning power to detect considerably different or new

fraud types (Baesens et al., 2015; Bolton & Hand, 2002; Copeland et al., 2012). These methods are

used to construct a model which will produce a classifier (or score) for new claims (Bolton & Hand,

2002). Supervised methods are currently being tested for fraud detection in healthcare and use

technologies like neural networks, decision trees and Bayesian networks (Albashrawi, 2016;

Copeland et al., 2012). Unsupervised techniques (descriptive techniques), although learning from

historical data, do not need it differentiated between fraudulent and non fraudulent to find a

behaviour different from normal. A disadvantage is that they are more prone to deception. Outlier

detection, clustering and profiling are usually used to detect abnormal behaviour different from a

baseline distribution (Baesens et al., 2015; Copeland et al., 2012). According to the Benford’s Law2,

unsupervised techniques identify potential fraudulent activities which will need to be validated by

experts. The goal is to be more resources’ efficient because less claims will need investigation. By not

labelling data, unsupervised techniques can detect new and evolving types of fraud but their

effectiveness is relatively untested in the literature (Copeland et al., 2012). In short, unsupervised

methods have better performance to detect new types of fraud but are more unreliable because

they learn from unlabelled data. They can be used complementing each other. A third type may be

worth mentioning, semi-supervised methods use a mix database: some claims are labelled some are

not (Baesens et al., 2015; Copeland et al., 2012).

There is a wide range of data mining techniques used in the fraud detection. So far, logistic

regression model (which measures the relation between a categorical variable – fraud or not fraud –

with the other independent variables) is the most used in detecting financial fraud (Albashrawi,

2016). According to the four eye principle3, business Mining detects fraud by mining event logs as

process perspective (process models which explains the paths followed), the organizational

perspective (explains the people involved) and the case perspective (explains the action itself) (Jans

et al., 2011). Interactive machine learning is characterized by the ability of including experts insights

directly into the model building process which is particularly important if the hypothesis explored is

subject to change. Kose et al. (2015) explained the importance of a bottom-up analysis of each

individual and his relation with the insurance segment and a top-down approach to automate the

experts’ method to identify the relevant evidence. The author enhanced the efficiency of the model

by, although not viable to decide if a single claim is fraudulent or not, only needing expert analysis in

2 Benford’s Law (or first-digit law), states that the first significant digit of many data sets follows a

known frequency pattern, being 1 more probable than 2 which is more probable than 3 and so on. If

the analysed data set does not conform to the predictions of the Benford’s Law, there is a possibility

fraudulent activity. (Baesens et al., 2015; Copeland et al., 2012; Paasch, 2008)

3 The four eye principle states that two individuals (usually, the CEO and the CFO) must approve the

same action before it can be taken.

Page 29: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

27

the most suspicious cases. The quality of expert feedback should be enough to compensate the lack

of data. In fact, for machine learning techniques to succeed they require quality training data which

is, as frequently stated in the literature, difficult to obtain. Thus, interaction between data and

experts is significant (Stumpf et al., 2009). Genetic algorithms (GAs) are (global) search techniques

aiming to find solutions for optimization problems. GAs follow the Darwin Theory4 and always need:

a fitness function to evaluate solutions (GAs search from populations, not single individuals); a

representation of the solution (genome) and the set of operations allowed on the genomes. GAs

domain are probabilities and not deterministic rules so the initial population is randomly chosen and

then measured in terms of the objective function and assigned a probability of being in the next

generation of individuals. In minimization problems, for example, the highest probability is assigned

to the point with the lowest objective function value (Paasch, 2008; Sexton, Dorsey, & Johnson,

1999). Regarding optimization problems, simulated annealing (SA) should also be mentioned.

Generating candidates, the main difference from GAs is that SA only generates one candidate

solution to be evaluated by the fitness function (Sexton et al., 1999). GAs and SA can (and usually do)

cooperate. It is common in the literature to find their use incorporated in other techniques like

neural networks or support vector machines (SVM). These are similar but differ in the measure of

error used: neural networks often use root mean square error (minimizing it) while SVMs use

generalization error (minimizing the upper bound) (Bhattacharyya, Jha, Tharakunnel, & Westland,

2011). For a better understanding of artificial Neural Networks, the next section should be consulted.

In terms of literature, Carminati, Caron, Maggi, Epifani, & Zanero (2014) built a decision support

system with both supervised and unsupervised rules aiming to give a probability (of fraud) score.

Phua et al. (2010) studied the literature on data mining approaches for fraud problems. Kose et al.

(2015) developed a framework for healthcare which includes the relation between individuals and

claims. He, Graco, & Yao, (1998) used genetic algorithms and k-nearest neighbour approach for fraud

detection in medical claims. Welch, Reeves, & Welch, (1998) used genetic algorithms to develop a

decision support system for auditors. Dharwa, Jyotindra N; Patel, (2011) constructed a model for

online transaction fraud using data mining techniques to model the risk.

4 In Darwinian Philosophy, stronger individuals adapt and survive while weaker ones do not survive.

Page 30: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

28

2.6. NEURAL NETWORKS

Neural Networks can be seen in one of two ways: first, they are a generalization of statistical models.

Second, they are a technique that mimics the functionality of the human brain (Baesens et al., 2015;

Ngai et al., 2011). Neural Networks learn by example and do not allow programming (Mo, Wang, &

Niu, 2016). Three elements are particularly important in any model of artificial neural networks:

structure of the nodes; topology of the network and the learning algorithm used to find the weights

(Rojas, 1996). The process of application of neural networks includes two phases: learning and

testing. The learning phase refers to the changes in the connections between the neurons necessary

to adapt correctly the model to our data and summarise its internal principles, and the testing refers

to the application of the model built (Guo & Li, 2008; Paasch, 2008).

ANNs are usually used for regression and classification problems such as pattern recognition,

forecasting, and data comprehension (Gershenson, 2003). Currently, neural networks are being

widely used in credit risk prediction/detection particularly in online banking where data has

comprehensive complexities. Therefore, Cost-sensitive Neural Networks are being derived based on

scoring methods for online banking. It was found that neural network still outperformed other

methods in both accuracy and efficiency, and cost-sensitive learning is proven to be a good solution

to the class imbalance problem (Vlasselaer et al., 2015).

These algorithms are known due to their remarkable ability to derive meaning from complicated or

imprecise data, which can be used to extract patterns and detect trends that are too complex to be

noticed by humans or other computer techniques. They have key characteristics valuable when

working with classification problems; they are adaptive (learning techniques and self-organizing rules

let them self-adapt to the requirements); they can generate robust models (a large number of

processing units enhanced by extensive interconnectivity between neurons); the classification

process can be modified if new training weights are set; and lastly, the ability to perform tasks

involving non- linear relationships (Ngai et al., 2011; Paasch, 2008).

Although relatively subtle, one can find some critiques to ANN in the literature like Paasch (2008)

who describes them as excessively complex compared with statistical techniques. As they work like

black-boxes, they lack interpretability because some intermediary determinations are meaningless to

humans (Paasch, 2008; Vlasselaer et al., 2015). Cases of overfitting are also mentioned in the

literature, which means the model is too specific for the dataset used in the training and does not

generalise for others. This can happen if the training data is not representative of the sample or

insufficient for a clear representation. Other factors that can make a vulnerable model are poorly

chosen network parameters such as momentum or learning rate, the wrong topology or choice of

input factors (Mo et al., 2016; Paasch, 2008).

Some of these downsides can be solved by making an adjustment in the ANN using gradient descent

or genetic algorithms, for example. It can happen while determining the connection weights, the

network topology or even defining the learning rules. These ANNs are referred as innovative and the

literature about them is relatively poor. However, Paasch (2008) quoted some studies which prove

ANN adjusted by GAs do not perform better than the ones trained with backpropagation, for

example. A hybrid approach is also discussed, a GA should be used to isolate a global minimum (or

maximum) and backpropagation is then used in the local search.

Page 31: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

29

The choice of the inputs for an ANN is usually determined by experts because there is no common

ground on what rules work better. Guo & Li, (2008) used in their research adjusted inputs for the

model: the inputs could be 1 or 0. To do this, the authors adapted the data available by transforming

them in confidence values whether it was a discrete or continuous value.

The topology of the neural network can determine the failure or success of the model. However,

there are no pre-determined rules to define it: it can be single or multilayer, it could have feedback

loops or several hidden layers. For instance, the multilayer perceptron (MLP) is a standard multilayer

feed forward network (processing feeds forward from the input nodes to the output nodes) with one

input layer, one output layer, and one or multiple hidden layers (Paasch, 2008). Because of the

importance of the topology, it would make sense to use a GA to determine it but researches showed

it does not give better results (Paasch, 2008).

In the development of an ANN model, there are multiple decisions that can compromise its

performance; experts mostly make these as, currently, there is not enough knowledge to automatize

them. These choices involve all parameters necessary for the ANN to learn and perform, for instance,

population size, number of generations, mutation and crossover probability and selection method.

In terms of hidden layers, the decision consists not only on quantity of layers but quantity of

neurons. Its role is to derive the characteristics of the inputs, combine them and restore them to the

output neurons (Baesens et al., 2015). Guo & Li, (2008) suggested using a specific GUI of their system

to select the number of hidden layers and nodes in each layer ending up in a multilayer with smaller

average error. In his research, Paasch (2008) used a model with only one hidden layer including 25

hidden neurons. Theoretically, one hidden layer is enough to approximate any function to any

desired degree of accuracy on a compact interval. In fact, the vanishing gradient problem states that

the chain rule used to update the hidden unit weights has the effect of multiplying n small numbers

to compute the gradients of the front layers in an n-layer network. Therefore, the gradient decreases

exponentially with n. Consequently, when more than two hidden layers are used, the algorithm

becomes slower. Also, a nonlinear transformation in the hidden layer and a liner transformation in

the output layer are, academically, recommended. To choose the number of hidden neurons, a try

and error process is the most suited using the data that will be used in the training phase. In the

literature, usually the number of hidden neuros goes from 6 to 12 (Baesens et al., 2015).

Some parameters like population size or number of generations of the model highly vary according

to the characteristics of the data collected. Additionally, a momentum is often added to a neural

network by adding a fraction of the previous change to every new change. However, one can see in

the literature that a momentum does not allow the neural network to outperform others nor lowers

the average error (Paasch, 2008). In terms of the learning rate, one can tell that higher learning rates

mean lower network errors but a learning rate too high may compromise the global search. The

learning rate may be set to countdown of the number of entries in training data or may even change

in each layer of the model. Though, these are relatively untested being the most common a manually

fixed learning rate (Guo & Li, 2008; Paasch, 2008).

The fitness function evaluates the features of the data and it ranks the potential hypotheses by

probability of legitimacy. In the fraud literature, the most used is the sigmoid function as it gives

values in the [0,1] interval which overlaps with the probabilities score domain (Baesens et al., 2015;

Guo & Li, 2008; Paasch, 2008).

Page 32: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

30

The output layer also deserves some attention regarding the number of neurons it should contain. If

a more qualitative output is intended (like fraud score and fraud consequences simultaneously) then

several outputs should be used. However, in the literature, one finds more often the use one single

output neuron. This output shall focus only on giving a fraud score for each claim: scores higher than

a pre-defined threshold shall be considered fraudulent and the others as legitimate (Baesens et al.,

2015; Paasch, 2008).

So far, we enhanced some decision-making mechanisms throughout the steps of neural networks:

introducing the input sample (activation of the input neurons); propagation of the input creating the

output; comparing the network output with the desired one; correcting the network. Now we shall

discuss how this propagation works: the definition of the weights is a crucial part of the development

of the model. There is no absolute truth on deciding which are the perfect weights but there are

some methods that can help (Kose et al., 2015).

The Delphi method relies on a panel of experts and a director. The director constructs questionnaires

for the experts and then summarises their feedbacks and thoughts, then a second round of

questionnaires is built. This way, the panel will converge toward the right answer (Kose et al., 2015).

The Rank Order Centroid Method is based on a ranking provided by experts, as it is easier to rank

attributes than to quantify them. Ratio Method is also based on ranking but ranked weighs are

assigned from 10 and in multiples of 10; then they are normalised (Kose et al., 2015). In their study,

Kose et al. (2015) also used a hybrid solution of these: used binary pairwise comparison method

(determine which attributes define more significantly fraudulent activities and then rank them). As in

the last method, the experts analyse the forecasts and reweight until the results are satisfactory.

One can see in the literature that neural networks that achieve significant results usually have

weights estimated by an iterative optimization algorithm after randomly assign initial weights

(Baesens et al., 2015). It is common to use a gradient descent method like backpropagation,

simulated annealing, or genetic algorithms. Gradient descent methods use the gradient of the error

function to descent the error surface. The gradient is the slope of the error surface which indicates

how sensitive the error is to changes in the weights (Paasch, 2008).

Backpropagation is a procedure used in the learning phase of neural networks, determining the

weights, which is particularly hard in the hidden layers since there is no output to compare. It is a

gradient descent method which, after receiving all inputs, derives an output error (Kriesel, 2005) It,

usually, follows the following steps:

1. Use the training sample as input of the network.

2. Compare the networks output with the desired one.

3. Calculate the error of each input neuron.

4. Calculate the local error5 of each neuron.

5. Adjust the weights of each neuron to lower the local error.

5 Difference between the output of the neuron and the calculated one

Page 33: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

31

6. Assign responsibility (for this error) to the neurons of the previous level, specially the

stronger ones.

7. Repeat steps 3,4, 5 and 6 for neurons in previous level.

This way, the weights of the network will be updated for the sample of the training set. The update

will occur according to the following equation:

where E is the cost function that measures the network’s error, n is the training epoch, ηL is the

learning rate for a specific layer, and α is the momentum term (Paasch, 2008).

Backpropagation often finds the solution of the optimization problem in the neighbourhood of its

starting point as it is designed for local search; thus the definition of its starting value is extremely

important. The downside of local search algorithms is the probability of finding a local minimum and

not global. To minimize this, some studies in the literature have used modifications of

backpropagation, using more advanced gradient descent techniques as Delta-Bar-Delta Learning,

Steepest Decent Method, Quickprop, the Gauss-Newton Method and the Levenberg- Marquardt

Method (Guo & Li, 2008; Paasch, 2008; Sexton et al., 1999).

Other local search algorithms have also been used competing with back propagation like tabu search.

Tabu search allows the acceptance of worse weights if no improving move is available. If a potential

solution has been previously considered, it is marked as "tabu" (forbidden) so that the algorithm

does not consider that possibility again. Sexton et al., (1999) have found superior solutions for

optimizing the neural network with this method rather than by backpropagation. Simulated

annealing may be a solution to avoid local minimums by adding a random value (user determined) to

the weights and keeping track of that value – by comparing it with initial point - with error. A

difficulty of SA is that there are no rules for adjusting the configuration resulting in very irregular

estimates which diverge significantly with out-of-sample data. This is due to the fact SA algorithm has

an inclination to choose (negative or positive) large values in absolute terms (Paasch, 2008; Sexton et

al., 1999). Genetic algorithms have also proven to be useful in neural networks training: the

phenotype may be the set of weights of the networks and the genotype is the genome

representation of these weights. The initial weights are chosen randomly with the probability of

selection equal to its assigned probability value (Paasch, 2008; Sexton et al., 1999). Thus, those

weights generating the lowest sum of squared errors are the most likely to be represented in the

new population. Once all points have been assigned a probability, a new population of points is

drawn from the original population. Then the weights in the new population are then randomly

paired for the crossover operation results in each new point having parameters from both parent

points mutation: each weight has a small probability of being replaced with a value randomly chosen

from the parameter space (Paasch, 2008; Sexton et al., 1999). Paasch (2008) studied deeply the use

of GAs in neural networks resulting in some interesting conclusions in terms of the GA’s most

adjusted parameters to choose. In terms of selection (the probability of a weight to continue in the

population) the best results were achieved using rank order selection – the weights are ranked and

the probability of remaining in the population is function of its position in the ranking. In terms of

crossover probability – probability of mixing two weights to keep them in the population – the

Page 34: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

32

authors found no impact on the results. Regarding elitism - copying a small proportion of the fittest

candidates, unchanged, into the next generation – the authors thought it would make a difference in

terms of time because it would allow not wasting time disregarding weights already determined.

Despite the fact it reduced the network error, they concluded it does not make a significant

difference. Sexton et al. (1999) used back propagation, simulated annealing and genetic algorithms

to determine weights in ANN and concluded, in their research, GA outperform the other two.

Noise is characterised as a random error of variance of featured data. Most real datasets have noise

even if it is not known as they usually are archived in several databases. Training an ANN, noise can

emerge due to the random allocation of initial weights which may produce different results for

different weights (Kalapanidas, Avouris, Cracium, & Niagu, 2003). As a solution, Paasch (2008)

suggests use of the average of the results of several training runs. The weights depend on the chosen

algorithm (and, consequently, the required inputs) so this may also impact noise. The absolute

consequence of noise is that it may propagate in the network and can even be modelled by the

training algorithm. If this happens, the model may become too dependent of the training set and do

not perform well for the validation set: this is called overfitting. To overcome this, Baesens et al.

(2015) suggest to keep small weights in absolute sense (avoids fitting the noise) and add a weight

size term to the objective function.

Figure 5 - Overfitting

A frequent question now is when to stop the training phase. It happens when the user is satisfied

with the resulted error. Other stop criteria are to stop the training when the change of weight is less

than a specified threshold or the number of iterations has reached the predetermined number (Guo

& Li, 2008).

It is common to find in the literature, researchers discussing the overly complex processes of neural

networks and how difficult it is to understand its behaviour. A way to soothe this is with a Hinton

diagram. It allows to visualise the weights between the inputs and the hidden neurons as squares

being the size of the square proportional to the magnitude of the weight and the colour of the

square represents the sign of the weight. However, when few weights are connected to a variable in

the hidden neurons it is not very useful (Baesens et al., 2015).

Another option to understand how the ANN works is the backward variable selection procedure

which is characterised by removing each variable in turn and re-estimating the network. This will give

N networks each having N – 1 variables. The process allows to understand which variables are

significant to the network and which are not. A reported problem with this procedure is that it still

Page 35: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

33

does not allow to understand the relation of inputs and outputs which can be explored by

decompositional and pedagogical techniques (Baesens et al., 2015).

Figure 6 - Hinton Diagram

In terms of the output of the network (the results of the model) one should analyse their confusion

matrix, that is, the volumes of misguided classifications. Let TP be the percentage of True Positives

(legitimate claims classified as legitimate), FP be the percentage of false positive (fraudulent claims

classified as legitimate), FN be the percentage of False Negative (legitimate claims classified as

fraudulent) and TN the percentage of true negative (fraudulent claims classified as fraudulent). (Guo

& Li, 2008) The sensitivity of the network is measured by the volume of True positives and the

specificity is quantified by the amount of true negatives (Guo & Li, 2008).

In the literature, there are some studies using ANN or some of its modifications. Vlasselaer et al.

(2015) used network-based extensions to study fraud in credit card transactions. In the same subject,

Ghosh & Reilly (1994) improved their model using neural networks. One more method using neural

networks (auto associative in this case) is Cardwatch a database mining system used for credit card

fraud detection, which provides an interface to a variety of commercial databases and has a

comfortable graphical user interface (Aleskerov, B.,1997). Guo & Li (2008) adapted a neural network

by adjusting the inputs of it: they adapted the data into confidence values before introducing in the

network which allow them to have a smaller error. Copeland et al., (2012) used a two-class neural

network to improve fraud detection in healthcare insurance and for automobile bodily injury in

healthcare insurance, Brockett, Xia, & Derrig, (1998) did too. Kose et al., (2015) studied the relation

between the participants in insurance fraud (attributes, claims and individuals) using ANN and Brause

et al., (1999) maximised the confidence of fraud decisions with it.

Page 36: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

34

3. METHODOLOGY

3.1. RESEARCH METHODOLOGY

In this work, the paradigm applied is the constructivism, where knowledge is constructed not

discovered; more accurately, the interpretivist since theoretic perspective shows the world as too

complex to reduce to a few laws. We will derive from our data the knowledge necessary to reach

meaningful conclusions and then apply them to other data. The purpose is to have all the

information regarding the data collected: the variables and the results (fraud/no fraud) – training set.

From here we will construct an algorithm that will replicate these results. This approach is known as

qualitative induction. From the data, we will develop a process and test it with all samples of the

data, if it does not work for some samples then it will be disregard and a new one will be developed

(or the last one will be improved). This induction process is rigorous: it does not choose the samples

to fulfil the hypothesis: it chooses deviant cases to disregard the algorithm. However, some

acceptance threshold must be in place in order not to disregard too many modifications otherwise

we could end up with one algorithm that works perfectly for the collected data, but it is so restrictive

that has no application for other works (Gray, 2014). The strategy for the data is based on grounded

theory as we will compare the samples of our data with each other through the algorithms and

emerging new ones form them.

The data collection process is difficult due to the security, privacy and cost issues so most researchers

generate synthetic data (Guo & Li, 2008). This was not an option for this work. The object of this

study is a sample of a data base containing both fraudulent and non-fraudulent claims belonging to a

Portuguese insurance company.

Once the data is collected, it needs investigation: we start by conducting an exploratory initial

analysis. The next stage consists of constructing and improving the procedure to get the results. We

make comparisons with other processes already in place to detect fraud and other methods using

neural networks. We need to find the weights of the network to obtain the desired output of each

sample of data which translates in using supervised learning. There are several algorithms to do this

(training process) so we chose from the documented ones with better feedback and test them with

our data. Simultaneously to the learning process, we took a decision on whether we should use

layered networks and how many neurons per layer; the activation function needs also to be set

although we already forecast testing the sigmoid function which is frequently quoted in the literature

by its efficiency (it is very close to one for large positive numbers and very close to zero for large

negative numbers).

Error also intervenes in our methodology as it should be considered since data collection stage. Let

Positive Error (PE) be the difference between a neural network’s output and the ideal output for

genuine claims, focusing on those genuine claims mistakenly classified as fraud and Negative Error

(NE) be the same difference but focusing on those fraud claims mistakenly classified as genuine. For

fraud detection, the loss by NE is far more expensive than that of PE. Also, Up-to-date studies show

neural network should treat cost NE as being more costly (higher impact) than PE. With our model

we aim to have high accuracy, a high detection rate, and a low false positive rate (Wei et al., 2013).

Page 37: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

35

3.2. DATA COLLECTION PROCESS

The average time needed for the data preparation process is 80% of the project time, more or less.

(Guo & Li, 2008). For some, this value can reach 85% which is common for data preparation which is

stored in many warehouses. Our data information is split through several channels: claim System,

Policy and Entity System and some fraud detection systems complied in the statistics System of the

company (Copeland et al., 2012).

Figure 7 - Data treatment

In this case, the data collection process is time-consuming mainly because there is no unified

database for it. For claims, there is Claim Software 1 where we can find information on contacts

between company and claimant, his clinic status, the company’s experts (doctors, nurses and

managers) opinions and the payments in course. The information on policies and entities is

contained at Production Software 2; here we shall find the history of the insurer-insured relation: the

premiums paid, the insured people, the several policies arranged, the number of claims, the

policyholder’s contacts and so on. There is another platform, Archive software 3, where there are

archives for side investigations like preliminary inquiry and fraud investigations. These are made by a

team of experts which is independent of the team working on the claim to obtain impartial

statements. Note that, in Production software 2 we can find the claims of a certain policy or entity, in

Claim Software 1 we cannot get the details of it and if there was some fraud investigation we can find

the report in Archive software 3. All these different platforms make the data collection process

significantly more difficult. At the time this dissertation is being written, there is a team developing a

way to get the archives (of Archive Software 3) through Claim Software 1 but there is no estimated

date for its completion.

Page 38: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

36

Figure 8 Software Connections

3.2.1. Meetings

In October 2016, there was an initial meeting with a company expert (from now on mentioned as

Bus. Emp. 1) about data availability and accuracy. In November 2016, took place an interview with a

data collection expert (from now on mentioned as Stat. Emp. 2) which mentioned how data is usually

collected and divided. Accidents data do not have a fraud/non-fraud attribute; they do have a reason

to justify closing a process (in the claims system) which may be fraud. The expert mentioned some

reasons which could identify some suspicious claims:

• Fraud/False statements

• Policy non-existent

• Work accident non-existent

• Causal Nexus non-existent

• Employee not covered

Stat. Emp. 2 explained that I should meet with a claim expert (from now on mentioned as Cla. Emp.

7) to figure out which reasons and in what percentage are more related to fraud. A third meeting was

held with Bus. Emp. 4 to discuss about clearings (permission) to obtain the data and another one to

present Stat. Emp. 6: supervisor of Stat. Emp.2 who has showed interest to help with data accuracy.

P6 and P2’s department is the one focused on statistics of the whole company so they can help giving

information about what are the processes in use by all segments. They expressed their interest to

give a small presentation about fraud in the company and to explain how Risk software 5 works; it is

the fraud detection system in place in the motor segment.

Page 39: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

37

The interview with Bus. Emp. 1 and Bus. Emp. 3 was about the data collection, more specifically the

process of it. They explained that it is not viable to ask for all the attributes and then select the useful

ones. A primarily scan shall be done to not waste resources in getting irrelevant data. The process

arranged was that we should get familiar with Claim Software 1 (output system of claims) and

Production software 2 (output system of policies and entities) to understand what (kind of) attributes

are more likely to be useful. After that, we shall give them a list to see if there is (resources)

availability to extract them.

Bus. Emp. 1 pointed out that before meeting with Cla. Emp. 7 one should become acquainted with

Claim Software 1 thus understanding more about closing reasons and comments and in which

attributes one can get hints about what triggered the manager to classify the claim as fraud.

Focusing in on an statistical point of view, another meeting was scheduled with Stat. Emp 2 and Stat.

Emp 6. Both belong in the Statistical Studies Department at different hierarchical positions: Stat. Emp

2’s key points in this meeting is to give details on some projects in course and some data specifics.

P6’ has a broader role in the company so he has more information on projects through several

segments and the company’s strategies for the future.

The meeting started with a description of the approaches the company is taking by Stat. Emp 6.

There are in place prevention and detection strategies: some more structural (organizational and

cultural) and others more analytical (pattern analysis and toxic clients’ history). In this last category,

some approaches were mentioned as business rules (informally, some were mentioned for instance,

insured amount, unusual patterns, cost of past claims, purpose and alerts by different methods),

predictive methods -in this field the company is using advanced analytics (using database with usual,

suspect and fraudulent claims). About networks there were also some considerations made, the

company has used (and still does in some projects) social networks but has come across some

difficulties concerning which (type) should be used: the cost of investigating this may be too high and

in the end, there is no proof it will work. Additionally, databases were discussed due to the

specificities they should have to fit the networks.

A macro vision of the company was debated: each department has its own rules, processes and

procedures which makes industrialization of fraud prevention/detection methods very difficult.

Currently, it is in the development phase a project which consists in merging in one only repository

the fraud central records. Other project in course is to merge certain databases to create alerts not

only in entities (and vehicles registration).

A different database, unknown until now, was introduced – Risk software 4 – which has data about

operational risk (where fraud is included). This database is property of another department (risk

management department) and when interest was shown to get familiar with it, some barriers rose: it

is too extensive and classified to analyse – they suggested to use the other databases (which are the

base where Risk software 4 is getting information from). Furthermore, another good news was

released: there is an (analytical) data base for archives (namely the fraud investigation reports) with

reasonable information (which in Archive software 3 was only available in report -pdf- form).

Other hot topic was Risk software 5 which is the software based in scoring models used for fraud

detection in the automobile segment. Its purchase was an investment and now the company is still

perfecting its usage with the developers. It is being used as a control and if it matches the

Page 40: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

38

expectations it will be implemented in other segments. In fact, workers’ compensation segment is

planned to implement Risk software 5 if all works as expected.

Furthermore, the discussion continued; some questions were answered and some debates took

place. It is worth mentioning Claim Software 1 has alerts when some claim is made on a policy with

fraud-related past issues. There are projects in place to have information transfers and other types of

alerts (Risk software 5 also demands this). At this point, it happened a very interesting debate on

fraud definition, which, as mentioned in previous sections, can be very different from distinct

standpoints.

Concluding the meeting, Stat. Emp 6 gave some insights for this project. He highlighted the study of

the impact of the model in progress – the impact study will determine whether the company will use

it or not. The accuracy of the model is also of high importance (the company wants to mark the right

cases) as this is being hard to implement in other models in use. Some calculations were suggested:

confusion matrix should be done and analysed; claim expected cost should be analysed (compare

provisions and what is being paid; also, besides when provisions are made there is no theoretical

value to use) to help in the calculation of probability; that is, it is considered fraud not only what is

probable (more than a percentage threshold) but probable and significant (more than an amount).

Legal restrictions were also suggested too be analysed.

Despite the fact Risk software 5 was discussed, there was no time to examine it so another meeting

(with Stat. Emp 2) was planned.

A final appointment with the work-related accidents department director to present the project and

clarify permissions on data collection. The anonymity of the data and of the company was also a

subject on the table: a decision was made to not release to the public the name of the company but

(anonymous) data is available for the project. An interesting side note was that the director

mentioned his participation as a member in a fraud committee of APS some years ago, so full

availability was promised as fraud is a subject of interest not only for both parties but also for the

whole segment.

Lastly, in December there was a meeting with Stat. Emp 2 about Risk software 5, a fraud software the

company uses that is based on business rules and with those it compiles cases which must have

further analysis by the experts. The business rules definition is time-consuming as it can evolve over

time – the higher the cases analysed, the higher the expertise which will conduct the upgrade of the

rules. These should be wide enough including all possible participants. To each rule it is assigned a

KFI (key fraud indicator) and a threshold. If a case has an overall score bigger than the

predetermined, the case is highlighted to analysis; if a case for any rule has a KFI higher than it

threshold, is highlighted for analysis; if a case has a combination of KFI for two rules higher than

predetermined (even if lower than the threshold of each), the case is highlighted for analysis. Once a

case (highlighted for analysis) is being investigated, it is possible to construct a network of that

participant – this is the way to detect non-isolated fraudulent claims. A downside of this software is

that, besides needing the experts to define the businesses rules, KFIs and thresholds, always needs

them in the investigation phase. Also, it includes a known difficulty of social networks - the stop

criterion: when should the expert stop before deciding there is no fraud?

Page 41: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

39

3.2.2. Extracting data

Although the meetings with the experts had started in October, it was only possible to have the first

sample extraction in March. From this point on, a sequence of extraction – error detection – new

extraction was set in place. The first extraction had about 400 variables. Yet, some of them were not

clear (not all had descriptions), others were questioned about their quality. Lastly, there were

variables (which were discussed in the meetings) that were not present. The reason was later figured

out to be because it would duplicate cases. So, there was some time dedicated to the discussion on

how to approach these variables. The problem was that a case could have several values for that

variable so it would duplicate the process as many times as different values the variable had. The

solution was to do a parallel extraction just for those variables, allowing to have multiple variables,

instead of multiple (duplicated) cases. The following extraction had about 384 variables. A different

problem that had to be addressed was a about an identifying variable of the individual that was not

present in some processes. Due to this, there were some variables that could not be extracted (there

was no variable to link to other data warehouse) so they are missing in the data base. Due to these

constraints, the next extraction had 410 variables. A distinct problem discussed was the presence of

cases from the own insurer. The workers’ compensation insurance is, in fact, mandatory to all

companies. A decision was made to disregard the company’s processes (claims) because most

relevant variables, like age or income, regarding these claims are masked in the company’s databases

due to confidentiality issues. An intense series of (brief) meetings started to clarify about questions

that would arise from analysing the data: questions about some data quality concerns or simply

variables clarification. The resulting sample extraction had 292 variables. A final problem arose: the

database was too large to extract. The solution was to split the data in several databases and then

extract them. A final file had 292 variables and about 50 000 claims.

3.2.3. Data Transformation

An initial data cleansing was made by deleting some artificial claims. These are not system errors,

these are claims intentionally engineered to perform numerous actions (a payment, a test, an

individual inclusion) that are not possible to do in the real claim. Through several filters, it was

possible to identify 161 claims that should not be considered. After this process, one could start

adjusting the final database.

It is known in the literature that, for the type of analysis we want to prepare for, continuous variables

work better. Therefore, we start converting discrete variables to continuous by dividing them into

classes). A preliminary activity was to identify the most common values in a variable; that allowed

setting preliminary intervals. In some cases (e.g. if there were doubts about data quality) the

company’s experts were consulted to confirm the intervals (they had a clearer view on what would

be the normal interval setting). This activity was time consuming because the data table had

variables in different formats: number, text and date. Furthermore, the variables in the text format

could be real text (written words) or numbers save as text.

A brief case study is the transformation of the variable which had the total number of treatment days

in a process. This variable is saved in text format, so no plot could be drawn. The first step was to

convert it to number in excel (exporting, converting, importing). Now, one can see the distribution of

the values (annex 1). One could state the most common were under 45 days. But, how common?

And why? These conclusions need feedback from experts. The variable was lastly divided in 5 classes:

Page 42: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

40

Missing value, low value (0-59), normal value (60-364), high value (365-729) and excessive (more

than 730). This way, there is now a variable with 5 different values (saved in number format) usable

in the next analysis.

Other interesting analysis was to the ones saved as date format. We shall recap the transformation

of the variable with the information on the date of the reception of the accident certificate. A

correlation analysis (annex 2) confirmed the high correlation between the date of the accident and

the date of the reception of the certificate. So, a more interesting variable to analyse could be the

difference between these two dates (in days). This rational was used for some other date variables

and run by the team. Now, the new variable (in number format) can be divided in classes considering

the number of days passed (in this case, 5 intervals were considered).

There were some variables that were discarded after analysis. For instance, the variable that had

information about the number of days the individual was absent from work (and the insurer had to

indemnify). It was too correlated with the variable that held information on the days of incapacity

the individual had due to the accident. Thereby, performing several correlation analysis’s, some

variables were disregarded. Also, there were variables that purely identify the claim (used to join

databases and define claims) so they were also discarded.

At this point, the clearing data process was thought to be complete. However, it was found to still

exist duplicated cases in the database. This was due to some variables that could have different

values for the same claims. For example, the claimant’s age group was changing through years so

new values were added. Also, the district could change (an individual can have different addresses in

the system). After analysis, some variables were adapted to only show the last value (like the age

group) and some were removed. This was a time-consuming process but allowed to reduce and

enhance the data base. At this point, there are 125 variables whose format is (almost exclusively)

number and are divided in classes (predominantly, less than 10 classes per variable).

3.3. DATA OVERVIEW

Due to time and availability constrains, the meeting with Cla. Emp. 7 had to be postponed. However,

it was possible to meet with Cla. Emp. 8, a very experienced claim manager used to work with

fraudulent claims in the business. With him, it was possible to have a preliminary discussion about

the fraud’s criteria. When a claim is closed is necessary to have a closure motive. There is one highly

related with fraud: “Fraud/False statements”. However, not all fraudulent claims have this as claim

closure motive. So, despite the fact that this motive is related with fraud, this is just one of several

rules used to identify a claim as fraudulent or not.

First step was to identify all claims (22) with this closure motive and mark them fraudulent. The

remaining were divided by their structure, i.e., the self-employed workers and the dependent

workers. 50% of the self-employed claims (a random sample was generated) was marked fraudulent

and the remaining non-fraudulent. Another closure motive was used as it has been often used to

close fraudulent claims: policy’s inexistence. 10% of the remaining claims (random sample) which

have this closure motive were marked fraudulent.

Using these rules, about 1600 claims were considered fraudulent which corresponds to nearly 3%

fraudulent claims. Consult the annex 3 for the target construction overview. This is a more realist

Page 43: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

41

sample than if we just considered the ones with fraud as closure motive (which would give 0.045%

fraudulent claims).

Now, with the Fraud variable engineered there is a target variable so one can use supervised

learning. It was also constructed a variable for Suspicion although not used in the model because it

would be too correlated with Fraud. The rational was build up upon the inquiry reports. The claims

with inquiry (except simple diligent or payment related ones) were considered suspect. In addition, a

file conceded by P2 which had cases that a preliminary inquiry request was also used to consider a

case suspect. Lastly, cases that had a red Flag6 (either in the claimant or policyholder) related with

fraud was also considerer suspect.

3.4. ASSUMPTIONS

The first decision that was made was which data to collect. After all the meetings with the team, it

was clear that a process of cleaning, organizing and enhancing the quality of the data base was in

force. 2015 was a year where data quality was more reliable and when systems/products became

consistent. Therefore, data from before 2015 was not recommended. One should not analyse too

recent data as many changes can happen in a claim (even if closed) in the following year. Thereupon,

the assumption to collect data was that the year of the accident was 2015 (independently of the date

claim was open in the system).

Perhaps the most important assumption used in this project is the target variable: fraud. It was built

(see Data Overview) and not extracted, although independently and before any model was tested or

thought. This could misrepresent data and distort conclusions. The method to construct fraud,

although suggested by the claim expert may not be the more indicated. Recall, the rational to

construct the variable fraud in annex 3.

The database was based on claims, so the claim ID was a key indicator. Thus, the claims with no ID

were removed. The company’s claims were all removed from the database which may also distort

the numbers; operational risk is real and there also can be fraud within the groups employers. Lastly,

the administrative claims were removed (although they are related with real claims).

3.5. MODEL

The next phase relates to the model itself. Contemplating the data characteristics mentioned above

and the algorithm’s details stated in the Literature Review we must choose the best suited model

possible. The plan includes testing several neural networks with different characteristics and reach

conclusions based on the results.

The software used to do this was SAS enterprise Miner. It comprises several advantages as Flow

based interface with drag and drop and good handling with large datasets. Other software in the run

were R (early adopter in explanatory and predictive modelling but can be slow with big datasets) and

Python (easy to learn but with no user interface and no official support). Yet, the true reason why

this software was chosen is because it is the data mining software used by the insurance company.

This way, one can build a model using software that final users trust and comprehend. The data was

also compiled in SAS tables so gathering the data in this template made sense.

6 A red flag is an alert in Production software 2 about some irregularities in the entity

Page 44: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

42

A total of 21 models was tested differing in terms of features pruned or number of variables used.

The models could have 124 variables (total), 9 variables (selected by performing a series of best

practices regarding data transformation in SAS literature) or 64 variables (selected by experts by

experience). There were 4 characteristics pruned: the number of hidden units (neurons of the hidden

layer) that could be 3, 15 or 50; the target layer activation function which could be the default one or

the logistic; the training technique which could be the default one or the Backprop and the use (or

not) of preliminary training. Lastly, a final model was tested, based on a very famous article by

(LeCun, Bottou, B. Orr, & Muller, 1998) which translated in adapting the learning rate, the

momentum and using tanh function as activation.

Page 45: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

43

4. RESULTS AND DISCUSSION

Following the rationale explained in the methodology, a variety of models was developed. Two

characteristics were taken into consideration. Firstly, the number of input variables of the model.

Secondly, the properties of the model itself.

As previously mentioned, at this point in time the database has 124 variables (plus the target one) so

one of the models had all of these as input. According to the SAS Enterprise Miner literature, there

are some best practices to select the input variables to neural networks, therefore another model

was developed using these. Finally, a third model had as inputs expert recommended variables

(approximately half of the original ones).

As explained in the section 2.6, there are several properties of a neural network that can be pruned.

This led to tests regarding the number of hidden units allowed, the activation function, the training

technique and the preliminary training. The last model was constructed based on the article by

LeCun, Bottou, B. Orr, & Muller, (1998) which suggested a myriad of adjustments to the model.

A decision was made to use the architectural structure of multilayer perceptron rather than common

perceptron. The addition of a hidden layer that the input layer feeds back to instead of going directly

to output layer directly makes possible the generation of nonlinear input-output mappings. 7

The first model tested was the control one, the 124 variables were used and no modifications were

made to the standard properties of the model. With 61 007 estimated weights and average square

error of 0.030469, this model is going to be used for comparison with the others tested with different

properties (but same number of variables). This control model has a relatively high misclassification

rate (0.032597) which is not encouraging, nonetheless it has the greatest cumulative % response of

the group (9.35) highlighting its performance in predicting the probability of response. Some of the

outputs of this model can be found in Annex 4, like the cumulative lift curve which provides a

comparison of how much better each model performs with regards to average captured response. In

the top decile (depth=10) the model is a 2.796 times better predictor than the response obtained if

no model was used.

The following step was start pruning some properties of the model according to the Neural Networks

literature although still using the same input variables. The first property to be adjusted was the

number of hidden units. This is the number of elements of the hidden layer. [In SAS Enterprise

Miner] The permissible values are between 1 and 64. In the control model, the number of hidden

units was 3. Two adjustments were done: one version was tested with 15 hidden units and the other

with 50 (Consult Annex 5 and Annex 6). The model with 15 hidden units had 244 031 estimated

weights and the model with 50 had 777 851. The first had an average square error of 0.030305 and

the second of 0.030235. By increasing the number of hidden units the capacity of the networks

increases since there are more possible combinations available (Rojas, 1996). Regarding the

Kolmogorov-Smirnov statistic and the cumulative lift both models show the same values. The

7 This overcomes the Minsky-Papert objection which states that without hidden layers the model is

capable only of making linearly separable discriminations, that is, when the patterns to be learned were not linearly separable, the model fail.

Page 46: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

44

Kolmogorov-Smirnov statistic ranges between 0 and 1 and is a function of the distance between the

cumulative score distributions of the non-fraudsters and fraudsters.

The ROC index plots for the models with 3, 15 and 50 hidden nodes are very similar; the ROC index

(area under the ROC curve) value is 0.81 for all.

The next property adjusted was the target layer activation function. By default, the target layer

activation function depends on the level of the variable (ordinal, nominal or interval) and it is logistic,

mlogistic or identity, respectively. These standard functions were tested against a model using

exclusively the logistic function (See Annex 8). This results in a model with 45763 estimated weights,

average square error of 0.031323 and a ROC index of 0.78 (lower than the control model one as one

can estimate from the plot). This model also presented the lowest Average profit (0.96713) of the

group and the lowest cumulative % response (7.29).

Figure 10 - ROC plot for the 4th model

The tuning that followed was regarding the training technique. There were several choices that could

have been done, for example, Quasi-Neewton algorithm, QuickProp and so on. Considering the

information collected on section Neural Networks, the BackProp method was chosen. It is important

to mention that, in this case, Backprop refers to Backpropagation of errors and not to the

Backpropagation algorithm. Backpropagation of errors is a technique for computing the gradient of

the error function with respect to the weights and biases of the network. It is an application of the

chain rule of elementary calculus but this derivative is needed in any first order nonlinear

optimization technique. The Backpropagation algorithm is a procedure for updating the weights

Figure 9 - ROC index for 15 and 50 hidden units

Page 47: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

45

based on the gradient. In summary, the Backprop is a training method that uses backpropagation

algorithm to compute the gradient (Bertsekas, 1995; Werbos, 1994). The back propagation

(generalized delta rule) allowed to determine the pre-transformed net activation of the hidden units

by using the continuous sigmoid function replacing the discontinuous step function between the

input and hidden layer. As Backprop is known by being efficient with large datasets, good results

were expected (see Annex 8). The standard training technique in the software depends on the

number of weights applied during the execution. Due to the dimension of the dataset, the conjugate

gradient training is in place. This model presented a significant ROC index (0.81) but a low cumulative

% response (9.15). The Kolmogorov-Smirnoff statistic is similar to the other models of this segment

(0.59) and the cumulative lift plot presents as expected, that is, decreasing. The cumulative lift value,

which in this case is 2.788 (with depth=10) represents the cumulative percentage of fraudsters per

decile, divided by the overall population percentage of fraudsters. (Cary, 206AD)

Another property tested was the preliminary training which is used to obtain starting values for the

weights to accelerate convergence and reduce the odds of ending up in a defective local optimum.

This preliminary training consists in finding initial weights estimates by performing several iterations

based on training dataset using different initializations. The final weights of the best trained network

are used as initial. The default number of iterations is five, but more can be computed. However, too

many preliminary simulations can result in a computationally expensive model. Disabling the

preliminary training option resulted in a model with 61 007 estimated weights and average square

error of 0.030255. This is a lower value than the one of the control model which can mean one of

two things: either preliminary training is not necessary for this data or, on the contrary, the (default)

5 iterations of the preliminary training are not enough (Bertsekas, 1995).

Still, without preliminary training the ROC index is 0.81 (see Annex 9). This model has the greatest

Kolmogorov-Smirnoff statistic of the group (0.594), which was expected due to its considerable ROC

value, and also the greatest cumulative lift value (2.831).

A last modification was made to this model. It was built by following the practical rules of a

recognised article in the neural network literature. LeCun et al., (1998) give some practical tricks to

build a better model to fit data. One of the guidelines corresponds to the use of hyperbolic tangent

rather than the standard logistic function as it converges faster. As the target is binary, the author

refers that the standard functions may result in instabilities (trying to get the output as close as

possible to the target values, which can only be achieved asymptotically) and low confidence levels

(weights may force the outputs to the tails of the sigmoid). Other strategy suggested is about the

learning rate; the author suggests it should be proportional to the square root of the number of

Figure 12 - ROC Index comparison between pilot and no preliminary training

Page 48: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

46

inputs. So, considering 124 input variables, a learning rate of 1.12 was chosen. There is also a

suggestion to add a momentum as it can increase the speed when the cost surface is highly

nonspherical. In the practical examples given by the authors, the momentum was higher than the

learning rate so in this case a momentum of 1.5 was chosen. Finally, the training technique suggested

was BackProp. Interesting results arose from these combinations (Annex 10). The final model had an

average square error of 0.284633 and a misclassification rate of 0.03287. Although with the training

sample, this model was considered promising (with an ROC index of 0.89) with the real (validation set

used) data it gave a ROC index of 0.26 which is considerably low. Moreover, its value of average

profit is not satisfactory and the values of cumulative % response and cumulate lift are also

disappointing, being the last equal to zero. This seems a typical case of overfitting as it worked

outstandingly for training data and poorly for validation dataset (Stumpf et al., 2009).

From this point on, the second part of the methodology was implemented. This second model used a

series of best practices by SAS which included two extra procedures: variable transformation and

variable selection. Transforming data can lead to a better model by, for example, transforming

skewed distributions. However, as a prior transformation to the variables (split bay classes) was

made this step was omitted. The next step is variable selection in which low R-squared variables are

rejected. The results were interesting as this selection reduced the number of variables to 9 (where 4

of them have imputed values due to missing ones). The Variable Selection nodes create “binned”

variables from interval scaled inputs and grouped variables from nominal inputs. Sometimes a binned

input is more strongly correlated with the target variable than the original input, indicating a non-

linear relationship between the input and the target. The grouped variables are created by collapsing

or grouping the categories of a nominal inputs. With fewer categories, the grouped variables are

easier to use in modelling than the original ungrouped variables (Bertsekas, 1995; Rojas, 1996). The

selected variables include attributes of the claim, the claimant, the policyholder and the claim itself.

Some of variables were expected (due to the target variable or the expert’s hints) but others were a

surprise. In order to examine these variables, a regression node was added. The category of the

insurance revealed the greatest relative worth being the self-worker insurance the attribute with a

stronger relationship with the target variable. The claimant’s nationality was also elected as key

variable to predict the target while the claims with African claimants were the ones more related

with fraud. Regarding the claim itself, as expected the closure motive related with fraud was

considered important to predict fraud and the duration of the claimant’s incapacity treatment also

revealed a relationship with the target. There were to variables linked to the channels through which

new businesses arrive considered as important. The intermediaries and the banking channel were

the ones more linked with fraud. Although these variables had attributes one could geographically

differentiate between claims, this differentiation was not considered as significant. Regarding the

policyholder, one can establish a relationship between his/her marital status and the target variable

while being divorced (however not legally) is the condition most related with fraud. Regarding the

policy itself, the business (invoice) volume of the company was considered relevant to predict the

target while the company’s seniority revealed a negative relationship with fraud, being the younger

companies the more related with fraudulent behaviours.

As with the complete database, with the selected variables a control model (say, control 2) was

simulated to build a comparative with the others (Annex 11). This control 2 has 721 estimated

weights and an average square error of 0.130454 (much higher than the control’s run before). This

network has a ROC index of 0.98. It is interesting that one of the control has significantly higher

Page 49: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

47

average square error and the other as a higher ROC index. This suggests that this model excels in

classifying observations considering true positive rate and false positive rate which is relevant for a

classification problem. This networks also outperforms the first control in terms of Cumulative Lift

(9.830), KS Statistic (0.948) and in cumulative % response (13.92).

Figure 13 - ROC curve for control 2

Replicating the procedures of the first group of models, some properties were now adjusted, for

example the number of hidden neurons: tests were conducted with 15 and 50 units. The first one led

to an average square error of 0.017152 and the second of 0.016993 which are in fact very similar but

not as much as the ROC index which is 0.98 for both models and the cumulative lift (9.830). On the

other hand, misclassification rate is lower with the model with fewer hidden nodes (0.03171 versus

0.031779).

The model using the logistic function as target layer activation function was also tested resulting in

723 estimated weights and an average squared error of 0.020955 (higher than the ones before). This

adjustment results with the outputs of the target layer being:

This translates in an easily calculated derivative (Mo et al., 2016). As one can see in Annex 12 this

model has interesting results; the misclassification rate shows sudden deviations however the ROC

index is the same as the other models mentioned (0.98). Regarding cumulative % response,

cumulative lift and KS statistic, this model did not outperform any other. This means, in this case, the

logistic function did not add any value to the model.

Next, using Backprop as training technique a model was built with 575 estimated weights, resulting

in an average square error of 0.01707 which is a lower value than the last model. Still, the ROC index

maintains itself constant. The misclassification rate in this case is 0.03246 comparing with the

0.03171 of the last model. Misclassification rate is based in the true negative rate and false positive

rate which are also the components of the ROC index value. So, although both models have the same

round up value, the model using Backprop performs better than the one using logistic function.

Furthermore, the Backpropagation model has a higher Cumulative % response presenting a better

prediction for probability of response.

In general, one never knows at which local minimum training stopped and if it is similar to the local

minimum of the error function (different training sets have different local minimums). Disregarding

Page 50: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

48

preliminary training has built a model with 575 final estimated weights and an average square error

of 0.016929 which is the lowest value so far (with 9 variables) which may indicate the preliminary

training performed in the control (2) model has stabilized in a local optimum not similar to the error

function one. Annex 13 shows a Misclassification rate with more fluctuations than usual which was

expected as the acceleration of convergence enabled by preliminary training was disregarded. The

ROC index value is 0.98 (similar to last models). This model presents a value of cumulative lift of

9.809, considering depth=10. Clearly, the lift curve always decreases as one considers bigger deciles,

until it will reach 1. Using no model, or a random sorting, the fraudsters would be equally spread

across the entire range and the lift value would always equal 1.

The final test to the model was made based on the 1998 article, “Efficient Backprop”. The reason this

article is famous is due to the practical applications it refers still using theoretical knowledge. Similar

to the model using all variables, a total of 3 adjustments were made: the target layer activation

function was set to the hyperbolic tangent, the learning rate was set to 0.3 (√9/10) and the

momentum was set to 0.5. This resulted in a model with 721 estimated weights and an average

square error of 0.430622 which is by far the highest value of this sequence models. This is a suspicion

these adjustments are not the most indicated to fit the data which is supported by the value of the

ROC index of 0.92 (lower than for the former models). The result plots in Annex 15 also show worst

performance than the formers, like the cumulative lift plot which shows (for depth 10) a value of

6.989, lower than the value of all the other models in this group. Also regarding cumulative %

response and KS statistic, this network has a worse performance.

In this segment, the model with lowest average square error is the one with no preliminary training

but there was no much difference compared with the control’s. The model which perform better in

Average square error (lowest value), Misclassification Rate (lowest value) and ROC index (highest

value) was the one whose preliminary training was disabled.

A final sequence of model was built with a different set of variables. The reasoning was to use

variables that the users (insurance experts) would trust and recognise as viable and feedable due to

their experience in the field. The Attention Investment theory is worth mentioning as it points to the

importance of cost, benefits, and risk as balance to users/experts’ feedback. Meaning, whether a

Figure 14 - Article based ROC plot

Page 51: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

49

statistical analysis (or algorithm) may seem to work well or not, the use of it depends on the

feedback of users. This experience resulted in a selection of a model using 64 variables which, by

confidentiality issues, cannot be listed. Just as with the former models, a control model was run that

will be used for comparisons. The control model has 45322 estimated weights and an average square

error of 0.030413 which is in the range of the group. Interesting results can be accessed in Annex 16.

Figure 15 - ROC plot for control 3

The ROC plot seems interesting although one could expect better of an expert based model; the ROC

index value is 0.81 and its perfectly clear that for the (first) true positive rate of 1 there is a false

positive rate of 0.4. The misclassification rate is 0.032324, the lowest of the segment. The cumulative

plot shows good results, presenting the greatest cumulative lift (for depth =10). Moreover, the values

of average profit (0.967676) and KS statistic (0.595) are higher than any other in the segment.

Just as for the former comparison models, some properties were adjusted. One of them is the

number of hidden neurons, which corresponds to the number of units of the hidden layer. Using 15

hidden neurons results in an enormous model with 241 711 estimated weights and an average

square error of 0.030365 (lower than the control’s). The ROC index value is 0.81 (note the similarities

of ROC plots with the control’s). The misclassification rate is 0.03287 (higher than the control’s). The

KS statistic value does not differ from the control’s but the cumulative lift and cumulative % response

(for depth=10) show a very slight decrease. As the difference to the control model is only an addition

of 12 hidden neurons, apart from the number of weights, similar results were expected.

The model with 50 hidden neurons has a lower average square error (0.030349) and a lower

misclassification rate (0.032801) although not as low as the controls. The ROC index value is 0.81.

The values of KS statistic, cumulative lift and cumulative % response are lower in this network than in

the one with fewer neurons or the control’s.

The number of hidden neurons directly affects the weights connected to them, consequently, it

interferes with the error of the output nodes. Indeed, the output nodes cannot reach the target if

they are fed poorly (Kröse & van der Smagt, 1996).

As it reduces the output for the [0,1] range, the logistic function is often used to predict probabilities

but it is also used for binary targets. There are signs that the activation function for the hidden

neurons is not significant. Therefore, the target layer activation function was tested. The default

target layer activation function is the identity but the logistic is widely used.

Page 52: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

50

Figure 16 - Known activation functions

Unfortunately, this model has the same ROC index as the formers (0.81) adding no value in this

indicator. The average square error is 0.030595 and the misclassification rate is 0.032665 which is

only outperformed by the model of the 50 hidden units model. In fact, the misclassification plot

revels very interesting results (see Annex 19). Regarding cumulative % response, this model presents

a poor result, only greater than the article-based one’s. In terms of KS statistic is slightly lower than

the control’s which reveals the ROC curve of the control’s better performing.

Using Backprop as training technique results in a backprop network: a feedforward neural network

using the generalised delta rule. Although being classified as slow, difficult to use and unreliable, it is

by far the most widely used training technique (SAS Institute Inc., 2016). Using this technique, a

model with 45316 estimated weights was built. It has an average square error of 0.030348 (the

lowest of this segment) and misclassification rate of 0.032665 (located in the range of the group).

The ROC index value is, again, 0.81 but the KS statistic is the lower of this segment (0.59).

By default, preliminary training is performed as it enhances the model performance and helps

avoiding local minimums. Hence, one expects the test with no preliminary training to perform worse

than the control model. It does in terms of the average square error, misclassification rate and

cumulative lift (0.030500, 0.032392 and 2.739). Yet, it has the same ROC index and higher

Cumulative % response (in fact, the greatest of this segment) (SAS Institute Inc., 2017). It can be seen

in Annex 21, the Misclassification rate plot looks similar to the control’s besides having a higher final

rate.

Based on the “Efficient Backprop” article, a last model was tested. This time using a learning rate of

0.8, a momentum of 1 and the hyperbolic tangent as target activation function. The results were

significantly worse than expected as one can see in Annex 22. A model with 45326 estimated weights

and an average squared error of 0.317015 is not even considerable comparing with the former. This

misclassification rate is 0.03287 (same as the 15-hidden unit model’s) and ROC index of 0.22 which

shows the model is not suitable for this data. Indeed, the shape of the plot confirms brutally the

difference in the results. In terms of cumulative lift, this model does not reveal significant differences

compared to not using any model, which makes it unsuitable for this data. The cumulative %

response is null presenting a poor prediction performance in the probability of response.

Page 53: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

51

4.1. MEASURES

The assessment measurements in SAS usually depend on the target variable: if it is interval then the

measure used is the average square error and if it is categorical the misclassification rate is applied.

(SAS Institute Inc., 2016) Still, average squared error is commonly used (particularly, in supervised

learning). The average squared error is the square of the difference between the estimated values

obtained by the model and the real ones. Average absolute deviation may also be used (instead of

the square of the difference is used its absolute value). The lower this measures are, better performs

the model (Baesens & Broucke, 2016; Phua et al., 2010). In the best practices segment (9 variables),

the model with the lowest average square error is the one with no preliminary training which may

indicate the choice of the initial weights was overfitting the training data. The model using the

expert-selected variables (64) with the ASE closer to zero is the one using Backpropagation as

training technique which supports the literature advertising this as the most commonly used. Finally,

the overall model (with all variables) with the lowest error is the one with 50 hidden units (neurons)

which is corroborated by the literature quoting the number of hidden neurons should be in the same

level as the number of inputs: more inputs more neurons. As this models is the one with more inputs

it makes sense more hidden units would be significant (Kriesel, 1996).

The misclassification rate (or classification error) is the percentage of incorrectly classified

observations and it is the most commonly used measure in the literature to address the performance

of an ANN (Baesens et al., 2015). In fact, misclassification has a cost associated with it. According to

Paasch, (2008) misclassification cost must be an important factor to consider in any company’s

strategy regrading fraud detection. The question arises: should the company miss some frauds or risk

a series of false positives? The author refers to the classification cost as the misclassification rate

times the cost associated with it. In the models tested using the variables chosen by experts, the one

with the lowest value of misclassification rate was the control, that is, the one with no properties

pruned (different from the one with the lowest average square error). However, in the best practices

model, the one with the lowest misclassification rate is the same as the one with lowest average

square error – the model with no preliminary training. This, indeed, was the predictable result

(contradicting the former model) because the difference between the real target and the estimated

one should be in the same direction as the proportion of error. The model with all variables follows

the same tendency, the lowest value of misclassification rate belongs to the model with 50 hidden

layers (just as the lowest average square error). Frequently used, indeed, but misclassification rate

may have some issues. It is known, and stated before, the fraud probability is quite low so the

sample with target equal to zero is massive comparing with target equal to one. This way

misclassification rate may not be appropriate to test the model performance because the minimum

rate may rise by assigning all observations as non-fraudulent (Paasch, 2008).

The average profit is frequently used to categorise classification problems. It depends on the profit

matrix which by default is the identity matrix (1s in the diagonal and zeros in the remain) which

assumes true positives are as important as true negatives. The average profit is, with this matrix, the

sum of true positives and true negatives divided by the number of classified observations. A model

with good performance should have a high average profit. In the expert selected variables model, the

one with highest average profit is the control one, in the best practices model the one selected is the

one with no preliminary training and in the global model the one with 50 hidden layers.

Page 54: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

52

The cumulative lift is a data mining measures which relates to the power of the model in flagging

cases as suspicious. It is the ratio of percent captured responses within each decile to the baseline

percent response (Baesens et al., 2015). For a decile, the higher the cumulative lift the better. In the

Best practices segment, all models have the same cumulative lift (depth=10) except the article-based

and the model with no preliminary training. In this segment, the values of cumulative eft are, in

overall, greater than in the other segments. Even so, all segments have cumulative lifts greater than

one which is what would be expected having no model at all. In the segment with expert selected

variables, the one with greater cumulative lift is the control one and in the global segment the best

choice would be the model with no preliminary training.

ROC means Receiver Operating Characteristics and the ROC curve is commonly used to visualise the

performance of a model. The ROC index or Gini Index or even the AUC (area under curve)

summarises a plot in a single number. The ROC curve plots the sensitivity and (1-) specificity. The

perfect model has a sensitivity of 1 and a specificity of 1. The closest to this point the ROC curve is

the better, or, the closest to 1 the ROC index is. The ROC index higher than 0.5 is interpreted as the

probability of a fraudulent claim getting higher score than a non-fraudulent one (Baesens et al.,

2015). There are studies in the literature which prove AUC has its value as it is a better measure than

accuracy (Bhattacharyya et al., 2011; Ling, Huang, & Zhang, 2003). The model with expert selected

variables indicates the same ROC index to all tests except the one following the (LeCun et al., 1998)

article – 0.81 for all and 0.22 for the last. 0.81 is a relatively high value but not as high as 0.98 which

is the ROC index value of all test in the best practices model (except the article based. The former has

a AUC lower than 0.5 which means this model does not outperform a model that randomly classified

observations (Kose et al., 2015). The model with all variables follows the same trend: the lowest ROC

index is of the article based (0.26) and all others are very similar (0.81 or 0.78 in case of the model

using the logistic function). With ROC index alone is not possible to identify the best model in each

category. The Kolmogorov-Smirnov distance can be calculated in the ROC plot as it is the maximum

vertical separation between the sensitivity and the baseline on the ROC curve. This is equal to one

minus the specificity. The KS statistic ranges between 0 and 1 and the greater the better in a

classification problem. In the best practices segment, the models with more hidden neurons have a

higher KS statistic value, but in the segment with expert selected variables the greater KS statistic

belong to the control model. In the global segment, the greater value of this measure belongs to the

model with n preliminary training.

(LeCun et al., 1998) is famous article due to its practical tricks constructing neural networks. Although

the authors describe some theoretical demonstrations; its real-world advices keep getting the article

quoted. Up to today, the article has been quoted 123 times, of them, 36 times were in 2017 and 27

in 2016. In building knowledge, one should not only try new and different things but also learn and

test others wisdom.

It is very interesting to notice in all models the number of estimated weights equals the number of

degrees of freedom. Obviously, in the models with different number of parameters (single neurons)

it was expected them to have different weights and this would impact the number of degrees of

freedom but it is curious the one-to-one relation between the weights and de degrees of freedom.

The intuition is that the more the weights, the more different sort of functions the model can fit, and

thus the higher the degrees of freedom. In the models where no units were added (for example the

control model and the logistic function activation function) one would not expect different number

Page 55: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

53

of weights but, in fact, this can mean different target layer activation functions consider different

weights as significant. A higher number of weights results in a better description of the data but

consequently reduces the model’s prediction capability. Also, for a fix number of units, deep

networks reveal less degrees of freedom so these results reveal not so deep networks. The number

of degrees of freedom is referred as the plasticity of the system, that is, its capability of

approximating to the training data (also referred as adaptability). Increasing the plasticity helps to

reduce the training error but may cause the network to overfit. Decreasing the plasticity excessively

can lead to a large training and test error. One can explain the approximation of degrees of freedom

to the training data as the degree of polynomials to approximate experimental data (Rojas, 1996).

In Annex 23, there is a summary table with the most important properties mentioned in the several

models for comparison.

Page 56: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

54

5. CONCLUSIONS

This section summarises the key conclusions of this study, which entailed the development of a

neural network fraud detection methodology as an alternative to traditional methods which are no

longer sufficient to tackle a growing number of sophisticated fraud cases. Aiming to improve fraud’s

detection and mitigation methods with neural networks using data from a Portuguese insurance

company, it is recalled that the three specific goals of this study were: (i) identification of the

significant variables for fraud detection in workers’ compensation insurance claims, (ii) identification

of the most appropriate types of neural networks to use, and (iii) selection the best suited model for

the problem in study. All goals have been met, as described below

This work was based on data from a Portuguese Insurance company, and focused on learning from it

using neural networks and drawing conclusions in terms of what should be the key variables when

triggering a suspicion, and also which network model should be used to continuously assess

detection of fraud in current claims. A review of the literature was conducted, data was extracted

and treated, and different models of artificial neural networks were tested.

Machine Learning algorithms are increasingly being used despite their common black box definition.

As experts are becoming familiar with them, it makes perfect sense to develop these techniques and

apply it to fraud detection as they can establish non-linear connections between data to obtain

conclusions and that can evolve each time they are used.

Large amount of data is crucial in any statistical model especially in non-linear models. Processing

data for this project took about 80% of the time used. The final data set did not contain all possible

variables in the system because some of them were duplicated or non-reliable. Neural networks are

known by its non-existent function form, behaving like nonparametric regression models which is

important as it allows building models even if the relationship between inputs and outputs is

unknown. Furthermore, its unlimited flexibility allows modelling any input-output relationship, no

matter how complex, disregarding interpretability but acting as high-performance analytical tools.

Neural networks ignore observations that contain missing values thus, for these, we used imputed

variables to fill the gap the missing ones. The rule used was to reject the variables with low R-square

values. This technique allowed a reduction in the number of input variables to nine, which is more

suitable to analyse in a neural network study. These nine variables do reflect some of the thoughts

the experts transmitted. The category of the insurance is of high importance in the case of a self-

employed worker, who will definitely try to recover the premium paid. His/Her motives are clear, the

premium is paid a priori so the more he/she can recover, the better. On the other hand, a dependent

worker, who does not pay the premium himself/herself will be satisfied having his medical expenses

paid. The motivation of the employed worker to increase claims frequency or build-ups on claims is,

more likely, related to his/her employer and not to the insurance company: the worker may want to

take more time of work or even undermine his/her employer rather than to recover the premium

paid as it is in the case of a self-employed worker. Interestingly, most of the variables selected are

known à priori (before a claim happens) which is an important aspect towards prevention:

companies may build a prevention filter and if a policy meets the criteria it issues an alert to

management, resulting in automated redflags. Surprisingly, the previously known variables refer

mostly to the policy’s characteristics rather than the insured person’s or policyholders. The

claimant’s nationality is also important to consider in fraud detection and solely depends on the

Page 57: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

55

insured person. This indicator could have 9 different outputs, like Portuguese, European non-

European, Brazilian, and so on. In the case of the self-employed worker, this variable is known in

advance but in the case of dependent workers this value may not be known at first because some

companies, instead of presenting the insured persons in advance prefer to send monthly updates of

insured workers. This type of insurance is subscribed mostly by companies which do not have a

stable number of employees. The importance of this indicator points toward the need of collecting

insured people’s nationality in advance and predict the future claim costs using an adjusted fraud

probability. The Portuguese legislation forces insures to underwrite a contract with a single rate,

meaning, the insurer may not apply different rates to different employees of a company. Therefore,

knowing the nationalities of the insured persons could led to a higher premium of the contract but

not differentiated by insured person. In the cases where the workers sheet is not stable, the rate

must remain constant through the year, but the capital insured may vary. Hence, the insurer adjusts

the premium the policyholder pays by calculating the differences in the capital insured throughout

the year and not the rate. There are two commercial structure levels with much significance in these

variable’s selection process and which are known in advance. These commercial structure levels

distinguish between brokers, agents and intermediaries and differ if they are situated in the north,

centre, islands or south of the country. These variables are easily collected and may indicate that the

different channels in force have different risk cultures. Either the insurer initiates training to instil the

same risk culture in all channels or it starts penalizing the channels with the riskier portfolios.

Regarding the policyholder, there are some characteristics that should be assessed. The

policyholder’s marital status has revealed a high correlation to the probability of a fraudulent claim.

This is an interesting finding that could be easily taken into consideration regarding pricing and

redflag alerts. The following two variables are characteristics of the insured company, its business

volume and seniority. These numbers are known at the time of the contract however as, sometimes,

there is no proof of its reliance and these are provided by the policyholder, it may be wise to first

raise awareness of its importance in the business channels instead of using them already in the fraud

prevention strategy. In alternative, insurers may use institutional databases such as Dun & Bradstreet

to confirm this information. The key variables that cannot be collected in advance relate to the claim

itself (process closure motive or incapacity treatment duration). These, although, not possible to be

used to alert management for fraud probability, may be helpful to establish the risk profile of the

claimants: if a claimant had already triggered, in the past, claims with these characteristics, it makes

sense to investigate carefully all aspects of a new claim as the claimant has more probability to

commit fraud. The variables reviewed made perfect sense for the experts and were considered

significant for fraud detection in workers’ compensation insurance; therefore the first goal of this

project is achieved. However, they were considered slightly restrictive and that is the reason why

another model with more inputs (also well accepted by the experts and suggested by them) was

tested. In fact, a total of 21 models ware tested combining the number of input variables and

different properties pruned: 3 segments with different number of input variables and 7 different

properties analysed.

From an academic point of view, it is always important to have a control group for comparison

analysis. In this case, we have one for each segment of models (3 control models) which we did not

adapt in any way for our data. Interestingly, in the expert-selected segment the control model

outperforms the other models in terms of misclassification rate and average profit. Regarding the

ROC index, it also has the highest value (although there is a draw). Some characteristics of this model

Page 58: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

56

include the multilayer perceptron architecture with 3 hidden units, default activation functions

(which depend on the measurement levels of the variables), default training technique (which

depend on the number of weights), learning rate of 0.1, no momentum and enabled preliminary

training. In this segment, only in terms of the average square error the control model is

outperformed, and it is by the backpropagation model by 21 percentage points. This model does not

outperform in any other segment (nor property). The backpropagation property refers to the training

technique applied whose detail is explained in the Results and discussion section. Although

frequently mentioned in the literature, this model is the least favourite of the best performing

models. Indeed, we had the option to choose more recent version of this technique as the RPROP

technique (which uses a separate learning rate for each weight and adapts all the learning rates

during training) or QPROP (which is a newton-like method but uses a diagonal approximation to the

hessian matrix). These techniques were not selected due to its lack of references in the literature and

have not been tested enough to be reliable. A model that has significantly better results than this is

the one with 50 hidden units (number of neurons in the hidden layer). This is the champion model in

the global segment (model with all input variables) because it outperforms all the others in all

measures. Considering the average square error as measure, this is the model with the best results in

all segments followed by the one with no preliminary training. Recall preliminary training is used to

calculate the initialization weights more suited to the data but can lead to overfitting. In fact,

considering misclassification rate, ROC index and average profit this is the model with best results in

all segments having only worst results in the average square error in the best practices segment (still

within this segment, the model with no preliminary training is the champion). The model using 15

hidden units turns out to be a disappointment as it did not differentiate in any measure which

allowed us to conclude that an addition of 12 hidden units in this data is not significant. Other model

with no results of excellence is the one using the logistic function as target layer activation function.

Despite often being mentioned in the literature – due to analytical trackability (accessible derivative)

and interpretability (commonly used in statistics) – it did not perform extraordinarily. Lastly, a lot of

expectation was placed in the article based models but in none of the segments it revealed good

results. In fact, most of the times it performed worse than the other models (regarding ROC index,

for example, the article based model always reached the lowest score). These conclusions on which

were the most appropriate network models completed the second goal of this work.

In addition, one should select the best suited model for each tested segment, completing the final

goal of this investigation. There are no doubts, for the best practices segment the best performing

model is the one with no preliminary training and for the global segment it is the one with 50 hidden

layers. For the expert selected segment, the decision of the best performing model is not so evident.

On one hand, the model with the lowest average square error is the one using backpropagation. On

the other hand, the lowest misclassification rate, highest ROC index and highest average profit

belong to the control one. Considering this, the control model should be the champion in this

segment. It comes the time one shall select the overall champion model. It should be selected the

model with no preliminary training of the best practices segment as it has the lowest

misclassification rate, the highest ROC index and the highest average profit. However, it does not

have the lowest average square error. A final assessment of this model should be done using the test

data set. Recall that, in the beginning of the study, the data set was split into training, validation and

test samples to ensure fraudsters/non-fraudsters were equally distributed. For the final test data set,

Page 59: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

57

the average squared error is of 0.016409, the misclassification rate is 0.032592, the average profit of

0.967408 and a great ROC index of 0.98.

As mentioned in the beginning of this project, Portuguese workers’ compensation insurance is a very

particular type of insurance as it must obey to the Portuguese law. This fact restricts the insurance

companies to make use of creative talents and have a diversified offer. Nonetheless, insurers may

focus on different characteristics to have different (more attractive) products in the market. Looking

upon this, it was expected that the model selected would be specific for this particular segment

which is workers’ compensation. On the other hand, the inputs used are sufficiently generalised to

justify a widespread use of the model amongst different insurers. The issue with this this fact is that

the target variable used here is very specific of the insurance company examined and, in particular,

its risk culture. On the positive side, any insurer may build its own target variable (if it does not

already exist) according to its own strategies. In short, however this study may need some

modifications to be used internationally, it is perfectly capable of being used amongst the several

Portuguese insurance companies. A unified response against fraud may, indeed, contribute to the

financial recovery of segment and its claim rates.

To select a model using reliable variables, most of them known even before a claim arises is of

tremendous importance as it allows to implement a prevention method and not only detection. The

variables considered are no strangers to the experts but having a model praising them as meaningful

gives more confidence to the company to strategically use them to improve its results. It is clear now,

fraud is a dynamic process. The players and the scenarios will keep changing over time thus it is of

high importance for any company to have several methods in place to prevent (or detect) fraud. Yet,

this model can be used for new data (new policies, new claims) and it will continue to learn and

adapt itself. After trained, a neural network is among the fastest executing predictive models

allowing real-time evaluation and monitoring of claims and policies. Even if not all information is

known, we have discussed methods to overcome these obstacles. In conclusion, this study not only

gave relevant insights about the company’s data but also provided a skilled model that can already

be successfully applied but which will moreover evolve over time. Certainly, some adjustments

should be done. General Data Protection regulation 2018 will make companies take a closer look at

their data, its storage and its quality. The employment of several softwares does not smooth the

implementation of prevention strategies. Data is most definitely a concern when feeding any

machine learning model in place as the progress of this kind of algorithms is solely useful when data

quality is not a problem. The focus should be on learning with the information the data allows and

the knowledge that it gives and not in assuring dataset’s quality. Businesses must be prepared to face

these realities if they want any chance of a successful prevention system. With this algorithm is

possible to prevent losses, expect cost and even to enhance the pre-contractual risk analysis: with

the awareness of just a few key characteristics one can anticipate fraud attempts and comprise them

in the pricing. This way, not only potential fraudsters are detected but also the business ratio starts

to stabilize. The company profits and the business benefits.

Page 60: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

58

6. LIMITATIONS AND RECOMMENDATIONS FOR FUTURE WORKS

There are always some limitations in terms of resources and time that constraints investigation

works like this. In this case, the great limitation was regarding data. In general, companies are not

substantially concerned about (history) data quality which makes backoffice research much complex

and time-consuming. As new regulations are built, some precautions arise that may contribute to

clearer databases. For example, the GDPR (general data protection regulation) 2018 is already

making companies analyse their databases to fulfil this regulations in the future.

From a macro view, a limitation worth mentioning is the fact the data used belongs solely to one

insurer. Using data from different insurers (which have different fraud protection strategies) would

have allowed more robust conclusions. Furthermore, this work was focused on the workers’

compensation insurance segment which shall not be the only segment used to draw fraud prevention

approaches. There is no significant literature in workers’ compensation insurance fraud detections

yet. Indeed, in whole financial (banking and insurance) business, practical fraud literature is not

meaningful. The exchange of ideas in fraud detection is limited as it does not make sense to give

fraudsters the information on how companies are fighting them. This “snowball” of silence and

inaction, makes individuals believe fraud is not a relevant problem. In fact, the low probability of

fraud in insurance claims discourages studying fraud. In 100 claims, scientist may always say they are

not fraudulent and only be wrong in roughly 2 times. Will any algorithm be as good as this?

In this work, about 70% of the time was used to extract and clean the database which held data from

too many different systems. As only one year was used to learn, the number of observations may be

a matter of concern. As explained, some techniques were used to adapt the database into the

algorithms tested, for example, there were missing values to be complete, repeated variables to be

eliminated and excessive continuous variables to be computed into interval ones. Indeed, the fact

there was not a singular trustworthy variable classifying a claim as fraudulent (or not) is a great

limitation in this work.

Technically, in the process of choosing the champion model, the ROC index was only presented with

two decimal places which is not vary significant.

A severe limitation with this type of algorithms is that they need to be updated and retested to be

effective: it should learn and be pruned (dropping/adding relevant variables) over time. With this

limitation in focus, the use of SAS software was a restriction. One should use tools that final

users/clients trust and comprehend.

There were some ideas thought in the beginning or discovered during the research that would be

very interesting to try but because of time constraints that was not possible. An example of that was

the comparison with a variable “suspicion” which would identify the suspicious case to analyse and

compare with the ones which were accurately classified as fraudulent.

In terms of statistical analysis of the data, several methods could have been used (as detected in the

literature) as cluster analysis (Hopkins’ statistics measures and Dunn index) and factorial analysis.

Correlation analyses could have used different correlation coefficients for example, the Kendall-tau

rank correlation coefficient or Spearman’s. Now, a great improvement would be to assemble unique

databases for workers’ compensation insurance data in Portugal to facilitate future investigations.

Page 61: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

59

In relation to the algorithm itself (neural networks) an interesting idea to explore would be Paasch's

(2008) idea to apply a different cost from a false positive to a false negative introducing a new

misclassification rate calculus. Other idea was to create boolean weights (if the previous output of

the neuron is greater than the threshold of the neuron the output will be one and zero otherwise)

which could have shown interesting results as the target variable was also boolean.

Although the first line of defence is the underwriting team, these models should be used to

complement the team helping to identify fraudsters. Other techniques like text mining (to search in

comments or observations in participating claims or insurance proposals) or social media search (to

detect relations between fraudsters) should also be considered.

For future research, it would be interesting to study the real-time applications of these algorithms,

gaining pre-claim information about probable fraudsters.

Page 62: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

60

7. BIBLIOGRAPHY

Albashrawi, M. (2016). Detecting Financial Fraud Using Data Mining Techniques: A Decade Review from 2004 to 2015. Journal of Data Science, 14, 553–570.

Allen, L., & Balli, T. (2007). Cyclicality in catastrophic and operational risk measurements. Journal of Banking and Finance, 31, 1191–1235.

Amado, M. (2015). Sistema para deteção de fraude na indústria seguradora: A aplicação de redes ao ramo da saúde.

APS. (2016a). Panorama do mercado Segurador 2015/2016, 10.

APS. (2016b). Regime Jurídico de Acidentes de Trabalho Anotado.

Artis, M., Ayuso, M., & Guillen, M. (2002). Detection of Automobile Insurance Fraud With Discrete Choice Models and Misclassified Claims. Journal of Risk and Insurance, 325–340.

ASF. (n.d.). Seguro Acidentes Trabalho.

Bação, F., & Painho, M. (2003). Aspectos Metodológicos da utilização do Data Mining no âmbito da Geografia. Finisterra, 75.

Baesens, B., & Broucke, S. vanden. (2016). Fraud Analytics using predictive, descriptive and social network learnin techniques.

Baesens, B., Vlasselaer, V. Van, & Verbeke, W. (2015). Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques: A Guide to Data Science for Fraud Detection. (WILEY, Ed.).

Basel. (2003). Sound Practices for the Management and Supervision of Operational Risk. Phytochemistry, 62(February), 245.

Bertsekas, D. P. (1995). Nonlinear Programming. MA: Athena Scientific.

Bhattacharyya, S., Jha, S., Tharakunnel, K., & Westland, J. C. (2011). Data mining for credit card fraud: A comparative study. Decision Support Systems, 50, 602–613.

Biddle, J. (2001). Do High Claim-Denial Rates Discourage Claiming? Evidence From Workers Compensation Insurance. The Journal of Risk and Insurance, 68(4), 631–658.

Bolton, R. J., & Hand, D. J. (2002). Statistical Fraud Detection: A Review. Statistical Science, 17(3), 235–255.

Brause, R., Langsdorf, T., & Hepp, M. (1999). Neural Data Mining for Credit Card Fraud Detection. In IEEE International Conference on Tools with Artificial Intelligence ICTAI-99 (pp. 103–106).

Brites, J. C. (2006). Fraude em Seguros. Bolsa dos Seguros: Revista de Seguros de Pensões.

Brockett, P. L., Derrig, R. A., Golden, L. L., Levine, A., & Alpert, M. (2002). Fraud Classification Using Principal Component Analysis of RIDITs. Journal of Risk and Insurance, 341–372.

Brockett, P. L., Xia, X., & Derrig, R. A. (1998). Using Kohonen’s self organising feature map to uncover automobile bodily injury claims fraud. The Journal of Risk and Insurance, 65, 245–274.

Butler, R., Durbin, D. L., & Helvacian, N. (1996). Increasing Claims for Soft Tissue in Workers’ compensation:Cost shifting and moral hazard. Journal of Risk and Uncertainty.

Page 63: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

61

Butler, R., & Worrall, J. D. (1991). Claims Reporting and Risk Bearing Moral hazard in Workers’ Compensation. Journal of Risk and Insurance.

Carminati, M., Caron, R., Maggi, F., Epifani, I., & Zanero, S. (2014). BankSealer: An Online Banking Fraud Analysis and Decision Support System. ICT Systems Security and Privacy Protection, 428, 380–394. https://doi.org/10.1007/978-3-642-55415-5_32

Cary, N. S. I. I. (206AD). Getting Started with SAS® Enterprise MinerTM 14.2 . SAS Institute Inc. 2016. Retrieved from http://support.sas.com/documentation/cdl/en/emgsj/70152/PDF/default/emgsj.pdf

Chernobai, A., Jorion, P., & Yu, F. (2011). The determinants of operational risk in U.S. financial institutions. Journal of Financial and Quantitative Analysis, 68, 1683–1725.

Clarke, M. (1990). The control of insurance fraud: a comparative view. British Journal of Criminology, 30(1), 1–23.

Copeland, L., Edberg, D., Panorska, A. K., & Wendel, J. (2012). Applying Business Intelligence Concepts to Medicaid Claim Fraud Detection. Journal of Information Systems Applied Research, 5(1).

Correia, A. (2010). Fraude: Introdução.

Derrig, R. A. (2002). Insurance fraud. The Journal of Risk and Insurance, 69(3), 271–287.

Dharwa, Jyotindra N ; Patel, A. R. (2011). A Data Mining with Hybrid Approach Based Transaction Risk Score Generation Model (TRSGM) for Fraud Detection of Online Financial Transaction. International Journal of Computer Applications, 16.

Dionne, G., & Wang, K. C. (2013). Does insurance fraud in automobile theft insurance fluctuate with the business cycle? Journal of Risk and Uncertainty, 47(1), 67–92. https://doi.org/10.1007/s11166-013-9171-y

Douzas, G., & Bação, F. (2017). Effective data generation for imbalanced learning using conditional generative adversarial networks. Expert Systems With Applications, 91.

European Banking Authority. (2016). Operational risk. Retrieved from http://www.eba.europa.eu/regulation-and-policy/operational-risk

Fernandes, F. S. (2015). Como as seguradoras combatem as fraudes. Jornal de Negócios.

Fisher, E. (2008). The Impact of Health Care Fraud on the United States Healthcare System.

Francisco, C. (2014). Utilização de redes para a detecção de casos de fraude em apólices de seguro automóvel . Caso de estudo em seguradoras portuguesas .

Galewitz, P. (2009). Workers’ Compensation Premium Fraud Claims Many Victims. Florida Underwriter, (April), 2007–2010.

Gardner, H., Kleinman, N. L., & Butler, R. (2000). Workers’ compensation and family and medical leve act contagion. Journal of Risk and Uncertainty.

Gershenson, C. (2003). Artificial Neural Networks for Beginners. Networks, 8.

Ghosh, S., & Reilly, D. L. (1994). Credit card fraud detection with a neural network. System Sciences, 3.

Page 64: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

62

Gonçalves, R. (2011). Sistemas de Informação para gestão de risco operacional em instituições financeiras.

Guo, T. a O., & Li, G. (2008). Neural Data Mining for Credit Card Fraud Detection. In Proceedings of the Seventh International Conference on Machine Learning and Cybernetics, Kunming (pp. 12–15).

Hartley, D. (2016). Business Analytics. Retrieved from http://businessanalytics.pt/analiticas-avancadas-outra-linha-de-defesa-na-batalha-contra-a-fraude-nos-seguros/

He, H., Graco, W., & Yao, X. (1998). Application of genetic algorithm and k-nearest neighbour method in medical fraud detection. In Asia-Pacific Conf. on Simulated Evolution and Learning, SEAL (pp. 74–81).

Herbest, H. (1996). Business Rules in Systems Analysis: A Meta-Model and Repository System.

Jans, M., Van Der Werf, J. M., Lybaert, N., & Vanhoof, K. (2011). A business process mining application for internal transaction fraud mitigation. Expert Systems with Applications, 38(10), 13351–13359. https://doi.org/10.1016/j.eswa.2011.04.159

Kalapanidas, E., Avouris, N., Cracium, M., & Niagu, D. (2003). Machine Learning Algorthims: a Study on noise sensitivity. In First Balcan Conference in Informatics.

Kaļiņina, D., & Voronova, I. (2014). Risk Management Improvement under the Solvency II Framework. Economics and Business, 24(24), 29. https://doi.org/10.7250/eb.2013.004

Kose, I., Gokturk, M., & Kilic, K. (2015). An interactive machine-learning-based electronic fraud and abuse detection system in healthcare insurance. Applied Soft Computing Journal, 36, 283–299. https://doi.org/10.1016/j.asoc.2015.07.018

Kriesel, D. (1996). A Brief introduction to Neural Networks. Books.

Kriesel, D. (2005). A Brief Introduction to Neural Networks. Retrieved August, 244. https://doi.org/10.1016/0893-6080(94)90051-5

Kröse, B., & van der Smagt, P. (1996). An Introduction to Neural Networks. Retrieved from http://www.fwi.uva.nl/research/neuro/

Krueger, A. (1990). Incentive Effects of Workers’ compensation Insurance. Journal of Public Economics.

LeCun, Y., Bottou, L., B. Orr, G., & Muller, K.-R. (1998). Efficient BackProp. Retrieved from http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf

Lesch, W. C., & Brinkmann, J. (2012). Consumer Insurance Fraud/Abuse as Co-creation and Co-responsibility: A New Paradigm. https://doi.org/10.1007/s10551-012-1226-5

Lewis, C. M., & Lantsman, Y. (2005). What is a Fair Price to Transfer the Risk of Unauthorised Trading? A Case Study on Operational Risk.

LexisNexis. FraudFocus Enhanced intelligently interprets the information in LexisNexis ® (2012).

Li, E. Y. (1994). Artificial neural networks and their business applications. Information & Management, 27(5), 303–313. https://doi.org/10.1016/0378-7206(94)90024-8

Ling, C. X., Huang, J., & Zhang, H. (2003). AUC: A statistically consistent and more discriminating

Page 65: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

63

measure than accuracy. IJCAI International Joint Conference on Artificial Intelligence, 519–524.

Maio, L. S. C. G. da C. (2013). Fraude nos seguros: a tolerância à fraude no seguro automóvel. https://doi.org/10.1017/CBO9781107415324.004

Martinez, P. R. (2015). Direito do trabalho.

Mo, H., Wang, J., & Niu, H. (2016). Exponent back propagation neural network forecasting for financial cross-correlation relationship. Expert Systems with Applications. https://doi.org/10.1016/j.eswa.2015.12.045

Ngai, E. W. T., Hu, Y., Wong, Y. H., Chen, Y., & Sun, X. (2011). The application of data mining techniques in financial fraud detection: A classification framework and an academic review of literature. Decision Support Systems, 50(3), 559–569. https://doi.org/10.1016/j.dss.2010.08.006

Niemi, H. (1995). Insurance fraud.

Paasch, C. A. W. (2008). Credit card fraud detection using artificial neural networks tuned by genetic algorithms. Hong Kong University of Science and Technology.

Park, R. M., Krebs, J. M., & Mirer, F. . (1996). Occupational Disease Survaillance using disaility insurance at an Automobile Stamping and Assembly Complex. Journal of Occupstionsl Snf Environmental Medicine.

Patel, S. (2010). Quantifying Operational risk. Retrieved from https://www.casact.org/education/reinsure/2010/handouts/CS14-PatelAppendix.pdf

Phua, C., Lee, V., Smith, K., & Gayler, R. (2010). A Comprehensive Survey of Data Mining-based Fraud Detection Research.

Pimenta, C. (2009). Esboço de Quantificação da Fraude em Portugal.

PWC. (2016). PwC: Building relationships, creating value. Retrieved November 13, 2016, from https://www.pwc.com/

Record Year of Fraud Reports in 2011 - ABC News. (2012). Retrieved from http://abcnews.go.com/Business/record-year-fraud-reports-2011/story?id=15953781

Ribeiro, A. I. (2012). Falsos acidentes triplicaram em cinco anos. Retrieved November 29, 2016, from https://www.dinheirovivo.pt/empresas/falsos-acidentes-triplicaram-em-cinco-anos/

Rojas, R. (1996). Neural Networks. Neural Networks, 7(1), 509. https://doi.org/10.1016/0893-6080(94)90051-5

SAS. (2012). Combating Insurance Claims Fraud How to Recognize and Reduce Opportunistic and Organized Claims Fraud.

SAS Institute Inc., C. N. (2016). SAS ® Enterprise Miner TM 14.2: Reference Help.

SAS Institute Inc., C. N. (2017). Neural Network Modeling Course Notes.

Schiller, J. (2006). The Impact of Insurance Fraud Detection System.

Sexton, R. S., Dorsey, R. E., & Johnson, J. D. (1999). Optimization of neural networks : A comparative analysis of the genetic algorithm and simulated annealing, 114.

Shao, J., & Pound, C. J. (1999). Extracting Business Rules from Information Systems.

Page 66: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

64

Smith, R. S. (1990). Mostly on Mondays: Is workers compensation covering off-the-job injuries? Benefits, Costs and Cycles in Workers Compensation.

Soares, M. (2008). Contributo do Data Mining na Detecção e Prevenção de Fraude.

Solvency II. (2009). DIRECTIVE 2009/138/EC OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL of 25 November 2009 on the taking-up and pursuit of the business of Insurance and Reinsurance (Solvency II). Official Journal of the European Communities, L 269(September 2000), 1–15. https://doi.org/2004R0726 - v.7 of 05.06.2013

Stumpf, S., Rajaram, V., Li, L., Wong, W. K., Burnett, M., Dietterich, T., … Herlocker, J. (2009). Interacting meaningfully with machine learning systems: Three experiments. International Journal of Human Computer Studies, 67(8), 639–662. https://doi.org/10.1016/j.ijhcs.2009.03.004

Tennyson, S. (2008). Moral, Social, and Economic Dimensions of Insurance Claims Fraud. Social Research, 75(4), 1181–1205.

Thirlwell, J. (2010). Operational risk: Cinderella or Prince Charming? Financial Times Prentice.

Vandenabeele, T. (2014). Solvency II in a nutshell. Milliman Market Update.

Viaene, S., Derrig, R. A., Baesens, B., & Dedene, G. (2002). A Comparison of State-of- the-Art Classification Techniques for Expert Automobile Insurance Fraud Detection. Journal of Risk and Insurance, 373–421.

Vieira Gomes, J. M. (2013). O acidente de trabalho. In O acidente de trabalho (pp. 19–47).

Vlasselaer, V. Van, Bravo, C., Caelen, O., Eliassi-Rad, T., Akoglu, L., Snoeck, M., & Baesens, B. (2015). APATE: A novel approach for automated credit card transaction fraud detection using network-based extensions. Decision Support Systems, 75, 38–48. https://doi.org/10.1016/j.dss.2015.04.013

Wei, W., Li, J., Cao, L., Ou, Y., Chen, J., Wei, W., … Chen, J. (2013). Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web, 16(16). https://doi.org/10.1007/s11280-012-0178-0

Welch, O. J., Reeves, T. E., & Welch, S. T. (1998). Using a genetic algorithm-based classifier system for modeling auditor decision behavior in a fraud setting. International Journal of Intelligent Systems in Accounting Finance & Management, 7, 173–186.

Werbos, P. J. (1994). The Roots of Backpropagation. John Wiley & Sons.

Page 67: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

65

8. ANNEXES

Annex 1

Figure 17 - Example of a distribution of a variable

Page 68: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

66

Annex 2

Figure 18 - Correlation between two variables

Page 69: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

67

Annex 3

Figure 19 - Fraud classification

Page 70: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

68

Annex 4

Outputs of control model with all variables

Figure 20 - Misclassification Rate

Figure 21 - Cumulative Lift

Figure 22 - Cumulative % Response

Page 71: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

69

Annex 5

Output of model with 15 hidden units

Figure 23 - Cumulative Lift

Figure 24 – Cumulative % Response

Figure 25 - Misclassification Rate

Page 72: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

70

Annex 6

Outputs of model with 50 hidden units

Figure 26 - Cumulative Lift

Figure 27 - Cumulative % Response

Figure 28 - Misclassification Rate

Page 73: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

71

Annex 7

Output of logistic function

Figure 29 - Misclassification Rate

Figure 30 - Cumulative % Response

Figure 31 - Cumulative Lift

Page 74: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

72

Annex 8

Output of the model using Backpropagation

Figure 32 - Misclassification Rate

Figure 33 - Cumulative Lift

Figure 34 - Cumulative % Response

Page 75: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

73

Annex 9

Output of the model with no preliminary training

Figure 35 - Misclassification Rate

Figure 36 - Cumulative % Response

Figure 37 - Cumulated Lift

Page 76: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

74

Annex 10 Output of the article-based algorithm

Figure 38 - Cumulative Lift

Figure 39 - Cumulative % Response

Figure 40 - ROC plot

Page 77: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

75

Annex 11

Output of Control Model 2 (Best Practices Group)

Figure 41 - Cumulative Lift

Figure 42 Cumulative % Response

Page 78: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

76

Annex 12

Output of the 15-hidden neurons algorithm

Figure 43 - Cumulative Lift

Figure 44 - Misclassification Rate

Figure 45 - Cumulative % Response

Page 79: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

77

Annex 13

Output of the 50-hidden neurons algorithm

Figure 46 - Cumulative Lift

Figure 47 - Cumulative % Response

Page 80: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

78

Annex 14

Output of the algorithm with no preliminary training

Figure 48 - Cumulated lift

Figure 49 - Misclassification Rate

Figure 50 - Cumulative % Response

Page 81: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

79

Annex 15

Output of the article based algorithm

Figure 51 - Cumulative Lift

Figure 52 – Cumulative % Response

Page 82: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

80

Annex 16

Output of Control model 3 (Expert Selected variables)

Figure 53 - Misclassification Rate

Figure 54 – Cumulative % Response

Figure 55 - Cumulative Lift

Page 83: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

81

Annex 17

Output of the 15-hidden neurons algorithm

Figure 56 - ROC plot

Figure 57 - Cumulative Lift

Figure 58 - Cumulative % Response

Page 84: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

82

Annex 18

Output of the 50-hidden neurons algorithm

Figure 59 - Cumulative Lift

Figure 60 - Cumulative % Response

Page 85: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

83

Annex 19

Output of the algorithm using the logistic function

Figure 61 - Misclassification rate

Figure 62 – Cumulative % Response

Figure 63 - Cumulative Lift

Page 86: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

84

Annex 20

Output of the model using Backpropagation

Figure 64 - Cumulative Lift

Figure 65 - ROC chart

Figure 66 - Cumulative % Response

Page 87: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

85

Annex 21

Output of the algorithm using no preliminary training

Figure 67 - Cumulative Lift

Figure 68 – Cumulative % Response

Figure 69 - Misclassification rate

Page 88: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

86

Annex 22

Output of the article based algorithm

Figure 70 - Cumulative Lift

Figure 71 – Cumulative % Response

Page 89: Application of neural networks to the detection of …algorithms for example). This supports the industrialization of neural networks in the current days (Douzas & Bação, 2017) 1.2.

87

Annex 23

Figure 72 - Algorithms overview