Green Data Science Using Big Data in an Environmentally ...€¦ · Using Big Data in an...

Green Data ScienceUsing Big Data in an “Environmentally Friendly” Manner

Wil M. P. van der AalstEindhoven University of Technology, Department of Mathematics and Computer Science,

PO Box 513, NL-5600 MB Eindhoven, The Netherlands

Keywords: Data Science, Big Data, Fairness, Confidentiality, Accuracy, Transparency, Process Mining.

Abstract: The widespread use of “Big Data” is heavily impacting organizations and individuals for which these data arecollected. Sophisticated data science techniques aim to extract as much value from data as possible. Powerfulmixtures of Big Data and analytics are rapidly changing the way we do business, socialize, conduct research,and govern society. Big Data is considered as the “new oil” and data science aims to transform this into newforms of “energy”: insights, diagnostics, predictions, and automated decisions. However, the process of trans-forming “new oil” (data) into “new energy” (analytics) may negatively impact citizens, patients, customers,and employees. Systematic discrimination based on data, invasions of privacy, non-transparent life-changingdecisions, and inaccurate conclusions illustrate that data science techniques may lead to new forms of “pollu-tion”. We use the term “Green Data Science” for technological solutions that enable individuals, organizationsand society to reap the benefits from the widespread availability of data while ensuring fairness, confiden-tiality, accuracy, and transparency. To illustrate the scientific challenges related to “Green Data Science”, wefocus on process mining as a concrete example. Recent breakthroughs in process mining resulted in powerfultechniques to discover the real processes, to detect deviations from normative process models, and to analyzebottlenecks and waste. Therefore, this paper poses the question: How to benefit from process mining whileavoiding “pollutions” related to unfairness, undesired disclosures, inaccuracies, and non-transparency?

1 INTRODUCTION

In recent years, data science emerged as a new andimportant discipline. It can be viewed as an amal-gamation of classical disciplines like statistics, datamining, databases, and distributed systems. We usethe following definition: “Data science is an inter-disciplinary field aiming to turn data into real value.Data may be structured or unstructured, big or small,static or streaming. Value may be provided in the formof predictions, models learned from data, or any typeof data visualization delivering insights. Data scienceincludes data extraction, data preparation, data ex-ploration, data transformation, storage and retrieval,computing infrastructures, various types of miningand learning, presentation of explanations and pre-dictions, and the exploitation of results taking intoaccount ethical, social, legal, and business aspects.”(Aalst, 2016).

Related to data science is the overhyped term “BigData” that is used to refer to the massive amountsof data collected. Organizations are heavily invest-ing in Big Data technologies, but at the same time

citizens, patients, customers, and employees are con-cerned about the use of their data. We live in anera characterized by unprecedented opportunities tosense, store, and analyze data related to human ac-tivities in great detail and resolution. This introducesnew risks and intended or unintended abuse enabledby powerful analysis techniques. Data may be sensi-tive and personal, and should not be revealed or usedfor proposes different from what was agreed upon.Moreover, analysis techniques may discriminate mi-norities even when attributes like gender and race areremoved. Using data science technology as a “blackbox” making life-changing decisions (e.g., medicalprioritization or mortgage approvals) triggers a vari-ety of ethical dilemmas.

Sustainable data science is only possible whencitizens, patients, customers, and employees are pro-tected against irresponsible uses of data (big orsmall). Therefore, we need to separate the “good”and “bad” of data science. Compare this with envi-ronmentally friendly forms of green energy (e.g. so-lar power) that overcome problems related to tradi-tional forms of energy. Data science may result in

Van Der Aalst W.Green Data Science - Using Big Data in an âAIJEnvironmentally FriendlyâAI Manner.DOI: 10.5220/0006806900010001In Proceedings of the 18th International Conference on Enterprise Information Systems (ICEIS 2016), pages 9-21ISBN: 978-989-758-187-8Copyright c© 2016 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved

unfair decision making, undesired disclosures, inac-curacies, and non-transparency. These irresponsibleuses of data can be viewed as “pollution”. Abandon-ing the systematic use of data may help to overcomethese problems. However, this would be comparableto abandoning the use of energy altogether. Data sci-ence is used to make products and services more reli-able, convenient, efficient, and cost effective. More-over, most new products and services depend on thecollection and use of data. Therefore, we argue thatthe “prohibition of data (science)” is not a viable so-lution.

In this paper, we coin the term “Green Data Sci-ence” (GDS) to refer to the collection of techniquesand approaches trying to reap the benefits of data sci-ence and Big Data while ensuring fairness, confiden-tiality, accuracy, and transparency. We believe thattechnological solutions can be used to avoid pollutionand protect the environment in which data is collectedand used. Section 2 elaborates on the following fourchallenges:

• Fairness – Data Science without prejudice: Howto avoid unfair conclusions even if they are true?

• Confidentiality – Data Science that ensures con-fidentiality: How to answer questions without re-vealing secrets?

• Accuracy – Data Science without guesswork:How to answer questions with a guaranteed levelof accuracy?

• Transparency – Data Science that provides trans-parency: How to clarify answers such that theybecome indisputable?

Concerns related to privacy and personal data protec-tion triggered legislation like the EU’s Data Protec-tion Directive. Directive 95/46/EC (“on the protectionof individuals with regard to the processing of per-sonal data and on the free movement of such data”) ofthe European Parliament and the Council was adoptedon 24 October 1995 (European Commission, 1995).The General Data Protection Regulation (GDPR) iscurrently under development and aims to strengthenand unify data protection for individuals within theEU (European Commission, 2015). GDPR will re-place Directive 95/46/EC and is expected to be final-ized in Spring 2016 and will be much more restrictivethan earlier legislation. Sanctions include fines of upto 4% of the annual worldwide turnover. GDPR andother forms of legislation limiting the use of data, mayprevent the use of data science also in situations wheredata is used in a positive manner. Prohibiting the col-lection and systematic use of data is like turning backthe clock. Next to legislation, positive technologicalsolutions are needed to ensure fairness, confidential-

ity, accuracy, and transparency. By just imposing re-strictions, individuals, organizations and society can-not exploit data (science) in a positive way.

The four challenges discussed in Section 2 arequite general. Therefore, we focus on a concretesubdiscipline in data science in Section 3: ProcessMining (Aalst, 2011). Process mining seeks the con-frontation between event data (i.e., observed behav-ior) and process models (hand-made or discovered au-tomatically). Event data are related to explicit processmodels, e.g., Petri nets or BPMN models. For exam-ple, process models are discovered from event data orevent data are replayed on models to analyze com-pliance and performance. Process mining providesa bridge between data-driven approaches (data min-ing, machine learning and business intelligence) andprocess-centric approaches (business process model-ing, model-based analysis, and business process man-agement/reengineering). Process mining results maydrive redesigns, show the need for new controls, trig-ger interventions, and enable automated decision sup-port. Individuals inside (e.g., end-users and workers)and outside (e.g., customers, citizens, or patients) theorganization may be impacted by process mining re-sults. Therefore, Section 3 lists process mining chal-lenges related to fairness, confidentiality, accuracy,and transparency.

In the long run, data science is only sustainableif we are willing to address the problems discussedin this paper. Rather than abandoning the use of dataaltogether, we should find positive technological waysto protect individuals.

2 FOUR CHALLENGES

Figure 1 sketches the “data science pipeline”. Individ-uals interact with a range of hardware/software sys-tems (information systems, smartphones, websites,wearables, etc.) Ê. Data related to machine and in-teraction events are collected Ë and preprocessedfor analysis Ì. During preprocessing data may betransformed, cleaned, anonymized, de-identified, etc.Models may be learned from data or made/modifiedby hand Í. For compliance checking, models are of-ten normative and made by hand rather than discov-ered from data. Analysis results based on data (andpossibly also models) are presented to analysts, man-agers, etc. Î or used to influence the behavior of in-formation systems and devices Ï. Based on the data,decisions are made or recommendations are provided.Analysis results may also be used to change systems,laws, procedures, guidelines, responsibilities, etc. Ð.

Figure 1 also lists the four challenges discussed

data in a

variety of

systems

data used as

input for

analytics

information

systems,

devices, etc.

results

models

extract, load,

transform, clean,

anonymize, de-

identify, etc.report, discover,

mine, learn, check,

predict, recommend, etc.

interaction with individuals

interpretation by analysts, managers, etc.

12 3

4

5

6

7

Figure 1: The “data science pipeline” facing four challenges.

in the remainder of this section. Each of the chal-lenges requires an understanding of the whole datapipeline. Flawed analysis results or bad decisionsmay be caused by different factors such as a samplingbias, careless preprocessing, inadequate analysis, oran opinionated presentation.

2.1 Fairness - Data Science withoutPrejudice: How to Avoid UnfairConclusions Even if they are True?

Data science techniques need to ensure fairness: Au-tomated decisions and insights should not be used todiscriminate in ways that are unacceptable from a le-gal or ethical point of view. Discrimination can be de-fined as “the harmful treatment of an individual basedon their membership of a specific group or category(race, gender, nationality, disability, marital status, orage)”. However, most analysis techniques aim to dis-criminate among groups. Banks handing out loansand credit cards try to discriminate between groupsthat will pay their debts and groups that will run intofinancial problems. Insurance companies try to dis-criminate between groups that are likely to claim andgroups that are less likely to claim insurance. Hos-pitals try to discriminate between groups for whicha particular treatment is likely to be effective andgroups for which this is less likely. Hiring employ-ees, providing scholarships, screening suspects, etc.can all be seen as classification problems: The goalis to explain a response variable (e.g., person will pay

back the loan) in terms of predictor variables (e.g.,credit history, employment status, age, etc.). Ideally,the learned model explains the response variable asgood as possible without discriminating on the basisof sensitive attributes (race, gender, etc.).

To explain discrimination discovery and discrimi-nation prevention, let us consider the set of all (poten-tial) customers of some insurance company specializ-ing in car insurance. For each customer we have thefollowing variables:

• name,

• birthdate,

• gender (male or female),

• nationality,

• car brand (Alfa, BMW, etc.),

• years of driving experience,

• number of claims in the last year,

• number of claims in the last five years, and

• status (insured, refused, or left).

The status field is used to distinguish current cus-tomers (status=insured) from customers that were re-fused (status=refused) or that left the insurance com-pany during the last year (status=left). Customers thatwere refused or that left more than a year ago are re-moved from the data set.

Techniques for discrimination discovery aim toidentify groups that are discriminated based on sen-sitive variables, i.e., variables that should not matter.

For example, we may find that “males have a higherlikelihood to be rejected than females” or that “for-eigners driving a BMW have a higher likelihood to berejected than Dutch BMW drivers”. Discriminationmay be caused by human judgment or by automateddecision algorithms using a predictive model. Thedecision algorithms may discriminate due to a sam-pling bias, incomplete data, or incorrect labels. If ear-lier rejections are used to learn new rejections, thenprejudices may be reinforced. Similar “self-fulfillingprophecies” can be caused by sampling or missingvalues.

Even when there is no intent to discriminate, dis-crimination may still occur. Even when the auto-mated decision algorithm does not use gender anduses only non-sensitive variables, the actual decisionsmay still be such that (fe)males or foreigners have amuch higher probability to be rejected. The decisionalgorithm may also favor more frequent values for avariable. As a result, minority groups may be treatedunfairly.

fair

nes

s

accuracylow accuracy highest accuracy

possible using all data without constraints

analysis results and model are

non-discriminating

analysis results and model are created

without considering discrimination

possible compromise

between fairness and

accuracy

ideal

situation

(impossible)

Figure 2: Tradeoff between fairness and accuracy.

Discrimination prevention aims to create auto-mated decision algorithms that do not discriminate us-ing sensitive variables. It is not sufficient to removethese sensitive variables: Due to correlations and thehandling of outliers, unintentional discrimination maystill take place. One can add constraints to the deci-sion algorithm to ensure fairness using a predefinedcriterion. For example, the constraint “males and fe-males should have approximately the same probabil-ity to be rejected” can be added to a decision-treelearning algorithm. Next to adding algorithm-specificconstraints used during analysis one can also use pre-processing (modify the input data by resampling orrelabeling) or postprocessing (modify models, e.g.,relabel mixed leaf nodes in a decision tree). In gen-eral there is often a trade-off between maximizing ac-curacy and minimizing discrimination (see Figure 2).By rejecting fewer males (better fairness), the insur-ance company may need to pay more claims.

Discrimination prevention often needs to use sen-

sitive variables (gender, age, nationality, etc.) to en-sure fairness. This creates a paradox, e.g., informa-tion on gender needs to be used to avoid discrimina-tion based on gender.

The first paper on discrimination-aware data min-ing appeared in 2008 (Pedreshi et al., 2008). Sincethen, several papers mostly focusing on fair classifica-tion appeared: (Calders and Verwer, 2010; Kamiranet al., 2010; Ruggieri et al., 2010). These examplesshow that unfairness during analysis can be activelyprevented. However, unfairness is not limited to clas-sification and more advanced forms of analytics alsoneed to ensure fairness.

2.2 Confidentiality - Data Science thatEnsures Confidentiality: How toAnswer Questions withoutRevealing Secrets?

The application of data science techniques should notreveal certain types of personal or otherwise sensi-tive information. Often personal data need to be keptconfidential. The General Data Protection Regula-tion (GDPR) currently under development (EuropeanCommission, 2015) focuses on personal information:

“The principles of data protection should apply to anyinformation concerning an identified or identifiable naturalperson. Data including pseudonymized data, which couldbe attributed to a natural person by the use of additional in-formation, should be considered as information on an iden-tifiable natural person. To determine whether a person isidentifiable, account should be taken of all the means rea-sonably likely to be used either by the controller or by anyother person to identify the individual directly or indirectly.To ascertain whether means are reasonably likely to be usedto identify the individual, account should be taken of all ob-jective factors, such as the costs of and the amount of timerequired for identification, taking into consideration bothavailable technology at the time of the processing and tech-nological development. The principles of data protectionshould therefore not apply to anonymous information, thatis information which does not relate to an identified or iden-tifiable natural person or to data rendered anonymous insuch a way that the data subject is not or no longer identifi-able.”

Confidentiality is not limited to personal data.Companies may want to hide sales volumes or pro-duction times when presenting results to certain stake-holders. One also needs to bear in mind that few infor-mation systems hold information that can be sharedor analyzed without limits (e.g., the existence of per-sonal data cannot be avoided). The “data sciencepipeline” depicted in Figure 1 shows that there are dif-

ferent types of data having different audiences. Herewe focus on: (1) the “raw data” stored in the informa-tion system Ë, (2) the data used as input for analysisÌ, and (3) the analysis results interpreted by analystsand managers Î. Whereas the raw data may refer toindividuals, the data used for analysis is often (partly)de-identified, and analysis results may refer to aggre-gate data only. It is important to note that confiden-tiality may be endangered along the whole pipelineand includes analysis results.

Consider a data set that contains sensitive infor-mation. Records in such a data set may have threetypes of variables:

• Direct Identifiers: Variables that uniquely identifya person, house, car, company, or other entity. Forexample, a social security number identifies a per-son.

• Key Variables: Subsets of variables that togethercan be used to identify some entity. For example,it may be possible to identify a person based ongender, age, and employer. A car may be uniquelyidentified based on registration date, model, andcolor. Key variables are also referred to as implicitidentifiers or quasi identifiers.

• Non-identifying Variables: Variables that cannotbe used to identify some entity (direct or indirect).

Confidentiality is impaired by unintended or mali-cious disclosures. We consider three types of suchdisclosures:

• Identity Disclosure: Information about an entity(person, house, etc.) is revealed. This can be donethrough direct or implicit identifiers. For exam-ple, the salaries of employees are disclosed unin-tentionally or an intruder is able to retrieve patientdata.

• Attribute Disclosure: Information about an entitycan be derived indirectly. If there is only one malesurgeon in the age group 40-45, then aggregatedata for this category reveals information aboutthis person.

• Partial Disclosure: Information about a group ofentities can be inferred. Aggregate informationon male surgeons in the age group 40-45 may dis-close an unusual number of medical errors. Thesecannot be linked to a particular surgeon. Never-theless, one may conclude that surgeons in thisgroup are more likely to make errors.

De-identification of data refers to the process of re-moving or obscuring variables with the goal to min-imize unintended disclosures. In many cases re-identification is possible by linking different datasources. For example, the combination of wedding

date and birth date may allow for the re-identificationof a particular person. Anonymization of data refers tode-identification that is irreversible: re-identificationis impossible. A range of de-identification methods isavailable: removing variables, randomization, hash-ing, shuffling, sub-sampling, aggregation, truncation,generalization, adding noise, etc. Adding some noiseto a continuous variable or the coarsening of valuesmay have a limited impact on the quality of analysisresults while ensuring confidentiality.

con

fid

enti

alit

y

data utility

no meaningful analysis possible

full use of data potential possible

full disclosure of sensitive

data

no sensitive data disclosed

possible compromise

between confidentiality

and utility

ideal

situation

(impossible)

Figure 3: Tradeoff between confidentiality and utility.

There is a trade-off between minimizing the dis-closure of sensitive information and the usefulness ofanalysis results (see Figure 3). Removing variables,aggregation, and adding noise can make it hard to pro-duce any meaningful analysis results. Emphasis onconfidentiality (like security) may also reduce conve-nience. Note that personalization often conflicts withfairness and confidentiality. Disclosing all data, sup-ports analysis, but jeopardizes confidentiality.

Access rights to the different types of data andanalysis results in the “data science pipeline” (Fig-ure 1) vary per group. For example, very few peo-ple will have access to the “raw data” stored in theinformation system Ë. More people will have accessto the data used for analysis and the actual analysisresults. Poor cybersecurity may endanger confiden-tiality. Good policies ensuring proper authentication(Are you who you say you are?) and authorization(What are you allowed to do?) are needed to protectaccess to the pipeline in Figure 1. Cybersecurity mea-sures should not complicate access, data preparation,and analysis; otherwise people may start using illegalcopies and replicate data.

See (Monreale et al., 2014; Nelson, 2015; Presi-dent’s Council, 2014) for approaches to ensure confi-dentiality.

2.3 Accuracy - Data Science withoutGuesswork: How to AnswerQuestions with a Guaranteed Levelof Accuracy?

Increasingly decisions are made using a combina-tion of algorithms and data rather than human judge-ment. Hence, analysis results need to be accurateand should not deceive end-users and decision mak-ers. Yet, there are several factors endangering accu-racy.

First of all, there is the problem of overfitting thedata leading to “bogus conclusions”. There are nu-merous examples of so-called spurious correlationsillustrating the problem. Some examples (taken from(Vigen, 2015)):

• The per capita cheese consumption strongly cor-relates with the number of people who died by be-coming tangled in their bedsheets.

• The number of Japanese passenger cars sold inthe US strongly correlates with the number of sui-cides by crashing of motor vehicle.

• US spending on science, space and technol-ogy strongly correlates with suicides by hanging,strangulation and suffocation.

• The total revenue generated by arcades stronglycorrelates with the number of computer sciencedoctorates awarded in the US.

When using many variables relative to the number ofinstances, classification may result in complex rulesoverfitting the data. This is often referred to as thecurse of dimensionality: As dimensionality increases,the number of combinations grows so fast that theavailable data become sparse. With a fixed numberof instances, the predictive power reduces as the di-mensionality increases. Using cross-validation mostfindings (e.g., classification rules) will get rejected.However, if there are many findings, some may sur-vive cross-validation by sheer luck.

In statistics, Bonferroni’s correction is a method(named after the Italian mathematician Carlo EmilioBonferroni) to compensate for the problem of multi-ple comparisons. Normally, one rejects the null hy-pothesis if the likelihood of the observed data underthe null hypothesis is low (Casella and Berger, 2002).If we test many hypotheses, we also increase the like-lihood of a rare event. Hence, the likelihood of in-correctly rejecting a null hypothesis increases (Miller,1981). If the desired significance level for the wholecollection of null hypotheses is α, then the Bonfer-roni correction suggests that one should test each in-dividual hypothesis at a significance level of α

k where

k is the number of null hypotheses. For example, ifα = 0.05 and k = 20, then α

k = 0.0025 is the requiredsignificance level for testing the individual hypothe-ses.

Next to overfitting the data and testing multiplehypotheses, there is the problem of uncertainty in theinput data and the problem of not showing uncer-tainty in the results.

Uncertainty in the input data is related to thefourth “V” in the four “V’s of Big Data” (Volume,Velocity, Variety, and Veracity). Veracity refers to thetrustworthiness of the input data. Sensor data may beuncertain, multiple users may use the same account,tweets may be generated by software rather than peo-ple, etc. These uncertainties are often not taken intoaccount during analysis assuming that things “evenout” in larger data sets. This does not need to be thecase and the reliability of analysis results is affectedby unreliable or probabilistic input data.

When we say, “we are 95% confident that thetrue value of parameter x is in our confidence inter-val [a,b]”, we mean that 95% of the hypotheticallyobserved confidence intervals will hold the true valueof parameter x. Averages, sums, standard deviations,etc. are often based on sample data. Therefore, it isimportant to provide a confidence interval. For exam-ple, given a mean of 35.4 the 95% confidence intervalmay be [35.3,35.6], but the 95% confidence intervalmay also be [15.3,55.6]. In the latter case, we willinterpret the mean of 35.4 as a “wild guess” ratherthan a representative value for true average value. Al-though we are used to confidence intervals for numer-ical values, decision makers have problems interpret-ing the expected accuracy of more complex analysisresults like decision trees, association rules, processmodels, etc. Cross-validation techniques like k-foldchecking and confusion matrices give some insights.However, models and decisions tend to be too “crisp”(hiding uncertainties). Explicit vagueness or more ex-plicit confidence diagnostics may help to better inter-pret analysis results. Parts of models should be keptdeliberately “vague” if analysis is not conclusive.

2.4 Transparency - Data Science thatProvides Transparency: How toClarify Answers Such that theybecome Indisputable?

Data science techniques are used to make a variety ofdecisions. Some of these decisions are made automat-ically based on rules learned from historic data. Forexample, a mortgage application may be rejected au-tomatically based on a decision tree. Other decisions

According to Bonferroni’s principle we need to avoid treating random observations as if they are real and sig-nificant (Rajaraman and Ullman, 2011). The following example, inspired by a similar example in (Rajaramanand Ullman, 2011), illustrates the risk of treating completely random events as patterns.A Dutch government agency is searching for terrorists by examining hotel visits of all of its 18 million citizens(18× 106). The hypothesis is that terrorists meet multiple times at some hotel to plan an attack. Hence, theagency looks for suspicious “events” {p1, p2} † {d1,d2} where persons p1 and p2 meet on days d1 and d2.How many of such suspicious events will the agency find if the behavior of people is completely random? Toestimate this number we need to make some additional assumptions. On average, Dutch people go to a hotelevery 100 days and a hotel can accommodate 100 people at the same time. We further assume that there are18×106

100×100 = 1800 Dutch hotels where potential terrorists can meet.The probability that two persons (p1 and p2) visit a hotel on a given day d is 1

100 × 1100 = 10−4. The probability

that p1 and p2 visit the same hotel on day d is 10−4× 11800 = 5.55×10−8. The probability that p1 and p2 visit

the same hotel on two different days d1 and d2 is (5.55× 10−8)2 = 3.086× 10−15. Note that different hotelsmay be used on both days. Hence, the probability of suspicious event {p1, p2}†{d1,d2} is 3.086×10−15.How many candidate events are there? Assume an observation period of 1000 days. Hence, there are 1000×(1000−1)/2 = 499,500 combinations of days d1 and d2. Note that the order of days does not matter, but thedays need to be different. There are (18×106)× (18×106−1)/2 = 1.62×1014 combinations of persons p1and p2. Again the ordering of p1 and p2 does not matter, but p1 6= p2. Hence, there are 499,500×1.62×1014 =8.09×1019 candidate events {p1, p2}†{d1,d2}.The expected number of suspicious events is equal to the product of the number of candidate events {p1, p2}†{d1,d2} and the probability of such events (assuming independence): 8.09×1019×3.086×10−15 = 249,749.Hence, there will be around a quarter million observed suspicious events {p1, p2} † {d1,d2} in a 1000 dayperiod!Suppose that there are only a handful of terrorists and related meetings in hotels. The Dutch government agencywill need to investigate around a quarter million suspicious events involving hundreds of thousands innocentcitizens. Using Bonferroni’s principle, we know beforehand that this is not wise: there will be too many falsepositives.

Example: Bonferroni’s principle explained using an example taken from (Aalst, 2016). To apply the principle, compute thenumber of observations of some phenomena one is interested in under the assumption that things occur at random. If thisnumber is significantly larger than the real number of instances one expects, then most of the findings will be false positives.

are based on analysis results (e.g., process models orfrequent patterns). For example, when analysis re-veals previously unknown bottlenecks, then this mayhave consequences for the organization of work andchanges in staffing (or even layoffs). Automated de-cision rules (Ï in Figure 1) need to be as accurateas possible (e.g., to reduce costs and delays). Anal-ysis results (Î in Figure 1) also need to be accurate.However, accuracy is not sufficient to ensure accep-tance and proper use of data science techniques. Bothdecisions Ï and analysis results Î also need to betransparent.

Figure 4 illustrates the notion of transparency.Consider an application submitted by John evaluatedusing three data-driven decision systems. The firstsystem is a black box: It is unclear why John’s appli-cation is rejected. The second system reveals it’s deci-sion logic in the form of a decision tree. Applicationsfrom females and younger males are always accepted.Only applications from older males get rejected. Thethird system uses the same decision tree, but also ex-plains the rejection (“because male and above 50”).

gender

age

accept

reject

gender

age

accept

reject

gender

age

accept

reject

black box

data-driven decision system 2



Your claim is rejected

because you are male

and above 50 ...

Figure 4: Different levels of transparency.

Clearly, the third system is most transparent. Whengovernments make decisions for citizens it is oftenmandatory to explain the basis for such decisions.

Deep learning techniques (like many-layered neu-ral networks) use multiple processing layers with

complex structures or multiple non-linear transforma-tions. These techniques have been successfully ap-plied to automatic speech recognition, image recogni-tion, and various other complex decision tasks. Deeplearning methods are often looked at as a “black box”,with performance measured empirically and no for-mal guarantees or explanations. A many-layered neu-ral network is not as transparent as for example a deci-sion tree. Such a neural network may make good deci-sions, but it cannot explain a rule or criterion. There-fore, such black box approaches are non-transparentand may be unacceptable in some domains.

Transparency is not restricted to automated deci-sion making and explaining individual decisions, italso involves the intelligibility, clearness, and com-prehensibility of analysis results (e.g., a processmodel, decision tree, regression formula). For exam-ple, a model may reveal bottlenecks in a process, pos-sible fraudulent behavior, deviations by a small groupof individuals, etc. It needs to be clear for the userof such models (e.g., a manager) how these findingswhere obtained. The link to the data and the analysistechnique used should be clear. For example, filteringthe input data (e.g., removing outliers) or adjustingparameters of the algorithm may have a dramatic ef-fect on the model returned.

Storytelling is sometimes referred to as “the lastmile in data science”. The key question is: How tocommunicate analysis results with end-users? Story-telling is about communicating actionable insights tothe right person, at the right time, in the right way.One needs to know the gist of the story one wantsto tell to successfully communicate analysis results(rather than presenting the whole model and all data).One can use natural language generation to transformselected analysis results into concise, easy-to-read, in-dividualized reports.

To provide transparency there should be a clearlink between data and analysis results/stories. Oneneeds to be able to drill-down and inspect the datafrom the model’s perspective. Given a bottleneck oneneeds to be able to drill down to the instances thatare delayed due to the bottleneck. This related to dataprovenance: it should always be possible to reproduceanalysis results from the original data.

The four challenges depicted in Figure 1 areclearly interrelated. There may be trade-offs betweenfairness, confidentiality, accuracy and transparency.For example, to ensure confidentiality we may addnoise and de-identify data thus possibly compromis-ing accuracy and transparency.

3 EXAMPLE: GREEN PROCESSMINING

The goal of process mining is to turn event data intoinsights and actions (Aalst, 2016). Process mining isan integral part of data science, fueled by the avail-ability of data and the desire to improve processes.Process mining can be seen as a means to bridgethe gap between data science and process science.Data science approaches tend to be process agonisticwhereas process science approaches tend to be model-driven without considering the “evidence” hidden inthe data. This section discusses challenges related tofairness, confidentiality, accuracy, and transparency inthe context of process mining. The goal is not to pro-vide solutions, but to illustrate that the more generalchallenges discussed before trigger concrete researchquestions when considering processes and event data.

3.1 What is Process Mining?

Figure 5 shows the “process mining pipeline” and canbe viewed as a specialization of the Figure 1. Pro-cess mining focuses on the analysis of event dataand analysis results are often related to process mod-els. Process mining is a rapidly growing subdisciplinewithin both Business Process Management (BPM)(Aalst, 2013a) and data science (Aalst, 2014). Main-stream Business Intelligence (BI), data mining andmachine learning tools are not tailored towards theanalysis of event data and the improvement of pro-cesses. Fortunately, there are dedicated process min-ing tools able to transform event data into action-able process-related insights. For example, ProM(www.processmining.org) is an open-source pro-cess mining tool supporting process discovery, con-formance checking, social network analysis, organi-zational mining, clustering, decision mining, predic-tion, and recommendation (see Figure 6). Moreover,in recent years, several vendors released commercialprocess mining tools. Examples include: CelonisProcess Mining by Celonis GmbH (www.celonis.de), Disco by Fluxicon (www.fluxicon.com), Inter-stage Business Process Manager Analytics by Fu-jitsu Ltd (www.fujitsu.com), Minit by GradientECM (www.minitlabs.com), myInvenio by Cogni-tive Technology (www.my-invenio.com), PerceptiveProcess Mining by Lexmark (www.lexmark.com),QPR ProcessAnalyzer by QPR (www.qpr.com), Ri-alto Process by Exeura (www.exeura.eu), SNP Busi-ness Process Analysis by SNP Schneider-Neureither& Partner AG (www.snp-bpa.com), and PPM web-Methods Process Performance Manager by SoftwareAG (www.softwareag.com).

data in a

variety of

systems

data used as

input for

analytics

information

systems,

devices, etc.

results

models

extract, load,

transform, clean,

anonymize, de-

identify, etc.report, discover,

mine, learn, check,

predict, recommend, etc.

interaction with individuals

interpretation by analysts, managers, etc.

12 3

4

5

6

7

event data

(e.g., in XES

format)

data in databases,

files, logs, etc.

having a temporal

dimension

process models (e.g.,

BPMN, UML AD/SDs, Petri

nets, workflow models)

techniques for process discovery,

conformance checking, and

performance analysis

results include process models

annotated with frequencies,

times, and deviations

operational support,

e.g., predictions,

recommendations,

decisions, and alerts

people and devices

generating a

variety of events

Figure 5: The “process mining pipeline” relates observed and modeled behavior.

3.1.1 Creating and Managing Event Data

Process mining is impossible without proper eventlogs (Aalst, 2011). An event log contains event datarelated to a particular process. Each event in an eventlog refers to one process instance, called case. Eventsrelated to a case are ordered. Events can have at-tributes. Examples of typical attribute names are ac-tivity, time, costs, and resource. Not all events needto have the same set of attributes. However, typically,events referring to the same activity have the same setof attributes. Figure 6(a) shows the conversion of anCSV file with four columns (case, activity, resource,and timestamp) into an event log.

Most process mining tools support XES (eXten-sible Event Stream) (IEEE Task Force on ProcessMining, 2013). In September 2010, the format wasadopted by the IEEE Task Force on Process Miningand became the de facto exchange format for pro-cess mining. The IEEE Standards Organization is cur-rently evaluating XES with the aim to turn XES intoan official IEEE standard.

To create event logs we need to extract, load,transform, anonymize, and de-identify data in a va-riety of systems (see Ì in Figure 5). Consider forexample the hundreds of tables in a typical HIS (Hos-pital Information System) like ChipSoft, McKessonand EPIC or in an ERP (Enterprise Resource Plan-ning) system like SAP, Oracle, and Microsoft Dynam-ics. Non-trivial mappings are needed to extract eventsand to relate events to cases. Event data needs to bescoped to focus on a particular process. Moreover, thedata also needs to be scoped with respect to confiden-tiality issues.

3.1.2 Process Discovery

Process discovery is one of the most challenging pro-cess mining tasks (Aalst, 2011). Based on an eventlog, a process model is constructed thus capturing thebehavior seen in the log. Dozens of process discov-ery algorithms are available. Figure 6(c) shows a pro-cess model discovered using ProM’s inductive visualminer (Leemans et al., 2015). Techniques use Petrinets, WF-nets, C-nets, process trees, or transition sys-tems as a representational bias (Aalst, 2016). Theseresults can always be converted to the desired nota-tion, for example BPMN (Business Process Modeland Notation), YAWL, or UML activity diagrams.

3.1.3 Conformance Checking

Using conformance checking discrepancies betweenthe log and the model can be detected and quantifiedby replaying the log (Aalst et al., 2012). For exam-ple, Figure 6(c) shows an activity that was skipped16 times. Some of the discrepancies found may ex-pose undesirable deviations, i.e., conformance check-ing signals the need for a better control of the process.Other discrepancies may reveal desirable deviationsand can be used for better process support. Input forconformance checking is a process model having ex-ecutable semantics and an event log.

3.1.4 Performance Analysis

By replaying event logs on process model, we cancompute frequencies and waiting/service times. Us-ing alignments (Aalst et al., 2012) we can relate casesto paths in the model. Since events have timestamps,

case activity resource timestamp

each row corresponds to an event

each dot corresponds to an event

208 cases

timeprocess model

discovered for the most frequent activities

conformance checking view

performance analysis view

activity was skipped 16 times

average waiting time is 18 days

queue length is currently 22

(a) (b)

(c)

(d)

(e)

(f)

5987 events

tokens refer to real cases

Figure 6: Six screenshots of ProM while analyzing an event log with 208 cases, 5987 events, and 74 different activities. First,a CSV file is converted into an event log (a). Then, the event data can be explored using a dotted chart (b). A process modelis discovered for the 11 most frequent activities (c). The event log can be replayed on the discovered model. This is used toshow deviations (d), average waiting times (e), and queue lengths (f).

we can associate the times in-between events alongsuch a path to delays in the process model. If theevent log records both start and complete events foractivities, we can also monitor activity durations. Fig-ure 6(d) shows an activity that has an average waitingtime of 18 days and 16 hours. Note that such bottle-necks are discovered without any modeling.

3.1.5 Operational Support

Figure 6(e) shows the queue length at a particularpoint in time. This illustrates that process mining canbe used in an online setting to provide operationalsupport. Process mining techniques exist to predictthe remaining flow time for a case or the outcome ofa process. This requires the combination of a discov-ered process model, historic event data, and informa-tion about running cases. There are also techniques torecommend the next step in a process, to check con-formance at run-time, and to provide alerts when cer-tain Service Level Agreements (SLAs) are violated.

3.2 Challenges in Process Mining

Table 1 maps the four generic challenges identified inSection 2 onto the six key ingredients of process min-ing briefly introduced in Section 3.1. Note that bothcases (i.e., process instances) and the resources usedto execute activities may refer to individuals (cus-tomers, citizens, patients, workers, etc.). Event dataare difficult to fully anonymize. In larger processes,most cases follow a unique path. In the event logused in Figure 6, 198 of the 208 cases follow a uniquepath (focusing only on the order of activities). Hence,knowing the order of a few selected activities maybe used to de-anonymize or re-identify cases. Thesame holds for (precise) timestamps. For the eventlog in Figure 6, several cases can be uniquely iden-tified based on the day the registration activity (firstactivity in process) was executed. If one knows thetimestamps of these initial activities with the preci-sion of an hour, then almost all cases can be uniquelyidentified. This shows that the ordering and times-tamp data in event logs may reveal confidential infor-mation unintentionally. Therefore, it is interesting toinvestigate what can be done by adding noise (or othertransformations) to event data such that the analysisresults do not change too much. For example, we canshift all timestamps such that all cases start in “week0”. Most process discovery techniques will still re-turn the same process model. Moreover, the averageflow/waiting/service times are not affected by this.

Conformance checking (Aalst et al., 2012) can beviewed as a classification problem. What kind of

cases deviate at a particular point? Bottleneck analy-sis can also be formulated as a classification problem.Which cases get delayed more than 5 days? We mayfind out that conformance or performance problemsare caused by characteristics of the case itself or thepeople that worked on it. This allows us to discoverpatterns such as:

• Doctor Jones often performs an operation withoutmaking a scan and this results in more incidentslater in the process.

• Insurance claims from older customers often getrejected because they are incomplete.

• Citizens that submit their tax declaration too lateoften get rejected by teams having a higher work-load.

Techniques for discrimination discovery can be usedto find distinctions that are not desirable/acceptable.Subsequently, techniques for discrimination preven-tion can be used to avoid such situations. It is impor-tant to note that discrimination is not just related tostatic variables, but also relates to the way cases arehandled.

It is also interesting to use techniques from de-composed process mining or streaming process min-ing (see Chapter 12 in (Aalst, 2016)) to make processmining “greener”.

For streaming process mining one cannot keeptrack of all events and all cases due to memory con-straints and the need to provide answers in real-time(Burattin et al., 2014; Aalst, 2016; Zelst et al., 2015).Hence, event data need to be stored in aggregatedform. Aging data structures, queues, time windows,sampling, hashing, etc. can be used to keep only theinformation necessary to instantly provide answers toselected questions. Such approaches can also be usedto ensure confidentiality, often without a significantloss of accuracy.

For decomposed/distributed process mining eventdata need to be split based on a grouping activities inthe process (Aalst, 2013b; Aalst, 2016). After split-ting the event log, it is still possible to discover pro-cess models and to check conformance. Interestingly,the sublogs can be analyzed separately. This may beused to break potential harmful correlations. Ratherthan storing complete cases, one can also store shorterepisodes of anonymized case fragments. Sometimesit may even be sufficient to store only direct succes-sions, i.e., facts of the form “for some unknown caseactivity a was followed by activity b with a delay of 8hours”. Some discovery algorithms only use data ondirect successions and do not require additional, pos-sibly sensitive, information. Of course certain ques-tions can no longer be answered in a reliable manner

Table 1: Relating the four challenges to process mining specific tasks.

creating andmanagingevent data

processdiscovery

conformancechecking

performanceanalysis

operationalsupport

fairness

Data Science with-

out prejudice: How to

avoid unfair conclusions

even if they are true?

The input data maybe biased, incompleteor incorrect suchthat the analysisreconfirms prejudices.By resampling orrelabeling the data,undesirable forms ofdiscrimination canbe avoided. Notethat both cases andresources (used toexecute activities)may refer to individ-uals having sensitiveattributes such asrace, gender, age, etc.

The discovered modelmay abstract frompaths followed by cer-tain under-representedgroups of cases.Discrimination-awareprocess-discoveryalgorithms can beused to avoid this. Forexample, if cases arehandled differentlybased on gender, wemay want to ensurethat both are equallyrepresented in themodel.

Conformancechecking can beused to “blame”individuals, groups,or organizations fordeviating from somenormative model.Discrimination-awareconformance check-ing (e.g., alignments)needs to separate(1) likelihood, (2)severity and (3)blame. Deviationsmay need to beinterpreted differentlyfor different groups ofcases and resources.

Straightforwardperformance measure-ments may be unfairfor certain classes ofcases and resources(e.g., not taking intoaccount the context).Discrimination-awareperformance analysisdetects unfairnessand supports processimprovements takinginto account trade-offs between internalfairness (worker’sperspective) andexternal fairness (citi-zen/patient/customer’sperspective).

Process-relatedpredictions, rec-ommendationsand decisionsmay discriminate(un)intentionally.This problem canbe tackled usingtechniques fromdiscrimination-awaredata mining.

confidentiality

Data Science that

ensures confidentiality:

How to answer questions

without revealing secrets?

Event data (e.g., XESfiles) may revealsensitive information.Anonymization andde-identification canbe used to avoiddisclosure. Notethat timestamps andpaths may be uniqueand a source forre-identification (e.g.,all paths are unique).

The discovered modelmay reveal sensitiveinformation, espe-cially with respectto infrequent pathsor small event logs.Drilling-down fromthe model may needto be blocked whennumbers get too small(cf. k-anonymity).

Conformance check-ing shows diagnosticsfor deviating casesand resources.Access-controlis important anddiagnostics need to beaggregated to avoidrevealing complianceproblems at the levelof individuals.

Performance analysisshows bottlenecks andother problems. Link-ing these problems tocases and resourcesmay disclose sensitiveinformation.

Process-relatedpredictions, rec-ommendationsand decisions maydisclose sensitiveinformation, e.g.,based on a rejectionother properties canbe derived.

accuracy

Data Science with-

out guesswork: How to

answer questions with

a guaranteed level of

accuracy?

Event data (e.g.,XES files) mayhave all kinds ofquality problems.Attributes may beincorrect, imprecise,or uncertain. Forexample, timestampsmay be too coarse(just the date) orreflect the time ofrecording rather thanthe time of the event’soccurrence.

Process discovery de-pends on many pa-rameters and charac-teristics of the eventlog. Process mod-els should better showthe confidence levelof the different parts.Moreover, additionalinformation needs tobe used better (do-main knowledge, un-certainty in event data,etc.).

Often multipleexplanations arepossible to interpretnon-conformance.Just providing onealignment basedon a particular costfunction may be mis-leading. How robustare the findings?

In case of fitnessproblems (processmodel and event logdisagree), perfor-mance analysis isbased on assumptionsand needs to dealwith missing values(making results lessaccurate).

Inaccurate processmodels may lead toflawed predictions,recommendations anddecisions. Moreover,not communicatingthe (un)certaintyof predictions,recommendationsand decisions, maynegatively impactprocesses.

transparency

Data Science that

provides transparency:

How to clarify answers

such that they become

indisputable?

Provenance of eventdata is key. Ide-ally, process mininginsights can be relatedto the event data theyare based on. How-ever, this may con-flict with confidential-ity concerns.

Discovered processmodels depend onthe event data usedas input and theparameter settings andchoice of discoveryalgorithm. How to en-sure that the processmodel is interpretedcorrectly? End-usersneed to understandthe relation betweendata and model totrust analysis.

When modeled andobserved behaviordisagree there may bemultiple explanations.How to ensure thatconformance diagnos-tics are interpretedcorrectly?

When detecting per-formance problems, itshould be clear howthese were detectedand what the possi-ble causes are. Ani-mating event logs onmodels helps to makeproblems more trans-parent.

Predictions, rec-ommendations anddecisions are basedon process models. Ifpossible, these modelsshould be transparent.Moreover, expla-nations should beadded to predictions,recommendationsand decisions (“Wepredict that this casebe late, because ...”).

(e.g., flow times of cases).The above examples illustrate that Table 1 iden-

tifies a range of novel research challenges in processmining. In today’s society, event data are collectedabout anything, at any time, and at any place. Today’sprocess mining tools are able to analyze such data andcan handle event logs with billions of events. Theseamazing capabilities also imply a great responsibility.Fairness, confidentiality, accuracy and transparencyshould be key concerns for any process miner.

4 CONCLUSION

This paper introduced the notion of “Green Data Sci-ence” (GDS) from four angles: fairness, confidential-ity, accuracy, and transparency. The possible “pollu-tion” caused by data science should not be addressed(only) by legislation. We should aim for positive,technological solutions to protect individuals, orga-nizations and society against the negative side-effectsof data. As an example, we discussed “green chal-lenges” in process mining. Table 1 can be viewed asa research agenda listing interesting open problems.

REFERENCES

Aalst, W. van der (2011). Process Mining: Discovery, Con-formance and Enhancement of Business Processes.Springer-Verlag, Berlin.

Aalst, W. van der (2013a). Business Process Management:A Comprehensive Survey. ISRN Software Engineer-ing, pages 1–37. doi:10.1155/2013/507984.

Aalst, W. van der (2013b). Decomposing Petri Nets for Pro-cess Mining: A Generic Approach. Distributed andParallel Databases, 31(4):471–507.

Aalst, W. van der (2014). Data Scientist: The Engineer ofthe Future. In Mertins, K., Benaben, F., Poler, R.,and Bourrieres, J., editors, Proceedings of the I-ESAConference, volume 7 of Enterprise Interoperability,pages 13–28. Springer-Verlag, Berlin.

Aalst, W. van der (2016). Process Mining: Data Science inAction. Springer-Verlag, Berlin.

Aalst, W. van der, Adriansyah, A., and Dongen, B. van(2012). Replaying History on Process Modelsfor Conformance Checking and Performance Analy-sis. WIREs Data Mining and Knowledge Discovery,2(2):182–192.

Burattin, A., Sperduti, A., and Aalst, W. van der (2014).Control-Flow Discovery from Event Streams. In IEEECongress on Evolutionary Computation (CEC 2014),pages 2420–2427. IEEE Computer Society.

Calders, T. and Verwer, S. (2010). Three Naive BayesApproaches for Discrimination-Aware Classification.Data Mining and Knowledge Discovery, 21(2):277–292.

Casella, G. and Berger, R. (2002). Statistical Inference, 2ndEdition. Duxbury Press.

European Commission (1995). Directive 95/46/EC of theEuropean Parliament and of the Council on the Pro-tection of Individuals with Wegard to the Processingof Personal Data and on the Free Movement of SuchData. Official Journal of the European Communities,No L 281/31.

European Commission (2015). Proposal for a Regulationof the European Parliament and of the Council onthe Protection of Individuals with Wegard to the Pro-cessing of Personal Data and on the Free Movementof Such Data (General Data Protection Regulation).9565/15, 2012/0011 (COD).

IEEE Task Force on Process Mining (2013). XES StandardDefinition. www.xes-standard.org.

Kamiran, F., Calders, T., and Pechenizkiy, M. (2010).Discrimination-Aware Decision-Tree Learning. InProceedings of the IEEE International Conference onData Mining (ICDM 2010), pages 869–874.

Leemans, S., Fahland, D., and Aalst, W. van der (2015).Exploring Processes and Deviations. In Fournier, F.and Mendling, J., editors, Business Process Manage-ment Workshops, International Workshop on BusinessProcess Intelligence (BPI 2014), volume 202 of Lec-ture Notes in Business Information Processing, pages304–316. Springer-Verlag, Berlin.

Miller, R. (1981). Simultaneous Statistical Inference.Springer-Verlag, Berlin.

Monreale, A., Rinzivillo, S., Pratesi, F., Giannotti, F., andPedreschi, D. (2014). Privacy-By-Design in Big DataAnalytics and Social Mining. EPJ Data Science,1(10):1–26.

Nelson, G. (2015). Practical Implications of Sharing Data:A Primer on Data Privacy, Anonymization, and De-Identification. Paper 1884-2015, ThotWave Technolo-gies, Chapel Hill, NC.

Pedreshi, D., Ruggieri, S., and Turini, F. (2008).Discrimination-Aware Data Mining. In Proceedingsof the 14th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining, pages560–568. ACM.

President’s Council of Advisors on Science and Technology(2014). Big Data and Privacy: A Technological Per-spective (Report to the President). Executive Office ofthe President, US-PCAST.

Rajaraman, A. and Ullman, J. (2011). Mining of MassiveDatasets. Cambridge University Press.

Ruggieri, S., Pedreshi, D., and Turini, F. (2010). DCUBE:Discrimination Discovery in Databases. In Proceed-ings of the ACM SIGMOD Intetrnational Conferenceon Management of Data, pages 1127–1130. ACM.

Zelst, S. van, Dongen, B. van, and Aalst, W. van der (2015).Know What You Stream: Generating Event Streamsfrom CPN Models in ProM 6. In Proceedings ofthe BPM2015 Demo Session, volume 1418 of CEURWorkshop Proceedings, pages 85–89. CEUR-WS.org.

Vigen, T. (2015). Spurious Correlations. Hachette Books.

Green Data Science Using Big Data in an Environmentally ...€¦ · Using Big Data in an...

Documents

Transcript of Green Data Science Using Big Data in an Environmentally ...€¦ · Using Big Data in an...