Introduction to Data Science and Knowledge Engineering ......Introduction to Data Science and...

Introduction to Data Science and Knowledge

Engineering

(2018-100-KEN1110)

Pietro Bonizzi & Rachel Cavillas noted by Krzysztof Cybulski and Paulius Skaisgiris

January 23, 2019

Contents

1 Pietro Bonizzi - Applied mathematics 31.1 Introduction to Data, Information, and Knowledge . . . . . . . . 31.2 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . . 41.3 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 Mathematical modeling . . . . . . . . . . . . . . . . . . . . . . . 101.5 Mathematical simulations . . . . . . . . . . . . . . . . . . . . . . 141.6 Game theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2 Rachel Cavill - Computer science 192.1 What is AI? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.2 Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.3 Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.4 Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.5 Learning agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.6 Information retrieval . . . . . . . . . . . . . . . . . . . . . . . . . 33

Preface

This course was described as an introduction to the general ideas behind theentire Data Science & Knowledge Engineering degree at Universiteit Maastricht- a little bit of everything to give a taste, but not quite enough to get in depthinto the substance. Like Pietro himself said: sit back and enjoy the ride.

1

Disclaimer

These notes are not official course material and they are not guaranteed tocorrespond exactly to what the lecturers said or even to what is correct. Theyare a students’ grassroots attempt at making a collaborative, comprehensivesource of knowledge based on what they, subjectively, considered noteworthy.

Trust at your own risk

With that said, the development of these notes is an open source process andcontributions are always welcome. If you wish to help out take a look at therepository at https://gitlab.com/k-cybulski/dke-notes. That is whereyou will find the LATEX source code as well as further steps towards makingchanges.

2

https://gitlab.com/k-cybulski/dke-notes

1 Pietro Bonizzi - Applied mathematics

1.1 Introduction to Data, Information, and Knowledge

Knowledge engineering is the process of developing knowledge based systems,i.e. computer programs that reason and solve complex problems based on aknowledge base. Data science is about scientific methods, processes and sys-tems to extract knowledge or insights from structured or unstructured data. Itcomes down to utilising mathematical and statistical methods to understandwhy things work and how we can change them.

In other words, data science is basically statistics on a Mac. Main terms:

• Data is the raw numbers we gather. (e.g. how many metres high up weare)

• Information is a rough estimate of what the numbers mean. (e.g. prettydamn low)

• Knowledge is information put into action. (e.g. open the bloody parachuteyou moron)

The data science process is the path from asking an interesting question togetting an interesting answer. In short, it looks like this:

1. Ask an interesting question. What is your scientific goal? What do youwant to predict?

2. Get the data. How was the data sampled? Which data is relevant? Arethere privacy issues?

3. Explore the data. How does the data look when you plot it? Are thereanomalies? Are there patterns?

4. Model the data. Build a model, fit it and validate it.

5. Communicate and visualize the results. What did we learn? Do the resultsmake sense? Can you explain them in a story?

The research topics of our department are on the verge of computer science,mathematics, artificial intelligence, operations research & data science.

The most important task is to always question everything.

Don’t trust your data. Don’t trust your findings.

3

1.2 Exploratory Data Analysis

Exploratory data analysis is an approach to analyzing data sets when we firstget our hands on them. Our goal is to find their main characteristics:

1. How is it organized? Are there any clusters? Visualize it!

2. Basic statistics & quantification.

3. Outliers?

4. Anomalies? Missing values? Data collection errors?

When in doubt, visualize. Check out Anscombe’s quartet for what happens ifyou don’t.

Stay critical. Do not trust your data. Triple check it.

Time seriesA time series is a sequence of data points in chronological order, typically con-sisting of successive measurements made over a time interval. E.g. ECG signalsor temperature over time.

Detecting patterns in a time series is no simple task. At first glance we cansee that repetition is a very prominent feature of such datasets, and as such itwould be very useful for us to be able to decompose the series into its primaryfrequency components.

Thankfully, our friend mathematics comes to the rescue. The Fourier trans-form lets us transform a time series from the time domain to the frequencydomain, and so it lets us see the underlying frequencies that make up a wave.This allows us to see short and long term behaviour of the system.

See: https://en.wikipedia.org/wiki/File:Fourier_transform_time_

and_frequency_domains_(small).gif

4

https://en.wikipedia.org/wiki/File:Fourier_transform_time_and_frequency_domains_(small).gif

https://en.wikipedia.org/wiki/File:Fourier_transform_time_and_frequency_domains_(small).gif

Complex numbersThe complex unit i is defined to have the property i2 = −1. A complex numberz has the form

z = a+ bi

where a, b ∈ R.

• a = Re(z) - Real part

• b = Im(z) - Imaginary part

Imaginary numbers can be considered points (a, b) on a complex plane.

Exponentials & polar formThe exponential function has various cool characteristics if used with imaginarynumbers:

eθi = cos(θ) + sin(θ)i

eθieφi = e(θ+φ)i

Every complex number may be written in polar form as:

z = reθi,

where r =√a2 + b2 and θ is the angle of (a, b) - as obtained by arctan( ba ) - in

the complex plane. This leads us to the trigonometric form:

z = r(cos(θ) + isin(θ))

The radius r is called the modulus and denoted by r = |z|. The angle θ is calledthe argument of z, and denoted by θ = arg(z). Note that with such notationmultiplication of complex numbers z and w gives:

zw = reθiseφi = (rs)e(θ+φ)i

This means |zw| = |z||w| and arg(zw) = arg(z) + arg(w)

ConjugateThe complex conjugate of z = a+ bi is z = a− bi. Note that

zz = (a+ bi)(a− bi) = a2 + b2 = |z|2

Things to learn for the exam:

• Complex numbers and their representation in the complex plane

• Cartesian and polar form - transformation from one into the other

• Operations with complex numbers

• Solve complex number exercises

5

1.3 Data Mining

Data mining is the computational process of finding patterns in datasets. Inother words, it’s an automatic or semi-automatic analysis or large quantities ofdata to extract previously unknown patterns. It generally involves six commontasks:

• Anomaly detection (outlier/change/deviation detection) - identificationof unusual data records that might be interesting or errors for furtherinvestigation

• Association rule learning (dependency modelling) - search for relation-ships between variables

• Clustering - search for groups and structures in the data

• Classification - generalization of known structures to apply to new data

• Regression - search for a function that models the data with the leasterror

• Summarization - providing a more compact representation of the dataset

ClassificationA classification problem occurs when an object needs to be assigned to a pre-defined group based on a number of observed (quantifiable) features related tothat object. So, for example you could classify pictures of cats and dogs basedon the red, green & blue values of pixels of the image.

There are lots of numerous different approaches to classification (that areall generally very cool - seriously, look them up) including linear/logistic regres-sion, support vector machines, naıve Bayes, decision trees, k-nearest neighbours,neural networks, etc.

Model validationBefore we can let our model roam free in the real world and we use it to quantifyinformation in vital tasks, we must first make sure it works.

To do this we generally split up the data measurements we have gatheredinto three sets:

• Training set that is used to fit the models i.e. learn the parameters ofthe model

• Validation set that is used to evaluate different models for model selec-tion

• Test set that is used to examine how well our final model performs

The usual split of the data between different sets is 50% for the training, 25%for the validation and 25% for the testing.

6

Overfitting & curse of dimensionalityThe more features we use in our model, the higher dimensional it is. If it hastoo many features it may appear to be working fine for our dataset at first, butit might still fail to classify new real world data. We call this effect overfitting.Our feature space is very sparse and so it is easy to find a separable hyperplanethat splits it in different classes. However, if we project our highly dimensionalclassification result to lower dimensional space we will see that our classifier haslearnt exceptions, not the actual distribution.

In other words, overfitting means our model is overqualified to deal with atask and is confused by its own complexity. Overfitting is an example of thecurse of dimensionality.

How to avoid it?There is no fixed rule to avoiding it. It depends on the amount of availabletraining data, the complexity of boundaries between clusters and the type ofclassifier used. Rules of thumb:

• The smaller the training data, the less features we should use

• We should select the most discriminant features and discard noise

• Look for features that enhance the information in the data & add combi-nations of features

Kernel trickIf we add new features based on combinations of existing features we can usethe kernel trick. For example, if the most decisive feature of a 2 dimensionaldistribution is the distance from coordinate origin, instead of using x and y asfeatures we could use x and x2 + y2 as features - now a linear classifier couldeasily see the difference between close and far.

See: https://www.youtube.com/watch?v=3liCbRZPrZA

Distance measuresStatistical classification is performed by comparing observations to previousobservations by means of a similarity or distance function. This will allow us toquantify how different two objects are and, thus, whether they are close enoughto be classified in the same category.

A distance measure should fulfil four conditions. Given two points P and Q:

1. d(P,Q) > 0

2. d(P,Q) = 0 ⇐⇒ P = Q

3. d(P,Q) = d(Q,P )

4. d(P,Q) < d(P,W ) + d(W,Q), i.e. the triangle inequality

These conditions give us much flexibility in designing our distance measures.We can compute the distance based on the location or some other properties ofthe two points.

7

https://www.youtube.com/watch?v=3liCbRZPrZA

A few distance measures:

• Euclidean distanceGiven two points P and Q in n-dimensional space with Cartesian coordi-nates P = (p1, p2, . . . , pn) and Q = (q1, q2, . . . , qn), the Euclidean distanced is given by:

d(P,Q) =√

(q1 − p1)2 + (q2 − p2)2 + · · ·+ (qn − pn)2

=

√√√√ n∑i=0

(qi − pi)2

• Absolute distance

d(P,Q) =

n∑i=1

|qi − pi|

• Infinity distance

d(P,Q) = max(|q1 − p1|, . . . , |qn − pn|)

• Cosine distanceCosine angle is defined as the angle between lines from the origin to thepoints in question.

• Edit distanceEdit distance is a metric of how many substitutions we need to make to astring to turn it into another string. So, for example, given

P = (0, 2, 1, 0), Q = (5, 2, 1, 3)

the distance d(P,Q) is 2.

Do note, however, that there are multiple types of edit distances out there.Some of them only measure substitutions while others, like the Levenshteindistance, also measure deletions and insertions. Putting that aside, thedefinition we stick to in this course only permits substitutions and is con-strained to strings of the same length.

ClusteringClustering is the task of finding groups in an unlabeled dataset. The task is tofind clusters/subsets in the set such that similar objects are close and dissimilarobjects are far.

8

Machine learning and deep learningMachine learning explores the study and construction of algorithms that canlearn from and make predictions on data without being explicitly programmedfor a specific task.

Deep learning is a branch of machine learning that brings it closer to AI.It’s about learning multiple layers of representation and abstraction that helpmake sense of data such as images, sound and text.

Deep learning very often outperforms other standard solutions but it requiresa very large amount of data as well as computational power. At the time beingit is something of a black art - there is hardly any theory to guide you orexplanations as to why some operations work. It also lacks interpretability, asit is very hard to comprehend what’s going on inside a neural network.


• Conditions for a distance measure

• Definitions of Euclidean distance, absolute distance, infinity distance, editdistance

• Know how to compute the Euclidean distance, absolute distance, infinitydistance and edit distance given two time series/strings

9

1.4 Mathematical modeling

A model is a mathematical representation of a system - it’s a simple and abstractcopy of reality. A mathematical model can help you understand or simulate thesystem it represents.

For example, a biological cell could be considered a system, and a modelcould be used to approximate how many nutrients it needs and how much wasteit produces.

Models are useful because:

• They help us understand the properties and behaviour of a system (e.g.predator-prey model)

• They provide us with information we can apply to the system itself (e.g.aeroplane autopilots)

• They let us make predictions for the future (e.g. weather forecasting)

• They let us perform simulations and experiments on a computer (e.g.simulate animal experiments)

Models can range from very simple to extremely complex. However, a compli-cated model is not necessarily much better - too many features can simply makeit prone to overfitting.

”All models are wrong, but some are useful.”

(George Box, 1978)

From data to modelConstructing a model from data we have is an identification problem - we wishto retrieve the theoretical model that could have created measurements we col-lected. As such, this becomes a problem of model fitting. We need to find amathematical structure that behaves similarly to what we measured.

Generally at this point it’s extremely valuable to visualize the data we have.

Linear regressionLinear regression models are one of the most common classes of models. Theyattempt to find a linear relation between variables we have and variables we areinterested in.

Assume we want to approximate the atmospheric pressure in Maastrichtgiven explanatory an variable like temperature, time of year or wind speed.Let’s label pressure y and our chosen explanatoroy variable, the regressor, x. Inthis case linear regression will attempt to find parameters a, b of the equation

y = ax+ b

These models are useful because they let us quickly find relationships be-tween quantities. If we have tonnes of different variables then performing linearregression is almost sure to give us an idea of what is having an influence.

10

How does it work?In linear regression we attempt to find the line of best fit. So, given a set ofpoints y and x, we attempt to approximate y for any x.

y = ax+ b

The differences between predicted y and real y are called residuals. The lineof best fit y = ax+ b minimizes the residuals.

Assume we collected n observations

n 1 2 3 . . . nx x1 x2 x3 . . . xny y1 y2 y3 . . . yn

We can construct a basic model y = ax+ b

y y1 y2 y3 . . . yn

Now we are able to improve it by comparing our predictions y with theground truth y:

y = {y1, y2, y3, . . . , yn}y = {y1, y2, y3, . . . , yn}

We want to minimize the residuals

ei = yi − yi

and, as such, we wish to minimize the distance. In this case we may choose theEuclidean distance

n∑i=1

(yi − yi)2 =

n∑i=1

(yi − (axi + b))2

which leads us to the optimization problem

mina,b

n∑i=1

(yi − (axi + b))2

Note here that we can neglect the root of the square - simply operating on thesquared numbers is enough as we do not care about their exact values, only thatthey are as low as they can get.

This boils down to finding the minimum of a quadratic function. We canfind it by using derivatives. The slope of the tangent at minimum of a parabolais zero, so if we manage to find the derivatives with respect to a and b andconstrain them to 0 we will be able to find the values for which the error isminimum.

11

A few things about derivatives:

1. Derivative of a constant c with respect to x

∂c

∂d= 0

2. Derivative of x with respect to x

∂x

∂x= 1

3. Derivative of x2 with respect to x

∂x2

∂x= 2x

Now, with these tools, we can compute the derivatives of

V =

N∑i=1

(yi − (axi + b))2

with respect to a and b. Take an example:

n 1 2 3 4x 0 1 2 3y 1 2 5 4

Sum of the error will be

V =

4∑i=1

(yi − (axi + b))2 =

4∑i=1

e2i

V = e21 + e22 + e23 + e24

e21 = 1 + b2 − 2b

e22 = 4 + a2 + b2 − 4a− 4b+ 2ab

e23 = 25 + 4a2 + b2 − 20a− 10b+ 4ab

e24 = 16 + 9a2 + b2 − 24a− 8b+ 6ab

and so the derivative with respect to a and b

∂V

∂a=0 + 0 + 0

0 + 2a+ 0− 4 + 0 + 2b

8a− 20 + 4b

18a− 24 + 6b

=28a− 48 + 12b = 0

∂V

∂b=2b− 2

2b− 4 + 2a

2b− 10 + 4a

2b− 8 + 6a

=8b− 24 + 12a = 0

Now just solve the system of equations:{28a− 48 + 12b = 0

8b− 24 + 12a = 0

We end up with a = 65 , b = 6

5 and so the line of best fit is:

yi =6

5xi +

6

5

12

Overfitting, estimation and validationWhen constructing a model we must keep a few things in mind. A large enoughmodel can reproduce a measured output arbitrarily well. However, we mustverify that the model is also relevant for other data - specifically, data that wasnot used for estimation but was collected from the same system.

If there is a discrepancy between how our model performs for estimationdata and validation data, we got ourselves into a pit of overfitting and we needto rethink our strategy. Our model fits the exception, not the rule.

Having more parameters does not make our model better.


• Given a set of correlated variables x and y you must know how to computethe linear equation of the regression line through the data y = ax+ b.

• Know how to solve exercises such as this one:

In a physical experiment, four different measurements yk are obtained.

xk -1 0 2 3yk 1 2 2 4

The physicist who collected those is interested in fitting these measure-ments to a model of the form yk = axk + b. Compute the optimal valuesfor a and b according to the chosen model.

13

1.5 Mathematical simulations

Simulation is the imitation of a real world process based on a model. The modelis an approximation of the real system itself, whereas the simulation representsits operation over time. In terms of data → information → knowledge:

Data −→Information −→Knowledge

System −→Model −→Computer simulation

Monte Carlo methodsMonte Carlo methods use randomness to calculate the properties of a mathe-matical model where the randomness is not a feature of the target model. Theyrely on repeated random sampling of values for uncertain variables that areplugged into the simulation model and used to calculate outcomes of interest.

Monte Carlo simulations are especially helpful to study and analyse be-haviour of systems that involve uncertainty or when an exhaustive numericalevaluation is prohibitively slow.

An example of this is a simple algorithm to approximate π by comparingthe area of a circle inscribed in a square to the square itself.

1. Let C be a circle inscribed into a square S of side length 2R.

2. Generate a set of points P (x, y) such that x and y are uniformly distributedwithin S.

3. Let PC , PS be the number of points that fell into C and S respectively.

4. Then

π ≈ 4PC

PS

because

AC = πR2

AS = 4R2

PC

PS≈A

C

AS=πR2

4R2=π

4

π ≈ 4PC

PS

Uniform vs Gaussian distributionWhen dealing with randomness we must keep in mind various distributions ofrandom variables:

• Uniformly distributed random numbers on an interval have an equal prob-ability of being selected.

• Normally distributed random numbers on an interval have probabilitiesthat follow the normal distribution bell curve, so numbers closer to themean are more likely to be selected

Knowing probability and statistics is crucial. Learn maths please.

14

Advantages of Monte Carlo methodsMonte Carlo methods require a lot less samples than traditional approaches.They can quickly analyse thousands of what-if scenarios when an exhaustivenumerical evaluation is restrictively time consuming.

They have many applications including statistical physics, engineering, AI,finance and more.


• What a uniform distribution is

• How to solve exercises such as:

You wish to perform an experiment of picking up a coloured ball at randomfrom a bag, noting the colour and putting it back. The bag has 100 balls,of which 45 are red, 25 are blue and 30 are yellow.

What is the probability distribution of this experiment? Write a rule thatwould allow you to simulate this experiment with a computer.

15

1.6 Game theory

Game theory is a study of strategic decision making when multiple decision-makes are involved.

Definitions:

• Payoff (final outcome) - benefit for a player resulting from their actions.It could be a negative number, meaning benefit for the other players

• Strategy - set of actions taken by a player

• Payoff table (matrix) - collection of possible payoffs for all players depend-ing on different strategies

Zero-sum gameA zero-sum game contains only two players. Whatever one player wins the otherloses, thus the sum of their winnings is zero.

Payoff tablesExample payoff table for rock, paper & scissors, assuming -1 is a loss for player1, 0 is a draw and 1 is a win for player 1.

Player 2Strategy Rock Paper Scissors

Rock 0 -1 1Player 1 Paper 1 0 -1

Scissors -1 1 0

Dominated strategiesStrategy A is dominated by strategy B if B provides a better or equal payoffthan A in every situation. If we consider rational players we can generallydiscard all dominated strategies as they are just bad. For example:

Player 2Strategy 1 2

Player 1 1 2 12 0 1

In this case for Player 1 we can see that Strategy 1 dominates Strategy 2 andit’s never reasonable to try Strategy 2. As such a rational Player 2 will dismissPlayer 1’s Strategy 1 from his consideration.

In simple cases like those above we can simply find the best strategies bydiscarding dominated strategies.

16

Minimax criterionPlayer 1 aims to maximize the minimum payoff, whereas Player 2 aims to min-imize the maximum payoff to Player 1.

If the maximin strategy for Player 1 is the same as the minimax strategy forPlayer 2, then such a combination of strategies is called the saddle point. Forexample:

Player 2Strategy 1 2 3

1 -3 -2 6Player 1 2 2 0 2

3 5 -2 -4

In this case no strategies dominate the any others, so we can’t discard any.Both players are rational, so they will seek to protect themselves from largepayoffs to the opponent. Players 1 and 2 ought to attempt to minimize themaximum possible loss, and as such they will both go for their respective strat-egy 2.

The end product of such reasoning is that each player should play in a wayas to minimize their maximum losses whenever the resulting strategy cannot beexploited by the opponent to then improve their position.

This so-called minimax criterion is a standard criterion proposed by gametheory for selecting a strategy. Player 1 aims to maximize the minimum payoffand player 2 aims to minimize the maximum payoff to player 1.

Unstable solutions and mixed strategiesIf a saddle point does not exist there is no stable solution. When this happensno player is able to deduce which strategy the other will use and hence they areunable to present a perfect response.

In these cases, the most rational move is to play irrationally, i.e. randomly.Game theory advises players to assign probability distributions over their setsof strategies.

xi = probability that player 1 will use strategy i(i = 1, 2, . . . ,m)

yj = probability that player 2 will use strategy j(j = 1, 2, . . . , n)

Thus, player 1 would specify her plan for playing the game by assigning valuesto x1, x2, . . . , xm. Because the values are probabilities they must be nonnegativeand add up to 1.

These plans (x1, x2, . . . , xm) and (y1, y2, . . . , yn) are usually referred to asmixed strategies, as opposed to the original strategies that are then calledpure strategies.

These probability distributions may be generated by playing the game re-peatedly with different mixed strategies.

The goal of player 1 is to choose a probability distribution such that theexpected payoff of player 2 is the same regardless of the strategy she chooses.This way, player 1 aims to make player 2 indifferent to whatever strategy shemay choose. Making the opponent indifferent to your strategy is equivalent tominimizing the opponent’s ability to recognize and exploit systematic patternsof behaviour in your own choice.

17

Nash equilibriumJohn Forbes Nash suggested an approach different than classical game theory.In his approach, we have a non zero-sum game where each player tries to do thebest for himself and for the others.

A classical non zero-sum game example - Prisoner’s dillema:

Prisoner 2Strategy Silent Betray

Prisoner Silent -1, -1 -3, 0Betray 0, -3 -2, -2

In this case the Silent strategy is dominated by the Betray strategy for bothplayers. As such, the Nash equilibrium is the Betray strategy for both.


• How to solve games by dominated strategies

• How to solve games by minimax criterion

• What are stable and unstable solutions

• How to find Nash equilibriums in a game

18

2 Rachel Cavill - Computer science

Course books:

• Russell and Norvig: Artificial Intelligence: A Modern Approach, PrenticeHall, 2010, third Edition

2.1 What is AI?

Suggested readings:

• Chapter 1 of Russell and Norvig

What is AI?Artificial Intelligence is a branch of computer science dealing with the simulationof intelligent behavior in computers; the capability of a machine to imitateintelligent human behavior.

Turing testThe Turing test is a classic example of a test for whether a machine is intelligent.The test consists of a conversation between a judge, a human and a computer.If the judge cannot accurately predict which conversation partner is human, themachine is consider to pass the test.

Problems with the Turing test

• Can improve chance of passing through psychological strategies

• Doesn’t test ‘intelligence’, just responses – Chinese room problem

• Humans can ‘fail’ to convince the judges they are not computers

Advantages of the Turing test

• Objective notion of Intelligence

• Avoids discussion of internal processes and consciousness

• Eliminates bias in favor of living organism

Objections to the Turing test

• Bias toward purely symbolic problem solving task

• Constrain machine intelligence to fit human mold

– Limited memory

– Error prone

• It is a distraction

19

Views of AI Views of AI fall into four categories:

• Thinking humanly - An AI in this sense is able to think like a human.To create such an entity we must understand the rules underlying actualhuman thought, i.e. activity of the neurons. Cognitive science and cogni-tive neuroscience are the branches of science concerned with this type ofresearch.

• Thinking rationally - An AI of this kind is able to reason purely basedon rational arguments and thought processes. This branch of AI could besaid to have been started by the ancient Greeks, e.g. Aristotle. The logicisttradition attempts to create intelligent systems using logic programming.

• Acting humanly - An AI of this sort acts like a human would act in agiven situation. This is the kind of agent the Turing test would detect.

• Acting rationally - Such AI does the right thing, i.e. it maximizes thegoal achievement given the available information. This is the kind of AIwe are generally most interested in.

History of AI

• 1950: Alan Turing’s “Computing Machinery and Intelligence” - first com-plete vision of AI

• 1956: The birth of AI.

– Dartmouth Workshop bringing together top minds on automata the-ory, neural nets and the study of intelligence.

– The term ”Artificial Intelligence” adopted.

• 1952-1969: Early enthusiasm, great expectations.

– Newell and Simon introduced the General Problem Solver.

– Arthur Samuel wrote a checkers program that learnt to play the gameat an amateur level.

– John McCarthy: invented Lisp.

• 1966-1973: A close of reality.

– Progress was slower than expected, unrealistic predictions.

– Some systems lacked scalability, combinatorial explosion in search.

– Fundamental limitations on techniques and representations.

• 1969-1979: Knowledge-based systems.

– Expert systems: MYCIN to diagnose blood infections. Introductionof uncertainty in reasoning.

– Increase in knowledge representation research: logic, frames, seman-tic networks.

• 1980-1988: AI becomes an industry.

– AI industry boomed from a few million dollars in 1980 to billions in1988.

20

– The return of neural networks: back-propagation algorithm, paralleldistributed processing, deep learning.

– AI becomes a science.

• 1995-present: The emergence of intelligent agents.

– Deep Blue defeated the reigning world chess champion Garry Kas-parov in 1997.

– Virtual environment (e.g. Internet bots).

– The whole agent problem: “How does an agent act/behave embeddedin real environments with continuous sensory inputs”.

– Human-level AI back on the agenda (since 2003).

– Artificial General Intelligence (AGI).

• 2001-: Very Large Data Sets Available.

– Big Data: Very large data sources (e.g., web).

– Learning methods vs hand-coded knowledge engineering.

– Solve knowledge bottleneck in AI for many applications.

State of the art AI uses today

• 2016: AlphaGo defeat Lee Sedol. Developed by Google Deepmind

• Autonomous self-driving cars

• Logistics planning

• Planning and scheduling

• E-mail spam fighting

• Vaccuum cleaning

• Quiz show contestant. IBM Watson, 2011

Limits of AI

– Each program is good in its own domain, but it can’t do all tasks

– Machine Translation: translating to different languages distorts theoriginal meaning of the sentence

– Acting as a judge: no clear definition of morality, truth and justice

– Beating humans in soccer: too many variables and external factorsto track

– General Game Playing: can only adopt to certain pre-defined rules.Does not know how to adapt quickly to completely different domainsof a game

– Converse successfully with another person for an hour: after sometime, it is not very difficult to ask certain questions that expose thecomputer

21


• Views of AI

• Turing test

• What is an agent?

22

2.2 Agents

Suggested readings:

• Chapter 2 of Russell and Norvig, except 2.4.7

What is an agent?

• An agent is anything that can be viewed as perceiving its environmentthrough sensors and acting upon that environment through actuators. Itoperates autonomously.

• Agents include humans, robots, softbots, thermostats, etc.

• Abstractly, an agent is a function from perception history to an action.

Examples of agents

• Human agent: sensors - eyes, ears, and other organs; actuators - hands,legs, mouth, and other body parts for actuators.

• Robotic agent: sensors - cameras and infrared range finders; actuators -various motors.

Rational agent

• An agent should strive to ”do the right thing”, based on what it canperceive and the actions it can perform.

• The right action is the one that will cause the agent to be most successful.

• Performance measure - an objective criterion for success of an agent’sbehavior.

• Rational Agent: for each possible percept sequence, a rational agentshould select an action that is expected to maximise its performance mea-sure, given the evidence provided by the percept sequence and whateverbuilt-in knowledge the agent has.

• An agent is autonomous if its behavior is determined by its own experience(with ability to learn and adapt).

PEAS

• PEAS: Performance measure, Environment, Actuators, Sensors.

Examples of PEAS systems:

• Agent: Automated taxi driver:

– Performance measure: Safe, fast, legal, comfortable trip, maximizeprofits.

– Environment: Roads, other traffic, pedestrians, customers.

23

– Actuators: Steering wheel, accelerator, brake, signal, horn.

– Sensors: Cameras, sonar, speedometer, GPS, odometer, engine sen-sors, keyboard.

• Agent: Medical diagnosis system:

– Performance measure: Healthy patient, minimize costs, lawsuits.

– Environment: Patient, hospital, staff.

– Actuators: Screen display (questions, tests, diagnoses, treatments,referrals).

– Sensors: Keyboard (entry of symptoms, findings, patient’s answers).

• Agent: Part-picking robot:

– Performance measure: Percentage of parts in correct bins.

– Environment: Conveyor belt with parts, bins.

– Actuators: Jointed arm and hand.

– Sensors: Camera, joint angle sensors.

Environment types

• Fully observable (vs. partially observable): An agent’s sensors giveit access to the complete state of the environment at each point in time.

• Deterministic (vs. stochastic): The next state of the environment iscompletely determined by the current state and the action executed bythe agent.

• Episodic (vs. sequential): The agent’s experience is divided intoatomic ”episodes” (each episode consists of the agent perceiving and thenperforming a single action), and the choice of action in each episode de-pends only on the episode itself.

• Static (vs. dynamic): The environment is unchanged while an agent isdeliberating.

– The environment is semidynamic if the environment itself does notchange with the passage of time but the agent’s performance scoredoes.

• Discrete (vs. continuous): A limited number of distinct, clearly de-fined percepts and actions.

• Single agent (vs. multiagent): An agent operating by itself in anenvironment.

• Known (vs. unknown): The state of knowledge of the laws of theenvironment.

The real world is (of course) partially observable, stochastic, sequential, dy-namic, continuous, multi-agent and, as of yet, unknown.

24

The structure of agents

• An agent is completely specified by the agent function mapping perceptsequences to actions

• One agent function (for a small equivalence class) is rational

• Aim: find a way to implement the rational agent function concisely

Agent

• Four basic types in order of increasing generality:

– Simple reflex agents

– Model-based reflex agents

– Goal-based agents

– Utility-based agents

• All can be turned into learning agents

Simple reflex agent

• Selects actions on the basis of only the current percept (e.g. the vacuum-agent)

• Large reduction in possible percept/action situations

• Implemented through condition-action (if-then) rules (e.g. If dirty thensuck)

• Will only work if the environment is fully observable otherwise infiniteloops may occur. Although randomization may help

Model-based reflex agent

• Maintains an internal state to tackle partially observable environments

• Over time it updates the internal state using world knowledge

• The model of the world includes information like:

– How does the world change

– How do actions affect the world

Goal-based agents

• A goal-based agent follows its goal to know which situations are desirable

– More tricky when long sequences of actions are required to find thegoal

• Typically investigated in search and planning research

• Major difference: future is taken into account

• Is more flexible since goals are represented explicitly and can be changed

25

Utility-based agent

• A utility-based agent attempts to maximize a utility function.

• This stems from the idea that certain goals can be reached in differentways. Some ways are better and thus have a higher utility.

• Utility function maps a (sequence of) state(s) onto a real number

• Improves on goals:

– Selecting between conflicting goals

– Select appropriately between several goals based on likelihood of suc-cess


• Agents and environments

• PEAS

• Environment types/properties

• Agent types

26

2.3 Search

Suggested readings:

• Chapter 3 of Russel and Norvig, up until 3.4.4, also 5.1 and 5.2

SearchSearch is the process of finding the optimal solution to a problem among allpossible solutions.

TreesTrees are a method of simplifying problems into a limited number of distinctchoices.

1. Root: node without parent.

2. Descendant of a node: child, grandchild etc.

3. Internal node: node with at least one child.

4. Leaf node: node without children.

5. Ancestor of a node: parent, grandparent etc.

6. Subtree: tree consisting of a node and its descendants.

Search methodsWe can split up search methods into breadth first search and depth first search:

1. Breadth first search: search through every child of every node we visit.If there are multiple exits, this strategy will give us the shortest path bythe number of decisions.

2. Depth first search: go as deep as we can with each path until we reacha leaf, then backtrack up the tree. Depth first makes sense if we are anagent going through a maze so that we don’t need to zigzag between nodestoo much.

Order of the nodes is important in both strategies. We should keep the mostlikely nodes early in the search.

Complex systemsIn complex systems, like in games, we cannot reasonably search through allnodes. We must include heuristics:

1. Search depth d: Number of state transitions from the root of the searchto the current state position.

2. Branching factor b: Average number of successor moves.

Adversarial searchIn systems with multiple adversarial actors, we can describe agents as such:

1. Player 1 as MAX: their goal is to maximize utility.

2. Player 2 as MIN: their goal is to minimize utility.

27

MiniMax searchMiniMax search is a strategy of minimizing the maximum loss, i.e. searching forthe best worst-case scenario in case of perfectly optimal moves by the opponent.

Principal variationPrincipal variation is the path from root to leaf node of optimal (i.e. rational)play by each player. This is the optimal action sequence. This is also known asthe main line, or optimal path.

PruningSome nodes can be proven to be irrelevant in the search. If we order the nodesin the right order and perform depth-first search, we will ignore them. Thebetter our node ordering, the more nodes we will prune, the faster our searchwill become.

Heuristic searchHeuristic search is a method of going through a simplified decision tree. It eval-uates leaves’ possible values based on a heuristic function that returns roughlyhow well the game is going at a given point.


• Trees

• Depth first search

• Breadth first search

• Minimax search

28

2.4 Logic

Suggested readings:

• 7.3, 7.4, 7.5 (only intro), skip figure 7.10 from Russel and Norvig

Prolog

• A programming language with built-in search

• We don’t need to program the computer to build and search the tree thisis built-in in the interpreter

• We just need to define the problem in a way which it understands

Inferences are steps in reasoning, moving from premises to logical conse-quences.

We can apply inference to a knowledge base to derive new informationand make decisions. Prolog is a programming language designed to help us outwith this.

Facts and rules of Prolog

• Prolog works by declaring the facts and the rules of the problem. Thisis the knowledge base - a technology used to store complex structuredand unstructured information used by a computer system

• Prolog then uses a built-in search algorithm to try to find whether a state-ment you ask is true or false, and if true, how it is true. This is theinference engine, its job is to extend the knowledge base automatically

Propositional logicPropositional logic includes the logical operations from discrete mathematics,i.e:

p ∧ q p and qp ∨ q p or qp⇒ q p then q

p ⇐⇒ q p if and only if q¬p not p

29

EntailmentEntailment means that one thing follows from another:

• KB |= α

– Knowledge base KB entails a sentence α if and only if, in all worldswhere the KB is true, α is true

– e.g., the KB containing ”the Giants won” and ”the Reds won” entails”the Giants won or the Reds won”

– e.g., x+ y = 4 entails 4 = x+ y

A statement A entails statement B if B holds in all models that A holds, e.g.:

• A ∧B |= A ⇐⇒ B

• A ⇐⇒ B |= ¬A ∨B

Validity & Satisfiability

A sentence is valid if it is true in all models,e.g., True, A ∨ ¬A, A⇒ A, (A ∧ (A⇒ B))⇒ B

A sentence is satisfiable if it is true in some model,e.g., A ∨B, C

A sentence is unsatisfiable if it is true in no models,e.g., A ∧ ¬A


• Models and entailment

• Propositional (Boolean) logic

• Equivalence, validity, satisfiability

30

2.5 Learning agents

Suggested readings:

• 2.4.6 & 18.3 from Russel and Norvig

Learning agentsLearning agents learn methods they can use on their own. Their behavior isnot predefined - it depends on what they learn. Learning is based on givingfeedback to the agent’s choices. Types of learning feedback:

• Supervised learning - learning is based on pre-defined labels attached todataset elements.

• Unsupervised learning - learning is based on automatic group detectionon a dataset without labels (i.e. clustering).

• Reinforcement learning - learning is based on occasional rewards

EntropyEntropy is a measure of messiness of data, e.g. if most rows in a table are thesame its entropy is low, whereas if all rows are different then the entropy is high.

If we wish to go for a fancy definition entropy is the weighted sum of self-information and probabilities of events from a given distribution happening.This could be the probability of an outcome, or the probability of a specific ob-ject belonging to a class. Due to very cool mathematical magic self-informationof an event can be calculated as log( 1

P ) where P is the probability of it happen-ing. Entropy is thus

n∑i=1

Pi log(1

Pi)

which can be rewritten asn∑i=1

−Pi log(Pi)

In theory the base of the logarithm can be arbitrary and it corresponds tothe unit of entropy we get in the end. For instance, base 2 corresponds to a bit.Keep in mind, however, that if the logarithm base we use is the same as thenumber of different classes then the resulting value will have the handy propertyof being ≤ 1.

Information gainInformation gain is the expected reduction in entropy caused by partitioningthe instances from learning set S according to a given attribute. Since entropyis a measure of messiness then if, after partitioning instances into categories,each of those categories is more uniform, then total the entropy decreases.

You can measure how good a decision tree is doing by calculating the totalentropy of the final decision classes.

31

ID3 tree building algorithmID3 algorithm is used to create decision trees based on selecting attributes thatdecrease entropy the most at each step. Informally:

1. Determine the attribute with the highest possible information gain on thetraining set.

2. Use this attribute as a root, and create branches for each of the valuesthis attribute can have. This could either be discrete values or specificsplit-points of a continuous value.

3. For each branch, repeat the process with subset of the training set that isclassified by that branch.


• Learning

• Entropy & information gain

• Building decision trees

32

2.6 Information retrieval

Suggested readings:

• Introduction to Information Retrieval by Christopher D. Manning, Prab-hakar Raghavan and Hinrich Schutzehttp://www-nlp.stanford.edu/IR-book

• Chapter 1 by Russel and Norvig

Information retrievalInformation retrieval is finding materials (e.g. documents) of an unstructurednature (e.g. text) that satisfies an information need from a large unstructureddata set. Common examples include:

• web search

• email search

• searching through files on a laptop

• corporate knowledge bases

Unstructured and semi-structured dataThe term unstructured data generally refers to free text, in opposition to theterm structured data which on the other hand means tables.

In fact, however, almost no data is really unstructured. You can generallyeasily find patterns in semi-structured data like books which have titles, chaptersand contents. In this example you could put more weight on the titles than thefootnotes, for instance.

Queries on unstructured data are generally more vague than on structureddatasets. Google searches make a great example - they make use of a scoringsystem called PageRank to assign a score to possible results, but Google stillcan’t be sure they’re right.

Basic model of Information RetrievalIn information retrieval we have a set of documents, a collection, and our goalis to retrieve those docs relevant to the user’s information need and that helpthe user complete a task.

We can measure an information retrieval model’s usefulness by its:

• precision - fraction of retrieved docs that are relevant to the informationneed

• recall - fraction of relevant docs in the entire collection that are retrieved

33

Term-document incidence matrixThere are multiple ways to approach information retrieval. In small data sce-narios, we could simply make a matrix of all Keywords×Documents and mark1 whenever a keyword is present in the corresponding document. This is calleda term-document matrix. Once we set it up we could run simple boolean oper-ations on it, e.g. Find all documents with cake and eat in them.

Inverted indexWhen our collections grow bigger a simple term-document matrix becomes un-wieldy and we need to get smart. The matrix will be very sparse since mostwords don’t occur in most documents. In these situations we may use invertedindices, so instead of keeping track whether each word is in each and every singledocument we just keep track of which documents each word is in. Such a list ofdocument ids is called a postings list.

Basics of text processingBefore we can create a coherent set of words we must first clean it up, i.e.:

1. Tokenize it - cut character sequences into tokens

2. Normalize it - map text and query to the same form, e.g. simplify words- U.S.A. and USA should match

3. Stem it - you may wish words with the same root to match, e.g. authorizeand authorization

4. Skip common words - you may omit very common words, like to or a

Boolean retrieval modelWhen we use a Boolean retrieval model we simply ask it queries that are booleanexpressions, e.g. find all documents where the words A and B occur but C doesnot. Such a model is very precise and simple, but not very versatile.

This used to be the primary retrieval tool for 3 decades. An example of thisis Westlaw, the largest commercial legal search service.

34

Answering queries by posting listsPosting lists let us answer boolean queries swiftly. Given such ordered lists ofkeywords and document ids where they can be found

Brutus 2 4 8 16 32 64 128Caesar 1 2 3 8 13 21 34

we can quickly query to find all documents that have both Brutus and Caesarby using a merge algorithm:

1: answer ← 〈〉2: while p1 6= NIL and p2 6= NIL3: do if docID(p1) = docID(p2)4: then Add(answer, docID(p1))5: p1 ← next(p1)6: p2 ← next(p2)7: else if docID(p1) < docID(p2)8: then p1 ← next(p1)9: else p2 ← next(p2)

10: return answer

Running this algorithm on the above posting lists will give us 2 and 8 efficiently.

Query optimisationRunning many boolean queries naıvely can be slow. Thankfully, some querieslike and allow us to boost our operations by skipping things that we know willbe false, e.g. if we have p ∧ q ∧ r only a single ∧ needs to be false for the entiresentence to be false.

In practice, if we have a long query, the sooner we discard false statementsthe quicker we get our results. We can follow this train of thought and keeptrack of frequencies of words, e.g. a dictionary of how often each word occurs inour corpus. If we then perform boolean operations on the pairs of words fromour queries that occur the least, we will be able to return the results withoutneedlessly running expensive operations on the other keywords.


• What is information retrieval?

• Structured & unstructured data

• Term-document incidence matrices

• Inverted index

• Using postings lists to answer queries

• Query optimisation

Success!

35

Introduction to Data Science and Knowledge Engineering ......Introduction to Data Science and...

Documents

Transcript of Introduction to Data Science and Knowledge Engineering ......Introduction to Data Science and...