Includes Review of Syllabus OVERVIEW OF THE...

OVERVIEW OF THE CLASS

Includes Review of Syllabus

What is this class about?

This class will introduce data mining The types of problems that can be addressed

The methods that can be used

Focus will be balanced between learning how to use the methods and understanding how they work A significant class project is required

This class is a key component of the new MS in Data Analytics (MSDA)

The spring course “Machine Learning” will cover some methods not covered in this course or covered only superficially

2

Textbooks

For many years the “Intro to Data Mining” book by Tan, Steinbach, and Kumar was used One of the commonly used DM textbooks for CS Not always very clear, but other books are not really any

better

“Data Science for Business” is much clearer and well written so it also being used. Provides much more on applications of data mining Provides surprising technical depth, so eventually my be

the sole book supplemented with other materials

Currently in transition. Some readings may be a bit redundant but good to get two perspectives.

3

More on the Use of 2 Textbooks

Initially I was put off by the use of “Business” in the Data Science title since this is a CS course

But book is still relatively technical and covers some details of the algorithms

Also it is critical for a data analyst/scientist to be able to understand how and why to use certain methods and to be able to articulate this.

Intro to Data Mining does a poor job at this

4

The Class Website & Syllabus

The class webiste is at: http://storm.cis.fordham.edu/~gweiss/classes/cisc6930/ Includes the syllabus and is linked to the class schedule

The class schedule is under active development since this course is being revised from the last time it was offered.

Will keep you up to date about what is current and what may still be modified

5

http://storm.cis.fordham.edu/~gweiss/classes/cisc6930/

http://storm.cis.fordham.edu/~gweiss/classes/cisc6930/

6

Data Mining

Lecture 01:

Introduction to Data Mining

Much of the material in this presentation is not from either textbook

Let’s Start By Seeing What You Know

Quick Quiz

Do you know what Data Mining is?

Do you know of any examples of Data Mining?

7

What is Data Mining?

Data Mining has many definitions

Non-trivial extraction of implicit, previously unknown and potentially useful information from data

Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns

8

Alternative Names

Data Mining was/is known by these other names (although many of these have lost favor over time):

Knowledge discovery in databases (KDD)

Knowledge extraction

Data/pattern analysis

Data archeology, data dredging, information harvesting, business intelligence, etc.

Recently introduced new names (maybe with different emphases):

Data Science

Big Data

9

What is Big Data?

Technically big data means data is so large that conventional data mining methods cannot be applied in normal manner Using this definition, in most cases a data set with

10,000,000 cases is not big data

But the term is overused and generally not always interpreted this way

Big data technologies include Hadoop

Big data technologies used for implementing data mining methods

We offer a course in “Big Data Programming”

10

Some Examples

Netflix and Amazon use data mining to recommend products (recommender systems)

Companies use data mining for marketing Who should be mailed a catalog Who should see what online ads (Google Adwords) Online advertising: large impact

Financial companies use credit scoring; fraud detection

Customer Churn: who will leave

Fordham’s WISDM project uses smartphone/watch accelerometer data to classify user activities and perform biometric identification

Some search engines cluster retrieved documents into meaningful groups Group pages about Jaguar into “car” pages and “cat” pages

11

Interesting specific example

Wal-Mart used data mining to find out what is needed when a hurricane is coming

Strawberry PopTarts increase in sales 7X ahead of a hurricane and the pre-hurricane top selling item is beer. (Data Science for Business page 3)

12

A Significant Example

Signet bank convinced that modelling profitability, not just default probability, is the way to go

But they did not have the proper data Constrained by having data only for strategies they

already used Decided to purposefully offer loans in new cases

(explore new strategies)

Initially poor results but eventually learned from data and go it right

Became one of the most successful credit card issuers: Capital One

13

Why Data Mining and Why Now?

Data Mining was not very popular until about 10 – 15 years ago

14

Quick Quiz: What do you think changed?

Why Mine Data?

There are now tremendous amounts of data that are automatically collected and warehoused. What are some examples?

Web data, e-commerce

Store purchases

Bank/Credit Card transactions

Cell phone GPS information

Smartphone and Smartwatch Sensor Data

15

Why Mine Data?

What technological changes have helped make data mining so prevalent now?

Computers: cheaper and more powerful Smaller mobile devices are exploding in popularity

Disk and other storage: greater capacity and cheaper

Increased use of on-line resources and Internet

We shouldn’t discount the advances in algorithms but most data mining algorithms are relatively mature

In business, competitive pressure is strong

16

Why Mine Data?

Often info “hidden” in data is not evident

Analysts may take weeks to discover useful information

Much of the data is never analyzed at all There is just too much data to analyze without

“assistance”

17

Scientific Need

Data collected at enormous speeds remote sensors on satellite

telescopes scanning the skies

microarrays generating gene expression data

scientific simulations

Traditional techniques infeasible

18

How Big is the Data?

Examples of Large Data Sets AT&T’s 26TB call detail database (2003)

Ebay 6PB, IRS 150TB data warehouse

Yahoo has a 2PB DB to analyze behavior of ½ billion web visitors/month (24 billion events/day)

Wal-Mart has a 583 TB database (2006)

Indexed web contains about 20 Billion pages

Sites like Facebook, Flicker & Twitter contain lots of data

Google is estimated (in 2011) to have 900,000 servers to handle its data!

19

How Much Data is Being Created?

5 Exabytes new data created (2002, UC Berkeley)

Humans created/copied 161/281 Exabytes in 06/07 (IDC) 1 Exabyte = 1018

12 stacks of books stretching from Earth to Sun 3 million times the books ever written Not all data stored at once (includes temporary data)

In 2012 2.8 ZB (2800EB) of data will be created/copied Forecast for 2020: 40 ZB, or (57X number of grains of sand on

Earth)

20

OK, we get the point

already.! Head hurts.

Why Data Mining? Why Now? According to BabyCenter.com, today one in three children born in the United States already have an online presence (usually in the form of a sonogram) before they are born. That number grows to 92% by the time they are two. In 2012 the average digital birth of children occurs at approximately six months, with a third of all children’s photos and information posted online within weeks of their birth. What will it mean to live in a world where our every moment, from birth to death, is digitally chronicled and preserved in vast cloud based databases, forever?

21

During the first day of a baby’s life, the amount of data generated by humanity is equivalent to 70 times the information contained in the library of congress.

Origins of Data Mining

Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems*

Traditional techniques may be unsuitable due to

Enormity of data

High dimensionality

Heterogeneous & distributed data

* databases currently have limited impact; data mining is rarely done in a database but rather on “flat files”

22

Artificial Intelligence

Machine Learning

Pattern Recognition

Statistics

Data Mining

Database

systems

Statistics vs. Data Mining

Students familiar from statistics are often

confused if differences aren’t highlighted

When compared to Data Mining: Statistics is more theory-based

Data mining methods are based on heuristic algorithms Statistics is based firmly on mathematics (e.g., probability)

Statistics is more focused on testing hypotheses vs.

finding interesting relationships

Statistics makes more assumptions about the data

23

The Process of Data Mining

24

Data Mining is a process, formerly referred to as a knowledge discovery process. In this process there is a data mining step that applies data mining algorithms to extract knowledge. About 80% of our class will focus on the data mining step but in the real world 80% of the time is spent on the other steps (e.g., prepping data). The process below was articulated by Fayyad in a seminal paper on Data Mining and KDD. There should be a loop since the process is iterative.

CRISP Data Mining Process

25

Second Part of Introduction:

26

Data Mining Tasks

Top-Level Data Mining Tasks

At highest level, data mining tasks can be divided into:

Prediction Tasks (supervised learning) Use some variables to predict unknown or future

values of other variables

Description Tasks (unsupervised learning) Find human-interpretable patterns that describe the

data

27

Key Data Mining Tasks

Overview of the major data mining tasks studied in this course:

Prediction Tasks

Classification (and class probability estimation)

Regression

Description Tasks

Clustering

Association Rule Discovery

Also known as “co-occurrence grouping” or “association rule mining”

28

Data Mining Tasks Continued

“Data Science for Business” includes several more. These are generally not as basic and may be more application oriented

Additional data mining tasks Similarity matching: More of a technique and is

always used in clustering and sometimes in classification. Can be of interest in its own right.

Profiling, Link Prediction, Data Reduction, and Causal Modeling

We will certainly cover similarity matching and may cover some of the others

29

Classification: Definition

Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the

class, which is to be predicted.

Find a model for class attribute as a function of the values of other attributes. Model maps record to a class value

Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine accuracy of the model

Class Probability Estimation: estimate the probability that an object belongs to a class

Can you think of classification tasks?

30

Classification Example

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

Refund MaritalStatus

TaxableIncome Cheat

No Single 75K ?

Yes Married 50K ?

No Married 150K ?

Yes Divorced 90K ?

No Single 40K ?

No Married 80K ?10

Test

Set

Training

Set Model

Learn

Classifier

Classification: Application 1

Direct Marketing Goal: Reduce cost of mailing by targeting a set of consumers

likely to buy a new cell-phone product. Approach:

Use the data for a similar product introduced before. We know which customers decided to buy and which didn’t This

{buy, don’t buy} decision forms the class attribute Collect various demographic, lifestyle, and company-interaction

related information about all such customers. Type of business, where they stay, how much they earn, etc.

Use this info as input attributes to learn a classifier model

Specific Example KDD Cup is a competition associated with top DM conference The KDD CUP 1998 competition was about direct marketing for a

charity. Lots of information is provided http://kdd.ics.uci.edu/databases/kddcup98/kddcup98.html

32


Fraud Detection Goal: Predict fraudulent cases in credit card transactions

Approach:

Use credit card transactions and info on account-holders as attributes When and what does customer buy, how often pays on time,

etc

Label past transactions as fraud or fair transactions. This forms the class attribute.

Learn a model for the class of the transactions.

Use this model to detect fraud by observing credit card transactions on an account.

33


Sky Survey Cataloging Goal: To predict class (star or galaxy) of sky objects,

especially visually faint ones, based on the telescopic survey images (from Palomar Observatory).

3000 images with 23,040 x 23,040 pixels per image.

Approach: Segment the image.

Measure image attributes (features) - 40 of them per object.

Model the class based on these features.

Success Story: Could find 16 new high red-shift quasars, some of the farthest objects that are difficult to find!

34

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

Classifying Galaxies

35

Early

Intermediate

Late

Data Size: • 72 million stars, 20 million galaxies • Object Catalog: 9 GB

• Image Database: 150 GB

Class: • Stages of Formation

Attributes: • Image features, • Characteristics of light

waves received, etc.

Courtesy: http://aps.umn.edu

Regression

Predict a value of a given continuous (numerical) variable based on the values of other variables

Greatly studied in statistics

Examples:

Predicting sales amounts of new product based on advertising expenditure.

Predicting wind velocities as a function of temperature, humidity, air pressure, etc.

Time series prediction of stock market indices

36

Clustering Given a set of data points find clusters so that

Data points in same cluster are similar

Data points in different clusters are dissimilar

37

You try it on the Simpsons. How can we cluster these 5 “data points”?

38

What is a natural grouping among these objects?

39 School Employees Simpson's Family Males Females

Clustering is subjective

What is a natural grouping among these objects?

Clustering Application

Market Segmentation: Goal: subdivide a market into distinct subsets of

similar customers

Approach: Collect different attributes of customers based on their

geographical and lifestyle related information.

Find clusters of similar customers.

Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters.

40

Association Rule Discovery

Given a set of records each of which contain some number of items from a given collection

Produce dependency rules which will predict occurrence of an item based on occurrences of other items.

41

TID Items

1 Bread, Coke, Milk

2 Beer, Bread

3 Beer, Coke, Diaper, Milk

4 Beer, Bread, Diaper, Milk

5 Coke, Diaper, Milk

Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}

beer

Diapers

Association Rule Discovery Application

Marketing and Sales Promotion Applications When items purchased together one can be used to drive sales of

the other

Can help determine where to position store items Supermarket shelf management

Some stores place bananas in the cereal aisle

42

Challenges of Data Mining

Scalability

Dimensionality

Complex and Heterogeneous Data

Data Quality

Streaming Data

Privacy Preservation

43

What is (and is not) Data Mining?

Based on the definitions of data mining, are these DM or not? Finding a phone number in a directory

Not data mining (trivial?, DB query)

Grouping related documents returned by search engine Is data mining (not trivial, clustering)

Identifying who has a disease based on symptoms Is data mining (not trivial, classification)

Web search on keyword using search engine May be data mining**

** More of an information retrieval task than data mining task. However, since Google does much more than keyword matching, there will be a data mining component. For example, Google mines the link structure of the Web to decide which pages are important (link mining is a type of data mining).

44

If you are Interested in Data Mining

Data sets NYC open data (https://nycopendata.socrata.com/)

UCI Data Repository (http://archive.ics.uci.edu/ml/)

Visit kdnuggets, an online newsletter and more http://www.kdnuggets.com

You can arrange to have newsletter emailed to you

Also includes job openings

ACM SIGKDD is the professional organization associated with data mining ACM Special Interest Group (SIG) on data mining

Can join SIGKDD for $22 or for $54 can also join ACM as student member

45

https://nycopendata.socrata.com/

https://nycopendata.socrata.com/

http://www.kdnuggets.com/

Includes Review of Syllabus OVERVIEW OF THE...

Documents

Transcript of Includes Review of Syllabus OVERVIEW OF THE...