CSCE822 Data Mining and Warehousing

34
Lecture 1 Dr. Jianjun Hu http://mleg.cse.sc.edu/edu/csce82 2/ CSCE822 Data Mining and Warehousing University of South Carolina Department of Computer Science and Engineering

description

CSCE822 Data Mining and Warehousing. Lecture 1 Dr. Jianjun Hu http://mleg.cse.sc.edu/edu/csce822/. University of South Carolina Department of Computer Science and Engineering. CSCE822 Course Information. Meet time: TTH 2:00-3:15PM Swearingen 2A11 Textbooks with slides 5 Homework - PowerPoint PPT Presentation

Transcript of CSCE822 Data Mining and Warehousing

Page 1: CSCE822 Data Mining and Warehousing

Lecture 1Dr. Jianjun Hu

http://mleg.cse.sc.edu/edu/csce822/

CSCE822 Data Mining and Warehousing

University of South CarolinaDepartment of Computer Science and Engineering

Page 2: CSCE822 Data Mining and Warehousing

CSCE822 Course InformationMeet time: TTH 2:00-3:15PM Swearingen 2A11Textbooks with slides5 Homework

Use CSE turn-in system to submit your Homework (https://dropbox.cse.sc.edu)

Deadline policy1 Midterm Exam (conceptual understanding)1 Final Project (deliverable to your future

employer!)TeamworkImplementation project/research project

TA: No TA.

Page 3: CSCE822 Data Mining and Warehousing

About Your InstructorDr. Jianjun Hu ([email protected])

Office hours: TTH 3:30-4:30PM or Drop by any time

Office Phone#: 803-7777304Background:

Mechanical Engineering/CADMachine learning/Computational

intelligence/Genetic Algorithms/Genetic Programming (PhD)

Bioinformatics (Postdoc)Multi-disciplinary just as data mining

Page 4: CSCE822 Data Mining and Warehousing

Why You are Here?

Data?Mining

?

Page 5: CSCE822 Data Mining and Warehousing

The Social Layer in an Instrumented Interconnected World

2+ billion

people on the Web

by end 2011

30 billion RFID tags today

(1.3B in 2005)

4.6 billion camera phones

world wide

100s of millions of GPS

enabled devices

sold annually

76 million smart meters in 2009… 200M by 2014

12+ TBs of tweet data

every day

25+ TBs oflog data

every day

? T

Bs

ofda

ta e

very

day

Page 6: CSCE822 Data Mining and Warehousing

Retailers collect click-stream data from Web site interactions and loyalty card data This traditional POS information is used by retailer for shopping basket

analysis, inventory replenishment, +++But data is being provided to suppliers for customer buying analysis

Healthcare has traditionally been dominated by paper-based systems, but this information is getting digitized

Science is increasingly dominated by big science initiativesLarge-scale experiments generate over 15 PB of data a year and can’t be

stored within the data center; sent to laboratories

Financial services are seeing large and large volumes through smaller trading sizes, increased market volatility, and technological improvements in automated and algorithmic trading

Improved instrument and sensory technologyLarge Synoptic Survey Telescope’s GPixel camera generates 6PB+ of image

data per year or consider Oil and Gas industry

Bigger and Bigger Volumes of Data

Page 7: CSCE822 Data Mining and Warehousing

Applications for Big Data Analytics

Homeland Security

Finance Smarter Healthcare Multi-channel sales

Telecom

Manufacturing

Traffic Control

Trading Analytics Fraud and Risk

Log Analysis

Search Quality

Retail: Churn, NBO

Page 8: CSCE822 Data Mining and Warehousing

8

Most Requested Uses of Big Data Log Analytics & Storage

Smart Grid / Smarter Utilities

RFID Tracking & Analytics

Fraud / Risk Management & Modeling

360° View of the Customer

Warehouse Extension

Email / Call Center Transcript Analysis

Call Detail Record Analysis

+++

Page 9: CSCE822 Data Mining and Warehousing

The IBM Big Data Platform

InfoSphere BigInsights Hadoop-based low latency

analytics for variety and volume

IBM Netezza High Capacity Appliance

Queryable Archive Structured Data

IBM Netezza 1000BI+Ad Hoc

Analytics on Structured Data

IBM Smart Analytics System

Operational Analytics on Structured Data

IBM Informix TimeseriesTime-structured analytics

IBM InfoSphere Warehouse

Large volume structured data analytics

InfoSphere StreamsLow Latency Analytics for

streaming data

MPP Data Warehouse

Stream ComputingInformation Integration

Hadoop

InfoSphere Information ServerHigh volume data integration and

transformation

Page 10: CSCE822 Data Mining and Warehousing

Big Data Values

Page 11: CSCE822 Data Mining and Warehousing
Page 12: CSCE822 Data Mining and Warehousing

What This course can do for You?They expect you are a DM insider!How do they know if you are a proficient Data

Miner?You know what they are talking about:

Glossary: cross-validation, boosting, missing values, sensitivity etc.

You know what algorithm solutions exist for their projects

You know what software tools/packages are availableYou can quickly prototype a system using existing or

your own codeYou know how to evaluate the toolsYou know how to tune or customize the

data/tools/algorithms for better performanceYou know the DM literature/progress in your area (new

fancy data mining?

Page 13: CSCE822 Data Mining and Warehousing

What Data Mining can Do for youCommercial:

Business intelligenceCustomer targeting

Scientific ResearchExtract hidden patterns from enormous

amount of dataMaterial science, Text mining, disease gene

discovery

Page 14: CSCE822 Data Mining and Warehousing

Voice from the Real-world Job Ads of Nexttag, San Jose, CA

Solid knowledge and hands-on experience in statistics, data clustering, predictive modeling, and/or text classification. Familiar with various modeling techniques such as regression, neural networks, decision trees, SVM, etc. Strong programming skills, especially in C++, Java, and/or Perl.

Corporate Analytics & Modeling: eBay Inc.Reduce losses by analyzing and correlating fraud patterns

across all companies and suggesting new technologies, techniques and models

Explore the use of statistical techniques like machine learning/neural networks, clustering, link analysis, graph theory and network theory to gain new insights on cross-company data, which in turn result in actionable ways to reduce fraud and risk without compromising business growth

Analysis will generally be project based and will often be complex in nature, whereby large volumes of data are extracted and synthesized into complex models and actionable recommendations. Analyses may involve segmentation, profiling, data mining, clustering and predictive modeling.

Page 15: CSCE822 Data Mining and Warehousing

Voice from the real-world DMWashington Mutual Funds: Senior Data Mining

Analyst - Customer Behavior AnalyticsThe role will require ability to extract data from

various sources & to design/construct complex analysis and communicate that to client as actionable intelligence. 

He/she will routinely engages in quantitative analysis on many non-standard and unique business problems and uses computer-intensive data mining techniques (decision trees, neural networks, etc.) to deliver actionable output.

Ad hoc prototyping skills using multiple techniques to solve a myriad of business scenarios.

Page 16: CSCE822 Data Mining and Warehousing

What I Expect from YouBe an Active Learner/ExplorerParticipate in brainstorming and

discussionTake Dr. Hu as your collaborator!

Course Objectives Actions

Glossary Textbook/Research paper/News

Algorithms We will cover most!

Software/tools Try as many as we can

Prototype Hands-on assignments/projects

Evaluation Hands-on

Tune/customize Hands-on/Read papers

Literature Read papers

Page 17: CSCE822 Data Mining and Warehousing

What You Can Expect from MeGood grade if you do well -Good collaborator for active discussionReady for help: email, drop by, call

Page 18: CSCE822 Data Mining and Warehousing

Click!

Case Study 1: Beat Google!Google’s Adsense System

Advertiser select Keywords for Ads: cell phone watch

Publisher/webmaster’s webpages/blog/forum…………………..……………phone….Watch….

Google Adsense System

?K1, K2, K3, K4…. W1, W2, W3,

….W100T1, T2, K3, T4… W1, W3, ….W10 4/100

2/100

W1, W3, ….W10Show which ads? Max Profit

Page 19: CSCE822 Data Mining and Warehousing

Case Study 1: Beat Google! 18M$Startup Company Turn.com ‘s New idea

Click!

•Select Keywords is not easy!•Advertiser DO nothing!• 100% automatic!•Advertiser’s URL/website as input

Publisher/webmaster’s webpages/blog/forum…………………..……………phone….Watch….

Turn.comSystem

?K1, K2, K3, K4….W1, W2, W3, ….W100+60

variablesT1, T2, K3, T4…

W1, W3, ….W10+60 variables

4/100

2/100

W1, W3, ….W10+60 variables

Show which ads?

Max Profit

Rank,Site

category

Page 20: CSCE822 Data Mining and Warehousing

DM Case Study 2: Molecular Classification of Cancer

Acute lymphoblastic leukemia (ALL) or acute myeloid leukemia (AML)

Page 21: CSCE822 Data Mining and Warehousing

Data mining: What, Who, Why, How?

What is data mining?Use historical (large-scale) data to uncover

regularities and improve future decisionsEverybody has some data:

Science: physics, chemistry, biologyHealth care: patients, diseases, imagesBusiness: sales, marketing Internet: web

What data can you get?

NIH

Hospital

Google

Walmart

Page 22: CSCE822 Data Mining and Warehousing

Data mining: what?

What is the original of Data mining?Draws ideas from machine learning/AI,

pattern recognition, statistics, and database systems

Traditional Techniquesmay be unsuitable due to Enormity of dataHigh dimensionality

of dataHeterogeneous,

distributed nature of data

Machine Learning/Pattern

Recognition

Statistics/AI

Data Mining

Database systems

Page 23: CSCE822 Data Mining and Warehousing

Data mining: What, Who, Why, How?WHO is doing data mining?

retail, financial, communication, and marketing organizations

Page 24: CSCE822 Data Mining and Warehousing

Data mining: What, Who, Why, How?Why data mining?

information explosion: Data Knowledge/Decision/Understanding/Profit

Exponential growth of the EMBL DNA sequence database

Personal Information storage

Page 25: CSCE822 Data Mining and Warehousing

Data mining: What, Who, Why, How?

1. Collect Your data

Page 26: CSCE822 Data Mining and Warehousing

Data mining: What, Who, Why, How?2. Determining the patterns you want to

mine: data mining tasksTwo main types of tasks

Prediction MethodsUse some variables to predict unknown or future

values of other variables.Description Methods

Find human-interpretable patterns/rules that describe the data.

Page 27: CSCE822 Data Mining and Warehousing

Data Mining TasksClassification [Predictive]

Clustering [Descriptive]

Regression [Predictive]Association Rule Discovery [Descriptive]

Sequential Pattern Discovery [Descriptive]

Deviation Detection [Predictive]

Frequent Subgraph mining [Descriptive]…

Page 28: CSCE822 Data Mining and Warehousing

Data mining: What, Who, Why, How?

2. Choose the algorithm(s)

Page 29: CSCE822 Data Mining and Warehousing

Data mining: How?

3. Select existing data mining software/packages

Page 30: CSCE822 Data Mining and Warehousing

Data mining: What, Who, Why, How?

2. Choose the implementation platform/programming languages

Page 31: CSCE822 Data Mining and Warehousing

Data mining: Where to work?

Page 32: CSCE822 Data Mining and Warehousing

About netflix.com: a movie/DVD rental serviceWhat is the story:

Recommendation system trained on ratings by customers

Customer selects one movie, the system will suggest 5 movies that are highly likely to be rented too…

Training data:18000 movies, each with diff. no. of customer

ratings. Each customer has unique userID.100 million ratings

Output: Given userID and a movieID, predict the rate!

Basic idea of algorithm:if two people enjoy the same product, they're likely

to have other favorites in common tooPotential Project!

Page 33: CSCE822 Data Mining and Warehousing

Take-home messageBuy text bookDownload Weka system and read about itRead papers posted on course website

Page 34: CSCE822 Data Mining and Warehousing

Have Serious Fun: Xprize GenomicsXprize: Revolution through competition$10 million prize for the winner of the

Archon X PRIZE for Genomics: awarded to the first Team that can build a device and use it to sequence 100 human genomes within 10 days or less

What it means for bioinformatics/Genomics/medical Research:Huge amount of DNA sequence data for

analysisPersonalized medicinePromising for data mining in bioinformatics