Introduction to Data Science: A Practical Approach to Big Data Analytics
-
Upload
ivan-khvostishkov -
Category
Data & Analytics
-
view
616 -
download
3
Transcript of Introduction to Data Science: A Practical Approach to Big Data Analytics
![Page 1: Introduction to Data Science: A Practical Approach to Big Data Analytics](https://reader035.fdocuments.net/reader035/viewer/2022062821/589aaae01a28abfc1a8b6e97/html5/thumbnails/1.jpg)
1
ВВЕДЕНИЕ В DATA SCIENCE: ПРАКТИЧЕСКИЙ ПОДХОД К АНАЛИТИКЕ БОЛЬШИХ ДАННЫХ
ИВАН ХВОСТИШКОВ, EMC2
3 МАРТА 2016 – ЦЕНТР РАЗРАБОТКИ DEUTSCHE BANK, МОСКВА
![Page 2: Introduction to Data Science: A Practical Approach to Big Data Analytics](https://reader035.fdocuments.net/reader035/viewer/2022062821/589aaae01a28abfc1a8b6e97/html5/thumbnails/2.jpg)
2
FOUR “V” OF BIG DATA
Volume Velocity
Variety Variability
Big Data
![Page 3: Introduction to Data Science: A Practical Approach to Big Data Analytics](https://reader035.fdocuments.net/reader035/viewer/2022062821/589aaae01a28abfc1a8b6e97/html5/thumbnails/3.jpg)
3
DATA SCIENCE VS. BUSINESS INTELLIGENCE
Data Science
Business Intelligence
Future
Low
High
Past Time
Businessvalue
![Page 4: Introduction to Data Science: A Practical Approach to Big Data Analytics](https://reader035.fdocuments.net/reader035/viewer/2022062821/589aaae01a28abfc1a8b6e97/html5/thumbnails/4.jpg)
4
DATA SCIENCE AND INNOVATION
ExploratoryAgile
Low
High
OperationalStable
Businessvalue Real-Time
DS DSEDW
Non real-time Very long time
![Page 5: Introduction to Data Science: A Practical Approach to Big Data Analytics](https://reader035.fdocuments.net/reader035/viewer/2022062821/589aaae01a28abfc1a8b6e97/html5/thumbnails/5.jpg)
5
INDUSTRY VERTICALSEXAMPLES
Health Care Public Services
Life Sciences
IT Infrastructure
Online Services
…
![Page 6: Introduction to Data Science: A Practical Approach to Big Data Analytics](https://reader035.fdocuments.net/reader035/viewer/2022062821/589aaae01a28abfc1a8b6e97/html5/thumbnails/6.jpg)
6
MACHINE LEARNING ALGORITHMSBASIC OVERVIEW
Unsupervised• K-means clustering• Association RulesSupervised• Linear regression• Logistic regression• Naïve Bayesian Classifier• Decision Trees• Time series analysis• Text analytics
learning structure from unlabeled data
![Page 7: Introduction to Data Science: A Practical Approach to Big Data Analytics](https://reader035.fdocuments.net/reader035/viewer/2022062821/589aaae01a28abfc1a8b6e97/html5/thumbnails/7.jpg)
7
K-MEANS CLUSTERING
• Choose centroids, assign cluster to each datum point• See also: k-nearest neighbors (regression, classification)
CLUSTERING SIMILAR DOCUMENTS, EVENTS
![Page 8: Introduction to Data Science: A Practical Approach to Big Data Analytics](https://reader035.fdocuments.net/reader035/viewer/2022062821/589aaae01a28abfc1a8b6e97/html5/thumbnails/8.jpg)
8
ASSOCIATION RULES
• {bread, eggs} -> {milk}• Freqent itemset, Support
– How often occur together– e. g. 50% of transactions
• Confidence– Relation of X to {X, Y}– e. g. 80% = interesting
APRIORI – EARLY ALGORITHM
![Page 9: Introduction to Data Science: A Practical Approach to Big Data Analytics](https://reader035.fdocuments.net/reader035/viewer/2022062821/589aaae01a28abfc1a8b6e97/html5/thumbnails/9.jpg)
9
LINEAR REGRESSIONfdq_rate = –0.9 + 0.66 CurrentUnem + 1.06 ChgInUnemp1yr + 0.22 HiCostMortRate
* What if scenario
*
![Page 10: Introduction to Data Science: A Practical Approach to Big Data Analytics](https://reader035.fdocuments.net/reader035/viewer/2022062821/589aaae01a28abfc1a8b6e97/html5/thumbnails/10.jpg)
10
LOGISTIC REGRESSION
Receiver Operation Classifier
![Page 11: Introduction to Data Science: A Practical Approach to Big Data Analytics](https://reader035.fdocuments.net/reader035/viewer/2022062821/589aaae01a28abfc1a8b6e97/html5/thumbnails/11.jpg)
11
NAÏVE BAYESSIAN CLASSIFIER
![Page 12: Introduction to Data Science: A Practical Approach to Big Data Analytics](https://reader035.fdocuments.net/reader035/viewer/2022062821/589aaae01a28abfc1a8b6e97/html5/thumbnails/12.jpg)
12
DECISION TREES• Entropy-based approach
• Conditional Entropy
• See also: SVM
![Page 13: Introduction to Data Science: A Practical Approach to Big Data Analytics](https://reader035.fdocuments.net/reader035/viewer/2022062821/589aaae01a28abfc1a8b6e97/html5/thumbnails/13.jpg)
13
TIME SERIES ANALYSIS• ARMA model – Autoregressive Moving Average
• ARIMA
![Page 14: Introduction to Data Science: A Practical Approach to Big Data Analytics](https://reader035.fdocuments.net/reader035/viewer/2022062821/589aaae01a28abfc1a8b6e97/html5/thumbnails/14.jpg)
14
TEXT ANALYSIS
• Bag of words• Reverse index• Relevance (precision / recall) - TF• Inverse document frequency (IDF)• TF-IDF (improved relevance)• PageRank, …
CONCEPTS
![Page 15: Introduction to Data Science: A Practical Approach to Big Data Analytics](https://reader035.fdocuments.net/reader035/viewer/2022062821/589aaae01a28abfc1a8b6e97/html5/thumbnails/15.jpg)
15
USE RIGHT TOOLSWHEN ALL YOU HAVE IS A HAMMER, EVERYTHING LOOKS LIKE A NAIL
![Page 16: Introduction to Data Science: A Practical Approach to Big Data Analytics](https://reader035.fdocuments.net/reader035/viewer/2022062821/589aaae01a28abfc1a8b6e97/html5/thumbnails/16.jpg)
16
BIG DATA LANDSCAPE IS BIG
![Page 17: Introduction to Data Science: A Practical Approach to Big Data Analytics](https://reader035.fdocuments.net/reader035/viewer/2022062821/589aaae01a28abfc1a8b6e97/html5/thumbnails/17.jpg)
17
SQL, NOSQL, HADOOP• SQL databases were not designed to scale
easily– Cost, > 10 TB? – OLTP vs OLAP
• NoSQL databases – Big Data approach– Native format, tight integration– Compute is still bottleneck
• Hadoop – put early, transform later– ETL vs. ELT– Sandboxing, loose integration patterns
![Page 18: Introduction to Data Science: A Practical Approach to Big Data Analytics](https://reader035.fdocuments.net/reader035/viewer/2022062821/589aaae01a28abfc1a8b6e97/html5/thumbnails/18.jpg)
18
HADOOP ECOSYSTEM
![Page 19: Introduction to Data Science: A Practical Approach to Big Data Analytics](https://reader035.fdocuments.net/reader035/viewer/2022062821/589aaae01a28abfc1a8b6e97/html5/thumbnails/19.jpg)
19
HAWQEX-GREENPLUM
* See also: Hive, Impala
![Page 20: Introduction to Data Science: A Practical Approach to Big Data Analytics](https://reader035.fdocuments.net/reader035/viewer/2022062821/589aaae01a28abfc1a8b6e97/html5/thumbnails/20.jpg)
20
SPARK
![Page 21: Introduction to Data Science: A Practical Approach to Big Data Analytics](https://reader035.fdocuments.net/reader035/viewer/2022062821/589aaae01a28abfc1a8b6e97/html5/thumbnails/21.jpg)
21
IN-MEMORY DATA GRIDAPACHE GEODE AKA GEMFIRE
![Page 22: Introduction to Data Science: A Practical Approach to Big Data Analytics](https://reader035.fdocuments.net/reader035/viewer/2022062821/589aaae01a28abfc1a8b6e97/html5/thumbnails/22.jpg)
22
INDUSTRIAL PROJECT EXAMPLEE-GOV.KZ
Saint PetersburgMoscow Astana
Almaty
Data SizePublic data: 1 TBArticles: 5 000 000Comments: 100 000 000
Private data: 70 TB
![Page 23: Introduction to Data Science: A Practical Approach to Big Data Analytics](https://reader035.fdocuments.net/reader035/viewer/2022062821/589aaae01a28abfc1a8b6e97/html5/thumbnails/23.jpg)
23
QUALITY ANALYSIS SYSTEMPROBLEM STATEMENT
Kazakhstan Government Services and Information Online
World Wide Web
Relevance
Sentiment
![Page 24: Introduction to Data Science: A Practical Approach to Big Data Analytics](https://reader035.fdocuments.net/reader035/viewer/2022062821/589aaae01a28abfc1a8b6e97/html5/thumbnails/24.jpg)
24
Resource 2
Resource 3
Resource 4
Resource 5
Resource 1EMC2
parsers
NIT parsers
Hive import
Results dump
Solr import
DATA WORKFLOW
Model execution
BI Dashboard
![Page 25: Introduction to Data Science: A Practical Approach to Big Data Analytics](https://reader035.fdocuments.net/reader035/viewer/2022062821/589aaae01a28abfc1a8b6e97/html5/thumbnails/25.jpg)
25
NUTCHSeed urls
CrawlDBIndexDB
Parsed text and data
Fetched content
WWW
Fetch list
Parse the content
Update CrawlDB
Fetch urls from the list
Generate new segment
Inject seed urls
![Page 26: Introduction to Data Science: A Practical Approach to Big Data Analytics](https://reader035.fdocuments.net/reader035/viewer/2022062821/589aaae01a28abfc1a8b6e97/html5/thumbnails/26.jpg)
26
CRAWLING VS. SCRAPPING
Crawling• Returns traffic back to the site
Scrapping• Doesn’t return traffic• Extract value
![Page 27: Introduction to Data Science: A Practical Approach to Big Data Analytics](https://reader035.fdocuments.net/reader035/viewer/2022062821/589aaae01a28abfc1a8b6e97/html5/thumbnails/27.jpg)
27
MACHINE LEARNING INSTRUMENTS
TreeTagger
Vowpal Wabbit Word2vec / Paragraph2vec
![Page 28: Introduction to Data Science: A Practical Approach to Big Data Analytics](https://reader035.fdocuments.net/reader035/viewer/2022062821/589aaae01a28abfc1a8b6e97/html5/thumbnails/28.jpg)
28
R
![Page 29: Introduction to Data Science: A Practical Approach to Big Data Analytics](https://reader035.fdocuments.net/reader035/viewer/2022062821/589aaae01a28abfc1a8b6e97/html5/thumbnails/29.jpg)
29
CLASSIFICATION METHOD• Logistic Regression• Multiclass classification• One-vs-All• Accuracy
Positive Negative Neutral
X0 0 0 1
X1 0 0 1
… … …
xn 1 0 0
![Page 30: Introduction to Data Science: A Practical Approach to Big Data Analytics](https://reader035.fdocuments.net/reader035/viewer/2022062821/589aaae01a28abfc1a8b6e97/html5/thumbnails/30.jpg)
30
MODEL WORKFLOW
• Cleaning • Lemmatisatio
n • Preparing
Step 1
• One-vs-all models
• Combination• Accuracy
Step 2 • Application• Re-training if
necessary
Step 3
![Page 31: Introduction to Data Science: A Practical Approach to Big Data Analytics](https://reader035.fdocuments.net/reader035/viewer/2022062821/589aaae01a28abfc1a8b6e97/html5/thumbnails/31.jpg)
31
![Page 32: Introduction to Data Science: A Practical Approach to Big Data Analytics](https://reader035.fdocuments.net/reader035/viewer/2022062821/589aaae01a28abfc1a8b6e97/html5/thumbnails/32.jpg)
32
![Page 33: Introduction to Data Science: A Practical Approach to Big Data Analytics](https://reader035.fdocuments.net/reader035/viewer/2022062821/589aaae01a28abfc1a8b6e97/html5/thumbnails/33.jpg)
33
![Page 34: Introduction to Data Science: A Practical Approach to Big Data Analytics](https://reader035.fdocuments.net/reader035/viewer/2022062821/589aaae01a28abfc1a8b6e97/html5/thumbnails/34.jpg)
34
PITFALLS• Private data access• Data growth – 10-100x• Hadoop cluster planning• Nutch scrapping integration is not easy• Oozie is cumbersome• Hive is not for BI, use HAWQ
![Page 35: Introduction to Data Science: A Practical Approach to Big Data Analytics](https://reader035.fdocuments.net/reader035/viewer/2022062821/589aaae01a28abfc1a8b6e97/html5/thumbnails/35.jpg)
35
![Page 36: Introduction to Data Science: A Practical Approach to Big Data Analytics](https://reader035.fdocuments.net/reader035/viewer/2022062821/589aaae01a28abfc1a8b6e97/html5/thumbnails/36.jpg)
36
DATA SCIENTIST
Data Scientist
Quantitative
Curious & Creative
Communicative & CollaborativeSkeptical
Technical
![Page 37: Introduction to Data Science: A Practical Approach to Big Data Analytics](https://reader035.fdocuments.net/reader035/viewer/2022062821/589aaae01a28abfc1a8b6e97/html5/thumbnails/37.jpg)
37
Discovery
Data Preparation
Model Planning
Model Building
Communicate Results
Operationalize
DATA ANALYTICS LIFECYCLE
70-80% of time
![Page 38: Introduction to Data Science: A Practical Approach to Big Data Analytics](https://reader035.fdocuments.net/reader035/viewer/2022062821/589aaae01a28abfc1a8b6e97/html5/thumbnails/38.jpg)
38
RESOURCES• Deep Learning• Visualization• Machine Learning Course
https://www.coursera.org/learn/machine-learning
• Data Science and Big Data Analyticshttp://eu.wiley.com/WileyCDA/WileyTitle/productCd-111887613X.html
• Online Twitter Sentiments Analysis http://sentiment140.com/
• Amazon MTurk• Meet-ups!