Introduction to data science

14
Introduction to Data Science Dr. Bill Howe - Director of Research, Scalable Data Analytics

Transcript of Introduction to data science

Page 1: Introduction to data science

Introduction to Data Science

Dr. Bill Howe - Director of Research, Scalable Data Analytics

Page 2: Introduction to data science

What is data science?◦ Set of theories and principles to perform several data

related tasks, like

◦ Data collection

◦ Data cleaning

◦ Data integration

◦ Data modeling

◦ Data visualization

Introduction to Data Science

Page 3: Introduction to data science

Data science is different from ◦ Business intelligence

◦ Statistics

◦ Database management

◦ Visualization

◦ Machine Learning

Introduction to Data Science

Page 4: Introduction to data science

DBA- Unstructured data

Statistician – data that doesn’t fit in to memories

Software engineer- statistical models and how to communicate results

Business analyst- algorithms and tradeoff at scale

Suggest ion for students!!

Page 5: Introduction to data science

Common three skills of Data scientiest◦ Statistics

traditional analysis

◦ Data Munging parsing, scraping, and formatting data

◦ Visualization graphs, tools, etc.

What do data scientists do?

Page 6: Introduction to data science

Three types of tasks:

◦ Preparing to run a model

◦ Running the model

◦ Communicating the results

What do data scientists do?

Page 7: Introduction to data science

◦ Preparing to run a model Gathering

Cleaning

Integrating

Restructuring

Transforming

Loading

Filtering

Page 8: Introduction to data science

◦ Running the model Choosing appropriate machine learning

algorithms for regression, classification, clustering and recommendations.

Validation of model

Improvement of model

◦Communicating the results

Page 9: Introduction to data science

Breadth◦ Mapreduce/Relational algebra/Logistic

regression/visualization Depth

◦ Structure (Relational algebra)/ statics (linear algebra)

Scale◦ Desktop (R)/Cloud (Hadoop)

Target◦ Hackers(R,Java, python) /Analyts (little/no

programming)

Data science dimensions

Page 10: Introduction to data science

Scale – Cloud for Bigdata The bigdata can be measured by 3 V’s

◦ Volume – number of rows (size)

◦ Variety – number of columns OR sources (text, images, audio, video)

◦ Velocity - number of rows OR bytes per unit time (processing time )

Data science dimensions

Page 11: Introduction to data science

“data exhaust” from customers

new and pervasive sensors

the ability to “keep everything”

Where does big data come from?

Page 12: Introduction to data science

Prior programming exercise◦ SQL◦ Python

Basic statistics

Basic database concepts

Prequisites

Page 13: Introduction to data science

Twitter sentiment Analysis◦ Extract the tweets from twitter API

◦ Calculate the sentiment score for tweets

◦ Calculate the sentiment score for terms in tweets

◦ Calculate frequency for terms of tweets

◦ Identify the happiest state

◦ Identify the top ten hastag

Programming Assignment 1

Page 14: Introduction to data science

Thanks !!