Musings on Data Science and Students Experiencing Data Analytics New England SENCER Center for...
-
Upload
destini-macklin -
Category
Documents
-
view
213 -
download
0
Transcript of Musings on Data Science and Students Experiencing Data Analytics New England SENCER Center for...
Musings on Data Science and Students Experiencing Data Analytics
New England SENCER Center for Innovation
Prof. Randy PaffenrothData Science Program
Department of Mathematical SciencesWorcester Polytechnic Institute
2014
My Research
"Internet Connectivity Access layer" by User:Ludovic.ferre - Internet_Connectivity_Overview2_Access.svg. Licensed under Creative Commons Attribution-Share Alike 3.0 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:Internet_Connectivity_Access_layer.svg#mediaviewer/File:Internet_Connectivity_Access_layer.svg
This is a panel, so I want to be provocative!
Provocative
Adjective
1. tending or serving to provoke; inciting,
stimulating, irritating, or vexing.
So, I will be a little sad if I don’t end up irritating anyone
The first war: Terminology
• Analyzing data has a long history!
• There have been many terms that have been used to describe such endeavors:
• Statistics
• Artificial Intelligence
• Machine learning
• Data analytics
• Since I happen to work in a “Data Science” program perhaps I may be allowed the indulgence of using that terminology…
Whatever we call it, what makes things different now?
Experiments, observations, and numerical simulations in many areas of science and business are currently generating terabytes of data, and in some cases are on the verge of generating petabytes and beyond. Analyses of the information contained in these data sets have already led to major breakthroughs in fields ranging from genomics to astronomy and high-energy physics and to the development of new information-based industries.- Frontiers in Massive Data Analysis, National Research Council of the National Academies
Given a large mass of data, we can by judicious selection construct perfectly plausible unassailable theories—all of which, some of which, or none of which may be right. - Paul Arnold Srere
The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades, not only at the professional level but even at the educational level for elementary school kids, for high school kids, for college kids. Because now we really do have essentially free and ubiquitous data. So the complimentary scarce factor is the ability to understand that data and extract value from it.- Hal Varian, Google's Chief Economist, http://www.mckinsey.com/insights/innovation/hal_varian_on_how_the_web_challenges_managers
My personal goal: Getting students to be able tothink critically about data.
What is Big Data? The are many examples of "data", but what makes some of
it “big”? The classic definition revolves around the three Vs.
Volume, velocity, and variety.
Volume: There is a just a lot of it being generated all the time. Things get interesting and “big”, when you can’t fit it all on one computer anymore. Why? There are many ideas here such as MapReduce, Hadoop, etc. that all revolve around being able to process data that goes from Terabytes, to Petabytes, to Exabytes.
Velocity: Data is being generated very quickly. Can you even store it all? If not, then what do you get rid of and what do you keep?
Variety: The data types you mention all take different shapes. What does it mean to store them so that you can play with or compare them?
http://pl.wikipedia.org/wiki/Green_Giant#mediaviewer/Plik:Jolly_green_giant.jpg
Is Big Data the same as Data Science?
Are Big Data and Data Science the same thing? I wouldn't say so... Data Science can be done on small data sets. And not everything done using Big Data would
necessarily be called Data Science.
Big Data Data Science
Is Big Data the same as Data Science?
Are Big Data and Data Science the same thing? I wouldn't say so... Data Science can be done on small data sets. And not everything done using Big Data would
necessarily be called Data Science. But there certainly is a substantial overlap!
Big DataData Science
Can you even be certain?
For real world problems, I claim that you will never be certain of any inferences from data.
I mean, what happens to your carefully thought out marketing plan for some rocking slacks when the Martians land.
What is unacceptable is when the data you actually have does not support the conclusion you report.
Public domain image
It can be easy to fool yourself!Human beings are really good at pattern detection...
http://en.wikipedia.org/wiki/Cydonia_(region_of_Mars)
Perhaps a bit too good!
It can be easy to fool yourself!
http://en.wikipedia.org/wiki/Cydonia_(region_of_Mars)
Skills for Data Science
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
Which is most important?
http://en.wikipedia.org/wiki/View_of_the_World_from_9th_Avenue
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
WPI Data Science Program:A Collaboration
Business School
Computer ScienceDepartmentMathematical
SciencesDepartment
M.S. in Data Science Program
INTEGRATIVE DATA SCIENCE (3 CREDITS) INTEGRATIVE DATA SCIENCE (3 CREDITS)
GRADUATE QUALIFYING PROJECT OR MS THESIS (3 TO 9 CREDITS)
MATHEMATICALANALYTICS(3 CREDITS)
DATA ACCESS & MANAGEMENT
(3 CREDITS)
DATA ANALYTICS &
MINING(3 CREDITS)
BUSINESSINTELLIGENCE &
CASE STUDIES(3 CREDITS)
CONCENTRATION AND ELECTIVES(9 TO 15 CREDITS)
18
Data Science Core
INTEGRATIVE DATA SCIENCE :INTEGRATIVE DATA SCIENCE :DS 501 INTRODUCTION TO DATA SCIENCE (NEW COURSE)
MATHEMATICAL ANALYTICS MATHEMATICAL ANALYTICS (SELECT ONE):MA 543/DS 502 STATISTICAL METHODS FOR DATA SCIENCE (NEW COURSE)MA 542 REGRESSION ANALYSISMA 554 APPLIED MULTIVARIATE ANALYSIS
DATA ACCESS AND MANAGEMENT DATA ACCESS AND MANAGEMENT (SELECT ONE):
CS 542 DATABASE MANAGEMENT SYSTEMSMIS 571 DATABASE APPLICATIONS DEVELOPMENTCS 561 ADVANCED TOPICS IN DATABASE SYSTEMSCS 585/DS 503 BIG DATA MANAGEMENT (NEW COURSE)
DATA ANALYTICS AND MINING DATA ANALYTICS AND MINING (SELECT ONE):
CS 548 KNOWLEDGE DISCOVERY AND DATA MININGCS 539 MACHINE LEARNINGCS 586/DS 504 BIG DATA ANALYTICS (NEW COURSE)
BUSINESS INTELLIGENCE AND CASE STUDIES BUSINESS INTELLIGENCE AND CASE STUDIES (SELECT ONE):
MIS 584 BUSINESS INTELLIGENCE MKT 568 DATA MINING BUSINESS APPLICATIONS
Data Science Certificate Program (18 credits);
•15 CREDIT DATA SCIENCE COREplus•3 CREDIT ELECTIVE
2014 Data Science Cohort
NATIONALITY
CAMBODIA
INDIA
CHINA
PAKISTAN
TAIWAN
IRAN
U.S.A.
BRAZIL
NEPAL
AFGHANISTAN
INDONESIA
EDUCATIONAL FOUNDATION QUANTITATIVE/ COMPUTATIONAL BACKGROUNDSPROGRAMMING WITH DATA STRUCTURES AND ALGORITHMS FOR COMPUTATIONAL SKILLSQUANTITATIVE SKILLS CALCULUS, LINEAR ALGEBRA AND STATISTICS
EMPLOYMENT HISTORIESSENIOR RESEARCH ANALYST SENIOR BUSINESS ANALYSTPATIENT FINANCIAL SERVICES DATA BASE ANALYST-ARCHITECT DECISION SCIENTIST MINISTRY OF FINANCE LAHEY HEALTH TECHNICAL PROGRAM MANAGEMENTU.S. DEPARTMENT OF STATE
66.70% Male66.70% Male33.3% Female33.3% Female
GENDERGENDER
10% FULBRIGHTSCHOLARS
2014 Data Science Cohort
FALL 2014 FALL 2014 Total ApplicantsTotal Applicants 126 126Total acceptancesTotal acceptances 33 33Fulbright ScholarsFulbright Scholars 3 3Brazil Science Mobility Student 1 Brazil Science Mobility Student 1 Countries Represented 9Countries Represented 9Domestic Students 5Domestic Students 5International Students 28International Students 28
Many hold more than one earned Bachelor’s DegreeMany hold more than one earned Bachelor’s DegreeUS Universities include Columbia, UNH and WPIUS Universities include Columbia, UNH and WPIDean Oates gave two Awards of $5K to outstanding Dean Oates gave two Awards of $5K to outstanding students. students. These awards help attract top students.These awards help attract top students.
Skills Acquired by Our StudentsFundamental/Technical :
SQL/ Data Modeling / Cleaning
Data Integration / Warehousing
Statistical Learning / Machine Learning
Distributed Computing
Big Data Management
Classif./Regression/DecisionTrees
Business Intelligence
Distributed Mining Algorithms
Professional Skills:
Business Use Cases / Entrepreneurship
Interdisciplinary Teams / Leadership
Tools :
Oracle /MySQL/DB2/SQLServer
R / SAS / SciKit
Weka /RapidMiner /MatLab
IBM Cognos / SPSS Modeler
Hadoop / Mahout / Cassandra
Python / Java / Cloud Computing
Storm / Sparc / InfoSphere Streams
Spotfire / Tableaux
Professional Skills:
Story Telling / Visualization
Presentations / Reports
Data Science Tools for Students: Free!
Software:
•Python
•http://www.python.org/• iPython: http://ipython.org/
• Numpy: http://www.numpy.org/
• Pandas: http://pandas.pydata.org/
• Matplotlib: http://matplotlib.org/
• Mayavi: http://mayavi.sourceforge.net/
• Scikit-learn: http://scikit-learn.org/stable/
Data:
•UCI Machine learning repository
• http://archive.ics.uci.edu/ml/
•Kaggle
• https://www.kaggle.com/
•U.S. Government
• https://www.data.gov/