Профессия Data Scientist

20
Профессия  Data Scientist Леонид Жуков Отделение Прикладной Математики и Информатики [email protected] Высшая школа экономики, Москва, 2013 www.hse.ru Конференция «Большие Данные в национальной экономике» Москва 2013

description

Презентация на конференции "Большие Данные в национальной экономике", Москва, 2013

Transcript of Профессия Data Scientist

Page 1: Профессия Data Scientist

Профессия  Data Scientist

Леонид Жуков Отделение Прикладной Математики и Информатики

[email protected]

Высшая школа экономики, Москва, 2013 www.hse.ru

Конференция «Большие Данные в национальной экономике» Москва 2013

Page 2: Профессия Data Scientist

Высшая школа экономики, Москва, 2013

The Sexiest Job of the 21st century

2  

McKinsey оценивает нехватку в 140,000-190,000 специалистов к 2018г

Page 3: Профессия Data Scientist

Высшая школа экономики, Москва, 2013

Требуются Data Scientists!

3  

Page 4: Профессия Data Scientist

Высшая школа экономики, Москва, 2013

Спрос и предложение

4  

Page 5: Профессия Data Scientist

Высшая школа экономики, Москва, 2013

Кто такие Data Scientists?

A practitioner of data science is called a data scientist ( Wikipedia)

5  

  Предпочтительное образование: •  Computer Science •  Статистика, математика •  Точные науки: Физика, Инженерия, итд •  Магистры и кандидаты наук

Data Scientist: •  Любит данные •  Исследовательский склад ума •  Цель работы – нахождение закономерностей в данных •  Практик, не теоретик •  Умеет и любит работать руками •  Эксперт в прикладной области (*) •  Работает в команде

demand for a certain set of skills, while later demand wanes as many of those initial skills are automated by even newer tools. Consider, for instance, the way many data processing and network management jobs that used to require legions of computer operators are now handled by automated monitoring tools. Data science is still in its very early phase, with the amount of data exploding and

the right tools to process them just becoming available.

Although data science is generating new opportunities, our capacity to train new data scientists is not keeping up, and nearly two-thirds of respondents foresee a looming shortfall in the number of data scientists over the next five years. This aligns with other research, including a recent McKinsey Global Institute study that predicts a shortage of 190,000 data scientists by the year 2019iii. And when our respondents were asked where the best source for talent was, few looked to today’s business intelligence professional. Instead, nearly two-thirds looked for today’s

university students.

Who is the Data Scientist? Although the term data science has been around for decades – indeed, most scientists’ use data of some form – the term data scientist in its current context is relatively new, frequently credited to DJ Patil, who started the data science team at LinkedIn.iv But as a new term, the field is still very much in flux, and without evidence about the practitioners, we’re left to speculate about what it may mean. In our survey, we allowed users to self-identify as “data science professionals,” in order to avoid conflicts over terminology in job titles. In this section we’ll attempt to define the data scientist by comparing them with the previous big player in the analytics space, business intelligence professionals.

Twenty years ago, business intelligence was itself a new term, just emerging to take over the various database management and decisions support functions within an organization. As the field grew rapidly in the 90s, it also coalesced around a smaller number of tools, more consistent expectations for talent, better training, and more rigorous organizational standards. As our data demonstrates, data scientists are currently going through that transition,

Students studying computer science

34%

Students studying

fields other than

computer science

24%

Professionals in disciplines other than IT or computer

science 27%

Today's BI professionals

12%

Other 3%

The best source of new Data Science talent is:

Jim Asplund, Chief Scientist at Gallup Consulting, is a data scientist focused on evaluating the role that human perception has on everything from disease conditions and GDP to worker productivity and consumer behavior. He works with massive data sets linking perception with actual behavior, and micro

and macroeconomic outcomes. His work has isolated emotional factors that are most highly related to outcomes

organizations care about.

 EMC Data Science Community Survey, 2011

 Dre

w C

onw

ay, 2

010

Page 6: Профессия Data Scientist

Высшая школа экономики, Москва, 2013

Рабочие инструменты

•  Operating systems: •  Linux + shell tools

•  Big data instruments: •  Hadoop (MapReduce) + hadoop tools •  Hive, Pig •  NoSQL (Hbase, MongoDB, Cassandra, Neo4J)

•  Database: •  SQL

•  Programming: •  Python •  Java •  Scala

•  Machine Learning: •  R •  Matlab •  Python libraries (NumPy, SciPy, Nltk,…) •  Java libraries (Mahaut)

.

6  

Page 7: Профессия Data Scientist

Высшая школа экономики, Москва, 2013

День из жизни Data Scientist

Постановка  задачи  

Получение  данных  

Разбор  форматов,  организация  

Очистка,  фильтрация  

Исследование  данных  

Построение  моделей   Визуализация   Обсуждение  

результатов  

7  

Page 8: Профессия Data Scientist

Высшая школа экономики, Москва, 2013

Data Scientist  или Аналитик

Если Вы программируете, то скорее всего Вы - Data Scientist, если используете Excel, то - аналитик

•  Data Scientist: •  Используют Hadoop, MapReduce, Hive, R •  Создают специализированные системы и инструменты

•  Работают со структурированными и не структурированными данными

•  Рабочие данные измеряются в TB, PB •  Опыт научной работы, экспертиза в статистке, машинном обучении, программировании

•  Магистры и кандидаты наук (PhDs) •  Разрабатывают предсказательными модели

•  Создают data products

•  Analysts: •  Используют Excel, SQL •  Используют существующие инструменты и системы

•  Работают с табличными данными •  Данные измеряются MB,GB •  Профессиональное образование, нет формального научного

•  Бакалавры etc (BS, BA, MS, MBA) •  Работают тесно с BI и маркетингом •  Занимаются отчетами о показателях работы бизнеса

8  

Page 9: Профессия Data Scientist

Высшая школа экономики, Москва, 2013

Опрос: роли и навыки Data Scientist

9  

From: “Analyzing the Analyzers” by Harlan Harris, Sean Murphy, and Marck Vaisman , O’Reilly Strata 2012

Page 10: Профессия Data Scientist

Высшая школа экономики, Москва, 2013

Data Science команда - ”the dream team”

10  

From: “Doing Data Science: Straight Talk from the Frontline”, Rachel Schutt, Cathy O'Neil, O'Reilly Media, 2013

Page 11: Профессия Data Scientist

Высшая школа экономики, Москва, 2013

Прикладные задачи

•  Маркетинг: •  Сегментация рынка •  Моделирование приобретения и оттока клиентов •  Рекомендательные системы •  Анализ социальных медиа

•  Финансовые и страховые компании: •  Предотвращение fraud •  Детектирование аномального поведения •  Анализ кредитных рисков •  Страховые моделирование •  Оптимизация портфолио

11  

•  Здравоохранение и Фармакология: •  Генетический анализ •  Анализ клинических испытаний •  Клинические системы принятия решений

Page 12: Профессия Data Scientist

Высшая школа экономики, Москва, 2013

Дорога дальняя…

•  Программирование •  Алгоритмы и структуры данных •  Базы данных •  Статистика •  Анализ данных •  Машинное обучение •  Компьютерная обработка текста •  Распределенные системы •  Инструменты Big Data •  Визуализация данных

12  

From: Swami Chandrasekaran,Executive Architect, IBM, Watson Solutions

Page 13: Профессия Data Scientist

Высшая школа экономики, Москва, 2013

Подготовительные программы в индустрии

©2013 Cloudera, Inc. All rights reserved. Cloudera and the Cloudera logo are trademarks or registered trademarks of Cloudera Inc. in the USA and other countries. All other trademarks are the property of their respective companies. Information is subject to change without notice.

cloudera-intro-data-science-trainingsheet-102

Cloudera, Inc. 1001 Page Mill Road, Palo Alto, CA 94304 | 1-888-789-1488 or 1-650-362-0488 | cloudera.com

TRAINING SHEET | 2

Course Outline: Cloudera Introduction to Data Science

Introduction

Data Science Overview > What Is Data Science? > The Growing Need for Data Science

> The Role of a Data Scientist

Use Cases > Finance > Retail > Advertising > Defense and Intelligence > Telecommunications and Utilities > Healthcare and Pharmaceuticals

Project Lifecycle > Steps in the Project Lifecycle

> Lab Scenario Explanation

Data Acquisition > Where to Source Data > Acquisition Techniques

Evaluating Input Data > Data Formats > Data Quantity > Data Quality

Data Transformation > Anonymization > File Format Conversion > Joining Datasets

Data Analysis and Statistical Methods > Relationship Between Statistics and

Probability > Descriptive Statistics > Inferential Statistics

Fundamentals of Machine Learning > Overview > The Three Cs of Machine Learning > Spotlight: Naïve Bayes Classifiers > Importance of Data and Algorithms

Recommender Overview > What Is a Recommender System? > Types of Collaborative Filtering > Limitations of Recommender Systems > Fundamental Concepts

Introduction to Apache Mahout > What Apache Mahout Is (and Is Not) > A Brief History of Mahout > Availability and Installation > Demonstration: Using Mahout’s Item-

Based Recommender

Implementing Recommenders with Apache Mahout

> Overview > Similarity Metrics for Binary Preferences > Similarity Metrics for Numeric Preferences > Scoring

Experimentation and Evaluation > Measuring Recommender Effectiveness > Designing Effective Experiments > Conducting an Effective Experiment > User Interfaces for Recommenders

Production Deployment and Beyond > Deploying to Production > Tips and Techniques for Working at Scale > Summarizing and Visualizing Results > Considerations for Improvement > Next Steps for Recommenders

Conclusion

Appendix A : Hadoop Overview

Appendix B: Mathematical Formulas

Appendix C : Language and Tool Reference

Cloudera Certified Professional: Data Scientist (CCP:DS)Establish yourself as an expert by completing the certification exam for data scientists. CCP:DS is the highest level of technical certification Cloudera offers and certifies your knowledge and skills as a data scientist using Apache Hadoop on large data sets. The credential requires both a multiple-choice Data Science Essentials exam and a hands-on, performance-based Data Science Challenge with a real-world problem on a live system.

TRAINING SHEET

Cloudera Introduction to Data Science: Building Recommender Systems

Take your knowledge to the next level with Cloudera’s Data Science Training and Certification

Data scientists build information platforms to ask and answer previously unimaginable questions. Learn how data science helps companies reduce costs, increase profits, improve products, retain customers, and identify new opportunities.

Cloudera University’s three-day course helps participants understand what data scientists do and the problems they solve. Through in-class simulations, participants apply data science methods to real-world challenges in different industries and, ultimately, prepare for data scientist roles in the field.

Hands-On Hadoop Through instructor-led discussion and interactive, hands-on exercises, participants will navigate the Hadoop ecosystem, learning topics such as:

> The role of data scientists, vertical use cases, and business applications of data science > Where and how to acquire data, methods for evaluating source data, and data

transformation and preparation > Types of statistics and analytical methods and their relationship > Machine learning fundamentals and breakthroughs, the importance of algorithms, and

data as a platform > How to implement and manage recommenders using Apache Mahout and how to set up

and evaluate data experiments > Steps for deploying new analytics projects to production and tips for working at scale

Audience & Prerequisites This course is suitable for developers, data analysts, and statisticians with basic knowledge of Apache Hadoop: HDFS, MapReduce, Hadoop Streaming, and Apache Hive. Students should have proficiency in a scripting language; Python is strongly preferred, but familiarity with Perl or Ruby is sufficient.

Data Scientist CertificationFollowing successful completion of the training class, attendees receive a Data Science Essentials practice test. Data Science Essentials plus the Data Science Challenge constitute the Cloudera Certified Professional: Data Scientist (CCP:DS). Certification is a great differen-tiator; it helps establish you as a leader in the field, providing employers and customers with tangible evidence of your skills and expertise.

The professionalism and expansive technical knowledge demonstrated by our instructor were incredible. The quality of the Cloudera training was on par with a university.

GENERAL DYNAMICS

““

13  

Page 14: Профессия Data Scientist

Высшая школа экономики, Москва, 2013 14  

Подготовительные программы в индустрии

Page 15: Профессия Data Scientist

Высшая школа экономики, Москва, 2013

Образовательные программы

Университетские программы: •  University of Washington: CertiUcate in Data Science •  UC Berkeley: Master of information and data science program •  New York University: Data Science at NYU •  Columbia University: Institute for Data Sciences and Engineering •  University of Southern California (UCS) : Master of Science in Data Science

15  

Онлайн курсы обучения (MOOC): •  Coursera •  edX •  Udacity

Ускоренные образовательные программы (компании): •  ZipUan Academy (12 weeks intensive program) •  Insight Data Science Fellows program ( 6 weeks post doc training)

Page 16: Профессия Data Scientist

Высшая школа экономики, Москва, 2013

Конференции

Индустрийные конференции и выставки: •  O’Reilly Strata Conference Making Data Work •  Hadoop World •  Big Data Techcon •  Big Data Innovation summits

16  

Meetups («кружки по интересам»)

Научные и академические конференции (peer reviewed): •  IEEE & ACM Supercomputing •  IEEE Big Data •  ACM KDD Knowledge Discovery and Data Mining •  ACM SIGIR Information Retrieval •  ICML International Conference on Machine Learning •  ICDM International Conference on Data Mining •  NIPS Neural Information Processing •  WWW World Wide Web Conference •  VLDB Very Large Data Bases •  ACM CIKM Information and Knowledge Management •  SIAM SDM International Conference on Data Mining •  IEEE ICDE Data Engineering •  IEEE Visualization

Page 17: Профессия Data Scientist

Высшая школа экономики, Москва, 2013

 Книги

17  

Page 18: Профессия Data Scientist

Высшая школа экономики, Москва, 2013

Открытые вопросы

• Насколько важно быть экспертом в предметной области решаемой задачи (domain expertise) ?

• Что более важно в профессии  Data Scientist : образование или практический опыт?

• Перспективы профессии Data Scientist, будут ли она замещена программными решениями?

18  

Page 19: Профессия Data Scientist

Высшая школа экономики, Москва, 2013

ВШЭ Отделение Прикладной Математики и Информатики

Курсы, читаемые на отделении: •  Программирование (Python, Java, Matlab) •  Архитектура компьютеров и системное программирование •  Распределенные системы •  Теория баз данных •  Дискретная математика •  Алгоритмы и структуры данных •  Статистическое моделирование и анализ •  Численные методы •  Прикладная теория графов •  Анализ и обработка данных •  Методы машинного обучения •  Автоматическая обработка текстов •  Компьютерная лингвистика •  Анализ социальных сетей

•  Запускается Магистерская программа «Наука о Данных»

19  

Page 20: Профессия Data Scientist

101000, Россия, Москва, Мясницкая ул., д. 20 Тел.: (495) 621-7983, факс: (495) 628-7931

www.hse.ru