Профессия Data Scientist
-
Upload
leonid-zhukov -
Category
Education
-
view
1.440 -
download
6
description
Transcript of Профессия Data Scientist
Профессия Data Scientist
Леонид Жуков Отделение Прикладной Математики и Информатики
Высшая школа экономики, Москва, 2013 www.hse.ru
Конференция «Большие Данные в национальной экономике» Москва 2013
Высшая школа экономики, Москва, 2013
The Sexiest Job of the 21st century
2
McKinsey оценивает нехватку в 140,000-190,000 специалистов к 2018г
Высшая школа экономики, Москва, 2013
Требуются Data Scientists!
3
Высшая школа экономики, Москва, 2013
Спрос и предложение
4
Высшая школа экономики, Москва, 2013
Кто такие Data Scientists?
A practitioner of data science is called a data scientist ( Wikipedia)
5
Предпочтительное образование: • Computer Science • Статистика, математика • Точные науки: Физика, Инженерия, итд • Магистры и кандидаты наук
Data Scientist: • Любит данные • Исследовательский склад ума • Цель работы – нахождение закономерностей в данных • Практик, не теоретик • Умеет и любит работать руками • Эксперт в прикладной области (*) • Работает в команде
demand for a certain set of skills, while later demand wanes as many of those initial skills are automated by even newer tools. Consider, for instance, the way many data processing and network management jobs that used to require legions of computer operators are now handled by automated monitoring tools. Data science is still in its very early phase, with the amount of data exploding and
the right tools to process them just becoming available.
Although data science is generating new opportunities, our capacity to train new data scientists is not keeping up, and nearly two-thirds of respondents foresee a looming shortfall in the number of data scientists over the next five years. This aligns with other research, including a recent McKinsey Global Institute study that predicts a shortage of 190,000 data scientists by the year 2019iii. And when our respondents were asked where the best source for talent was, few looked to today’s business intelligence professional. Instead, nearly two-thirds looked for today’s
university students.
Who is the Data Scientist? Although the term data science has been around for decades – indeed, most scientists’ use data of some form – the term data scientist in its current context is relatively new, frequently credited to DJ Patil, who started the data science team at LinkedIn.iv But as a new term, the field is still very much in flux, and without evidence about the practitioners, we’re left to speculate about what it may mean. In our survey, we allowed users to self-identify as “data science professionals,” in order to avoid conflicts over terminology in job titles. In this section we’ll attempt to define the data scientist by comparing them with the previous big player in the analytics space, business intelligence professionals.
Twenty years ago, business intelligence was itself a new term, just emerging to take over the various database management and decisions support functions within an organization. As the field grew rapidly in the 90s, it also coalesced around a smaller number of tools, more consistent expectations for talent, better training, and more rigorous organizational standards. As our data demonstrates, data scientists are currently going through that transition,
Students studying computer science
34%
Students studying
fields other than
computer science
24%
Professionals in disciplines other than IT or computer
science 27%
Today's BI professionals
12%
Other 3%
The best source of new Data Science talent is:
Jim Asplund, Chief Scientist at Gallup Consulting, is a data scientist focused on evaluating the role that human perception has on everything from disease conditions and GDP to worker productivity and consumer behavior. He works with massive data sets linking perception with actual behavior, and micro
and macroeconomic outcomes. His work has isolated emotional factors that are most highly related to outcomes
organizations care about.
EMC Data Science Community Survey, 2011
Dre
w C
onw
ay, 2
010
Высшая школа экономики, Москва, 2013
Рабочие инструменты
• Operating systems: • Linux + shell tools
• Big data instruments: • Hadoop (MapReduce) + hadoop tools • Hive, Pig • NoSQL (Hbase, MongoDB, Cassandra, Neo4J)
• Database: • SQL
• Programming: • Python • Java • Scala
• Machine Learning: • R • Matlab • Python libraries (NumPy, SciPy, Nltk,…) • Java libraries (Mahaut)
.
6
Высшая школа экономики, Москва, 2013
День из жизни Data Scientist
Постановка задачи
Получение данных
Разбор форматов, организация
Очистка, фильтрация
Исследование данных
Построение моделей Визуализация Обсуждение
результатов
7
Высшая школа экономики, Москва, 2013
Data Scientist или Аналитик
Если Вы программируете, то скорее всего Вы - Data Scientist, если используете Excel, то - аналитик
• Data Scientist: • Используют Hadoop, MapReduce, Hive, R • Создают специализированные системы и инструменты
• Работают со структурированными и не структурированными данными
• Рабочие данные измеряются в TB, PB • Опыт научной работы, экспертиза в статистке, машинном обучении, программировании
• Магистры и кандидаты наук (PhDs) • Разрабатывают предсказательными модели
• Создают data products
• Analysts: • Используют Excel, SQL • Используют существующие инструменты и системы
• Работают с табличными данными • Данные измеряются MB,GB • Профессиональное образование, нет формального научного
• Бакалавры etc (BS, BA, MS, MBA) • Работают тесно с BI и маркетингом • Занимаются отчетами о показателях работы бизнеса
8
Высшая школа экономики, Москва, 2013
Опрос: роли и навыки Data Scientist
9
From: “Analyzing the Analyzers” by Harlan Harris, Sean Murphy, and Marck Vaisman , O’Reilly Strata 2012
Высшая школа экономики, Москва, 2013
Data Science команда - ”the dream team”
10
From: “Doing Data Science: Straight Talk from the Frontline”, Rachel Schutt, Cathy O'Neil, O'Reilly Media, 2013
Высшая школа экономики, Москва, 2013
Прикладные задачи
• Маркетинг: • Сегментация рынка • Моделирование приобретения и оттока клиентов • Рекомендательные системы • Анализ социальных медиа
• Финансовые и страховые компании: • Предотвращение fraud • Детектирование аномального поведения • Анализ кредитных рисков • Страховые моделирование • Оптимизация портфолио
11
• Здравоохранение и Фармакология: • Генетический анализ • Анализ клинических испытаний • Клинические системы принятия решений
Высшая школа экономики, Москва, 2013
Дорога дальняя…
• Программирование • Алгоритмы и структуры данных • Базы данных • Статистика • Анализ данных • Машинное обучение • Компьютерная обработка текста • Распределенные системы • Инструменты Big Data • Визуализация данных
12
From: Swami Chandrasekaran,Executive Architect, IBM, Watson Solutions
Высшая школа экономики, Москва, 2013
Подготовительные программы в индустрии
©2013 Cloudera, Inc. All rights reserved. Cloudera and the Cloudera logo are trademarks or registered trademarks of Cloudera Inc. in the USA and other countries. All other trademarks are the property of their respective companies. Information is subject to change without notice.
cloudera-intro-data-science-trainingsheet-102
Cloudera, Inc. 1001 Page Mill Road, Palo Alto, CA 94304 | 1-888-789-1488 or 1-650-362-0488 | cloudera.com
TRAINING SHEET | 2
Course Outline: Cloudera Introduction to Data Science
Introduction
Data Science Overview > What Is Data Science? > The Growing Need for Data Science
> The Role of a Data Scientist
Use Cases > Finance > Retail > Advertising > Defense and Intelligence > Telecommunications and Utilities > Healthcare and Pharmaceuticals
Project Lifecycle > Steps in the Project Lifecycle
> Lab Scenario Explanation
Data Acquisition > Where to Source Data > Acquisition Techniques
Evaluating Input Data > Data Formats > Data Quantity > Data Quality
Data Transformation > Anonymization > File Format Conversion > Joining Datasets
Data Analysis and Statistical Methods > Relationship Between Statistics and
Probability > Descriptive Statistics > Inferential Statistics
Fundamentals of Machine Learning > Overview > The Three Cs of Machine Learning > Spotlight: Naïve Bayes Classifiers > Importance of Data and Algorithms
Recommender Overview > What Is a Recommender System? > Types of Collaborative Filtering > Limitations of Recommender Systems > Fundamental Concepts
Introduction to Apache Mahout > What Apache Mahout Is (and Is Not) > A Brief History of Mahout > Availability and Installation > Demonstration: Using Mahout’s Item-
Based Recommender
Implementing Recommenders with Apache Mahout
> Overview > Similarity Metrics for Binary Preferences > Similarity Metrics for Numeric Preferences > Scoring
Experimentation and Evaluation > Measuring Recommender Effectiveness > Designing Effective Experiments > Conducting an Effective Experiment > User Interfaces for Recommenders
Production Deployment and Beyond > Deploying to Production > Tips and Techniques for Working at Scale > Summarizing and Visualizing Results > Considerations for Improvement > Next Steps for Recommenders
Conclusion
Appendix A : Hadoop Overview
Appendix B: Mathematical Formulas
Appendix C : Language and Tool Reference
Cloudera Certified Professional: Data Scientist (CCP:DS)Establish yourself as an expert by completing the certification exam for data scientists. CCP:DS is the highest level of technical certification Cloudera offers and certifies your knowledge and skills as a data scientist using Apache Hadoop on large data sets. The credential requires both a multiple-choice Data Science Essentials exam and a hands-on, performance-based Data Science Challenge with a real-world problem on a live system.
TRAINING SHEET
Cloudera Introduction to Data Science: Building Recommender Systems
Take your knowledge to the next level with Cloudera’s Data Science Training and Certification
Data scientists build information platforms to ask and answer previously unimaginable questions. Learn how data science helps companies reduce costs, increase profits, improve products, retain customers, and identify new opportunities.
Cloudera University’s three-day course helps participants understand what data scientists do and the problems they solve. Through in-class simulations, participants apply data science methods to real-world challenges in different industries and, ultimately, prepare for data scientist roles in the field.
Hands-On Hadoop Through instructor-led discussion and interactive, hands-on exercises, participants will navigate the Hadoop ecosystem, learning topics such as:
> The role of data scientists, vertical use cases, and business applications of data science > Where and how to acquire data, methods for evaluating source data, and data
transformation and preparation > Types of statistics and analytical methods and their relationship > Machine learning fundamentals and breakthroughs, the importance of algorithms, and
data as a platform > How to implement and manage recommenders using Apache Mahout and how to set up
and evaluate data experiments > Steps for deploying new analytics projects to production and tips for working at scale
Audience & Prerequisites This course is suitable for developers, data analysts, and statisticians with basic knowledge of Apache Hadoop: HDFS, MapReduce, Hadoop Streaming, and Apache Hive. Students should have proficiency in a scripting language; Python is strongly preferred, but familiarity with Perl or Ruby is sufficient.
Data Scientist CertificationFollowing successful completion of the training class, attendees receive a Data Science Essentials practice test. Data Science Essentials plus the Data Science Challenge constitute the Cloudera Certified Professional: Data Scientist (CCP:DS). Certification is a great differen-tiator; it helps establish you as a leader in the field, providing employers and customers with tangible evidence of your skills and expertise.
The professionalism and expansive technical knowledge demonstrated by our instructor were incredible. The quality of the Cloudera training was on par with a university.
GENERAL DYNAMICS
““
13
Высшая школа экономики, Москва, 2013 14
Подготовительные программы в индустрии
Высшая школа экономики, Москва, 2013
Образовательные программы
Университетские программы: • University of Washington: CertiUcate in Data Science • UC Berkeley: Master of information and data science program • New York University: Data Science at NYU • Columbia University: Institute for Data Sciences and Engineering • University of Southern California (UCS) : Master of Science in Data Science
15
Онлайн курсы обучения (MOOC): • Coursera • edX • Udacity
Ускоренные образовательные программы (компании): • ZipUan Academy (12 weeks intensive program) • Insight Data Science Fellows program ( 6 weeks post doc training)
Высшая школа экономики, Москва, 2013
Конференции
Индустрийные конференции и выставки: • O’Reilly Strata Conference Making Data Work • Hadoop World • Big Data Techcon • Big Data Innovation summits
16
Meetups («кружки по интересам»)
Научные и академические конференции (peer reviewed): • IEEE & ACM Supercomputing • IEEE Big Data • ACM KDD Knowledge Discovery and Data Mining • ACM SIGIR Information Retrieval • ICML International Conference on Machine Learning • ICDM International Conference on Data Mining • NIPS Neural Information Processing • WWW World Wide Web Conference • VLDB Very Large Data Bases • ACM CIKM Information and Knowledge Management • SIAM SDM International Conference on Data Mining • IEEE ICDE Data Engineering • IEEE Visualization
Высшая школа экономики, Москва, 2013
Книги
17
Высшая школа экономики, Москва, 2013
Открытые вопросы
• Насколько важно быть экспертом в предметной области решаемой задачи (domain expertise) ?
• Что более важно в профессии Data Scientist : образование или практический опыт?
• Перспективы профессии Data Scientist, будут ли она замещена программными решениями?
18
Высшая школа экономики, Москва, 2013
ВШЭ Отделение Прикладной Математики и Информатики
Курсы, читаемые на отделении: • Программирование (Python, Java, Matlab) • Архитектура компьютеров и системное программирование • Распределенные системы • Теория баз данных • Дискретная математика • Алгоритмы и структуры данных • Статистическое моделирование и анализ • Численные методы • Прикладная теория графов • Анализ и обработка данных • Методы машинного обучения • Автоматическая обработка текстов • Компьютерная лингвистика • Анализ социальных сетей
• Запускается Магистерская программа «Наука о Данных»
19
101000, Россия, Москва, Мясницкая ул., д. 20 Тел.: (495) 621-7983, факс: (495) 628-7931
www.hse.ru