20170210 sapporotechbar7

38
PyData & Apache Spark 2017 / 2 / 10 Sapporo TechBar #7 @

Transcript of 20170210 sapporotechbar7

PyData & Apache Spark

2017 / 2 / 10 Sapporo TechBar #7

@

▸ facebook : Ryuji Tamagawa

▸ Twitter : tamagawa_ryuji

▸ FB

techbar

▸ FB

▸ Twitter

5

Python

PyData

Apache Spark

Jupyter Notebook

2017

and the

future

Pandas

PyData

1 / 5 : PyData

1 / 5 : PyData

PyData.org

1 / 5 : PyData

PyData

Anaconda PythonBlaze NumPy and pandas interface to Big Data'. daskBokeh

Canopy PythonIPython

matplotlib PyDatanose

numba JITNumPy PyDataScipy PyData

StatsmodelsSymPypandas NumPy SciPy

scikit-imagescikit-learn PyData

pandas

2 / 5 : pandas

pandas

▸ NumPy SciPy

▸ DataFrame

2 / 5 : pandas

pandas Wes McKinney

2 / 5 : pandas

DataFrame

2 / 5 : pandas

2 / 5 : pandas

Python

▸ PyData pandas

Jupyter Notebook

3 /5 : Jupyter Notebook

IPython Notebook

▸ Jupyter Notebook

▸ Julia Python R

▸ JupyterCon

3 /5 : Jupyter Notebook

3 /5 : Jupyter Notebook

3 /5 : Jupyter Notebook

pandas / matplotlib

3 /5 : Jupyter Notebook

Interactive Widget

3 /5 : Jupyter Notebook

▸ Learning Jupyter

Apache Spark

4 / 5 : Apache Spark

Hadoop

▸ MapReduce Spark

▸ 2010 Hadoop = MapReduce + HDFS

▸ Hadoop

OSHDFS

Hive e.t.c.

HBaseMapReduce

YARN

Impala e.t.c in-

memory SQL engine

Spark Spark Streaming, MLlib, GraphX, Spark SQL)

Hadoop

HDFS S3

YARN Mesos

/

4 / 5 : Apache Spark

Apache Spark PyData pandas

Apache Spark pandas

JVM Python

× dask

I/OScala Java Python R

JVMPython

4 / 5 : Apache Spark

Spark

▸ 1 PC

Hadoop / MapReduce

4 / 5 : Apache Spark

DataFrame

4 / 5 : Apache Spark

Apache Spark

▸ Parquet

Machine Learning

Machine Learning

▸ scikit-learn

▸ Spark MLlib / ML

▸ TensorFlow

▸ Python

2017 and the future

5/5 : 2017 and the future

PyData

▸ Spark - pandas

▸ pandas → Spark …

5/5 : 2017 and the future

Wes blog

▸ pandas Apache Arrow

▸ Blog

▸ PyData Blog

Wes OK

▸ 2017 : pandas, Arrow, Feather, Parquet, Spark, Ibis

http://qiita.com/tamagawa-ryuji/items/deb3f63ed4c7c8065e81

5/5 : 2017 and the future

High speed Apache Parquet for Python

▸ Parquet

▸ Spark

▸ Python

▸ Fastparquet

▸ pyarrow

5/5 : 2017 and the future

: apache arrow

▸ apache arrow

▸ PyData / OSS

▸ /