20170210 sapporotechbar7
-
Upload
ryuji-tamagawa -
Category
Software
-
view
184 -
download
0
Transcript of 20170210 sapporotechbar7
1 / 5 : PyData
PyData
Anaconda PythonBlaze NumPy and pandas interface to Big Data'. daskBokeh
Canopy PythonIPython
matplotlib PyDatanose
numba JITNumPy PyDataScipy PyData
StatsmodelsSymPypandas NumPy SciPy
scikit-imagescikit-learn PyData
4 / 5 : Apache Spark
Hadoop
▸ MapReduce Spark
▸ 2010 Hadoop = MapReduce + HDFS
▸ Hadoop
OSHDFS
Hive e.t.c.
HBaseMapReduce
YARN
Impala e.t.c in-
memory SQL engine
Spark Spark Streaming, MLlib, GraphX, Spark SQL)
Hadoop
HDFS S3
YARN Mesos
/
4 / 5 : Apache Spark
Apache Spark PyData pandas
Apache Spark pandas
JVM Python
× dask
I/OScala Java Python R
JVMPython
4 / 5 : Apache Spark
▸
▸ SSD
▸ Spark Parquet
▸ Performance comparison of different file formats
and storage engines in the Hadoop ecosystem
▸ Parquet Python
5/5 : 2017 and the future
Wes blog
▸ pandas Apache Arrow
▸ Blog
▸ PyData Blog
Wes OK
▸ 2017 : pandas, Arrow, Feather, Parquet, Spark, Ibis
http://qiita.com/tamagawa-ryuji/items/deb3f63ed4c7c8065e81
5/5 : 2017 and the future
High speed Apache Parquet for Python
▸ Parquet
▸ Spark
▸ Python
▸ Fastparquet
▸ pyarrow