DIY Analytics with Apache Spark

download DIY Analytics with Apache Spark

of 98

Embed Size (px)

Transcript of DIY Analytics with Apache Spark

  1. 1. DIY ANALYTICS WITH APACHE SPARK ADAM ROBERTS London, 22nd June 2017: originally presented at Geecon
  2. 2. Important disclaimers Copyright 2017 by Internatonal Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form without written permission from IBM. U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM. Informaton in these presentatons (including informaton relatng to products that have not yet been announced by IBM) has been reviewed for accuracy as of the date of inital publicaton and could include unintentonal technical or typographical errors. IBM shall have no responsibility to update this information. THIS document is distributed "AS IS" without any warranty, either express or implied. In no event shall IBM be liable for any damage arising from the use of this informaton, including but not limited to, loss of data, business interrupton, loss of profit or loss of opportunity. IBM products and services are warranted according to the terms and conditions of the agreements under which they are provided. Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice. Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other operatng environments may vary. References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries in which IBM operates or does business. Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall consttute legal or other guidance or advice to any individual partcipant or their specific situation. It is the customers responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the identficaton and interpretaton of any relevant laws and regulatory requirements that may affect the customers business and any actons the customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the customer is in compliance with any law. Information within this presentation is accurate to the best of the author's knowledge as of the 4th of June 2017
  3. 3. Informaton concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products in connecton with this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilites of non-IBM products should be addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to interoperate with IBMs products. IBM expressly disclaims all warranties, expressed or implied, including but not limited to, the implied warrantes of merchantability and fitness for a partcular purpose. The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents, copyrights, trademarks or other intellectual property right. IBM, the IBM logo,, Bluemix, Blueworks Live, CICS, Clearcase, DOORS, Enterprise Document Management System, Global Business Services , Global Technology Services , Informaton on Demand, ILOG, LinuxONE, Maximo, MQIntegrator, MQSeries, Netcool, OMEGAMON, OpenPower, PureAnalytcs, PureApplicaton, pureCluster, PureCoverage, PureData, PureExperience, PureFlex, pureQuery, pureScale, PureSystems, QRadar, Ratonal, Rhapsody, SoDA, SPSS, StoredIQ, Tivoli, Trusteer, urban{code}, Watson, WebSphere, Worklight, X-Force and System z Z/OS, are trademarks of Internatonal Business Machines Corporaton, registered in many jurisdictons worldwide. Other product and service names might be trademarks of IBM or other companies. Oracle and Java are registered trademarks of Oracle and/or its afiliates. Other names may be trademarks of their respectve owners: and a current list of IBM trademarks is available on the Web at "Copyright and trademark informaton" at Apache Spark, Apache Cassandra, Apache Hadoop, Apache Maven, Apache Kafka and any other Apache project mentoned here and the Apache product logos including the Spark logo are trademarks of The Apache Software Foundaton.
  4. 4. Showing you how to get started from scratch: going from Ive heard about Spark to I can use it for... Worked examples aplenty: lots of code Not intended to be scientfically accurate! Sharing ideas Useful reference material Slides will be hosted Stick around for...
  5. 5. Doing stuf yourself (within your tmeframe and rules) Findings can be subject to bias: yours dont have to be Trust the data instead Motivation!
  6. 6. Finding aliens with the SETI insttute Genomics projects (GATK, Bluemix Genomics) IBM Watson services Cool projects involving Spark
  7. 7. Powerful machine(s) Apache Spark and a JDK Scala (recommended) Optonal: visualisation library for Spark output e.g. Python with bokeh pandas Optonal but not covered here: a notebook bundled with Spark like Zeppelin, or use Jupyter Your DIY analytcs toolkit Toolbox from wikimedia: Tanemori derivatve work: '
  8. 8. Why listen to me? Worked on Apache Spark since 2014 Helping IBM customers use Spark for the first tme Resolving problems, educatng service teams Testng on lots of IBM platforms since Spark 1.2: x86, Power, Z systems, all Java 8 deliverables... Fixing bugs in Spark/Java: contributng code and helping others to do so Working with performance tuning pros Code provided here has an emphasis on readability!
  9. 9. What is it (why the hype)? How to answer questons with Spark Core spark functons (the bread and butter stuf), plotting, correlatons, machine learning Built-in utlity functons to make our lives easier (labels, features, handling nulls) Examples using data from wearables: two years of actvity What I'll be covering today
  10. 10. Ask me later if you're interested in... Spark on IBM hardware IBM SDK for Java specifics Notebooks Spark using GPUs/GPUs from Java Performance tuning Comparison with other projects War stories fixing Spark/Java bugs
  11. 11. You know how to write Java or Scala Youve heard about Spark but never used it You have something to process! What I assume...
  12. 12. This talk wont make you a superhero!
  13. 13. Know more about Spark what it can/cant do Know more about machine learning in Spark Know that machine learnings stll hard but in diferent ways But you will...
  14. 14. Open source project (the most actve for big data) offering distributed... Machine learning Graph processing Core operatons (map, reduce, joins) SQL syntax with DataFrames/Datasets
  15. 15. Build it yourself from source (requiring Git, Maven, a JDK) or Download a community built binary or Download our free Spark development package (includes IBM's SDK for Java)
  16. 16. Things you can process... File formats you could use with Hadoop Anything theres a Spark package for json, csv, parquet... Things you can use with it... Kafka for streaming Hive tables Cassandra as a database Hadoop (using HDFS with Spark) DB2!
  17. 17. Whats so good about it then?
  18. 18. Offers scalability and resiliency Auto-compression, fast serialisaton, caching Python, R, Scala and Java APIs: eligible for Java based optmisations Distributed machine learning!
  19. 19. Why isnt everyone using it?
  20. 20. Can you get away with using spreadsheet software? Have you really got a large amount of data? Data preparation is very important! How will you properly handle negative, null, or otherwise strange values in your data? Will you benefit from massive concurrency? Is the data in a format you can work with? Needs transforming first (and is it worth it)? Not every problem is a Spark one!
  21. 21. Not really real-tme streaming (micro-batching) Debugging in a largely distributed system with many moving parts can be tough Security: not really locked down out of the box (extra steps required by knowledgable users: whole disk encrypton or using other projects, SSL config to do...) Implementation details...
  22. 22. Getting something up and running quickly
  23. 23. Run any Spark example in local mode first (from spark) bin/run-example org.apache.spark.examples.SparkPi 100 Then run it on a cluster you can set up yourself: Add hostnames in conf/slaves sbin/ bin/run-example master ... Check for running Java processes: looking for workers/executors coming and going Spark UI (default port 8080 on the master) See: lib is only with the IBM package Running something simple
  24. 24. And you can use Spark's Java/Scala APIs with bin/spark-shell (a REPL!) bin/spark-submit java/scala -cp $SPARK_HOME/jars/* PySpark not covered in this presentation but fun to experiment with and lots of good docs online for you
  25. 25. Increasing the number of threads available for Spark processing in local mode (5.2gb text file) actually works? --master local[1] real 3m45.328s --master local[4] real 1m31.889s time { echo "--master local