Getting Started with Apache Cassandra and Apache Zeppelin (DuyHai DOAN, DataStax) | C* Summit 2016
Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark
-
Upload
felixcss -
Category
Technology
-
view
1.213 -
download
3
Transcript of Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark
Apache CassandraDC/OSAWSSMACKSMACKZ
INTERACTIVE DATA SCIENCE FROM SCRATCH WITH APACHE ZEPPELIN
AND APACHE SPARKFELIX CHEUNG
APACHECON BIG DATA 2016 - MAY
TODAY
• SETUP A VIRTUAL MACHINE ON YOUR LAPTOP – THIS COULD TAKE 25-MIN TO 1 HR• VM NEEDS 8GB RAM
• IN THE VIRTUAL MACHINE, YOU WILL BE RUNNING SPARK, ZEPPELIN AND OTHERS
• AND YOU WILL RUN SOME DATA PROCESSING AND MACHINE LEARNING USE CASES
• CREATE AND CONFIGURE LIGHTWEIGHT, REPRODUCIBLE AND PORTABLE DEVELOPMENT ENVIRONMENT
• $ vagrant up• MACHINES ARE PROVISIONED ON TOP OF VIRTUALBOX, VMWARE, AWS,
OR ANY OTHER PROVIDER, INDUSTRY-STANDARD PROVISIONING TOOLS SUCH AS SHELL SCRIPTS, CHEF, OR PUPPET, CAN BE USED TO AUTOMATICALLY INSTALL AND CONFIGURE SOFTWARE ON THE MACHINE
• HTTPS://WWW.VAGRANTUP.COM/DOWNLOADS.HTML
• X86 AND AMD64/INTEL64 VIRTUALIZATION• OPEN SOURCE SOFTWARE, ORACLE• RUNS ON WINDOWS, LINUX, MACINTOSH, AND SOLARIS HOSTS AND
SUPPORTS A LARGE NUMBER OF GUEST OPERATING SYSTEMS INCLUDING BUT NOT LIMITED TO WINDOWS (NT 4.0, 2000, XP, SERVER 2003, VISTA, WINDOWS 7, WINDOWS 8, WINDOWS 10), DOS/WINDOWS 3.X, LINUX (2.4, 2.6, 3.X AND 4.X), SOLARIS AND OPENSOLARIS, OS/2, AND OPENBSD
• HTTPS://WWW.VIRTUALBOX.ORG/WIKI/DOWNLOADS
LET’S START!
• VAGRANT -> VIRTUALBOX• HTTPS://GITHUB.COM/FELIXCHEUNG/VAGRANT-PROJECTS• SPARK-CASSANDRA-ZEPPELIN• DOWNLOAD AND PUT THEM IN A DIRECTORY• $ vagrant up
• SPARK• SPARK SQL + DATA
FRAME + DATA SOURCE
• SPARK STREAMING• MLLIB• GRAPHX
val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)
val countsByAge = df.groupBy("age").count()
ZEPPELIN
APACHE ZEPPELIN (INCUBATING)
• INTERACTIVE DATA ANALYTICS ENVIRONMENT FOR DISTRIBUTED DATA PROCESSING SYSTEM. IT PROVIDES BEAUTIFUL INTERACTIVE WEB-BASED INTERFACE, DATA VISUALIZATION, COLLABORATIVE WORK ENVIRONMENT AND MANY OTHER NICE FEATURES TO MAKE YOUR DATA ANALYTICS MORE FUN AND ENJOYABLE.
• ZEPPELIN HAS BEEN INCUBATING SINCE DEC 2014.HTTPS://ZEPPELIN.INCUBATOR.APACHE.ORG/
• REALTIME COLLABORATION - ENABLED BY WEBSOCKET COMMUNICATIONS
• FRONTEND: ANGULARJS BACKEND SERVER: JAVA INTERPRETERS: JAVAVISUALIZATION: NVD3
INTERPRETERS
• ALLUXIO (WAS TACHYON)• CASSANDRA• ELASTICSEARCH• FLINK• GEODE• HBASE• HDFS• HIVE• IGNITE• JDBC/PHOENIX/POSTGRESQL/HAWQ• LENS• MARKDOWN• R• SCALDING• SHELL• SPARK• TAJO
LET’S CHECK THIS OUT
NOTEBOOK - START
• HTTPS://GITHUB.COM/FELIXCHEUNG/SPARK-NOTEBOOK-EXAMPLES/TREE/MASTER/ZEPPELIN_NOTEBOOK/APACHECON2016
SPARK INTERPRETER
• CLUSTER MODE – MASTER• SPARK.EXECUTOR.MEMORY• HTTP://
SPARK.APACHE.ORG/DOCS/LATEST/CONFIGURATION.HTML
• “SHARED” INTERPRETER• CREATE ADDITION INTERPRETER
INSTANCES
NOTEBOOK - WIKIPEDIA
• HTTPS://GITHUB.COM/FELIXCHEUNG/SPARK-NOTEBOOK-EXAMPLES/TREE/MASTER/ZEPPELIN_NOTEBOOK/APACHECON2016
MACHINE LEARNING WITH SPARK
K-MEANS
• K-MEANS CLUSTERING AIMS TO PARTITION N OBSERVATIONS INTO K CLUSTERS IN WHICH EACH OBSERVATION BELONGS TO THE CLUSTER WITH THE NEAREST MEAN, SERVING AS A PROTOTYPE OF THE CLUSTER.
• SPARK – K-MEANS|| HTTP://THEORY.STANFORD.EDU/~SERGEI/PAPERS/VLDB12-KMPAR.PDFPARALLELIZED VARIANT OF THE K-MEANS++ METHOD
• SPARK – STREAMING K-MEANS
GRAPH
• GRAPH-PARALLEL COMPUTATION
GRAPHFRAMES
• POWER OF GRAPHX + DATAFRAME• MOTIF FINDING (A)-[E]->(B); (B)-[E2]->(A)• BREADTH-FIRST SEARCH (BFS)• CONNECTED COMPONENTS• PAGERANK• SHORTEST PATHS• TRIANGLE COUNT
NOTEBOOK – K-MEANS – EXPLORATORY ANALYSIS
NOTEBOOK – GRAPHFRAMES
“BIG DATA”
• OPEN-SOURCE, DISTRIBUTED, VERSIONED, NON-RELATIONAL DATABASE• MODELED AFTER GOOGLE'SBIGTABLE: A DISTRIBUTED STORAGE SYSTEM FOR STRUCTURED
DATA• AUTOMATIC FAILOVER SUPPORT BETWEEN REGIONSERVERS• BLOCK CACHE AND BLOOM FILTERS FOR REAL-TIME QUERIES• QUERY PREDICATE PUSH DOWN VIA SERVER SIDE FILTERS• THRIFT GATEWAY AND A REST-FUL WEB SERVICE THAT SUPPORTS XML, PROTOBUF, AND BINARY
DATA• EXTENSIBLE JRUBY-BASED (JIRB) SHELL• HTTPS://HBASE.APACHE.ORG/BOOK.HTML#QUICKSTART
HBASE SCRIPTING
• CREATE 'TEST', 'CF’• LIST 'TEST' • PUT 'TEST', 'ROW1', 'CF:A', 'VALUE1' • PUT 'TEST', 'ROW2', 'CF:B', 'VALUE2' • SCAN 'TEST' • GET 'TEST', 'ROW1' • DISABLE 'TEST' • ENABLE 'TEST'
CASSANDRA
CASSANDRA
• BORN AT FACEBOOK AND BUILT ON AMAZON’S DYNAMO AND GOOGLE’S BIGTABLE
• DISTRIBUTED DATABASE• STRUCTURED DATA• HIGHLY AVAILABLE SERVICE AND NO SINGLE POINT OF FAILURE• MASTERLESS “RING” DESIGN• USER-CASES: PB’S OF DATA IN CLUSTERS OF OVER 75,000 NODES
• HTTP://WIKI.APACHE.ORG/CASSANDRA/GETTINGSTARTED
CASSANDRA QUERY LANGUAGE (CQL)
• CREATE KEYSPACE MYKEYSPACE WITH REPLICATION = { 'CLASS' : 'SIMPLESTRATEGY', 'REPLICATION_FACTOR' : 1 };
• CREATE TABLE USERS ( USER_ID INT PRIMARY KEY, FNAME TEXT, LNAME TEXT );
• INSERT INTO USERS (USER_ID, FNAME, LNAME) VALUES (1745, 'JOHN', 'SMITH');
• SELECT MAX(NAME), NAME, COUNT(*) FROM USERS;
NOTEBOOK – HBASE
NOTEBOOK – CASSANDRA, SPARK-CASSANDRA
TIPS
• AFTER VAGRANT HALT YOU NEED TO RESTART ZEPPELIN• VAGRANT UP• VAGRANT SSH• SUDO SERVICE ZEPPELIN START
• TO RESTART HBASE OR CASSANDRA, SEE THE COMMAND HERE• TO START CLEANLY, CONSIDER REBUILDING THE VM FROM SCRATCH: $
VAGRANT DESTROY• /OPT/APACHE-CASSANDRA-3.5/BIN/CQLSH
• “CONNECTED TO TEST CLUSTER AT 127.0.0.1:9042.”• HTTPS://
WWW.DIGITALOCEAN.COM/COMMUNITY/TUTORIALS/HOW-TO-INSTALL-CASSANDRA-AND-RUN-A-SINGLE-NODE-CLUSTER-ON-A-UBUNTU-VPS
• CASSANDRA 3.0 AND LATER REQUIRE JAVA 8U40 OR LATER.
SCALING UP – CLOUD-BASED ARCHITECTURE
DC/OS
• OPEN SOURCE DATACENTER-SCALE OPERATING SYSTEM FOR BUILDING AND RUNNING MODERN APPS WITH EASE.
• EASILY DEPLOY AND RUN STATEFUL OR STATELESS DISTRIBUTED WORKLOADS INCLUDING DOCKER CONTAINERS, BIG DATA, AND TRADITIONAL APPS.
• RUNNING ON APACHE MESOS• GUI/CLI• “UNIVERSE” APP STORE• SERVICE DISCOVERY AND MONITORING• ENTERPRISE DC/OS
AMAZON WEB SERVICES (AWS)
• “CLOUD”• COMPUTE• STORAGE, CONTENT• DATABASE• NETWORKING• HADOOP CLUSTER
SMACKZ = SMACK + Z
• SPARK – SCALABLE DATA PROCESSING AND ANALYTICS• MESOS – CLUSTER MANAGER, RESOURCE SHARING, SCHEDULING• AKKA – ACTOR-BASED CONCURRENT, DISTRIBUTED APPLICATION
FRAMEWORK• CASSANDRA – DISTRIBUTED DATABASE• KAFKA – HIGH-THROUGHPUT, LOW-LATENCY, PUB-SUB MESSAGING• ZEPPELIN – DATA MANIPULATION AND VISUALIZATION INTERFACE• SINGLE INTERFACE, HIGHLY SCALABLE CLUSTER
SMACKZ – SMACK STACK + ZEPPELIN
http://www.natalinobusa.com/2015/11/why-is-smack-stack-all-rage-lately.html
SMACK
• HTTPS://DOCS.MESOSPHERE.COM/ADMINISTRATION/INSTALLING/CLOUD/AWS/
• HTTPS://DCOS.IO/DOCS/1.7/ADMINISTRATION/INSTALLING/CLOUD/AWS/• $ dcos package install spark$ dcos package install cassandra$ dcos package install kafka
ZEPPELIN
• $ cat options.json{ "zeppelin": { "role": "slave_public" } }• $ dcos package install --options=options.json zeppelin
ZEPPELIN
• RUNS ON PUBLIC NODE• RUNS FROM MARAHTON$ dcos marathon task list$ dcos marathon task show zeppelin….
• https://dcos.io/docs/1.7/usage/tutorials/spark/
CONTACT ME
• GITHUB: HTTPS://GITHUB.COM/FELIXCHEUNG