QCon Rio 2015 - Stock Predictions with Spark-Geode-Zeppelin.key

37
‹#› © 2015 Pivotal Software, Inc. All rights reserved. ‹#› Building a Stock Prediction system with Machine Learning using Geode, Spring XD e Spark MLLib William Markito @william_markito Fred Melo @fredmelo_br

Transcript of QCon Rio 2015 - Stock Predictions with Spark-Geode-Zeppelin.key

‹#›© 2015 Pivotal Software, Inc. All rights reserved. ‹#›

Building a Stock Prediction system with Machine Learning using Geode, Spring XD

e Spark MLLib

William Markito@william_markito

Fred Melo@fredmelo_br

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

It's all about DATA

Data SourcesLook for patterns

Prediction

‹#›© 2015 Pivotal Software, Inc. All rights reserved. 4

‹#›© 2015 Pivotal Software, Inc. All rights reserved. 5

‹#›© 2015 Pivotal Software, Inc. All rights reserved. 7

medium avg (x+1)

relative strength (x)

medium avg (x)

price(x)

Machine Learning Model (e.g. Linear Regression)

© Copyright 2014 Pivotal. All rights reserved.

Transform Sink

SpringXD

Extensible Open-Source Fault-Tolerant Horizontally Scalable Cloud-Native

Machine Learning

Enrich Filter

Split

Dashboard

Indicators

1

2

Predict

3

Real data

Simulator

/Stocks

/TechIndicators

/Predictions

‹#›© 2015 Pivotal Software, Inc. All rights reserved. ‹#›

Apache Geode (incubating)

Introduction

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

IntroductionA distributed, memory-based data management platform for data oriented apps that need:

•High performance, scalability, resiliency and continuous availability

• Fast access to critical data set

• Location aware distributed data processing

•Event driven data architecture

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

Concepts•Cache

• In-memory storage and management for your data

•Configurable through XML, Spring, Java API or CLI

•Collection of Region

Region

Region

Region

Cache

JVM

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

Concepts•Region

• Distributed java.util.Map on steroids(Key/Value)

• Consistent API regardless of where or how data is stored

• Observable (reactive)

• Highly available, redundant on cache Member (s).

Region

Cache

java.util.Map

JVM

Key Value

K01 May

K02 Tim

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

Concepts•Region

• Local, Replicated or Partitioned

• In-memory or persistent

•Redundant

• LRU

•Overflow

Region

Cache

java.util.Map

JVM

Key Value

K01 May

K02 Tim

Region

Cache

java.util.Map

JVM

Key Value

K01 May

K02 Tim

LOCAL  LOCAL_HEAP_LRU  LOCAL_OVERFLOW  LOCAL_PERSISTENT  LOCAL_PERSISTENT_OVERFLOW  PARTITION  PARTITION_HEAP_LRU  PARTITION_OVERFLOW  PARTITION_PERSISTENT  PARTITION_PERSISTENT_OVERFLOW  PARTITION_PROXY  PARTITION_PROXY_REDUNDANT  PARTITION_REDUNDANT  PARTITION_REDUNDANT_HEAP_LRU  PARTITION_REDUNDANT_OVERFLOW  PARTITION_REDUNDANT_PERSISTENT  PARTITION_REDUNDANT_PERSISTENT_OVERFLOW  REPLICATE  REPLICATE_HEAP_LRU  REPLICATE_OVERFLOW  REPLICATE_PERSISTENT  REPLICATE_PERSISTENT_OVERFLOW  REPLICATE_PROXY

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

Concepts•Member

•A process that has a connection to the system

•A process that has created a cache

•Embeddable within your applicationClient

Locator

Server

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

Concepts•Client cache

•A process connected to the Geode server(s)

•Can have a local copy of the data

•Can be notified about events on the servers

Application

GemFire Server

Region

Region

Region Client Cache

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

Concepts• Listeners

•CacheWriter / CacheListener

•AsyncEventListener (queue / batch)

•Parallel or Serial

•Conflation

© Copyright 2014 Pivotal. All rights reserved. 19

Apache Geode (incubating)

• Currently under incubation in Apache Software Foundation

•Welcome contributions and contributors

• Code and Patches

• Bugs, feature requests

• Documentation and content

• Any form of feedback

© Copyright 2014 Pivotal. All rights reserved. 20

• Code • New features

• Bug fixes (patches)

•Writing tests

• Documentation •Wiki

•Web site

• User guides

• Community • Join our mailing lists (Ask or answer)

• Become a speaker

• Find and report bugs

• Testing a release candidate or beta

Apache Geode (incubating)

© Copyright 2014 Pivotal. All rights reserved. 21

• JIRA - https://issues.apache.org/jira/browse/GEODE

•GitHub - https://github.com/apache/incubator-geode

•Mailing lists: • Development - [email protected]

• Users - [email protected]

•Wiki - cwiki.apache.org/confluence/display/GEODE

• StackOverflow - http://stackoverflow.com/questions/tagged/geode+or+gemfire

Apache Geode (incubating)

‹#›© 2015 Pivotal Software, Inc. All rights reserved. ‹#›

SpringXDIntroduction

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

Concepts

Runs as a distributed application or as a single node

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

Concepts • A stream is composed from modules. Each module is deployed to a container and its

channels are bound to the transport.

‹#›© 2015 Pivotal Software, Inc. All rights reserved. ‹#›

Apache Zeppelin(incubating)

Introduction

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

Concepts•Web based REPL

• Iterative & Exploratory

•Support for Data Ingestion

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

Concepts•Multi interpreters

•Markdown

•Shell

•Spark

•Geode

•Python…

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

Concepts•Sharing through URLs without Reports

‹#›© 2015 Pivotal Software, Inc. All rights reserved. ‹#›

Apache SparkIntroduction

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

Concepts•RDD

•Dataframe

•Driver

•Worker

"An RDD in Spark is simply an immutable distributed collection of objects. Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes."

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

Concepts•RDD

•Dataframe

•Driver

•Worker

“A dataframe is a distributed collection of rows organized into named columns. An abstraction for selecting, filtering and plotting structured data (pandas), previously known as SchemaRDD."

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

Concepts•RDD

•Dataframe

•Driver

•Worker

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

Summary

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

Summary

• Integration • Spark, JDBC, Geode • HDFS, Twitter, File, Mail…

• Data pipeline orchestration • Intuitive DSL • Streaming & Analytics • Distributed and scalable

• Web based REPL • Multiple Interpreters

• Apache Spark • Markdown • Flink • Python • Geode…

• Iterative & Exploratory

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

Summary

• Fast data processing • Columnar queries • RDDs • Machine Learning • Analytics & Streaming

• Fast data store and processing • In-memory & Persistent • Highly Consistent • Transaction processing • Thousands of concurrent

clients

© Copyright 2014 Pivotal. All rights reserved. 36

Source Codehttp://pivotal-open-source-hub.github.io/StockInference-Spark/