Building a Stock Prediction system with Machine Learning using Geode, SpringXD and Spark MLLib

Post on 26-Jan-2017

987 views 0 download

Transcript of Building a Stock Prediction system with Machine Learning using Geode, SpringXD and Spark MLLib

‹#›© 2015 Pivotal Software, Inc. All rights reserved. ‹#›

Building a Stock Prediction system with Machine Learning using Geode, Spring XD

e Spark MLLib

William Markito@william_markit

o

Fred Melo@fredmelo_br

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

It's all about DATA

Data SourcesLook for patterns

Prediction

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

medium avg (x+1)

relative strength

(x)

medium avg (x)

price(x)

Machine Learning Model (e.g. Linear

Regression)

© Copyright 2014 Pivotal. All rights reserved.

Transform Sink

SpringXD

ExtensibleOpen-SourceFault-TolerantHorizontally ScalableCloud-Native

Machine Learning

Enrich Filter

Split

Dashboard

Indicators

1

2

Predict

3

Real data

Simulator

/Stocks

/TechIndicators

/Predictions

‹#›© 2015 Pivotal Software, Inc. All rights reserved. ‹#›

Apache Geode (incubating)

Introduction

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

Introduction

A distributed, memory-based data management platform for data oriented apps that need:High performance, scalability, resiliency and continuous

availabilityFast access to critical data setLocation aware distributed data processingEvent driven data architecture

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

Concepts

CacheIn-memory storage and management for

your dataConfigurable through XML, Spring, Java

API or CLICollection of Region

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

Concepts

RegionDistributed java.util.Map on steroids

(Key/Value)

Consistent API regardless of where or how data is stored

Observable (reactive)

Highly available, redundant on cache Member (s).

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

Concepts

RegionLocal, Replicated or PartitionedIn-memory or persistentRedundantLRU Overflow

LOCALLOCAL_HEAP_LRULOCAL_OVERFLOWLOCAL_PERSISTENTLOCAL_PERSISTENT_OVERFLOWPARTITIONPARTITION_HEAP_LRUPARTITION_OVERFLOWPARTITION_PERSISTENTPARTITION_PERSISTENT_OVERFLOWPARTITION_PROXYPARTITION_PROXY_REDUNDANTPARTITION_REDUNDANTPARTITION_REDUNDANT_HEAP_LRUPARTITION_REDUNDANT_OVERFLOWPARTITION_REDUNDANT_PERSISTENTPARTITION_REDUNDANT_PERSISTENT_OVERFLOWREPLICATEREPLICATE_HEAP_LRUREPLICATE_OVERFLOWREPLICATE_PERSISTENTREPLICATE_PERSISTENT_OVERFLOWREPLICATE_PROXY

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

Concepts

MemberA process that has a connection to the systemA process that has created a cacheEmbeddable within your application

Client

Locator

Server

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

Concepts

Client cacheA process connected to the Geode server(s)Can have a local copy of the dataCan be notified about events on the servers

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

Concepts

ListenersCacheWriter / CacheListenerAsyncEventListener (queue / batch)Parallel or SerialConflation

© Copyright 2014 Pivotal. All rights reserved.

Apache Geode (incubating)

Currently under incubation in Apache Software FoundationWelcome contributions and contributors

Code and PatchesBugs, feature requestsDocumentation and contentAny form of feedback

© Copyright 2014 Pivotal. All rights reserved.

CodeNew features

Bug fixes (patches)

Writing tests

DocumentationWiki

Web site

User guides

CommunityJoin our mailing lists (Ask or answer)

Become a speaker

Find and report bugs

Testing a release candidate or beta

Apache Geode (incubating)

© Copyright 2014 Pivotal. All rights reserved.

JIRA - https://issues.apache.org/jira/browse/GEODEGitHub - https://github.com/apache/incubator-geodeMailing lists:

Development - dev@geode.incubator.apache.org

Users - user@geode.incubator.apache.org

Wiki - cwiki.apache.org/confluence/display/GEODEStackOverflow - http://stackoverflow.com/questions/tagged/geode+or+gemfire

Apache Geode (incubating)

‹#›© 2015 Pivotal Software, Inc. All rights reserved. ‹#›

SpringXDIntroduction

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

Concepts

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

Concepts A stream is composed from modules. Each module is deployed to a container and its

channels are bound to the transport.

‹#›© 2015 Pivotal Software, Inc. All rights reserved. ‹#›

Apache Zeppelin(incubating)

Introduction

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

Concepts

Web based REPLIterative & ExploratorySupport for Data Ingestion

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

Concepts

Multi interpretersMarkdownShellSparkGeodePython…

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

Concepts

Sharing through URLs without Reports

‹#›© 2015 Pivotal Software, Inc. All rights reserved. ‹#›

Apache SparkIntroduction

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

Concepts

RDDDataframeDriverWorker

"An RDD in Spark is simply an immutable distributed collection of objects. Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes."

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

Concepts

RDDDataframeDriverWorker

“A dataframe is a distributed collection of rows organized into named columns. An abstraction for selecting, filtering and plotting structured data (pandas), previously known as SchemaRDD."

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

Concepts

RDDDataframe DriverWorker

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

Summary

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

Summary

• Integration• Spark, JDBC, Geode• HDFS, Twitter, File, Mail…

• Data pipeline orchestration• Intuitive DSL• Streaming & Analytics• Distributed and scalable

• Web based REPL• Multiple Interpreters

• Apache Spark• Markdown• Flink• Python• Geode…

• Iterative & Exploratory

‹#›© 2015 Pivotal Software, Inc. All rights reserved.

Summary

• Fast data processing• Columnar queries• RDDs• Machine Learning• Analytics & Streaming

• Fast data store and processing• In-memory & Persistent• Highly Consistent• Transaction processing• Thousands of concurrent

clients

© Copyright 2014 Pivotal. All rights reserved.

Source Codehttp://pivotal-open-source-hub.github.io/StockInference-Spark/