AvalancheProject2012
description
Transcript of AvalancheProject2012
Avalanche By: Matthew Levandowski, Travis Fisher, Erik Vavro,
Eric Nelson, Jonathan Hoatlin
Ideas & Definitions Workbench / Interface
- A sandbox environment for developing workflows that can be later used in implementations (e.g., our beer restaurant).
The workbench acts as a secure entry point to the remote framework (RESTful cloud service).
Block
- A single event of data manipulation. Blocks are commonly chained together and each block is usually dependent on the
output of its preceding blocks. They can accept data from either Mongo, the UI, and of course, other blocks. Blocks inherit
the behavior of celery tasks.
Connection
- An identifying route between a source block and target block.
Group Block
- A block that encapsulates sub-blocks, used to provide a basic sense of hierarchy. Group blocks do not perform any data
manipulation themselves and simply forward incoming data to their sub-blocks.
Workflow
- A user-owned collection of blocks and/or group blocks and connections that is described by a JSON schema. Workflows
are generated in the UI (workbench) and displayed with the Graphiti JSON graph and passed to the remote framework for
serialization into an executable sequence of blocks (celery tasks).
Ideas & Definitions (con.) Framework
- A RESTful cloud-based framework that mines data, serializes workflows and performs various statistical/analytical tasks
(powered by celery).
Celery
- An asynchronous task queue/job queue library based on distributed message passing.
Celery Worker
- An external process connected to the mongo database that executes tasks on the task queue and returns results to other
tasks or to the main workflow task
Task
- A unit of execution in Celery. Blocks inherit from Task, so that they can be run in Celery
Workbench/Features • Administrator page allows user to create workflow
• Each block has metadata so that front end knows what connections and parameters each block needs.
• After user creates blocks dynamic form is created to receive parameters from the user.
• Restaurant allows user to create data by ordering beers and wines
• History of Results
• Upload Datasets
General Use Case
1. User logs in and then creates new dataset upload (server parses as json)
2. Dataset file is uploaded to server and generates unique filename
3. User creates new block by requesting block parameters and building form
4. Form and data is validated and new block is created
5. Before saving block model generates unique block id and adds to Graphiti canvas
6. Saves block model json to workflow field
7. User clicks ‘Run’ button and serializes blocks and workflow to send to backend
Framework/Features • Uses celery which is a multi-threaded tasks handler – increases performance
• MongoDB is a flexible, schema-free, BSON based database (NoSQL)
• Parses workflows into blocks and creates tasks for celery
Concepts and Paradigms
• Distributed, message-based computing
• Meta based
• Choose between duck and static typing
• Data confidence
• Scalability
• Modularity
• Cloud-based RESTful service
General Use Case
1. Workflow json gets sent to backend to be executed
2. Backend parses the workflow data and creates an executable sequence of blocks
3. Celery automagically handles and optimizes block queueing and saves results into MongoDB
4. Backend returns ids of results back to frontend.
5. Frontend access MongoDB API to get result data and parse into a visually pleasing format
6. Django display’s views for results with highcharts javascript library.
Example Workflow
Celery Constructs
Chain Chord
What we need
Common Dependencies Multiple Inputs
Solution:
Parallel Topological Sort
Parallel Topological Sort
Blocks without dependencies are started
Parallel Topological Sort
B0 finishes, b3 is started
Parallel Topological Sort
b1 finishes, b2 and b4 are started
Parallel Topological Sort
B2, b3, b4 finish, b5 is started
Parallel Topological Sort
B5 finishes
• Result ids are returned when all blocks finish
• The data stays in mongo
Framework/Algorithms • Basic Statistics
o Mean, Median, Mode
o Standard Deviation
o Variation
o Maximum, Minimum
• Set Theory
o Union
o Intersection
o Difference
o Sorting
• Apriori Algorithm
• K-Means Clustering
• Outlier Detection (Density-Based Clustering)
Demo
Workbench Technology • Django – Python based website framework
• Jquery – multi-browser JavaScript library designed to simplify the client-side scripting of HTML with ajax support
• Twitter Bootstrap Framework – HTML and CSS-based design templates for typography, forms, buttons, charts, navigation
and other interface components, as well as optional JavaScript extensions.
• Gargoyle – Togglable feature flips for administrator interface
• HTML5 Canvas - dynamic, scriptable rendering of 2D shapes and bitmap images
Problems Encountered?
• HTML5 Canvas GUI frontend does not work right on all browsers
• Django and jquery ui drag and drop.
• Django steep learning curve.
Framework Technology • Celery
• MongoDB
• Numpy
• Scipy
• Scikit Learn
• Flask
Problems Encountered?
• Celery has a steep initial learning curve
• Spent a lot of time revising the structure of workflows and blocks
• Machine learning algorithms are difficult
• Coordination of data formats was difficult to address between the front and back end