Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently...

41
Copyright © 2015 KNIME.com AG Harnessing Big Data with KNIME Tobias Kötter KNIME.com

Transcript of Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently...

Page 1: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG

Harnessing Big Data with KNIME

Tobias Kötter

KNIME.com

Page 2: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG

Agenda

• The three V’s of Big Data

• Big Data Extension and Databases Nodes

• Demo

2

Page 3: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG

Variety, Volume, Velocity

Variety:• integrating heterogeneous data (and tools)

Volume:• from small files...

• ...to distributed data repositories (Hadoop)

• bring the tools to the data

Velocity:• from distributing computationally heavy

computations...

• ...to real time scoring of millions of records/sec.

3

Page 4: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG 4

Variety

Page 5: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG

Variety

• Data Integration– Small (Ascii)

– Proprietary (XLS, SAS...)

– Medium (Databases)

– Large (Hive, Impala, ParStream, HP Vertica...)

– Diverse (Numbers, Texts, Images, Networks, Sequences...)

• Tool Integration– Native

– Legacy, Inhouse

– R, Python, Matlab, ...

Page 6: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG 6

Volume

Page 7: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG

Every Minute…

7

Page 8: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG

IoT

8

Page 9: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG

Big Data Support

• KNIME Database Nodes

– in database processing

– preconfigured connectors

• KNIME Big Data Extension

– package required drivers/libraries for specific HDFS, Hive, Impala access

• Spark MLlib integration (coming soon)

Page 10: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG 10

Velocity

Page 11: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG

Velocity

• Computationally Heavy Analytics:

– Distributed Execution of one workflow branch

– Parallel Execution of workflow branches

• Hosted Analytics/Prediction

– Web service Deployment of Workflows

• High Demand Scoring/Prediction:

– Continuous Execution of Workflow parts

– High Performance Scoring using generic Workflows

– High Performance Scoring of Predictive Models

Page 12: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG

KNIME Cluster Execution: Distributed Data

Page 13: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG

KNIME Cluster Execution: Distributed Analytics

Page 14: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG

Deployed Workflows

Application Access• Custom API• WSDL/SOAP based

Page 15: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG

Continuous Scoring using Workflows

• Exposes workflow fragment as RESTful web service

• Deployed on KNIME Server (v4.0 – 1H2015)

Page 16: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG

High Performance Scoring via Workflows

• Streaming Executor

• Deployed via KNIME Server (v4.1 – 2H2015/2016)

• Record (or small batch) based processing

• Exposed as RESTful web service

Page 17: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG

High Performance Scoring using Models

• Deployed on KNIME Server (v4.0 – 1H2015)

• KNIME PMML Scoring via compiled PMML

• Exposed as RESTful web service

• Partnership with Zementis

– ADAPA Real Time Scoring

– UPPI Big Data Scoring Engine

Page 18: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG

Big Data, IoT, and the three V

Variety:– KNIME inherently well-suited: open platform

– broad data source/type support

– extensive tool integration

Volume:– Big Data Extensions cover Hadoop based data integration and

aggregation

– Big Data Executors will address model building and streaming execution

Velocity:– Distributed Execution of heavy workflows to...

– High Performance Scoring of predictive models.

Page 19: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG 19

Big Data Extension and Database Nodes

Page 20: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG

Database Port Types

20

Database JDBC Connection Port (light red)• Connection information

Database Connection Port (dark red)• Connection information• SQL statement

Database Connection Ports can be connected to

Database JDBC Connection Ports but not vice versa

Page 21: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG

Database JDBC Connection Port View

21

Page 22: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG

Database Connection Port View

22

Copy SQL statement

Page 23: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG

Database Connectors

• Nodes to connect to specific Databases – Bundling necessary JDBC drivers

– Easy to use

– DB specific behavior/capability

• Hive and Impala connector part of the commercial Big Data Extension

• General Database Connector– Can connect to any JDBC source

– Register new JDBC driver viapreferences page

23

Page 24: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG

Register JDBC Driver

24

Open KNIME and go toFile -> Preferences

Increase connection timeout forlong running database operations

Page 25: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG

Reader/Writer

• Table selection

• Load data into KNIME

• Create table as select

• Insert/append data

• Delete rows from table

• Update values in table

25

Page 26: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG

Hive/Impala Loader

26

• Upload a KNIME data table to Hive/Impala

• Part of the commercial Big Data Extension

Page 27: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG

Hive/Impala Loader

27

Partitioning influencesperformance

Page 28: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG

Manipulation

• Filter rows and columns

• Join tables/queries

• Sort your data

• Write your own query

• Aggregate your data

28

Page 29: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG

Database GroupBy – DB Specific Aggregation Methods

29

SQLite 7 aggregation functions

PostgreSQL 25 aggregation functions

Page 30: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG

Database GroupBy – Aggregation Method Description

30

Page 31: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG

Database GroupBy – Manual Aggregation

31

Returns number of rows per group

Page 32: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG

Database GroupBy – Pattern Based Aggregation

32

Tick this option if the search pattern is a

regular expression otherwise it is treated

as string with wildcards ('*' and '?')

Page 33: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG

Database GroupBy – Type Based Aggregation

33

Matches all cells

Matches all numericcells

Page 34: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG

Database GroupBy – Custom Aggregation Function

34

Page 35: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG

Utility

• Drop table

– missing table handling

– cascade option

• Execute any SQL statement e.g. DDL

• Manipulate existing queries

35

Executes severalqueries separatedby ; and new line

Page 36: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG

In-Database Processing

36

Loads your pre-processeddata into KNIME

Page 37: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG

HDFS File Handling

• KNIME & Extensions -> KNIME File Handling Nodes

• HDFS Connection and HDFS File Permission nodes part of the commercial Big Data Extension

37

Page 38: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG

HDFS File Handling

38

Page 39: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG

Virtual Machines

• Hortonworks:

http://hortonworks.com/products/hortonworks-sandbox/

• Cloudera:http://www.cloudera.com/content/cloudera/en/downloads/quickstart_vms.html

• Virtual Box

https://www.virtualbox.org/

• VMWare Player

https://www.virtualbox.org/

39

Page 40: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG 40

Demo

Page 41: Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently well-suited: open platform –broad data source/type support –extensive tool integration

Copyright © 2015 KNIME.com AG

Resources

• KNIME (www.knime.org)• BLOG for news, tips and tricks(www.knime.org/blog)

• FORUM for questions and answers (tech.knime.org/forum)

• EXAMPLE SERVER for example workflows

• LEARNING HUB (www.knime.org/learning-hub)

• KNIME TV channel on

• KNIME on @KNIME

• KNIME on

https://www.facebook.com/KNIMEanalytics

41