Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently...

Post on 23-May-2020

6 views 0 download

Transcript of Harnessing Big Data with KNIME · Big Data, IoT, and the three V Variety: –KNIME inherently...

Copyright © 2015 KNIME.com AG

Harnessing Big Data with KNIME

Tobias Kötter

KNIME.com

Copyright © 2015 KNIME.com AG

Agenda

• The three V’s of Big Data

• Big Data Extension and Databases Nodes

• Demo

2

Copyright © 2015 KNIME.com AG

Variety, Volume, Velocity

Variety:• integrating heterogeneous data (and tools)

Volume:• from small files...

• ...to distributed data repositories (Hadoop)

• bring the tools to the data

Velocity:• from distributing computationally heavy

computations...

• ...to real time scoring of millions of records/sec.

3

Copyright © 2015 KNIME.com AG 4

Variety

Copyright © 2015 KNIME.com AG

Variety

• Data Integration– Small (Ascii)

– Proprietary (XLS, SAS...)

– Medium (Databases)

– Large (Hive, Impala, ParStream, HP Vertica...)

– Diverse (Numbers, Texts, Images, Networks, Sequences...)

• Tool Integration– Native

– Legacy, Inhouse

– R, Python, Matlab, ...

Copyright © 2015 KNIME.com AG 6

Volume

Copyright © 2015 KNIME.com AG

Every Minute…

7

Copyright © 2015 KNIME.com AG

IoT

8

Copyright © 2015 KNIME.com AG

Big Data Support

• KNIME Database Nodes

– in database processing

– preconfigured connectors

• KNIME Big Data Extension

– package required drivers/libraries for specific HDFS, Hive, Impala access

• Spark MLlib integration (coming soon)

Copyright © 2015 KNIME.com AG 10

Velocity

Copyright © 2015 KNIME.com AG

Velocity

• Computationally Heavy Analytics:

– Distributed Execution of one workflow branch

– Parallel Execution of workflow branches

• Hosted Analytics/Prediction

– Web service Deployment of Workflows

• High Demand Scoring/Prediction:

– Continuous Execution of Workflow parts

– High Performance Scoring using generic Workflows

– High Performance Scoring of Predictive Models

Copyright © 2015 KNIME.com AG

KNIME Cluster Execution: Distributed Data

Copyright © 2015 KNIME.com AG

KNIME Cluster Execution: Distributed Analytics

Copyright © 2015 KNIME.com AG

Deployed Workflows

Application Access• Custom API• WSDL/SOAP based

Copyright © 2015 KNIME.com AG

Continuous Scoring using Workflows

• Exposes workflow fragment as RESTful web service

• Deployed on KNIME Server (v4.0 – 1H2015)

Copyright © 2015 KNIME.com AG

High Performance Scoring via Workflows

• Streaming Executor

• Deployed via KNIME Server (v4.1 – 2H2015/2016)

• Record (or small batch) based processing

• Exposed as RESTful web service

Copyright © 2015 KNIME.com AG

High Performance Scoring using Models

• Deployed on KNIME Server (v4.0 – 1H2015)

• KNIME PMML Scoring via compiled PMML

• Exposed as RESTful web service

• Partnership with Zementis

– ADAPA Real Time Scoring

– UPPI Big Data Scoring Engine

Copyright © 2015 KNIME.com AG

Big Data, IoT, and the three V

Variety:– KNIME inherently well-suited: open platform

– broad data source/type support

– extensive tool integration

Volume:– Big Data Extensions cover Hadoop based data integration and

aggregation

– Big Data Executors will address model building and streaming execution

Velocity:– Distributed Execution of heavy workflows to...

– High Performance Scoring of predictive models.

Copyright © 2015 KNIME.com AG 19

Big Data Extension and Database Nodes

Copyright © 2015 KNIME.com AG

Database Port Types

20

Database JDBC Connection Port (light red)• Connection information

Database Connection Port (dark red)• Connection information• SQL statement

Database Connection Ports can be connected to

Database JDBC Connection Ports but not vice versa

Copyright © 2015 KNIME.com AG

Database JDBC Connection Port View

21

Copyright © 2015 KNIME.com AG

Database Connection Port View

22

Copy SQL statement

Copyright © 2015 KNIME.com AG

Database Connectors

• Nodes to connect to specific Databases – Bundling necessary JDBC drivers

– Easy to use

– DB specific behavior/capability

• Hive and Impala connector part of the commercial Big Data Extension

• General Database Connector– Can connect to any JDBC source

– Register new JDBC driver viapreferences page

23

Copyright © 2015 KNIME.com AG

Register JDBC Driver

24

Open KNIME and go toFile -> Preferences

Increase connection timeout forlong running database operations

Copyright © 2015 KNIME.com AG

Reader/Writer

• Table selection

• Load data into KNIME

• Create table as select

• Insert/append data

• Delete rows from table

• Update values in table

25

Copyright © 2015 KNIME.com AG

Hive/Impala Loader

26

• Upload a KNIME data table to Hive/Impala

• Part of the commercial Big Data Extension

Copyright © 2015 KNIME.com AG

Hive/Impala Loader

27

Partitioning influencesperformance

Copyright © 2015 KNIME.com AG

Manipulation

• Filter rows and columns

• Join tables/queries

• Sort your data

• Write your own query

• Aggregate your data

28

Copyright © 2015 KNIME.com AG

Database GroupBy – DB Specific Aggregation Methods

29

SQLite 7 aggregation functions

PostgreSQL 25 aggregation functions

Copyright © 2015 KNIME.com AG

Database GroupBy – Aggregation Method Description

30

Copyright © 2015 KNIME.com AG

Database GroupBy – Manual Aggregation

31

Returns number of rows per group

Copyright © 2015 KNIME.com AG

Database GroupBy – Pattern Based Aggregation

32

Tick this option if the search pattern is a

regular expression otherwise it is treated

as string with wildcards ('*' and '?')

Copyright © 2015 KNIME.com AG

Database GroupBy – Type Based Aggregation

33

Matches all cells

Matches all numericcells

Copyright © 2015 KNIME.com AG

Database GroupBy – Custom Aggregation Function

34

Copyright © 2015 KNIME.com AG

Utility

• Drop table

– missing table handling

– cascade option

• Execute any SQL statement e.g. DDL

• Manipulate existing queries

35

Executes severalqueries separatedby ; and new line

Copyright © 2015 KNIME.com AG

In-Database Processing

36

Loads your pre-processeddata into KNIME

Copyright © 2015 KNIME.com AG

HDFS File Handling

• KNIME & Extensions -> KNIME File Handling Nodes

• HDFS Connection and HDFS File Permission nodes part of the commercial Big Data Extension

37

Copyright © 2015 KNIME.com AG

HDFS File Handling

38

Copyright © 2015 KNIME.com AG

Virtual Machines

• Hortonworks:

http://hortonworks.com/products/hortonworks-sandbox/

• Cloudera:http://www.cloudera.com/content/cloudera/en/downloads/quickstart_vms.html

• Virtual Box

https://www.virtualbox.org/

• VMWare Player

https://www.virtualbox.org/

39

Copyright © 2015 KNIME.com AG 40

Demo

Copyright © 2015 KNIME.com AG

Resources

• KNIME (www.knime.org)• BLOG for news, tips and tricks(www.knime.org/blog)

• FORUM for questions and answers (tech.knime.org/forum)

• EXAMPLE SERVER for example workflows

• LEARNING HUB (www.knime.org/learning-hub)

• KNIME TV channel on

• KNIME on @KNIME

• KNIME on

https://www.facebook.com/KNIMEanalytics

41