UIUC Fireside Chats 2015

37
1 © Cloudera, Inc. All rights reserved. UIUC Fireside Chats Building NoSQL Applications, Hadoop In The Cloud, and other Assorted Topics

Transcript of UIUC Fireside Chats 2015

Page 1: UIUC Fireside Chats 2015

1© Cloudera, Inc. All rights reserved.

UIUC Fireside ChatsBuilding NoSQL Applications, Hadoop In The Cloud, and other Assorted Topics

Page 2: UIUC Fireside Chats 2015

2© Cloudera, Inc. All rights reserved.

Introductions

Aleks Shulman - Software Engineer in Test

Page 3: UIUC Fireside Chats 2015

3© Cloudera, Inc. All rights reserved.

Agenda

• Introductions• The Technicals

• NoSQL Applications with HBase• Considerations for Hadoop in the Cloud

• Parting Thoughts & Suggestions• Small company vs. big company• Thoughts on becoming a successful engineer• Thoughts on the Bay Area

• Q & A• Happy Hour

Page 4: UIUC Fireside Chats 2015

4© Cloudera, Inc. All rights reserved.

A Little About Myself

[email protected]

Aleks Shulman

Current:ClouderaCloud TeamTest Engineer

Past:Salesforce.comPlatform API TeamQuality Engineering

School: UIUC - ‘10Computer Science (bachelors) & Aerospace Engineering (bachelors)

Aleks Shulman

Page 5: UIUC Fireside Chats 2015

5© Cloudera, Inc. All rights reserved.

A Little About Cloudera

Engineering - @ClouderaEngCorporate - @ClouderaLife at Cloudera - @ClouderaJobs

• Started 2009 by former Facebook, Yahoo!, Google, and Oracle engineers and execs

• 1000+ employees as of 10/2015• Distributes Cloudera’s Distribution

including Apache Hadoop (CDH), Cloudera's 100% Open Source Distribution of Hadoop

• Distributes Cloudera Manager (CM), a proprietary monitoring and management layer atop CDH

• Contributes heavily to the open source community

• Employs 50+ Apache committers across the community | 84 committerships | 12+ top-level Apache projects originated at Cloudera

Page 6: UIUC Fireside Chats 2015

6© Cloudera, Inc. All rights reserved.

Building NoSQL ApplicationsOverview and HBase Case Study

Page 7: UIUC Fireside Chats 2015

7© Cloudera, Inc. All rights reserved.

Why Are NoSQL Applications Interesting?

Relational databases are SO 2005

• Build higher-scale systems• Think differently about what it means

to store & retrieve data• Solve different types of problems

• Data variability• Data variety• Data velocity

• Expertise is highly coveted in industry

Page 8: UIUC Fireside Chats 2015

8© Cloudera, Inc. All rights reserved.

• Referential Integrity - Valid references across tables

• Transactions - Atomicity, consistency, isolation, and durability (ACID) while doing concurrent sets of multiple R+W requests

• Joins - Constructing a view of data from two or more table with a common criteria

• Locking - Not permitting access because someone or something else is using it

Key Database Terms and Concepts

ClassId ClassName ProfessorId

CS125 Intro To CS 003

CS241 Systems 002

CS473 Algorithms 005

ProfessorId ProfessorFName ProfessorLName

001 Chandra Chekuri

002 Kravets Robin

003 Angrave Lawrence

004 Erickson Jeff

Page 9: UIUC Fireside Chats 2015

9© Cloudera, Inc. All rights reserved.

What is this NoSQL thing?

• Premise - Full relational access to all data may not be necessary+

• Performance penalties• Implementation Complexity

• If we relax those constraints we can…

• Process more data• Have a more flexible schema• Scale out instead of scale-up

• Successful NoSQL databases• Document stores: MongoDB,

CouchDB• Key-Value stores: HBase,

Cassandra, Riak KV• Graph stores: Giraph, Neo4j

+ http://hbase.apache.org/acid-semantics.html

Page 10: UIUC Fireside Chats 2015

10© Cloudera, Inc. All rights reserved.

• What is Hadoop?• Open-source framework for crunching LOTS of data!• Originated at Google and Yahoo! in mid-2000’s

• Why is it such a big deal?• Democratizing access to extremely powerful tools• Solve problems that have never been possible to solve• Help enterprises & institutions learn from and use all their data!

Case Study: NoSQL Applications with Hadoop & HBase

Page 11: UIUC Fireside Chats 2015

11© Cloudera, Inc. All rights reserved.

• Philosophy• Distributed computing• Commodity hardware• Accept, embrace, and handle

failure• One set of data - multiple

processing engines• Linear scalability

• Core Hadoop• Storage - HDFS• Processing - MapReduce

• ...and we build from there

MapReduce

HDFS

Other Components

OS

JVM

ZooKeeper

Infrastru

cture

Had

oo

p

Physical Hardware

Hadoop Architecture

Page 12: UIUC Fireside Chats 2015

12© Cloudera, Inc. All rights reserved.

NoSQL: HBase, Accumulo, Kudu

Processing: MR, YARN, Spark

Query: Impala, Phoenix, Hive

Infrastructure : Linux

Coordination: ZooKeeper

Storage: HDFS (or Isilon, S3, etc.)

YOUR APPLICATION HERE

Hadoop

Ecosystem

Core H

adoopHadoop Architecture - The Rest of the Stack

Page 13: UIUC Fireside Chats 2015

13© Cloudera, Inc. All rights reserved.

Hadoop - Cluster Topology

A Machine

A Cluster of Machines

...

...

...

...

... ......... ...... ......

...... ......

Page 14: UIUC Fireside Chats 2015

14© Cloudera, Inc. All rights reserved.

Hadoop - Cluster Command And Control

A Machine

A Cluster of Machines

...

...

...

...

... ......... ...... ......

...... ......M M M

Page 15: UIUC Fireside Chats 2015

15© Cloudera, Inc. All rights reserved.

What Is HBase?

• Distributed, ColumnFamily-Oriented Key-Value Data-Store

• Modeled after Google’s BigTable paper• Scalable, low-latency, consistent,

random-access• Non-relational• Built atop HDFS• Apache-Licensed Open Source

Page 16: UIUC Fireside Chats 2015

16© Cloudera, Inc. All rights reserved.

RDBMS HBase

Data Layout Structured & Row-oriented Semi-Structured - Column-family-oriented

Schema Defined at create Defined at create & runtime

Transactions Multi-row ACID Single row only

Query Language SQL get/put/scan/increment/etc

Security - Authentication- Authorization

- Authentication (Kerberos)- Authorization (ACLs)

Indexes On arbitrary columns Row-key only

Max Data Size TBs ~1 PB

Read/write throughput limits 1000s queries/second Millions of “queries”/second

What Is HBase?

Page 17: UIUC Fireside Chats 2015

17© Cloudera, Inc. All rights reserved.

HBase Is A Set Of Tables Defined by KVsImplicit PRIMARY KEY in RDBMS terms

Column format isfamily:qualifier

Data is all byte[] in HBase

Different rows may have different sets of columns(table is sparse)

A single cell might have differentvalues at different timestamps

Key: cutting/info:height/<timestamp> Value: ‘9ft’Key: tlipcon/roles:hbase/<timestamp> Value: ‘Committer’

Page 18: UIUC Fireside Chats 2015

18© Cloudera, Inc. All rights reserved.

Your First HBase Java App

pom.xml MyClient.javaimport org.apache.hadoop.conf.Configuration;import org.apache.hadoop.hbase.*;import org.apache.hadoop.hbase.client.HBaseAdmin;import java.io.IOException;

public class MyClient { public static void main(String args[]) throws IOException {

final String TABLE_NAME = "myTable_" + System.currentTimeMillis(); final String CF_NAME = "myColumnFamily";

//Create the table HBaseAdmin myAdmin = new HBaseAdmin(new Configuration()); HTableDescriptor htd = new HTableDescriptor(TABLE_NAME); htd.addFamily(new HColumnDescriptor(CF_NAME)); myAdmin.createTable(htd);

//List the table (e.g. select name from tables) for(TableName t : myAdmin.listTableNames()) { System.out.println("Table: " + t.getNameAsString()); } }}

<groupId>myGroup</groupId> <artifactId>myClient</artifactId> <version>1.0-SNAPSHOT</version> <properties> <hadoop.version>2.3.0</hadoop.version> <hbase.version>0.98.2-hadoop2</hbase.version> </properties> <dependencies> <dependency> <groupId>org.apache.hbase</groupId> <artifactId>hbase-client</artifactId> <version>${hbase.version}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>${hadoop.version}</version> </dependency> </dependencies>

Page 19: UIUC Fireside Chats 2015

19© Cloudera, Inc. All rights reserved.

Running Your First HBase Java App

1. Get HBase (0.98, for example) List of mirrors: http://www.apache.org/dyn/closer.cgi/hbase/

wget -O hbase.tar.gz <mirror link> tar -xf ./hbase.tar.gz2. Start HBase Locally ./hbase/bin/start-hbase.sh3. Build & Run Your App mvn clean compile package -DskipTests mvn exec:java*

* A little config required in maven. To avoid this, add pom and Client into Intellij and run as an application

Page 20: UIUC Fireside Chats 2015

20© Cloudera, Inc. All rights reserved.

So you want to use a NoSQL Data Store?

• What data is being stored?• Entity data • Event data

• Why is the data being stored?• Operational use cases• Analytical use cases

• How does the data get in and out?• Real time vs. Batch • Random vs. Sequential

Page 21: UIUC Fireside Chats 2015

21© Cloudera, Inc. All rights reserved.

Why are you storing the data?

• So what kind of questions are you asking the data?• Entity-centric questions

• Give me everything about entity e• Give me the most recent event v about entity e• Give me the n most recent events V about entity e• Give me all events V about e between time [t1,t2]

• Event and Time-centric questions• Give me an aggregate for each entity between time [t1,t2]• Give me an aggregate for each time interval for entity e• Find events V that match some other given criteria

Page 22: UIUC Fireside Chats 2015

22© Cloudera, Inc. All rights reserved.

Entity Centric Data

• Entity data is information about current state• Generally real time reads and writes

• Examples: • Accounts• Users• Geolocation points• Click Counts and Metrics• Current Sensors Reading

• Scales up with # of Humans and # of Machines/Sensors• Billions of distinct entities

Page 23: UIUC Fireside Chats 2015

23© Cloudera, Inc. All rights reserved.

Event Centric Data

• Event centric data are time-series data points recording successive points spaced over time intervals.

• Generally real time write, some combination of real time read or batch read

• Examples: • Sensor data over time• Historical Stock Ticker data• Historical Metrics• Clicks time-series

• Scales up due to finer grained intervals, retention policies, and the passage of time

Page 24: UIUC Fireside Chats 2015

24© Cloudera, Inc. All rights reserved.

If You Need SQL, xTable Transactions, and ACID

• No SQL -> Not Only SQL• Hadoop Query Engines• SpliceMachine• Apache Phoenix• HP’s Trafodion

Page 25: UIUC Fireside Chats 2015

25© Cloudera, Inc. All rights reserved.

The CloudPossibilities & Considerations

Page 26: UIUC Fireside Chats 2015

26© Cloudera, Inc. All rights reserved.

Why Might You Find Hadoop in the Cloud Interesting?

• Flexibility• API-defined computing• Rapid prototyping and POC• Burst capacity

• Exciting new data persistence options

• Network-attached storage• File stores• Block stores

• Cluster topologies & lifecycles• Short-lived vs. long-lived

Page 27: UIUC Fireside Chats 2015

27© Cloudera, Inc. All rights reserved.

Hadoop + Cloud ?

• Pros• A natural fit• Elasticity and workloads• Easy to prove out new technology

• Cons• Security Policy• Cost• Lack of transparency/control• Relatively untested/unproven

Page 28: UIUC Fireside Chats 2015

28© Cloudera, Inc. All rights reserved.

Cloud Providers

• Short-lived Clusters• Microsoft HDInsights (Azure)• Amazon EMR (AWS)

• Longer-term Clusters• Cloudera Director (Azure, AWS,

GCP)

Page 29: UIUC Fireside Chats 2015

29© Cloudera, Inc. All rights reserved.

What Can Go Wrong

• Scale• Provisioning• Timeouts• Retry counts

• Network• Connectivity issues• VPC/NAT throttling• Latency

• AWS• Oversubscribed hardware• Network connectivity/throughput issues• Opaque topology

• Network-Attached Storage• High latency• Low throughput

• Running on a file store• Semantic mismatches• File operation incompatibility

• OS / Machine Image• Suboptimal memory/disk tuning• Hypervisor issues

Page 30: UIUC Fireside Chats 2015

30© Cloudera, Inc. All rights reserved.

What Can Go Wrong - Case Study: Network-Attached Storage

• Network attached storage• Backing store is some kind of

block store• Blocks organized into logical

disks• Disks are mounted• Should just work, right?!

• What can go wrong• High latency• Low throughput• Write/Read timeouts

• How to fix:• OS (Memory) Tuning• Disk tuning• Cluster (filesystem) tuning• Application (tuning)

Page 31: UIUC Fireside Chats 2015

31© Cloudera, Inc. All rights reserved.

Other Cloud Considerations

• SSD vs. Rotating disk• PV vs. HVM Machine Images

• Other virtualization considerations• # of disks• Memory considerations• Instance types vs. workloads

Page 32: UIUC Fireside Chats 2015

32© Cloudera, Inc. All rights reserved.

Less Technical MattersThoughts on Engineering, personal growth, living in the Bay Area, and topics

Page 33: UIUC Fireside Chats 2015

33© Cloudera, Inc. All rights reserved.

Small vs. Big Company

• Pros• Greater impact/greater leverage• Less process/bureaucracy• Dedicated staff/founders• Ownership of key, visible products/features• Closer personal ties with co-workers

• Cons• The buck stops with you• Nights & Weekends• Requirements and business direction change• Common problems don’t yet have solutions

Page 34: UIUC Fireside Chats 2015

34© Cloudera, Inc. All rights reserved.

Frenemies

• Sometimes companies have shared interests, but compete

• Customers at one level of the stack, collaborators at another level, competitors at a third

• Keep competition professional & ethical - things change often

• People move companies• Companies get acquired• Companies decide to partner or merge

Page 35: UIUC Fireside Chats 2015

35© Cloudera, Inc. All rights reserved.

Good Engineers...Apply Good Engineering Patterns & Processes

• Build small, dependable, trusted kernels, and then scale up

• Look to reuse as much as possible• Think twice, implement once• Pick their technical battles very carefully

• Knowing which constrains can be relaxed, and when, can be really helpful (don’t boil the ocean!)

Are Always Learning and Thinking In Patterns

• Look for natural interfaces to things• Aren’t afraid to go beyond abstractions• Able to quickly understand unfamiliar systems in

terms of more familiar systems• Invest in tools & tools knowledge

Communicate

• Conscious of their communication style and those of others

• Seek out feedback regularly and make sure to use it

• Give feedback compassionately and delicately• Look for mentorship & mentor others where

appropriate

Have a Sense of Self

• Are self-aware & play to their own strengths• Understand and mitigate their weaknesses• Self-regulate to avoid burn-out

Page 36: UIUC Fireside Chats 2015

36© Cloudera, Inc. All rights reserved.

SF & The Bay Area

Work

• The Technology• Hub for Innovation

• The People• Very intelligent, driven, and unique• Extremely diverse

Play

• Weather• No snow!• Drive to your choice of weather

• Food• Virtually unlimited options and

availability, usually a few minutes away

Page 37: UIUC Fireside Chats 2015

37© Cloudera, Inc. All rights reserved.

Staying In Touch

[email protected]@cloudera.com

@a_shulman @cloudera@clouderaEng @clouderaJobs We’re Hiring!!!

Internships: Summer 2016Full Time Engineering