Intro To Cascading

Introduction to

Cascading

What is Cascading?

• abstraction over MapReduce

What is Cascading?

• abstraction over MapReduce

• API for data-processing workflows

Why use Cascading?

• MapReduce can be:

Why use Cascading?

• the wrong level of granularity

Why use Cascading?

• the wrong level of granularity

• cumbersome to chain together

Why use Cascading?

• Cascading helps create

Why use Cascading?

• higher-level data processing abstractions

Why use Cascading?

• sophisticated data pipelines

Why use Cascading?

• sophisticated data pipelines

• reusable components

Credits

• Cascading written by Chris Wensel

• based on his users guide

http://bit.ly/cascading

package cascadingtutorial.wordcount;

/** * Wordcount example in Cascading */public class Main {

public static void main( String[] args ) { String inputPath = args[0]; String outputPath = args[1];

Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine();

Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath);

Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "\\s+"), new Fields("word"));

wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class);

Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }

Data Model

Pipes Filters

Data Model

Filters

Pipe Assembly

Tuple Stream

Taps: sources & sinks

Source

Taps: sources & sinks

Source Sink

Pipe + Taps = Flow

Flow Flow

n Flows

Flow Flow

n Flows = Cascade

Flow Flow

n Flows = Cascade

Flow Flow

complex

Example

import cascading.flow.Flow;import cascading.flow.FlowConnector;import cascading.operation.aggregator.Count;import cascading.operation.regex.RegexParser;import cascading.operation.regex.RegexSplitGenerator;import cascading.operation.regex.RegexSplitter;import cascading.pipe.Each;import cascading.pipe.Every;import cascading.pipe.GroupBy;import cascading.pipe.Pipe;import cascading.scheme.Scheme;import cascading.scheme.TextLine;import cascading.tap.Hfs;import cascading.tap.Lfs;import cascading.tap.SinkMode;import cascading.tap.Tap;import cascading.tuple.Fields;import java.util.Properties;

public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine();

String inputPath = args[0]; String outputPath = args[1];

// sort by count // wcPipe = new GroupBy(wcPipe, Fields.ALL, new Fields("count"), true);

TextLine()

TextLine() SequenceFile()

hdfs://master0:54310/user/nmurray/data.txt

hdfs://master0:54310/user/nmurray/data.txt new Hfs()

hdfs://master0:54310/user/nmurray/data.txtdata/sources/obama-inaugural-address.txt

new Hfs()

new Hfs()new Lfs()

S3fs()

new Hfs()new Lfs()

S3fs()GlobHfs()

Pipe Assembly

Pipe Assemblies

• Define work against a Tuple Stream

Pipe Assemblies

• May have multiple sources and sinks

Pipe Assemblies

• Splits

Pipe Assemblies

• Splits

• Merges

Pipe Assemblies

• Splits

• Merges

• Joins

Pipe Assemblies

• Pipe

Pipe Assemblies

• Pipe

• Each

• GroupBy

• CoGroup

• Every

• SubAssembly

Pipe Assemblies

• Pipe

• Each

• GroupBy

• CoGroup

• Every

• SubAssembly

Pipe Assemblies

• Pipe

• Each

• GroupBy

• CoGroup

• Every

• SubAssembly

Applies aFunction or Filter

Operation to each Tuple

Pipe Assemblies

• Pipe

• Each

• GroupBy

• CoGroup

• Every

• SubAssembly

Pipe Assemblies

• Pipe

• Each

• GroupBy

• CoGroup

• Every

• SubAssembly

Pipe Assemblies

• Pipe

• Each

• GroupBy

• CoGroup

• Every

• SubAssembly

Pipe Assemblies

Group(& merge)

• Pipe

• Each

• GroupBy

• CoGroup

• Every

• SubAssembly

Pipe Assemblies

Joins(inner, outer, left, right)

• Pipe

• Each

• GroupBy

• CoGroup

• Every

• SubAssembly

Pipe Assemblies

• Pipe

• Each

• GroupBy

• CoGroup

• Every

• SubAssembly

Pipe Assemblies

Applies an Aggregator (count, sum) to every group of Tuples.

• Pipe

• Each

• GroupBy

• CoGroup

• Every

• SubAssembly

Pipe Assemblies

Nesting

Each vs. Every

• Each() is for individual Tuples

Each vs. Every

• Each() is for individual Tuples

• Every() is for groups of Tuples

new Each(previousPipe, argumentSelector, operation, outputSelector)

Fields()

offset line

0 the lazy brown fox

offset line

0 the lazy brown foxX

offset line

Operations: Functions

• Insert()

• RegexParser() / RegexGenerator() / RegexReplace()

• Insert()

• DateParser() / DateFormatter()

• Insert()

• XPathGenerator()

• Insert()

• Identity()

• Insert()

• Identity()

• etc...

Operations: Filters

• RegexFilter()

Operations: Filters

• RegexFilter()

• FilterNull()

Operations: Filters

• RegexFilter()

• FilterNull()

• And() / Or() / Not() / Xor()

Operations: Filters

• RegexFilter()

• FilterNull()

• And() / Or() / Not() / Xor()

• ExpressionFilter()

Operations: Filters

• RegexFilter()

• FilterNull()

• And() / Or() / Not() / Xor()

• ExpressionFilter()

• etc...

new Every(previousPipe, argumentSelector, operation, outputSelector)

new Every(wcPipe, new Count(), new Fields("count", "word"));

Operations: Aggregator

• Average()

• Count()

• Average()

• Count()

• First() / Last()

• Average()

• Count()

• Min() / Max()

• Average()

• Count()

• Min() / Max()

• Sum()

new Every(wcPipe, Fields.ALL, new Count(), new Fields("count", "word"));

Field Selection

Predefined Field Sets

Fields.ALL all available fields

Fields.GROUP fields used for last grouping

Fields.VALUES fields not used for last grouping

Fields.ARGS fields of argument Tuple(for Operations)

Fields.RESULTS replaces input with Operation result (for Pipes)

Field Selection

pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("timestamp"));

Field Selection

Operation Input Tuple =

Original Tuple Argument Selector

Field Selection

pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));

Field Selection

id visitor_ id

search_term

timestamp

page_number

Original Tuple

Field Selection

id visitor_ id

search_term

timestamp

page_number

Original Tuple

timestamp

Argument Selector

Field Selection

id visitor_ id

search_term

timestamp

page_number

Original Tuple

timestamp

Argument Selector

Field Selection

id visitor_ id

search_term

timestamp

page_number

Original Tuple

timestamp

Argument Selector

timestamp

Input to DateParser will be:

Field Selection

Output Tuple =

Original Tuple ⊕ Operation Tuple Output Selector

Field Selection

idvisitor_

idsearch_

termtime

stamppage_

number

Original Tuple

Field Selection

idvisitor_

idsearch_

termtime

stamppage_

number

Original Tuple

Field Selection

DateParser Output

tsidvisitor_

idsearch_

termtime

stamppage_

number

Original Tuple

Field Selection

DateParser Output

ts ts search_term

visitor_ id

Output Selector

idvisitor_

idsearch_

termtime

stamppage_

number

Original Tuple

Field Selection

DateParser Output

ts ts search_term

visitor_ id

Output Selector

idvisitor_

idsearch_

termtime

stamppage_

number

Original Tuple

Field Selection

DateParser Output

ts ts search_term

visitor_ id

Output Selector

idvisitor_

idsearch_

termtime

stamppage_

number

Original Tuple

Field Selection

DateParser Output

ts ts search_term

visitor_ id

Output Selector

idvisitor_

idsearch_

termtime

stamppage_

number

Original Tuple

Output of Each will be:

ts search_term

visitor_ id

Field Selection

DateParser Output

ts ts search_term

visitor_ id

Output Selector

idvisitor_

idsearch_

termtime

stamppage_

number

Original Tuple

ts search_term

visitor_ id

Field Selection

DateParser Output

ts ts search_term

visitor_ id

Output Selector

idvisitor_

idsearch_

termtime

stamppage_

number

Original Tuple

ts search_term

visitor_ id

Field Selection

DateParser Output

ts ts search_term

visitor_ id

Output Selector

idvisitor_

idsearch_

termtime

stamppage_

number

Original Tuple

ts search_term

visitor_ id

Field Selection

DateParser Output

ts ts search_term

visitor_ id

Output Selector

idvisitor_

idsearch_

termtime

stamppage_

number

Original Tuple

ts search_term

visitor_ id

⊕X X X

count word

2 hello

count word

1 world

input: mary had a little lamblittle lamblittle lambmary had a little lambwhose fleece was white as snow

hadoop jar ./target/cascading-tutorial-1.0.0.jar \ data/sources/misc/mary.txt data/output

command:

output: 2 a1 as1 fleece2 had4 lamb4 little2 mary1 snow1 was1 white1 whose

What’s next?

apply operations to tuple streams

repeat.

Cookbook

// remove all tuples where search_term is '-' pipe = new Each(pipe, new Fields("search_term"), new RegexFilter("^(\\-)$", true));

Cookbook

// convert "timestamp" String field to "ts" long field pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), Fields.ALL);

Cookbook

// td - timestamp of day-level granularity pipe = new Each(pipe, new Fields("ts"), new ExpressionFunction( new Fields("td"), "ts - (ts % (24 * 60 * 60 * 1000))", long.class), Fields.ALL);

Cookbook // SELECT DISTINCT visitor_id

// GROUP BY visitor_id ORDER BY ts pipe = new GroupBy(pipe, new Fields("visitor_id"), new Fields("ts"));

// take the First() tuple of every grouping pipe = new Every(pipe, Fields.ALL, new First(), Fields.RESULTS);

Custom Operations

Desired Operation

Pipe pipe = new Each("ngram pipe", new Fields("line"), new NGramTokenizer(2), new Fields("ngram"));

Desired Operation

mary had a little lamb

Desired Operation

mary had

Desired Operation

mary had

Desired Operation

mary had

a little

Desired Operation

mary had

a little

little lamb

Custom Operations

• extend BaseOperation

Custom Operations

• implement Function (or Filter)

Custom Operations

• implement Function (or Filter)

• implement operate

public class NGramTokenizer extends BaseOperation implements Function{ private int size;

public NGramTokenizer(int size) { super(new Fields("ngram")); this.size = size; }

public void operate(FlowProcess flowProcess, FunctionCall functionCall) { String token; TupleEntry arguments = functionCall.getArguments(); com.attinteractive.util.NGramTokenizer t = new com.attinteractive.util.NGramTokenizer(arguments.getString(0), this.size);

while((token = t.next()) != null) { TupleEntry result = new TupleEntry(new Fields("ngram"), new Tuple(token)); functionCall.getOutputCollector().add(result); } }}

Tuple 123 vis-abc tacos 2009-10-0803:21:23 1

zero-indexed, ordered values

id visitor_ id

search_term

timestamp

page_numberFields

list of strings representing field names

id visitor_ id

search_term

timestamp

page_numberFields

123 vis-abc tacos 2009-10-0803:21:23 1

id visitor_ id

search_term

timestamp

page_number

TupleEntry

Success

mary had

a little

little lamb

code examples

dotFlow importFlow = flowConnector.connect(sourceTap, tmpIntermediateTap, importPipe);importFlow.writeDOT( "import-flow.dot" );

What next?

• http://www.cascading.org/

What next?

• http://www.cascading.org/

• http://bit.ly/cascading

Questions?<nmurray@att.com>

Intro To Cascading

Technology

Transcript of Intro To Cascading

Cascading the Balanced Scorecard to Build Organizational Alignment

Programming Cascading

CSS (Cascading Style Sheets): An Overview · 1 CSS (Cascading Style Sheets): An Overview CSS (Cascading Style Sheets) is a language used to describe the appearance and formatting

Operational defense of power system cascading sequences ... · Operational defense of power system cascading sequences: probability ... Reference to any particular cascading scenario

Brand Cascading - How to Stay True to Your Brand

Cascading introduction

Cascading Style Sheets 23 rd March. Cascading Style Sheets (CSS) CSS Syntax Linking CSS to XHTML Inheritance & Cascading Order Font Properties Text Properties.

Workshop HTML & CSS - FemmeHacksfemmehacks.io/fh17/fh16/html-css.pdf · HTML = HyperText Markup Language CSS = Cascading Stylesheets. Today’s Workshop: Intro to HTML Tags and attributes

CWP: Cascading Style Sheets · 3 Cascading Style Sheets Benefits of Cascading Style Sheets • Powerful and flexible way to specify the formatting of HTML elements – Can define

Cascading Style Sheets css-intro Cascading Style Sheetstecfa.unige.ch/guides/tie/pdf/files/css-intro.pdf · • Dans ce cas on utilise CSS pour changer l’affichage par défaut des

Internet Web Publishing III. Intro to Cascading Style Sheets Patricia Roberts.

Intro to Cascading

Cascading Style Sheets css-intro Cascading Style Sheetstecfa.unige.ch/tecfa/teaching/formcont/webmaster2001b/tie/pdf/files… · 2.3 Ressources 7 2.4 Conseils HTML 7 3. Association

CSS Introduction to - Florida State Universityww2.cs.fsu.edu/~faizian/cgs3066/resources/Lecture4-Intro to CSS.pdfCSS = Cascading Style Sheets CSS is a "style sheet language" that lets

The Physical Realities of Cascading S-Parameters … Physical Realities of Cascading S-Parameters for Full-Path Simulations ... Easily converted to the T-Matrix for cascading ... transmission

Intro to CSS Workshop. Cascading Style Sheets Cascading Style Sheets are a means to separate the presentation from the structural markup (xhtml) of a.

Intro to CSS Workshop. Welcome This slideshow presentation is designed to introduce you to Cascading Style Sheets (CSS). It is the first of two CSS workshops.

Introduction to Cascading Style Sheets

Using Network Analysis to Evaluate the Cascading Impacts ... › files › 65950_f45finalwetter... · Using Network Analysis to Evaluate the Cascading Impacts of Crises on Service

CHAPTER 3 Hotspots of exposure to cascading risks