Map, flatmap and reduce are your new best friends (javaone, svcc)

Post on 01-Dec-2014

169 views 0 download

description

Higher-order functions such as map(), flatmap(), filter() and reduce() have their origins in mathematics and ancient functional programming languages such as Lisp. But today they have entered the mainstream and are available in languages such as JavaScript, Scala and Java 8. They are well on their way to becoming an essential part of every developer’s toolbox. In this talk you will learn how these and other higher-order functions enable you to write simple, expressive and concise code that solve problems in a diverse set of domains. We will describe how you use them to process collections in Java and Scala. You will learn how functional Futures and Rx (Reactive Extensions) Observables simplify concurrent code. We will even talk about how to write big data applications in a functional style using libraries such as Scalding.

Transcript of Map, flatmap and reduce are your new best friends (javaone, svcc)

@crichardson

Map(), flatMap() and reduce() are your new best friends:

Simpler collections, concurrency, and big data

Chris Richardson

Author of POJOs in ActionFounder of the original CloudFoundry.com

@crichardsonchris@chrisrichardson.nethttp://plainoldobjects.com

@crichardson

Presentation goalHow functional programming simplifies

your code

Show that map(), flatMap() and reduce()

are remarkably versatile functions

@crichardson

About Chris

@crichardson

About Chris

Founder of a buzzword compliant (stealthy, social, mobile, big data, machine learning, ...) startup

Consultant helping organizations improve how they architect and deploy applications using cloud, micro services, polyglot applications, NoSQL, ...

@crichardson

Agenda

Why functional programming?

Simplifying collection processing

Eliminating NullPointerExceptions

Simplifying concurrency with Futures and Rx Observables

Tackling big data problems with functional programming

@crichardson

Functional programming is a programming paradigm

Functions are the building blocks of the application

Best done in a functional programming language

@crichardson

Functions as first class citizens

Assign functions to variables

Store functions in fields

Use and write higher-order functions:

Take functions as parameters

Return functions as values

@crichardson

Avoids mutable state

Use:

Immutable data structures

Single assignment variables

Some functional languages such as Haskell don’t allow side-effects

@crichardson

Why functional programming?

"the highest goal of programming-language design to enable good ideas to be elegantly

expressed"

http://en.wikipedia.org/wiki/Tony_Hoare

@crichardson

Why functional programming?More expressive

More concise

More intuitive - solution matches problem definition

Functional code is usually much more composable

Immutable state:

Less error-prone

Easy parallelization and concurrency

But be pragmatic

@crichardson

An ancient idea that has recently become popular

@crichardson

Mathematical foundation:

λ-calculus

Introduced byAlonzo Church in the 1930s

@crichardson

Lisp = an early functional language invented in 1958

http://en.wikipedia.org/wiki/Lisp_(programming_language)

1940

1950

1960

1970

1980

1990

2000

2010

garbage collection dynamic typing

self-hosting compiler tree data structures

(defun factorial (n) (if (<= n 1) 1 (* n (factorial (- n 1)))))

@crichardson

My final year project in 1985: Implementing SASL in LISP

sieve (p:xs) = p : sieve [x | x <- xs, rem x p > 0];

primes = sieve [2..]

A list of integers starting with 2

Filter out multiples of p

Mostly an Ivory Tower technology

Lisp was used for AI

FP languages: Miranda, ML, Haskell, ...

“Side-effects kills kittens and

puppies”

@crichardson

http://steve-yegge.blogspot.com/2010/12/haskell-researchers-announce-discovery.html

!*

!*

!*

@crichardson

But today FP is mainstreamClojure - a dialect of Lisp

A hybrid OO/functional language

A hybrid OO/FP language for .NET

Java 8 has lambda expressions

@crichardson

Java 8 lambda expressions are functions

x -> x * x

x -> { for (int i = 2; i < Math.sqrt(x); i = i + 1) { if (x % i == 0) return false; } return true; };

(x, y) -> x * x + y * y

@crichardson

Agenda

Why functional programming?

Simplifying collection processing

Eliminating NullPointerExceptions

Simplifying concurrency with Futures and Rx Observables

Tackling big data problems with functional programming

@crichardson

Lot’s of application code=

collection processing:

Mapping, filtering, and reducing

@crichardson

Social network examplepublic class Person {

enum Gender { MALE, FEMALE }

private Name name; private LocalDate birthday; private Gender gender; private Hometown hometown;

private Set<Friend> friends = new HashSet<Friend>(); ....

public class Friend {

private Person friend; private LocalDate becameFriends; ...}

public class SocialNetwork { private Set<Person> people; ...

@crichardson

Mapping, filtering, and reducing

public class Person {

public Set<Hometown> hometownsOfFriends() { Set<Hometown> result = new HashSet<>(); for (Friend friend : friends) { result.add(friend.getPerson().getHometown()); } return result; }

Declare result variable

Modify result

Return result

Iterate

@crichardson

Mapping, filtering, and reducingpublic class SocialNetwork {

private Set<Person> people;

...

public Set<Person> lonelyPeople() { Set<Person> result = new HashSet<Person>(); for (Person p : people) { if (p.getFriends().isEmpty()) result.add(p); } return result; }

Declare result variable

Modify result

Return result

Iterate

@crichardson

Mapping, filtering, and reducing

public class SocialNetwork {

private Set<Person> people;

...

public int averageNumberOfFriends() { int sum = 0; for (Person p : people) { sum += p.getFriends().size(); } return sum / people.size(); }

Declare scalar result variable

Modify result

Return result

Iterate

@crichardson

Problems with this style of programming

Lots of verbose boilerplate - basic operations require 5+ LOC

Imperative (how to do it) NOT declarative (what to do)

Mutable variables are potentially error prone

Difficult to parallelize

@crichardson

Java 8 streams to the rescue

A sequence of elements

“Wrapper” around a collection

Streams are lazy, i.e. can be infinite

Provides a functional/lambda-based API for transforming, filtering and aggregating elements

Much simpler, cleaner and declarative code

@crichardson

Using Java 8 streams - mappingclass Person ..

private Set<Friend> friends = ...;

public Set<Hometown> hometownsOfFriends() { return friends.stream() .map(f -> f.getPerson().getHometown()) .collect(Collectors.toSet()); }

transforming lambda expression

@crichardson

The map() function

s1 a b c d e ...

s2 f(a) f(b) f(c) f(d) f(e) ...

s2 = s1.map(f)

@crichardson

public class SocialNetwork {

private Set<Person> people;

...

public Set<Person> lonelyPeople() { return people.stream()

.filter(p -> p.getFriends().isEmpty())

.collect(Collectors.toSet()); }

Using Java 8 streams - filtering

predicate lambda expression

@crichardson

Using Java 8 streams - friend of friends V1

class Person ..

public Set<Person> friendOfFriends() { Set<Set<Friend>> fof = friends.stream() .map(friend -> friend.getPerson().friends) .collect(Collectors.toSet()); ... }

Using map() => Set of Sets :-(

Somehow we need to flatten

@crichardson

Using Java 8 streams - mapping

class Person ..

public Set<Person> friendOfFriends() { return friends.stream() .flatMap(friend -> friend.getPerson().friends.stream()) .map(Friend::getPerson) .filter(person -> person != this) .collect(Collectors.toSet()); }

maps and flattens

@crichardson

Chaining with flatMap()

s1 a b ...

s2 f(a)0 f(a)1 f(b)0 f(b)1 f(b)2 ...

s2 = s1.flatMap(f)

@crichardson

Using Java 8 streams - reducingpublic class SocialNetwork {

private Set<Person> people;

...

public long averageNumberOfFriends() { return people.stream() .map ( p -> p.getFriends().size() ) .reduce(0, (x, y) -> x + y) / people.size(); } int x = 0;

for (int y : inputStream) x = x + yreturn x;

@crichardson

The reduce() function

s1 a b c d e ...

x = s1.reduce(initial, f)

f(f(f(f(f(f(initial, a), b), c), d), e), ...)

@crichardson

Newton's method for calculating sqrt(x)

It’s an iterative algorithm

initial value = guess

betterValue = value - (value * value - x) / (2 * value)

Iterate until |value - betterValue| < precision

@crichardson

Functional square root in Scalapackage net.chrisrichardson.fp.scala.squareroot

object SquareRootCalculator {

def squareRoot(x: Double, precision: Double) : Double =

Stream.iterate(x / 2)( value => value - (value * value - x) / (2 * value) ).

Creates an infinite stream: seed, f(seed), f(f(seed)), .....

sliding(2).map( s => (s.head, s.last)). find { case (value , newValue) => Math.abs(value - newValue) < precision}. get._2}

a, b, c, ... => (a, b), (b, c), (c, ...), ...

Find the first convergent approximation

@crichardson

Adopting FP with Java 8 is straightforward

Switch your application to Java 8Start using streams and lambdasEclipse can refactor anonymous inner classes to lambdas

Or write modules in Scala: more expressive and runs on older JVMs

@crichardson

Agenda

Why functional programming?

Simplifying collection processing

Eliminating NullPointerExceptions

Simplifying concurrency with Futures and Rx Observables

Tackling big data problems with functional programming

@crichardson

Tony’s $1B mistake

“I call it my billion-dollar mistake. It was the invention of the null

reference in 1965....But I couldn't resist the temptation to put in a null reference, simply because it

was so easy to implement...”

http://qconlondon.com/london-2009/presentation/Null+References:+The+Billion+Dollar+Mistake

@crichardson

Coding with null pointersclass Person

public Friend longestFriendship() { Friend result = null; for (Friend friend : friends) { if (result == null || friend.getBecameFriends() .isBefore(result.getBecameFriends())) result = friend; } return result; }

Friend oldestFriend = person.longestFriendship();if (oldestFriend != null) { ...} else { ...}

Null check is essential yet easily forgotten

Return null if no friends

@crichardson

Java 8 Optional<T>A wrapper for nullable references

It has two states:

empty ⇒ throws an exception if you try to get the reference

non-empty ⇒ contain a non-null reference

Provides methods for: testing whether it has a value, getting the value, ...

Use an Optional<T> parameter if caller can pass in null

Return reference wrapped in an instance of this type instead of null

Uses the type system to explicitly represent nullability

@crichardson

Coding with optionalsclass Person public Optional<Friend> longestFriendship() { Friend result = null; for (Friend friend : friends) { if (result == null || friend.getBecameFriends().isBefore(result.getBecameFriends())) result = friend; } return Optional.ofNullable(result); }

Optional<Friend> oldestFriend = person.longestFriendship();// Might throw java.util.NoSuchElementException: No value present// Person dangerous = popularPerson.get();if (oldestFriend.isPresent) { ...oldestFriend.get()} else { ...}

@crichardson

Using Optionals - better

Optional<Friend> oldestFriendship = ...;

Friend whoToCall1 = oldestFriendship.orElse(mother);

Avoid calling isPresent() and get()

Friend whoToCall3 = oldestFriendship.orElseThrow( () -> new LonelyPersonException());

Friend whoToCall2 = oldestFriendship.orElseGet(() -> lazilyFindSomeoneElse());

@crichardson

Transforming with map()public class Person {

public Optional<Friend> longestFriendship() { return ...; }

public Optional<Long> ageDifferenceWithOldestFriend() { Optional<Friend> oldestFriend = longestFriendship(); return oldestFriend.map ( of -> Math.abs(of.getPerson().getAge() - getAge())) ); }

Eliminates messy conditional logic

@crichardson

Chaining with flatMap()class Person

public Optional<Friend> longestFriendship() {...}

public Optional<Friend> longestFriendshipOfLongestFriend() { return longestFriendship() .flatMap(friend -> friend.getPerson().longestFriendship());}

not always a symmetric relationship. :-)

@crichardson

Agenda

Why functional programming?

Simplifying collection processing

Eliminating NullPointerExceptions

Simplifying concurrency with Futures and Rx Observables

Tackling big data problems with functional programming

@crichardson

Let’s imagine you are performing a CPU intensive operation

class Person ..

public Set<Hometown> hometownsOfFriends() { return friends.stream() .map(f -> cpuIntensiveOperation(f)) .collect(Collectors.toSet()); }

@crichardson

class Person ..

public Set<Hometown> hometownsOfFriends() { return friends.parallelStream() .map(f -> cpuIntensiveOperation(f)) .collect(Collectors.toSet()); }

Parallel streams = simple concurrency Potentially uses N cores

⇒Nx speed up

Perhaps this will be faster. Perhaps not

@crichardson

Let’s imagine that you are writing code to display the

products in a user’s wish list

@crichardson

The need for concurrency

Step #1

Web service request to get the user profile including wish list (list of product Ids)

Step #2

For each productId: web service request to get product info

Sequentially ⇒ terrible response time

Need fetch productInfo concurrently

Composing sequential + scatter/gather-style operations is very common

@crichardson

Futures are a great concurrency abstraction

http://en.wikipedia.org/wiki/Futures_and_promises

@crichardson

Worker thread or event-driven code

Main thread

Composition with futures

Outcome

Future 2

Client

get Asynchronous operation 2

set

initiates

Asynchronous operation 1

Outcome

Future 1

getset

@crichardson

BenefitsSimple way for multiple concurrent activities to communicate safely

Abstraction:

Client does not know how the asynchronous operation is implemented, e.g. thread pool, event-driven, ....

Easy to implement scatter/gather:

Scatter: Client can invoke multiple asynchronous operations and gets a Future for each one.

Gather: Get values from the futures

@crichardson

But composition with basic futures is difficult

Java 7 future.get([timeout]):

Blocking API ⇒ client blocks thread ⇒ poor scalability

Difficult to compose multiple concurrent operations

Futures with callbacks:

e.g. Guava ListenableFutures, Spring 4 ListenableFuture

Attach callbacks to all futures and asynchronously consume outcomes

But callback-based code = messy code

See http://techblog.netflix.com/2013/02/rxjava-netflix-api.html

We need functional futures!

@crichardson

Functional futures - Scala, Java 8 CompletableFuture

def asyncPlus(x : Int, y :Int): Future[Int] = ... x + y ...

val future2 = asyncPlus(4, 5).map{ _ * 3 }

assertEquals(27, Await.result(future2, 1 second))

Asynchronously transforms future

def asyncSquare(x : Int) : Future[Int] = ... x * x ...

val f2 = asyncPlus(5, 8).flatMap { x => asyncSquare(x) }

assertEquals(169, Await.result(f2, 1 second))

Calls asyncSquare() with the eventual outcome of asyncPlus(), i.e. chaining

@crichardson

map() etc are asynchronous

outcome2

f2

f2 = f1 map (someFn)

Outcome1

f1

Implemented using callbacks

outcome2 = someFn(outcome1)

@crichardson

class WishListService(...) { def getWishList(userId : Long) : Future[WishList] = {

userService.getUserProfile(userId).

Scala wish list serviceFuture[UserProfile]

map { userProfile => userProfile.wishListProductIds}.

flatMap { productIds => val listOfProductFutures = productIds map productInfoService.getProductInfo

Future.sequence(listOfProductFutures) }.

map { products => WishList(products) }

Future[List[Long]]

List[Future[ProductInfo]]

Future[List[ProductInfo]]

Future[WishList]

@crichardson

Using Java 8 CompletableFuturespublic CompletableFuture<Wishlist> getWishlistDetails(long userId) { return userService.getUserProfile(userId).thenComposeAsync(userProfile -> {

Stream<CompletableFuture<ProductInfo>> s1 = userProfile.getWishListProductIds() .stream() .map(productInfoService::getProductInfo);

Stream<CompletableFuture<List<ProductInfo>>> s2 = s1.map(fOfPi -> fOfPi.thenApplyAsync(pi -> Arrays.asList(pi)));

CompletableFuture<List<ProductInfo>> productInfos = s2 .reduce((f1, f2) -> f1.thenCombine(f2, ListUtils::union)) .orElse(CompletableFuture.completedFuture(Collections.emptyList()));

return productInfos.thenApply(list -> new Wishlist()); }); }

Java 8 is missing Future.sequence()

flatMap()!

map()!

@crichardson

Your mouse is your database

Erik Meijer

http://queue.acm.org/detail.cfm?id=2169076

@crichardson

Introducing Reactive Extensions (Rx)

The Reactive Extensions (Rx) is a library for composing asynchronous and event-based programs ....

Using Rx, developers represent asynchronous data streams with Observables , query asynchronous

data streams using LINQ operators , and .....

https://rx.codeplex.com/

@crichardson

About RxJava

Reactive Extensions (Rx) for the JVM

Developed by Netflix

Original motivation was to provide rich, functional Futures

Implemented in Java

Adaptors for Scala, Groovy and Clojure

Embraced by Akka and Spring Reactor: http://www.reactive-streams.org/

https://github.com/Netflix/RxJava

@crichardson

RxJava core concepts

trait Observable[T] { def subscribe(observer : Observer[T]) : Subscription ...}

trait Observer[T] {def onNext(value : T)def onCompleted()def onError(e : Throwable)

}

Notifies

An asynchronous stream of items

Used to unsubscribe

Comparing Observable to...Observer pattern - similar but adds

Observer.onComplete()

Observer.onError()

Iterator pattern - mirror image

Push rather than pull

Futures - similar

Can be used as Futures

But Observables = a stream of multiple values

Collections and Streams - similar

Functional API supporting map(), flatMap(), ...

But Observables are asynchronous

@crichardson

Fun with observables

val every10Seconds = Observable.interval(10 seconds)

-1 0 1 ...

t=0 t=10 t=20 ...

val oneItem = Observable.items(-1L)

val ticker = oneItem ++ every10Seconds

val subscription = ticker.subscribe { (value: Long) => println("value=" + value) }...subscription.unsubscribe()

@crichardson

def getTableStatus(tableName: String) : Observable[DynamoDbStatus]=

Observable { subscriber: Subscriber[DynamoDbStatus] =>

}

Observables as the result of an asynchronous operation

amazonDynamoDBAsyncClient.describeTableAsync( new DescribeTableRequest(tableName), new AsyncHandler[DescribeTableRequest, DescribeTableResult] {

override def onSuccess(request: DescribeTableRequest, result: DescribeTableResult) = { subscriber.onNext(DynamoDbStatus(result.getTable.getTableStatus)) subscriber.onCompleted() }

override def onError(exception: Exception) = exception match { case t: ResourceNotFoundException => subscriber.onNext(DynamoDbStatus("NOT_FOUND")) subscriber.onCompleted() case _ => subscriber.onError(exception) } }) }

@crichardson

Transforming/chaining observables with flatMap()

val tableStatus = ticker.flatMap { i => logger.info("{}th describe table", i + 1) getTableStatus(name) }

Status1 Status2 Status3 ...

t=0 t=10 t=20 ...+ Usual collection methods: map(), filter(), take(), drop(), ...

@crichardson

Calculating rolling averageclass AverageTradePriceCalculator {

def calculateAverages(trades: Observable[Trade]): Observable[AveragePrice] = { ... }

case class Trade( symbol : String, price : Double, quantity : Int ...)

case class AveragePrice(symbol : String, price : Double, ...)

@crichardson

Calculating average pricesdef calculateAverages(trades: Observable[Trade]): Observable[AveragePrice] = {

trades.groupBy(_.symbol).map { symbolAndTrades => val (symbol, tradesForSymbol) = symbolAndTrades val openingEverySecond =

Observable.items(-1L) ++ Observable.interval(1 seconds) def closingAfterSixSeconds(opening: Any) =

Observable.interval(6 seconds).take(1)

tradesForSymbol.window(...).map { windowOfTradesForSymbol => windowOfTradesForSymbol.fold((0.0, 0, List[Double]())) { (soFar, trade) => val (sum, count, prices) = soFar (sum + trade.price, count + trade.quantity, trade.price +: prices) } map { x => val (sum, length, prices) = x AveragePrice(symbol, sum / length, prices) } }.flatten }.flatten}

@crichardson

Agenda

Why functional programming?

Simplifying collection processing

Eliminating NullPointerExceptions

Simplifying concurrency with Futures and Rx Observables

Tackling big data problems with functional programming

@crichardson

Let’s imagine that you want to count word frequencies

@crichardson

Scala Word Count

val frequency : Map[String, Int] = Source.fromFile("gettysburgaddress.txt").getLines() .flatMap { _.split(" ") }.toList

frequency("THE") should be(11)frequency("LIBERTY") should be(1)

.groupBy(identity) .mapValues(_.length))

Map

Reduce

@crichardson

But how to scale to a cluster of machines?

@crichardson

Apache HadoopOpen-source ecosystem for reliable, scalable, distributed computing

Hadoop Distributed File System (HDFS)

Efficiently stores very large amounts of data

Files are partitioned and replicated across multiple machines

Hadoop MapReduce

Batch processing system

Provides plumbing for writing distributed jobs

Handles failures

And, much, much more...

@crichardson

Overview of MapReduceInputData

Mapper

Mapper

Mapper

Reducer

Reducer

Reducer

Output

DataShuffle

(K,V)

(K,V)

(K,V)

(K,V)*

(K,V)*

(K,V)*

(K1,V, ....)*

(K2,V, ....)*

(K3,V, ....)*

(K,V)

(K,V)

(K,V)

@crichardson

MapReduce Word count - mapper

class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } }}

(“Four”, 1), (“score”, 1), (“and”, 1), (“seven”, 1), ...

Four score and seven years⇒

http://wiki.apache.org/hadoop/WordCount

@crichardson

Hadoop then shuffles the key-value pairs...

@crichardson

MapReduce Word count - reducer

class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }

(“the”, 11)

(“the”, (1, 1, 1, 1, 1, 1, ...))⇒

http://wiki.apache.org/hadoop/WordCount

@crichardson

About MapReduceVery simple programming abstraction yet incredibly powerful

By chaining together multiple map/reduce jobs you can process very large amounts of data in interesting ways

e.g. Apache Mahout for machine learning

But

Mappers and Reducers = verbose code

Development is challenging, e.g. unit testing is difficult

It’s disk-based, batch processing ⇒ slow

@crichardson

Scalding: Scala DSL for MapReduce

class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => tokenize(line) } .groupBy('word) { _.size } .write( Tsv( args("output") ) )

def tokenize(text : String) : Array[String] = { text.toLowerCase.replaceAll("[^a-zA-Z0-9\\s]", "") .split("\\s+") }}

https://github.com/twitter/scalding

Expressive and unit testable

Each row is a map of named fields

@crichardson

Apache SparkCreated at UC Berkeley and now part of the Hadoop ecosystem

Key abstraction = Resilient Distributed Datasets (RDD)

Collection that is partitioned across cluster members

Operations are parallelized

Created from either a collection or a Hadoop supported datasource - HDFS, S3 etc

Can be cached in-memory for super-fast performance

Can be replicated for fault-tolerance

Scala, Java, and Python APIs

http://spark.apache.org

@crichardson

Spark Word Countval sc = new SparkContext(...)

sc.textFile(“s3n://mybucket/...”) .flatMap { _.split(" ")} .groupBy(identity) .mapValues(_.length) .toArray.toMap }}

Expressive, unit testable and very fast

Very similar to Scala collection

code!!

@crichardson

Summary

Functional programming enables the elegant expression of good ideas in a wide variety of domains

map(), flatMap() and reduce() are remarkably versatile higher-order functions

Use FP and OOP together

Java 8 has taken a good first step towards supporting FP

Go write some functional code!

@crichardson

Questions?

@crichardson chris@chrisrichardson.net

http://plainoldobjects.com