Pair RDD - Spark
-
Upload
hyeonseok-choi -
Category
Technology
-
view
4.643 -
download
1
Transcript of Pair RDD - Spark
Spark�Key/vale�pairs�
아키텍트를�꿈꾸는�사람들�Cecil
Key/Value�Pairs?
Key/Value을�쌍으로�가지고�있는�RDD�(Pair�RDD)�주로�각�Key에�대해�병렬로�처리하거나,��
데이터를�그룹핑�할�때�사용�
ex)�word�count�…
Pair�RDD의�생성
• 데이터를�불러올때�Pair�RDD로�생성�
->�여기서�다루지�않음.�
• 일반�RDD를�페어�RDD로�변환�
���->�키/값을�쌍으로�돌려주는�map�함수를�써서�변환
//�line의�첫�단어를�key로�해서�Pair�를�생성�val�pairs�=�lines.map(x�=>�(x.split("�")(0),�x))
언어별�Pair�RDD의�생성• Python/Scala�
->�map�함수에서�튜플을�반환�• Java:�tuple(X)�
���->�scala.Tuple2�클래스를�사용하여�튜플을�생성
PairFunction<String,�String,�String>�keyData�=���new�PairFunction<String,�String,�String>()�{���public�Tuple2<String,�String>�call(String�x)�{�����return�new�Tuple2(x.split("�")[0],�x);���}�};�JavaPairRDD<String,�String>�pairs�=�lines.mapToPair(keyData);
Pair�RDD의�Transformation
• 기본�RDD에�사용하는�모든�Transformation�사용�가능�
• key와�연관된�Transformation�
��->�단일�Pair�RDD(집합):�reduceByKey,�groupByKey,�mapValues…�
��->�두개의�Pair�RDD(그룹화):�subtractByKey,�join,�cogroup…
//�ex)�value의�길이가�20보다�작은�것만�필터링�pairs.filter�{�case�(key,�value)�=>�value.length�<�20}
예제:�Key별�평균�구하기�(reduceByKey)
//�rdd:�key,�value로�구성된�Pair�RDD�rdd.mapValues(x�=>�(x,�1)).reduceByKey((x,�y)�=>�(x._1�+�y._1,�x._2�+�y._2))
mapValues를��수행한�결과�
value�->�(value,�1)
reduceByKey를��수행한�결과�
value�->�(total,�count)
자주�사용되는�combineByKey
• aggregate()와�마찬가지로�입력�타입과�출력�타입이�다를�수�있음�• combineByKey(createCombiner,�mergeValue,�mergeConviner)�• create�Combiner�
• 해당�key에�대한�accumulator의�초기값을�제공하는�함수�• 해당�파티션에서�key가�처음�나올때�호출�
• merge�Value�• 파티션�내에서�이전에�나온�key가�있을때�호출되는�함수�
• merge�Conbiner�• 둘�이상의�파티션이�동일할�key�를�가지고�있으면�호출되는�함수
예제:�Key별�평균�구하기�(combineByKey)
//�rdd:�key,�value로�구성된�Pair�RDD�
val�result�=�input.combineByKey(�
��(v)�=>�(v,�1),��//�create�Conbiner�
��(acc:�(Int,�Int),�v)�=>�(acc._1�+�v,�acc._2�+�1),��//�merge�value�
//�merge�Conbiner�
��(acc1:�(Int,�Int),�acc2:�(Int,�Int))�=>�(acc1._1�+�acc2._1,�acc1._2�+�acc2._2)�
��).map{�case�(key,�value)�=>�(key,�value._1�/�value._2.toFloat)�}�
result.collectAsMap().map(println(_))
예제:�Key별�평균�구하기�(combineByKey)�-�cont’d
병렬화�수준의�최적화
• 모든�RDD는�고정된�개수의�파티션을�가짐�
• 파티션의�수는�RDD에서�연산이�처리될�때�동시�작업의�수준을�결정�
• default�파티션�수:�클러스터의�사이즈에�따라�결정됨��
• 그룹�RDD나�집합�연산된�RDD�생성시�파티션�수�지정�가능�
• 집합�연산이나�그룹화�연산을�사용하지�않고�파티션�수�수정�
���->�repartition():�네트워크로�데이터�이동�발생(셔플링)
val�data�=�Seq(("a",�3),�("b",�4),�("a",�1))�sc.parallelize(data).reduceByKey((x,�y)�=>�x�+�y)����//�Default�parallelism�sc.parallelize(data).reduceByKey((x,�y)�=>�x�+�y,�10)����//�Custom�parallelism
데이터�그룹화
• 동일한�key에�대한�그룹핑을�수행하는�함수�
• groupByKey:�[K,�Iterable[V]]�
• cogroup:�[K,�Iterable[V],�Iterable[W]]�
• 여러�RDD에�같은�값을�가지는�key에�대하여�그룹핑�
• join:�양쪽�모두�key가�존재�하는�경우�
• left/right�outer�join:�어느�한쪽에만�key가�있을�경우�포함�
• cross�join:�카타시안�곱
데이터�정렬
• 키에�대해�순서가�정의되어�있는�Pair라면�정렬이�가능�
• 정렬하고�난�이후,�collect,�save�같은�함수�호출시�정렬이�적용됨�
• sortByKey()를�사용:�(true:�오름차순,�기본값)�
• 사용자�정의�함수를�제공하여�정렬�가능
val�input:�RDD[(Int,�Venue)]�=�...�implicit�val�sortIntegersByString�=�new�Ordering[Int]�{���override�def�compare(a:�Int,�b:�Int)�=�a.toString.compare(b.toString)�}�rdd.sortByKey()
Pair�RDD의�기본�Action
• CountByKey()�
• 각�키에�대한�값의�개수를�헤아림�
• collectAsMap()�
• 쉬운�검색을�위해�결과를�맵�형태로�수집�
• lookup(Key)�
• 해당�key에�대한�모든�값을�반환
데이터�파티셔닝
• 분산�프로그램에서�통신�비용이�매우�크므로�네트워크�부하를�줄이는�것이�중요�
• 파티셔닝은�네트워크�부하를�효율적으로�줄이는�방법�
• 어떤�키의�모음들이�임의의�노드에�함께�모여�있는�것을�보장�
• 키�중심의�연산에서�데이터가�여러번�재�사용될때만�의미가�있음�
• ex)�join�
• 스파크의�파티셔닝�
• Hash�기반,�Range�기반,�사용자�지정
예제:�구독하지�않는�주제�방문�횟수
val�sc�=�new�SparkContext(…)�//�userInfo:�사용자가�구독하는�주제의�리스트�val�userData�=�sc.sequenceFile[UserID,�UserInfo]("hdfs://...").persist()�
//�5분간의�이벤트�로그�파일을�처리하기�위해�주기적으로�불리는�함수�//�사용자�정보와�링크�정보를�쌍으로�가지고�있음.�def�processNewLogs(logFileName:�String)�{�
��val�events�=�sc.sequenceFile[UserID,�LinkInfo](logFileName)���val�joined�=�userData.join(events)//�RDD�of�(UserID,�(UserInfo,�LinkInfo))�pairs���val�offTopicVisits�=�joined.filter�{�
����case�(userId,�(userInfo,�linkInfo))�=>�//�Expand�the�tuple�into�its�components�������!userInfo.topics.contains(linkInfo.topic)�
��}.count()���println("Number�of�visits�to�non-subscribed�topics:�"�+�offTopicVisits)�
}
예제:�구독하지�않는�주제�방문�횟수(cont’d)
5분�마다�조인을�위해�많은�네트워크�전송이�있음
예제:�구독하지�않는�주제�방문�횟수(cont’d)val�sc�=�new�SparkContext(...)�val�userData�=�sc.sequenceFile[UserID,�UserInfo]("hdfs://...")������������������.partitionBy(new�HashPartitioner(100))���//�100개의�파티션�생성������������������.persist()���<=�파티션된�결과를�영속화�하는것이�중요
파티션의�수는�이후에�이루어질�작업이�얼마나�병렬로�나누어�진행�될지를�결정.�
파티션수�>�클러스터�코어�수
userData는�매번�네트워크를�통행�
전송되지�않음
파티셔닝�etc.• 파티션닝이�도움이�되는�연산들�
• reduceByKey,�combineByKey,�lookup…�
• 각키에�대한�연산이�단일�장비에서�이루어짐�
• cogorup,�join�…�
• 최소�하나�이상의�RDD가�네트워크를�통해�전송될�필요가�없게�해줌�
• 파티셔닝에�영향을�주는�연산들�
• 데이터를�파티션하는�연산에�의해�만들어진�RDD는�자동으로�설정됨�
• ex)�join:�Hash�파티셔닝�됨�
• 지정된�파티셔닝을�쓰는�것이�보장되지�못하는�연산에서는�설정�X�
• ex)�map:�키의�변경�가능성이�존재
예제:�페이지�랭크
//�이웃�리스트는�스파크�오브젝트에�저장되어�있다고�가정�val�links�=�sc.objectFile[(String,�Seq[String])]("links")���������������.partitionBy(new�HashPartitioner(100))���������������.persist()�
//�각�페이지의�기본�랭크를�1.0으로�초기화�var�ranks�=�links.mapValues(v�=>�1.0)�
//�알고리즘�10회�반복�for�(i�<-�0�until�10)�{�
��val�contributions�=�links.join(ranks).flatMap�{�
����case�(pageId,�(links,�rank))�=>�������links.map(dest�=>�(dest,�rank�/�links.size))�
��}���ranks�=�contributions.reduceByKey((x,�y)�=>�x�+�y).mapValues(v�=>�0.15�+�0.85*v)�
}�
//�결과�저장�ranks.saveAsTextFile("ranks")
알고리즘�• 각�페이지의�랭크를�1.0으로�초기화�
• 반복�주기마다�공헌치�{랭크(p)�/�
이웃숫자(p)}를�이웃들에게�전송�
• 각�페이지�랭크를�갱신�0.15�+�
0.85�*�받은�공헌치
예제:�페이지�랭크�(사용자�지정�파티셔너)class�DomainNamePartitioner(numParts:�Int)�extends�Partitioner�{�
��override�def�numPartitions:�Int�=�numParts���override�def�getPartition(key:�Any):�Int�=�{�
����val�domain�=�new�Java.net.URL(key.toString).getHost()�����val�code�=�(domain.hashCode�%�numPartitions)�����if�(code�<�0)�{�������code�+�numPartitions��//�음수인�경우�0�이상인�값으로�����}�else�{�������code�����}�
��}���//�자바의�equals�메소드,�스파크가�직접�만든�파티셔너�객체를�비교하는데�사용�
��override�def�equals(other:�Any):�Boolean�=�other�match�{�����case�dnp:�DomainNamePartitioner�=>�������dnp.numPartitions�==�numPartitions�����case�_�=>�
������false���}�
}
사용자�지정�파티셔너�• 같은�도메인의�안의�링크를�동일한�파티
션으로�지정�
• org.apache.spark.Partitioner��
• numPartitions:�파티션�수�
• getPartiton:�키에�대한�파
티션�ID�반환�
• equlas:�두�RDD가�같은�방식으
로�파티션�되었는지�검사
Q&A
References• Holden�Karau,�Andy�Konwinski,�Patrick�Wendell,�Matei�
Zaharia.�러닝�스파크(박종용�옮김).�경기도�파주시:�제이펍(주),�2015�
• Spark�home�page:�http://spark.apache.org