Pair RDD - Spark

download Pair RDD - Spark

of 22

  • date post

    07-Jan-2017
  • Category

    Technology

  • view

    4.640
  • download

    1

Embed Size (px)

Transcript of Pair RDD - Spark

  • SparkKey/valepairs

    Cecil

  • Key/ValuePairs?

    Key/ValueRDD(PairRDD)Key,

    ex)wordcount

  • PairRDD

    PairRDD

    ->.

    RDDRDD

    ->/map

    //linekeyPairvalpairs=lines.map(x=>(x.split("")(0),x))

  • PairRDD Python/Scala

    ->map Java:tuple(X)

    ->scala.Tuple2

    PairFunctionkeyData=newPairFunction(){publicTuple2call(Stringx){returnnewTuple2(x.split("")[0],x);}};JavaPairRDDpairs=lines.mapToPair(keyData);

  • PairRDDTransformation

    RDDTransformation

    keyTransformation

    ->PairRDD():reduceByKey,groupByKey,mapValues

    ->PairRDD():subtractByKey,join,cogroup

    //ex)value20pairs.filter{case(key,value)=>value.length

  • :Key(reduceByKey)

    //rdd:key,valuePairRDDrdd.mapValues(x=>(x,1)).reduceByKey((x,y)=>(x._1+y._1,x._2+y._2))

    mapValues

    value->(value,1)

    reduceByKey

    value->(total,count)

  • combineByKey

    aggregate() combineByKey(createCombiner,mergeValue,mergeConviner) createCombiner

    keyaccumulator key

    mergeValue key

    mergeConbiner key

  • :Key(combineByKey)

    //rdd:key,valuePairRDD

    valresult=input.combineByKey(

    (v)=>(v,1),//createConbiner

    (acc:(Int,Int),v)=>(acc._1+v,acc._2+1),//mergevalue

    //mergeConbiner

    (acc1:(Int,Int),acc2:(Int,Int))=>(acc1._1+acc2._1,acc1._2+acc2._2)

    ).map{case(key,value)=>(key,value._1/value._2.toFloat)}

    result.collectAsMap().map(println(_))

  • :Key(combineByKey)-contd

  • RDD

    RDD

    default:

    RDDRDD

    ->repartition():()

    valdata=Seq(("a",3),("b",4),("a",1))sc.parallelize(data).reduceByKey((x,y)=>x+y)//Defaultparallelismsc.parallelize(data).reduceByKey((x,y)=>x+y,10)//Customparallelism

  • key

    groupByKey:[K,Iterable[V]]

    cogroup:[K,Iterable[V],Iterable[W]]

    RDDkey

    join:key

    left/rightouterjoin:key

    crossjoin:

  • Pair

    ,collect,save

    sortByKey():(true:,)

    valinput:RDD[(Int,Venue)]=...implicitvalsortIntegersByString=newOrdering[Int]{overridedefcompare(a:Int,b:Int)=a.toString.compare(b.toString)}rdd.sortByKey()

  • PairRDDAction

    CountByKey()

    collectAsMap()

    lookup(Key)

    key

  • ex)join

    Hash,Range,

  • :

    valsc=newSparkContext()//userInfo:valuserData=sc.sequenceFile[UserID,UserInfo]("hdfs://...").persist()

    //5//.defprocessNewLogs(logFileName:String){valevents=sc.sequenceFile[UserID,LinkInfo](logFileName)valjoined=userData.join(events)//RDDof(UserID,(UserInfo,LinkInfo))pairsvaloffTopicVisits=joined.filter{case(userId,(userInfo,linkInfo))=>//Expandthetupleintoitscomponents!userInfo.topics.contains(linkInfo.topic)

    }.count()println("Numberofvisitstonon-subscribedtopics:"+offTopicVisits)

    }

  • :(contd)

    5

  • :(contd)valsc=newSparkContext(...)valuserData=sc.sequenceFile[UserID,UserInfo]("hdfs://...").partitionBy(newHashPartitioner(100))//100.persist()

    userData

  • etc.

    reduceByKey,combineByKey,lookup

    cogorup,join

    RDD

    RDD

    ex)join:Hash

    X

    ex)map:

  • :

    //vallinks=sc.objectFile[(String,Seq[String])]("links").partitionBy(newHashPartitioner(100)).persist()

    //1.0varranks=links.mapValues(v=>1.0)

    //10for(ilinks.map(dest=>(dest,rank/links.size))

    }ranks=contributions.reduceByKey((x,y)=>x+y).mapValues(v=>0.15+0.85*v)

    }

    //ranks.saveAsTextFile("ranks")

    1.0

    {(p)/

    (p)}

    0.15+

    0.85*

  • :()classDomainNamePartitioner(numParts:Int)extendsPartitioner{overridedefnumPartitions:Int=numPartsoverridedefgetPartition(key:Any):Int={valdomain=newJava.net.URL(key.toString).getHost()valcode=(domain.hashCode%numPartitions)if(codednp.numPartitions==numPartitionscase_=>false

    }}

    org.apache.spark.Partitioner

    numPartitions:

    getPartiton:

    ID

    equlas:RDD

  • Q&A

  • References HoldenKarau,AndyKonwinski,PatrickWendell,MateiZaharia.().:(),2015

    Sparkhomepage:http://spark.apache.org

    http://spark.apache.org