Tachyon介绍及应用总结 - Meetupfiles.meetup.com/16395762/Tachyon_0.3.4.pdf · 2015-03-20 ·...

27
Software and Services Software and Services Tachyon介绍及应用总结 1 史鸣飞 [email protected]

Transcript of Tachyon介绍及应用总结 - Meetupfiles.meetup.com/16395762/Tachyon_0.3.4.pdf · 2015-03-20 ·...

Page 1: Tachyon介绍及应用总结 - Meetupfiles.meetup.com/16395762/Tachyon_0.3.4.pdf · 2015-03-20 · Software and ServicesSoftware and Services Tachyon出现的背景 以Spark为例

Software and ServicesSoftware and Services

Tachyon介绍及应用总结

1

史鸣飞

[email protected]

Page 2: Tachyon介绍及应用总结 - Meetupfiles.meetup.com/16395762/Tachyon_0.3.4.pdf · 2015-03-20 · Software and ServicesSoftware and Services Tachyon出现的背景 以Spark为例

Software and ServicesSoftware and Services

我们是谁?

Intel大数据团队

国内最早参与Spark的开发和推广

Spark及相关项目中拥有超过10位活跃的Contributor

2

Page 3: Tachyon介绍及应用总结 - Meetupfiles.meetup.com/16395762/Tachyon_0.3.4.pdf · 2015-03-20 · Software and ServicesSoftware and Services Tachyon出现的背景 以Spark为例

Software and ServicesSoftware and Services

内容概要

Tachyon介绍

Tachyon的应用实例

Tachyon当前的发展状况

Intel在Tachyon上的贡献

3

Page 4: Tachyon介绍及应用总结 - Meetupfiles.meetup.com/16395762/Tachyon_0.3.4.pdf · 2015-03-20 · Software and ServicesSoftware and Services Tachyon出现的背景 以Spark为例

Software and ServicesSoftware and Services

Tachyon出现的背景

对速度的追求,内存为王

内存速度的提升远高于磁盘

价格/容量越来越低

现有基于内存的计算框架面临一些挑战

4

Page 5: Tachyon介绍及应用总结 - Meetupfiles.meetup.com/16395762/Tachyon_0.3.4.pdf · 2015-03-20 · Software and ServicesSoftware and Services Tachyon出现的背景 以Spark为例

Software and ServicesSoftware and Services

Tachyon出现的背景

以Spark为例

数据共享

缓存数据丢失

GC

开销

Spark Tasks

Spark memblock manager

block 1

block 3

HDFS / Amazon S3block 1

block 3

block 2

block 4

YARN

计算引擎和内存管理模块在同一进程中,数据共享需通过共享存储实现(HDFS)

Hadoop MRSpark Tasks

Spark memoryblock manager

block 1

block 3

crash

crash

Spark Tasks

Spark memblock manager

block 1

block 3

Spark Tasks

Spark memblock manager

block 2

block 4

垃圾回收

垃圾回收

缓存数据在执行引擎JVM堆内存中,执行引擎崩溃,缓存数据丢失。

数据缓存在JVM堆内存空间中,GC开销随应用程序运行时间和堆内存空间中缓存数据量的增长而迅速增长

5

Page 6: Tachyon介绍及应用总结 - Meetupfiles.meetup.com/16395762/Tachyon_0.3.4.pdf · 2015-03-20 · Software and ServicesSoftware and Services Tachyon出现的背景 以Spark为例

Software and ServicesSoftware and Services

解决方法

根本原因:

缺乏独立的脱离JVM管理的内存管理模块

解决方法:

基于内存的分布式存储系统Tachyon

6

Page 7: Tachyon介绍及应用总结 - Meetupfiles.meetup.com/16395762/Tachyon_0.3.4.pdf · 2015-03-20 · Software and ServicesSoftware and Services Tachyon出现的背景 以Spark为例

Software and ServicesSoftware and Services

Tachyon

设计思想:

1. 基于内存的OffHeap的分布式存储

2. 通过在存储层保存数据的Lineage实现容错

特点:

1. 数据仅保存一份在内存中

2. 当数据丢失时基于Lineage进行数据恢复

7

Page 8: Tachyon介绍及应用总结 - Meetupfiles.meetup.com/16395762/Tachyon_0.3.4.pdf · 2015-03-20 · Software and ServicesSoftware and Services Tachyon出现的背景 以Spark为例

Software and ServicesSoftware and Services

软件栈

8

SparkMapRe

duceSparkSQL

H2O GraphX Impala

HDFS S3Gluster

FSOrange

FSNFS Ceph ……

……

Page 9: Tachyon介绍及应用总结 - Meetupfiles.meetup.com/16395762/Tachyon_0.3.4.pdf · 2015-03-20 · Software and ServicesSoftware and Services Tachyon出现的背景 以Spark为例

Software and ServicesSoftware and Services

Tachyon的基本架构

9

Page 10: Tachyon介绍及应用总结 - Meetupfiles.meetup.com/16395762/Tachyon_0.3.4.pdf · 2015-03-20 · Software and ServicesSoftware and Services Tachyon出现的背景 以Spark为例

Software and ServicesSoftware and Services

数据的容错

元数据容错 – 日志

Image

Editlog

文件数据容错 - 血统关系(Lineage)

10

Page 11: Tachyon介绍及应用总结 - Meetupfiles.meetup.com/16395762/Tachyon_0.3.4.pdf · 2015-03-20 · Software and ServicesSoftware and Services Tachyon出现的背景 以Spark为例

Software and ServicesSoftware and Services

现有问题的解决

数据共享

缓存数据丢失

GC开销

Spark Tasks

Spark memblock manager

HDFS / Amazon S3block 1

block 3

block 2

block 4

YARN

Hadoop MRSpark Tasks

Spark memoryblock manager

crashcrash

Spark Tasks

Spark memblock manager

Spark Tasks

Spark memblock manager

Tachyonin-memory

block 1

block 3 block 4

• 解决!

• 解决!

• 解决!

11

Page 12: Tachyon介绍及应用总结 - Meetupfiles.meetup.com/16395762/Tachyon_0.3.4.pdf · 2015-03-20 · Software and ServicesSoftware and Services Tachyon出现的背景 以Spark为例

Software and ServicesSoftware and Services

Tachyon的集成

将Tachyon的jar包加入到Hadoop中 方法1:export HADOOP_CLASSPATH+=

/pathToTachyon/target/tachyon-x.x.x-jar-with-dependencies.jar 方法2:将Tachyon的jar包放到Hadoop_HOME/lib目录下 方法3:将Tachyon的jar包作为应用程序的一部分

在core-site.xml中配置Tachyon文件系统<property>

<name>fs.tachyon.impl</name><value>tachyon.hadoop.TFS</value>

</property>

MapReduce通过Tachyon加载/写入数据 hadoop jar hadoop-examples-1.0.4.jar wordcount

tachyon://localhost:19998/input tachyon://localhost:19998/output

Map

Reduce

Spark

在SparkConf中配置Tachyon的URIspark.tachyonStore.url= tachyon://TachyonMasterURI:port

将RDD的数据存储在Tachyon中rdd.persist(StorageLevel.OFF_HEAP)

通过SparkContext配置Tachyon文件系统sc.hadoopConfiguration.set("fs.tachyon.impl", "tachyon.hadoop.TFS")

Spark通过Tachyon加载/写入数据val rdd = sc.textFile("tachyon://TachyonMasterURI:port/Input")rdd.saveAsTextFile("tachyon://TachyonMasterURI:port/Output")

12

Page 13: Tachyon介绍及应用总结 - Meetupfiles.meetup.com/16395762/Tachyon_0.3.4.pdf · 2015-03-20 · Software and ServicesSoftware and Services Tachyon出现的背景 以Spark为例

Software and ServicesSoftware and Services

内容概要

Tachyon介绍

Tachyon的应用实例

Tachyon当前的发展状况

Intel在Tachyon上的贡献

13

Page 14: Tachyon介绍及应用总结 - Meetupfiles.meetup.com/16395762/Tachyon_0.3.4.pdf · 2015-03-20 · Software and ServicesSoftware and Services Tachyon出现的背景 以Spark为例

Software and ServicesSoftware and Services

Tachyon应用实例-数据共享

Event Logs

Kafka

Messaging/Queue

SparkStreaming

In-MemoryTables

Shark/Spark-SQL

StreamProcessing Tachyon

HDFS Tables

Persistent Storage

https://github.com/thunderain-project/thunderain

OnlineAnalysis/

Dashboard

Interactive Query/BI

14

Page 15: Tachyon介绍及应用总结 - Meetupfiles.meetup.com/16395762/Tachyon_0.3.4.pdf · 2015-03-20 · Software and ServicesSoftware and Services Tachyon出现的背景 以Spark为例

Software and ServicesSoftware and Services

Tachyon应用实例-OffHeap存储

N度级联问题:在一个有向图中,相临顶点之间存在相关性,相关性的大小以边的权重表示,权重值为(0,1)之间的浮点数,查找每个顶点在第N个步长时相关性最大的M个点并得到它们之间的相关性权重值;

从X到Y的N度级联的计算方法: 𝒌=𝟏𝑴 𝑾𝒆𝒊𝒈𝒉𝒕𝒌(𝑿, 𝒀)

M是从X到Y的所有包含n条边的路径数目

𝑾𝒆𝒊𝒈𝒉𝒕𝒌(𝑿, 𝒀) 是X到Y的第k条路径权重值,该值的计算方式是该路径上所包含的所有边权重之积:𝑾𝒆𝒊𝒈𝒉𝒕𝒌 𝑿, 𝒀 = 𝒆=𝟏

𝒏 𝒘𝒆 , 𝒆 ∈ 𝒆𝒅𝒈𝒆𝒔

两种图并行计算框架实现 Bagel (Pregel on Spark)

GraphX

Weight1(u, v) = edge(u, v) ∈ (0, 1)Weightn(u, v) = 𝑥→𝑣Weightn−1(u, x)∗Weight1(x, v)

15

Page 16: Tachyon介绍及应用总结 - Meetupfiles.meetup.com/16395762/Tachyon_0.3.4.pdf · 2015-03-20 · Software and ServicesSoftware and Services Tachyon出现的背景 以Spark为例

Software and ServicesSoftware and Services

Tachyon应用实例-OffHeap存储

N维相关度计算(Bagel & GraphX on Tachyon)

16

聚集并生成

下一轮消息

计算(compute)

并生成新的消息

Tachyon

OffHeap

缓存

Bagel

收集(Gather)

计算(Apply) 分发(Scatter)

Tachyon

GraphX

OffHeap

缓存

Page 17: Tachyon介绍及应用总结 - Meetupfiles.meetup.com/16395762/Tachyon_0.3.4.pdf · 2015-03-20 · Software and ServicesSoftware and Services Tachyon出现的背景 以Spark为例

Software and ServicesSoftware and Services

Tachyon应用实例-远程数据缓存

计算集群从远程存储集群读取数据,缓存至Tachyon进行多次读取和计算

存储服务(例如:S3)

计算集群Spark/MR(例如:EC2)

Tachyon缓存数据

17

Page 18: Tachyon介绍及应用总结 - Meetupfiles.meetup.com/16395762/Tachyon_0.3.4.pdf · 2015-03-20 · Software and ServicesSoftware and Services Tachyon出现的背景 以Spark为例

Software and ServicesSoftware and Services

Tachyon应用总结

适用场景

中间结果数据需要在不同的应用/计算框架之间共享

需要快速响应,对延迟敏感的应用

内存数据量比较大,并且拥有长时间/迭代式的计算需求

需要多次访问大量的远程数据

局限

CPU负载增大,序列化和反序列化引入开销

18

Page 19: Tachyon介绍及应用总结 - Meetupfiles.meetup.com/16395762/Tachyon_0.3.4.pdf · 2015-03-20 · Software and ServicesSoftware and Services Tachyon出现的背景 以Spark为例

Software and ServicesSoftware and Services

内容概要

Tachyon介绍

Tachyon的应用实例

Tachyon当前的发展状况

Intel在Tachyon上的贡献

19

Page 20: Tachyon介绍及应用总结 - Meetupfiles.meetup.com/16395762/Tachyon_0.3.4.pdf · 2015-03-20 · Software and ServicesSoftware and Services Tachyon出现的背景 以Spark为例

Software and ServicesSoftware and Services

Tachyon的发展状况

Tachyon项目于2012年夏开始于UC Berkeley AMPLab

Reliable, Memory Speed Storage for Cluster Computing Frameworks (UC Berkeley EECS Tech Report)

Haoyuan Li, Ali Ghodsi, Matei Zaharia, Ion Stoica, Scott Shenker

20

Page 21: Tachyon介绍及应用总结 - Meetupfiles.meetup.com/16395762/Tachyon_0.3.4.pdf · 2015-03-20 · Software and ServicesSoftware and Services Tachyon出现的背景 以Spark为例

Software and ServicesSoftware and Services

Tachyon的发展状况

Apache License 2.0, 最新的正式版本为 0.6.0 (2015年3月发布),主分支版本

为0.7.0-SNAPSHOT

在超过50家公司中被使用, 拥有60多位Contributor,分别来自二十多个组织

Spark和MapReduce程序可以不经修改直接运行在Tachyon上

Spark将Tachyon作为默认的OffHeap的存储系统

21

Page 22: Tachyon介绍及应用总结 - Meetupfiles.meetup.com/16395762/Tachyon_0.3.4.pdf · 2015-03-20 · Software and ServicesSoftware and Services Tachyon出现的背景 以Spark为例

Software and ServicesSoftware and Services

业界的参与

22

22

Page 23: Tachyon介绍及应用总结 - Meetupfiles.meetup.com/16395762/Tachyon_0.3.4.pdf · 2015-03-20 · Software and ServicesSoftware and Services Tachyon出现的背景 以Spark为例

Software and ServicesSoftware and Services

内容概要

Tachyon介绍

Tachyon的应用实例

Tachyon当前的发展状况

Intel在Tachyon上的贡献

23

Page 24: Tachyon介绍及应用总结 - Meetupfiles.meetup.com/16395762/Tachyon_0.3.4.pdf · 2015-03-20 · Software and ServicesSoftware and Services Tachyon出现的背景 以Spark为例

Software and ServicesSoftware and Services

Intel在Tachyon中的贡献

3位Contributor

超过 200 个提交

重要功能模块

多层级的本地存储

提高可用性/易用性

大量的bug修补

24

Page 25: Tachyon介绍及应用总结 - Meetupfiles.meetup.com/16395762/Tachyon_0.3.4.pdf · 2015-03-20 · Software and ServicesSoftware and Services Tachyon出现的背景 以Spark为例

Software and ServicesSoftware and Services

多层级的本地存储

使用SSD/HDD扩展Tachyon存储空间

分层结构,不同层有不同的优先级

热数据放在顶层存储,不热的数据放在下层存储

同一层中可以拥有多个目录

当数据不再“热”时,可以从顶层被移动至下层

不同的eviction策略

当数据再次变“热”时,可以从下层重新被移动至上层

25

MEM

SSD

HDD

... ...

Evict from layer 0 to 1

Evict from layer 1 to 2

Evict from layer 2 to 3

Evict out

Promote to top layer

Page 26: Tachyon介绍及应用总结 - Meetupfiles.meetup.com/16395762/Tachyon_0.3.4.pdf · 2015-03-20 · Software and ServicesSoftware and Services Tachyon出现的背景 以Spark为例

Software and ServicesSoftware and Services26

Welcome Collaboration!https://github.com/amplab/tachyon

Q&A

Page 27: Tachyon介绍及应用总结 - Meetupfiles.meetup.com/16395762/Tachyon_0.3.4.pdf · 2015-03-20 · Software and ServicesSoftware and Services Tachyon出现的背景 以Spark为例