Voldemort Intro Tangfl

Introduce to Voldemort

唐福林 <iMobile>

背景

LinkedIn, Bhupesh, Elias, and jaykreps http://project-voldemort.com/blog/2009/06/buildi

ng-a-1-tb-data-cycle-at-linkedin-with-hadoop-and-project-voldemort/

Along with Hadoop

http://project-voldemort.com/blog/2009/06/building-a-1-tb-data-cycle-at-linkedin-with-hadoop-and-project-voldemort/



需求

大量的数据（ Tb级别）离线计算（ Hadoop）计算结果供线上使用（线上只读） Daily update（ large daily data cycles） Project Voldemort

the system built to deploy data to the live site key-value storage system

需求（续）

以前最大的限制是离线计算能力不够 Hadoop 解决了离线计算能力的问题现在的问题是： deliver data to the site Hadoop has been quite helpful in removing

scalability problems in the offline portion of the system; but in doing so it creates a huge bottleneck in our ability to actually deliver data to the site

现有的做法

rsync， ftp， jdbc batch 问题：

centralized, un-scalable 需要在线上机器上建立 index索引，影响服务

现有的备选方案

Memcache mem：内存限制 cache：易失无批量操作支持

现有的备选方案（续）

Mysql InnoDB： too high space overhead MyISAM：

线上只读，锁表不是问题 load data infile local，批量操作问题 1：建索引需要很长时间，不能在线上机器上做问题 2：mysql并行能力不足

幻想

Mysql MyIsam 离线插入，建立好索引拷贝数据库文件到线上机器，线上机器立即发现，并投入使用（要求不需要重启）

问题：需要额外建索引的机器数据拷贝了多次当前 mysql 不支持立即发现不支持压缩。。。

预期

Protect the live servers Horizontal scalability at each step Ability to rollback Failure tolerance Support large ratios of data to RAM

Project Voldemort － overview

Project Voldemort - store

起初想自己设计一种 storage engine —— lookup and caching structures

Benchmark 发现主要瓶颈在于取数据的时候pagecache 是否命中

Lookup 不是瓶颈，所以不值得优化所以，简单的 mmap data file http://en.wikipedia.org/wiki/Amdahl%27s_law

只有占总时间百分比比较大的部分，才值得优化

http://en.wikipedia.org/wiki/Amdahl%27s_law

Project Voldemort－ store（ 2）

my_store/

version-0/

0.index

0.data

...

n.index

n.data

version-1/

0.index

0.data

...

Project Voldemort - store（ 3）

Version-0 ：当前版本 .index: 索引文件 .data:原始数据文件 0－ n：打散

Data 文件最大 2G (java mmap 指针 32位 ) 更新：

上传一份新的数据到 tmp 目录重命名： v(n-1)->v(n), ... , v(0)->v(1), tmp->v(0)

容忍硬盘上同时存在多份


each key/value pair has a fixed overhead of exactly 24 bytes in addition to the length of the value itself： 16byte key md5＋ 4byte location + 4byte size

ith index :20 * i, no internal pointers 唯一的问题： index 文件里的 key 怎么组织

Index 文件是在 hadoop 上生成的，map， reduce过程要求生成的时候的内存占用尽量小

index文件很小（ data文件才 2G），应该可以放进一个 pagecache，所以怎么组织其实没有关系

结论：简单的排序，读的时候使用二分查找


额外考虑 data很小， key很多： 100 million entries ，二分查找需要 27 次比较，相比一次读取，代价比较高，值得优化

when we have an entirely uncached index： update 或 rollback To page the 100 million entry index for a chunk into memory

will require 500k page faults no matter what the structure is However it would be desirable to minimize the maximum

number of page faults incurred on a given request to minimize the variance of the request time

page-organized tree 尝试

利用 md5 的特性，修改二分查找的实现（还未实现）

Project Voldemort - build

single-process command-line java program，测试用 distributed Hadoop-based store builder

An user-extensible Mapper extracts keys from the source data A custom Hadoop Partitioner then applies the Voldemort

consistent hashing function to the keys, and assigns all keys mapped to a given node and chunk to a single reduce task

the shuffle phase of the map/reduce copies all values with the same destination node and chunk to the same reduce task, values are sorted by Hadoop, group by key

each of the reduce tasks will create one .index and .data file for a given chunk on a particular node

Project Voldemort - deployment

更新：上传一份新的数据到 tmp 目录重命名： v(n-1)->v(n), ... , v(0)->v(1), tmp->v(0) 重命名保证原子性（前提：在同一个硬盘分区）

上传： rsync： diff 运算耗费线上机器的 cpu； HDFS 不支持，必须先 copy到 ext3之类的地方

Push vs pull ： pull 需要额外的 triggers 传输限速

Project Voldemort - benchmarks

benchmark build time for a store in Hadoop

* 100GB: 28mins (400 mappers, 90 reducers)

* 512GB: 2hrs, 16mins (2313 mappers, 350 reducers)

* 1TB: 5hrs, 39mins (4608 mappers, 700 reducers)

request rate a node can sustain once live MySQL Voldemort

Reqs per sec. 727 1291

Median req. time 0.23 ms 0.05 ms

Avg. req. time 13.7 ms 7.7 ms

99th percentile req. time 127.2 ms 100.7 ms

Project Voldemort - benchmarks(2)

影响因素： The ratio of data to memory The performance of the disk subsystem The entropy （熵） of the request

stream（ random or organized， determine cache misses rate）

Project Voldemort - Future

Incremental data updates diff file，节约网络传输

index文件：有序文件， diff 大，但文件小 Data 文件：无序文件，新内容 append 到最后就可以了

2G 大小的问题？ Version-0 ＝ version-1 + diff patch,

耗硬盘 io Version-0 = diff patch, 读的时候直接去 version-

1， version-n 里面读读逻辑复杂 keep a Bloom filter tracking which keys are in each day’s patch Rollback ？


Improved key hashing replicating at the file level，冗余度： 2 replicating at the chunk level，冗余度： <2


Compression 要求： fast decompression speed LZO compression


Better indexing probabilistic binary search 204-way page-aligned tree cache-oblivious algorithms， van Emde Boas tree on-disk hash-based lookup structure

结束

当前对 imobile 没有用，因为我们没有这样的需要

某些做法，如 deployment 中的考虑，在Search 2.0 中可以借鉴

Hadoop 是一个好东西，可以关注一下找到系统的瓶颈很重要，虽然很困难

关于

Imobile http://www.imobile.com.cn Team: http://team.imobile.com.cn Me: http://blog.fulin.org My twitter: http://twitter.com/tangfl

http://www.imobile.com.cn/

http://team.imobile.com.cn/

http://blog.fulin.org/

http://twitter.com/tangfl

Voldemort Intro Tangfl

Technology

Transcript of Voldemort Intro Tangfl