BigTABLE - boun.edu.tr

22
BigTABLE A distributed storage System for Structured Data Rumeysa Çamdereli, Neslihan Kaplan, Dila Taşer

Transcript of BigTABLE - boun.edu.tr

Page 1: BigTABLE - boun.edu.tr

BigTABLEA distributed storage System for Structured Data

Rumeysa Çamdereli, Neslihan Kaplan, Dila Taşer

Page 2: BigTABLE - boun.edu.tr

OUTLINE

§ What is BigTable?§ Data Model

§ Client API

§ Google infrastructure

§ Fundamentals of BigTable implementation

§ Performance issues (refinements, advantages & disadvantages)

§ Examples

§ Conclusion

Page 3: BigTABLE - boun.edu.tr

Distributed storage system

Very large sized data

- Petabytes of data

- Thousands of commodity servers

Example usage : Google Earth, Google Finance… etc. s

Wide applicability, scalability, high performance and high availability is achieved

WHAT IS BIGTABLE?

Page 4: BigTABLE - boun.edu.tr

DATA MODEL

BigTable: Sparse, distributed, persistent, multidimensional sorted map

Indexing row key (string), column key (string), timestamp (int64)

Web table example:

Page 5: BigTABLE - boun.edu.tr

DATA MODEL (2)

In webtable

pages in the same domain reversed host name components

Column families

- Web table example: anchor and language

- Number of distinct column families must be small! (in hundreds at most)

- Access control and disk & memory accounting

Timestamps

- Versions of same data

- Real time / explicitly assigned (avoid collisions)

- Last n versions to be kept (Garbage collection)

Page 6: BigTABLE - boun.edu.tr

CLIENT API

Provides Create & Delete table

Changing cluster, coloumn family

We can limit scanning Only anchors whose timestamps within ten days

Other processes Single-row transactions

Allow cell to be used as integer counters

Support execution of client-supplied scripts in the address spaces of the servers

Page 7: BigTABLE - boun.edu.tr

INFRASTRUCTURE

Built on some other pieces of Google infrastructure

Uses Google File System (GFS) to store log and data files

SSTable (Sorted String Table) file format is used to store data

SSTable persistent, ordered map from keys to values, both being arbitrary byte strings

SSTables contains a sequence of blocks

Block index locating blocks

Page 8: BigTABLE - boun.edu.tr

INFRASTRUCTURE (2)

Uses Chubby a distributed lock service

What is a lock service?

- provides distributed software applications with a means to synchronize their accesses to shared resources

Chubby service 5 replicas, 1 is master serves requests

§ Ensuring there is at most one active master

§ Storing the location of BigTable data

§ Discovering tablet servers and finalize tablet server deaths

§ Storing BigTable schema information (column family information)

§ Storing access control lists (a list of permissions attached to an object)

Page 9: BigTABLE - boun.edu.tr

IMPLEMENTATION

Has 3 major components:

§ A library linked to every client

§ One master server

§ Tablet servers

Page 10: BigTABLE - boun.edu.tr

IMPLEMENTATION (2)

Master server:

§ Assigning tablets to tablet servers,§ Detecting the addition and expiration of tablet

servers,

§ Balancing tablet-server load,§ Garbage collection of files in GFS.

Tablet server:

§ Manages a set of tablets

§ Handles read/write requests§ Splits tablets grown too much

Client library caches tablet locations (?)

BigTable cluster stores tables tablets data associated with a row range

Page 11: BigTABLE - boun.edu.tr

Tablet Location

3-level hierarchy to store tablet location information

File stored in Chubby stores location info of root tablet

Root tablet location of all tablets in METADATA table

Root tablet is never split

METADATA location of user tablets

Page 12: BigTABLE - boun.edu.tr

Tablet Assignment

Uses Chubby keeping track of tablet servers

Each tablet is assigned to one tablet server

Master keeps track of assignments

Unassigned tablet tablet server is available master assigns

Page 13: BigTABLE - boun.edu.tr

Tablet Serving

Persistent state of tablet stored in GFS

Updates committed to a commit log redo records

§ Redo records pointers to commit logs that may contain data for tablet to recover a tablet

Recently committed ones in memory memtable

Older updates SSTables

Page 14: BigTABLE - boun.edu.tr

Compactions

Write operations size of memtable increases

If too big new memtable is created old one is converted to SSTable written to GFS

Shrinking the memory usage

Page 15: BigTABLE - boun.edu.tr

REFINEMENTS

• Group multiple column

families together into a locality group.

• SSTables for a locality group are

compressed or not (10-to-1 reduction in space)

• Single commit log per tablet server,

• Mixing changes for different tablets in the same physical log file.

Page 16: BigTABLE - boun.edu.tr

PERFORMANCE

• Random reads involve transfer of 64KB

SSTable block but only 1KB value is used.

• Sequential reads are better than random

reads since 64KB SSTable block is stored in block cache.

• Random and Sequential writes are done via single commit log.

• Scans are even faster since the tablet server can return large number of values.

Page 17: BigTABLE - boun.edu.tr

q Google Analytics

§ Service for analyzing traffic patterns

of web sites for web masters.

§ 2 tables are used by Google Analytics.

§ The raw click table maintains a raw for each end user session.

§ The summary table contains predefined summaries for each web site.

REAL APPLICATIONS

Page 18: BigTABLE - boun.edu.tr

q Google Earth

§ Provides users access to high-resolution satellite imagery of the world's surface.

§ One table to preprocess data.

§ Different set of tables for serving client data.

§ Preprocessing pipeline uses one table to store raw imagery, approximately 70 TB of data, so served from disk.

§ Serving system uses one table to index data which is relatively small 500GB.

REAL APPLICATIONS

Page 19: BigTABLE - boun.edu.tr

ADVANTAGES OF BIG TABLE

• A special query language is not needed, and thus improving the query language for query optimization is not necessary.

• Operations are performed only at the level of the line so join operations are not required.

• Tablets are kept accessible by all the servers in the big table system.

• Each transaction is kept in additional transaction log and accessible by all servers.

• In case of a corruption of one server, another server can take its role in the process.

Page 20: BigTABLE - boun.edu.tr

ADVANTAGES OF BIG TABLE

• There is no limit for row length.

• Unlimited number of connections can be kept for each record.

• With this approach, disk access is reduced.

• Cost is low in contrast to RDBMS.

• Offers high availability.

• High performance in reading writing and updating data.

Page 21: BigTABLE - boun.edu.tr

DISADVANTAGES OF BIG TABLE

• Data loss can occur.

• Lack of advanced features for data security.

• Possibility of multiple copies of same data.

• Secondary index is not supported.

Page 22: BigTABLE - boun.edu.tr

CONCLUSION

Richer than simple key-value pairs, support semi-structured data

- simple enough flat file presentation

- transparent enough allows users to perform important behaviours

Problems detected in the year 2006

- do not consider multiple copies of same data

- user tell what data belongs in memory/disk, but it must be done dynamically

- no complex queries to execute or optimize