The Art of Big Data
-
Upload
krishna-sankar -
Category
Technology
-
view
7.838 -
download
4
description
Transcript of The Art of Big Data
Krishna Sankar, http://doubleclix.wordpress.com
EC4000–PhD Guest Seminar, Naval Post Graduate School
Nov 29,2011
The road lies plain before me;--'tis a theme
Single and of determined bounds; …
- Wordsworth, The Prelude
What is Big Data ?
Big Data to smart data
Big Data Pipeline
Analytic Algorithms
Storage - NOSQL
Processing - Hadoop …
Analytics/Modeling
R
Visualization
o Agenda o To cover the broad
picture o Understand the
waypoints & o Drill down into one
area (NOSQL) o Can do others later
…
o Of the Big Data domain …
Thanks to … The giants whose shoulders I am
standing on
Special Thanks to: Peter Ateshian, NPS
Prof Murali Tummala, NPS Shirley Bailes,O’Reilly Ed Dumbill,O’Reilly
Jeff Barr,AWS Jenny Kohr Chynoweth,AWS
When I think of my own native land, In a moment I seem to be there;
But, alas! recollection at hand Soon hurries me back to despair.
- Cowper, The Solitude Of Alexander SelKirk
What is Big Data ? “Big data” is data that becomes large enough that it cannot be processed using conventional methods. @twitter
Ref: hIp://radar.oreilly.com/2010/09/the-‐smaq-‐stack-‐for-‐big-‐data.html
“Big data” is less about size, more
about flow & velocity - persisting
petabytes per year is easier than
processing terabytes per hour. @twitter
What is Big Data ?
Ref: hIp://www.ciol.com/News/News/News-‐Reports/Vinod-‐Khosla%E2%80%99s-‐cool-‐dozen-‐tech-‐innovaXons/156307/0/ hIp://yourstory.in/2011/11/vinod-‐khoslas-‐keynote-‐at-‐nasscom-‐product-‐conclave-‐reject-‐punditry-‐believe-‐in-‐an-‐idea-‐take-‐risk-‐and-‐succeed/
Vinod Khosla’s Cool Dozen !① Consumers : “Widespread innovation in technologies that reduce data overload for
users” ~ Data Reduction ② Businesses : “Simple solutions to handle the deluge of data generated from various
sources …” ~ Big Data Analytics TV 2.0, EducaXon, Social NEXT,Tools for sharing inteerst,Publishing,…
① Volume o Scale
② Velocity o Data change rate vs. decision window
③ Variety o Different sources & formats o Structured vs. Unstructured
④ Variability o Breadth of interpreta<on & o Depth of analy<cs
⑤ Contextual o Dynamic variability o RecommendaXon
⑥ Connectedness
EBC322
hIp://doubleclix.wordpress.com/2011/09/13/when-‐is-‐big-‐data-‐really-‐big-‐data/ hIp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf
① Volume o Scale
② Velocity o Data change rate vs. decision window
③ Variety o Different sources & formats o Structured vs. Unstructured
④ Variability o Breadth of interpreta<on & o Depth of analy<cs
⑤ Contextual o Dynamic variability o RecommendaXon
⑥ Connectedness
EBC322
hIp://doubleclix.wordpress.com/2011/09/13/when-‐is-‐big-‐data-‐really-‐big-‐data/ hIp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf
① Volume o Scale
② Velocity o Data change rate vs. decision window
③ Variety o Different sources & formats o Structured vs. Unstructured
④ Variability o Breadth of interpreta<on & o Depth of analy<cs
⑤ Contextual o Dynamic variability o RecommendaXon
⑥ Connectedness
EBC322
hIp://doubleclix.wordpress.com/2011/09/13/when-‐is-‐big-‐data-‐really-‐big-‐data/ hIp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf
① Volume o Scale
② Velocity o Data change rate vs. decision window
③ Variety o Different sources & formats o Structured vs. Unstructured
④ Variability o Breadth of interpreta<on & o Depth of analy<cs
⑤ Contextual o Dynamic variability o RecommendaXon
⑥ Connectedness
EBC322
hIp://doubleclix.wordpress.com/2011/09/13/when-‐is-‐big-‐data-‐really-‐big-‐data/ hIp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf
① Volume o Scale
② Velocity o Data change rate vs. decision window
③ Variety o Different sources & formats o Structured vs. Unstructured
④ Variability o Breadth of interpreta<on & o Depth of analy<cs
⑤ Contextual o Dynamic variability o RecommendaXon
⑥ Connectedness
EBC322
hIp://doubleclix.wordpress.com/2011/09/13/when-‐is-‐big-‐data-‐really-‐big-‐data/ hIp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf
I. Two Main Types – based on collecXon i. Big Data Streams
o Data in “moXon” o TwiIer fire hose, Facebook, G+
ii. Big Data Logs o Data “at rest” o Logs, DW, external market data, POS, …
II. Typically, Big Data has a non-‐determinisXc angle as well … o CreaXve Discovery o IteraXve, Model based AnalyXcs o Explore quesXons to ask
III. Smart Data = Big Data + context + embedded/interacXve (inference, reasoning) models o Model Driven o DeclaraXvely InteracXve
hIp://www.slideshare.net/leonsp/hadoop-‐slides-‐11-‐what-‐is-‐big-‐data hIp://www.slideshare.net/Dataversity/wed-‐1550-‐bacvanskivladimircolor
Twitter § 200 million tweets/day § Peak 10,000/second § How would you handle the fire
hose for social network analytics ?
hIp://goo.gl/dcBsQ
Storage § 4 U box = 40 TB, § 1 PB = 25 boxes !
Zynga § “Analytics company, not a
gaming company!” § Harvests data : 15 TB/day
§ Test new features § Target advertising
§ 230 million players/month
AWS – 600 Billion objects!
• 6 Billion Messages per day
• 2 PB (w/compression) online
• 6 PB w/ replicaXon • 250 TB/Month growth • HBase Infrastructure
Ref: hIp://www.hpts.ws/sessions/2011HPTS-‐TomFastner.pdf
Path Analysis A/B TesXng
50 TB/Day 240 nodes, 84 PB Teradata InstallaXon
Very systemaXc Diagram speaks volumes!
• “… they didn’t need a genius, … but build the world’s most impressive dileIante … baIling the efficient human mind with spectacular flamboyant inefficiency” – Final Jeopardy by Stephen Baker
• 15 TB memory, across 90 IBM 760 servers, in 10 racks • 1 TB of dataset • 200 Million pages processed by Hadoop • This is a good example of Connected data
– Contextual w/ variability – Breath of interpretaXon – AnalyXcs depth
hIp://doubleclix.wordpress.com/2011/03/01/the-‐educaXon-‐of-‐a-‐machine-‐%E2%80%93-‐review-‐of-‐book-‐%E2%80%9Cfinal-‐jeopardy%E2%80%9D-‐by-‐stephen-‐baker/ hIp://doubleclix.wordpress.com/2011/02/17/watson-‐at-‐jeopardy-‐a-‐race-‐of-‐machines/
Storage
Parallelism
Inference
NOSQL
HPC
Map/Reduce
Object Store
Block Store
AnalyXcs
Web AnalyXcs
Log AnalyXcs
Social Media
Social Graph
Knowledge Graph
Distributed ApplicaXons
Warehouse-‐style ApplicaXons
RecommendaXon/Inference Engines Machine Learning
ClassificaXon, Clustering
Search, Indexing
Mahout
Cloud Architecture
Big Data
Big Data to Smart Data
“A towel is about the most massively useful thing an interstellar hitchhiker can have … any man who can hitch the length and breadth of the Galaxy, rough it … win through, and still know where his towel is, is clearly a man to be reckoned with.”
- From The Hitchhiker's Guide to the Galaxy, by Douglas Adams. Published by Harmony Books in 1979
Big data to smart data • summary
1 Don’t throw away any data !
2 Be ready for different ways of organizing the data
h;p://goo.gl/fGw7r
Big Data Pipeline
If a problem has no solution, it is not a problem, but a fact, not to be solved but to be coped with, over time …
- Peres’s Law
Big Data Pipeline • Stages
o Collect o Store o Transform & Analyze o Model & Reason o Predict, Recommend & Visualize
• Different systems have different characteristics o Infrastructure optimization based in application/hardware
attributes correlation (short term) • Hadoop, Splunk, internal Dashboard
o Application performance trends (medium term) • Analytics, Modeling,…
o Product Metrics • Feature set vs. usage, what is important to users, stratification • Modeling using R, Visualization layers like Tableau
Volume
Velocity
Variety
Variability
Connectedness
Context
Model
Infer-ability
Big Data Pipeline
Decomplexify! Contextualize! Network! Reason! Infer!
Logs, Scribe, Flume, Hadoop…
SQL NOSQL, HDFS, XML, <iles, …
SQL, BI Tools, Hadoop, Pig, Hive, .NET Dryad, Various other tools
Hand coded Programs, R, Mahout, …
Internal dashboards, Tableau
Ref:h;p:goo.gl/Mm83k
The NOSQL !
I AM monarch of all I survey; My right there is none to dispute;
From the centre all round to the sea I am lord of the fowl and the brute
- Cowper, The Solitude Of Alexander SelKirk
Build to Fail - “It is working” is not binary
Agenda • Opening Gambit
– NOSQL : Toil, Tears & Sweat ! • The Pragmas
– ABCs of NOSQL [ACID, BASE & CAP] • The Mechanics
– Algorithmics & Mechanisms (For reference)
Referenced Links @ http://doubleclix.wordpress.com/2010/06/20/nosql-talk-references/
What is NOSQL Anyway ?
• NOSQL != NoSQL or NOSQL != (!SQL) • NOSQL = Not Only SQL • Can be traced back to Eric Evans[2]!
– You can ask him during the ayernoon session! • Unfortunate Name, but is stuck now • Non RelaXonal could have been beIer • Usually OperaXonal, Definitely Distributed • NOSQL has certain semanXcs – need not stay that way
Key Value Column Document Graph
Ref: [22,51,52]
NOSQL
Neo4j
FlockDB
InfiniteGraph
CouchDB
MongoDB
Lotus Domino
Riak
Google BigTable
HBase
Cassandra
HyperTable
In-‐memory
Disk Based
SimpleDB
Memcached
Redis
Tokyo Cabinet
Dynamo
Voldemort Azure TS
WHAT WORKS NOSQL Tales from the field
When I think of my own native land, In a moment I seem to be there;
But, alas! recollection at hand Soon hurries me back to despair.
- Cowper, The Solitude Of Alexander SelKirk
• Designer Augmenting RDBMS with a Distributed key Value Store[40 : A good talk by Geir]
• Invitation only designer brand sales • Limited inventory sales – start at 12:00, members have
10 min to grab them. 500K mails every day • Keeps brand value, hidden from search • Interesting load properties • Each item a row in DB-BUY NOW reserves it
– Can't order more • Started out as a Rails app
– shared nothing
• Narrow peaks – half of revenue
Christian Louboutin Effect
• ½ amz for Louboutin • Use Voldemort • Inventory, Shopping Cart,
Checkout • Partition by prod ID • Shared infrastructure – “fog”
not “cloud’ - Joyent! • In-memory inventory • Not afraid of sale anymore!
And SQL DBs are still relevant !
Typical NOSQL Example Bit.ly • Bit,ly URL shortening service, uses MongoDB • User, title, URL, hash, labels[I-5], sort by time • Scale – ~50M users, ~10K concurrent, ~1.25B shortens
per month • Criteria:
– Simple, Zippy FAST, Very Flexible, Reasonable Durability, Low cost of ownership
• Sharded by userid
• New kind of “dictionary” a word repository, GPS for English – context, pronunciations, twitter … developer API
• Characteristics[I-6,Tony Tam’s presentation] – RO-centric, 10,000 reads for every write – Hit a wall with MySQL (4B rows) – MongoDB read was so good that memcached layer was not
required – MongoDB used 4 times MySQL storage
• Another example : – Voldemort – Unified Communications, IP-Phone data stored
keyed off of phone number. Data relatively stable
Large Hadron Collider@CERN • DAS is part of giant data management
enterprise (cms) – Polygot Persistence (SQL + NOSQL, Mongo, Couch,
memcache, HDFS, Luster, Oracle, mySQL, …) • Data Aggregation System [I-1,I-2,I-3,I-4]
– Uses MongoDB – Distributed Model, 2-6 pb data – Combine info. from different metadata sources, query
without knowing their existence, user has domain knowledge – but shouldn’t deal with various formats, interfaces and query semantics
– DAS aggregates, caches and presents data as JSON documents – preserving security & integrity
And SQL DBs are still relevant !
Scaling Twitter •
• Digg – RDBMS places burden on reads than writes[I-8] – Looked at NOSQL, selected Cassandra
• Colum oriented, so more structure than key-value
• Heard from noSQL Boston[http://twitter.com/#search?q=%23nosqllive] – Baidu: 120 node HyperTable cluster managing
600TB of data – StumbleUpon uses HBase for Analytics – Twitter’s Current Cassandra cluster: 45 nodes
• Adob is a HBase shop[I-10,I-11,2]
• Adobe SaaS Infrastructure – tagging, content aggregation, search, storage and so forth
• Dynamic schema & huge number of records[I-5]
• 40 million records in 2008 to 1 billion with 50 ms response
• NOSQL not mature in 2008, now good enough
• Prod Analytics:40 nodes, largest has 100 nodes
• BBC is a CouchDB shop[I-13]
• Sweet spot: • Multi-master, multi
datacenter replication
• Interactive Mediums • Old data to CouchDB • Thus free up DB to do
work!
• Cloudkick is a Cassandra shop[I-12] • Cloudkick offers cloud management services • Store metrics data • Linear scalability for write load • Massive write performance
• Memory table & serial commit log • Low operational costs • Data Structure
– Metrics, Rolled-up data, Statuses at time slice : all indexed by timestamp
• Guardian/UK – Runs on Redis[I-14] ! – “Long-term The Guardian is looking
towards the adoption of a schema-free database to sit alongside its Oracle database and is investigating CouchDB. … the relational database is now just a component in the overall data management story, alongside data caching, data stores, search engines etc.
– NOSQL can increase performance of relational data by offloading specific data and tasks
And SQL DBs are still relevant ! "The evil that SQL DBs do lives after them; the good is oft interred with their bones...",
NOSQL at Netflix • Netflix is fully in the cloud • Uses NOSQL across the globe • Customer Profiles, watchlog, usage logging (see next
slide) – No multi-record locking
• No DBA ! • Easier Schema Changes • Less complex, Highly Available data store • Joins happen in the applications
http://www.hpts.ws/sessions/nosql-ecosystem.pdf http://www.hpts.ws/sessions/GlobalNetflixHPTS.pdf
21 NOSQL Themes • Web Scale • Scale Incrementally/conXnuous growth • Oddly shaped & exponenXally connected • Structure data as it will be used – i.e. read, query • Know your queries/updates in advance[96], but you can change
them later • Compute aIributes at run Xme • Create a few large enXXes with opXonal parts
– NormalizaXon creates many small enXXes • Define Schemas in models (not in databases) • Avoid impedance mismatch • Narrow down & solve your core problem • Solve the right problem with the right tool
Ref: [I-‐8]
21 NOSQL Themes • ExisXng soluXons are clunky[1] (in certain situaXons) • Scale automaXcally, “becoming prohibiXvely costly (in
terms of manpower) to operate” TwiIer[I-‐9] • DistribuXon & parXXoning are built-‐in NOSQL
• RDBMS distribuXon & sharding not fun and is expensive – Lose most funcXonality along the way
• Data at the center, Flexible schema, Less joins • The value of NOSQL is in flexibility as much as it is in “Big
Data”
21 NOSQL Themes • Requirements[3]
– Data will not fit in one node • And so need data parXXon/distribuXon by the system
– Nodes will fail, but data needs to be safe – replicaXon! – Low latency for real-‐Xme use
• Data Locality – Row based structures will need to read whole row, even for a column
– Column based structures need to scan for each row • SoluXon : Column storage with Locality
– Keep data that is read together, don’t read what you don’t care • For example friends – other data
Ref: 3
ABCs of NOSQL -
ACID, BASE &
CAP
The woods are lovely, dark, and deep, But I have promises to keep,
And miles to go before I sleep, And miles to go before I sleep.
-Frost
CAP Principle
Consistency
Availability Partition
“CAP Principle → Strong Consistency, High Availability, Par::on-‐resilience:
Pick at most 2”[37]
Which feature to discard depends on the nature of your system[41]
CAP Principle
Consistency
Availability Partition
“CAP Principle → Strong Consistency, High Availability, Par::on-‐resilience:
Pick at most 2”[37] C-‐A No P → Single DB server, no network par::on
Which feature to discard depends on the nature of your system[41]
CAP Principle
Consistency
Availability Partition
“CAP Principle → Strong Consistency, High Availability, Par::on-‐resilience:
Pick at most 2”[37] C-‐P No A → Block transac:on in case of par::on failure
Which feature to discard depends on the nature of your system[41]
CAP Principle
Consistency
Availability Partition
“CAP Principle → Strong Consistency, High Availability, Par::on-‐resilience:
Pick at most 2”[37] A-‐P No C → Expira:on based caching, vo:ng majority
Interesting (& controversial) from NOSQL perspective
ABCs of NOSQL • ACID
o Atomicity, Consistency, IsolaXon & Durability – fundamental properXes of SQL DBMS
• BASE[35,39] o Basically Available Soy state(Scalable) Eventually Consistent
• CAP[36,39] o Consistency, Availability & ParXXoning o This C is ~A+C
• i.e. Atomic Consistency[36]
ACID • Atomicity
o All or nothing • Consistent
o From one consistent state to another • e.g. ReferenXal Integrity
o But it is also applicaXon dependent on • e.g. min account balance • Predicates, invariants,…
• IsolaXon • Durability
CAP Pragmas • PrecondiXons
o The domain is scalable web apps o Low Latency For real Xme use o A small sub-‐set of SQL FuncXonality o Horizontal Scaling
• PritcheI[35] talks about relaxing consistency across funcXonal groups than within funcXonal groups
• Idempotency to consider o Updates inc/dec are rarely idempotent o Order preserving trx are not idempotent either o MVCC is an answer for this (CouchDB)
Consistency
• Strict Consistency o Any read on Data X will return the most recent write on X[42]
• SequenXal Consistency o Maintains sequenXal order from mulXple processes (No menXon of Xme)
• Linearizability o Add Xmestamp from loosely synchronized processes
Consistency • Write availability, not read availability[44] • Even load distribuXon is easier in eventually consistent systems
• MulX-‐data center support is easier in eventually consistent systems
• Some problems are not solvable with eventually consistent systems
• Code is someXmes simpler to write in strongly consistent systems
CAP EssenXals – 1 of 3 • “CAP Principle → Strong Consistency, High Availability, ParXXon-‐resilience: Pick at most 2”[37] o C-‐A No P → Single DB server, no network parXXon
o C-‐P No A → Block transacXon in case of parXXon failure
o A-‐P No C → ExpiraXon based caching, voXng majority
• Which feature to discard depends on the nature of your system[41]
CAP EssenXals – 2 of 3 • Yield vs. Harvest[37]
o Yield → Probability of compleXng a request o Harvest → FracXon of data reflected in the response
• Some systems tolerate < 100% harvest (e.g search i.e. approximate answers OK) others need 100% harvest (e.g. Trx i.e. correct behavior = single well defined response)
• For sub-‐systems that tolerate harvest degradaXon, CAP makes sense
CAP EssenXals – 3 of 3 • Trading Harvest for yield – AP • ApplicaXon decomposiXon & use NOSQL in
appropriate sub-‐systems that has state management and data semanXcs that match the opera<onal feature & impedance o Hence NotOnly SQL not No SQL o Intelligent homing to tolerate parXXon failures[44] o MulX zones in a region (150 miles -‐ 5 ms) o TwiIer tweets in Cassandra & MySQL o BBC using MongoDB for offloading DBMS o Polygot persistence at LHC@CERN
CAP EssenXals – 3 of 3 • Trading Harvest for yield – AP • ApplicaXon decomposiXon & use NOSQL in
appropriate sub-‐systems that has state management and data semanXcs that match the opera<onal feature & impedance o Hence NotOnly SQL not No SQL o Intelligent homing to tolerate parXXon failures[44] o MulX zones in a region (150 miles -‐ 5 ms) o TwiIer tweets in Cassandra and MySQL o BBC using MongoDB for offloading DBMS o Polygot persistence at LHC@CERN
Most important point in the whole
presentation
Eventual Consistency & AMZ • DistribuXon Transparency[38] • Larger distributed systems, network parXXons are given
• Consistency Models o Strong o Weak
• Has an inconsistency window before update and guaranteed view
o Eventual • If no new updates, all will see the value, eventually
Eventual Consistency & AMZ • Guarantee variaXons[38]
o Read-‐Your-‐writes o Session consistency o Monotonic Read consistency
• Access will not return previous value o Monotonic Write consistency
• Serialize write by the same process
• Guarantee order (vector clocks, mvcc) o Example : Amz Cart merger (let cart add even with parXal
failure)
Eventual Consistency & AMZ -‐ SimpleDB • SimpleDB strong consistency semanXcs [49,50] o UnXl Feb 2010, SimpleDB only supported eventual consistency i.e. GetAIributes ayer PutAIributes might not be the same for some Xme (1 second)
o On Feb 24, AWS Added ConsistentRead=True aIribute for read
o Read will reflect all writes that got 200OK Xll that Xme!
Eventual Consistency & AMZ -‐ SimpleDB
• SimpleDB strong consistency semanXcs [49,50] o Also added condiXonal put/delete o Put aIribute has a specified value (Expected.1.Value=) or (Expected.1.Exists = true/false)
o Same condiXonal check capability for delete also
o Only on one aIribute !
Eventual Consistency & AMZ – S3 • S3 is an eventual consistency system
o Versioning o “S3 PUT & COPY synchronously store data across mulXple faciliXes before returning SUCCESS”
o Repair Lost redundancy, repair bit-‐rot o Reduced Redundancy opXon for data that can be reproduced (99.999999999% vs. 99.99%) • Approx 1/3rd less
o CloudFront for caching
!SQL ? • “We conclude that the current RDBMS code lines, while
aIempXng to be a “one size fits all” soluXon, in fact, excel at nothing. Hence, they are 25 year old legacy code lines that should be reXred in favor of a collecXon of “from scratch” specialized engines.”[43]
• “Current systems were built in an era where resources were incredibly expensive, and every compuXng system was watched over by a collecXon of wizards in white lab coats, responsible for the care, feeding, tuning and opXmizaXon of the system. In that era, computers were expensive and people were cheap”
• “The 1970 -‐ 1985 period was a <me of intense debate, a myriad of ideas, & considerable upheaval. We predict the next fiUeen years will have the same feel “
Further deliberaXon • Daniel Abadi[45],Mike Stonebreaker[46], James Hamilton[47], Pat Hilland[48] are all good read for further deliberaXons
NOSQL Internals & Algorithmics
Caveats • A representaXve subset of the mechanics and mechanisms used in the NOSQL world
• Being refined & newer ones are being tried • At a system level – to show how the techniques play a part to deliver a capability
• The NOSQL Papers and other references for further deliberaXon
• Even if we don’t cover fully, it is OK. I want to introduce some of the concepts so that you get an appreciaXon …
NOSQL Mechanics • Horizontal Scalability
– Gossip (Cluster membership)
– Failure DetecXon – Consistent Hashing – ReplicaXon Techniques • Hinted Handoff • Merkle Trees
– Sharding MongoDB – Regions in HBase
• Performance – SStables/memtables – LSM w/Bloom Filter
• Integrity/Version reconcilia<on – Timestamps – Vector Clocks – MVCC – SemanXc vs. syntacXc reconciliaXon
Consistent Hashing • Origin: web caching “To decrease ‘hot spots’
• Three goals[87] – Smooth evoluXon
• When a new machine joins, minimum rebalance work and impact
– Spread • Objects assigned to a min number of nodes
– Load • # of disXnct objects assigned to a node is small
Consistent Hashing • Hash Keyspace/Token is divided into parXXons/ranges • Cassandra – choice
– OrderPreserving parXXoner – key = token (for range queries) – Also saw a CollaXngOrderPreservingParXXoner
• ParXXons assigned to nodes that are logically arranged in a circle topology
• Amz (dynamo) – assign sets of (random) mulXple points to different machines depending on load
• Cassandra – monitor load & distribute
• Specific join & leave protocols • ReplicaXon – next 3 consecuXve • Cassandra – Rack-‐aware, Datacenter-‐aware
Consistent Hashing -‐ Hinted-‐handoff • What happens when a node is not available ?
– May be under load – May be network parXXon
• Sloppy Quorum & Hinted-‐handoff • R/W performed on the 1st n healthy nodes • Replica sent to a host node with hint in metadata & then transferred when the actual node is up
• Burdens neighboring nodes • Cassandra 0.6.2 default is disabled (I think)
Consistent Hashing -‐ ReplicaXon • What happens when a new node joins ? – It gets one or more parXXons – Dynamo : Copy the whole parXXon – Cassandra : Replicate keyset – Cassandra : working on a bit torrent type protocol to copy from replicas
AnX-‐entropy • Merge and reconciliaXon operaXons
– Operate on two states and return a new state[86] • Merkle Trees
– Dynamo use of Merkle trees to detect inconsistencies between replicas
– AnXEntropy in Cassandra exchanges Merkle trees and if they disagree, range repair via compacXon[91,92]
– Cassandra uses the ScuIlebuI ReconciliaXon[86]
Gossip • Membership & Failure detecXon • Based on emergence without rigidity – pulse coupled oscillators, biological systems like fireflies ![90]
• Also used for state propagaXon – Used in Dynamo/Cassandra
Gossip • Cassandra exchanges heartbeat state, applicaXon state
and so forth • Every second, random live node, random unreachable
node and exchanges key-‐value structures • Some nodes play the part of seeds • Seed /iniXal contact points in staXc conf file
storage.conf file • Could also come from a configuraXon service like
zookeeper • To guard against node flap, explicit membership join and
leave – now you know why hinted handoff was added
Membership & Failure detecXon • Consensus & Atomic Broadcast -‐ impossible to solve in a distributed system[88,89] – Cannot differenXate between an slow system and a crashed system
• Completeness – Every system that crashed will be eventually detected
• Correctness – A correct process is never suspected
• In short, if you are dead somebody will no<ce it and if you are alive, nobody will mistake you for dead !
Ø Accrual Failure Detector • Not Boolean value but a probabilisXc number that “accrues” over
an exponenXal scale • Captures the degree of confidence that a corresponding monitored
process has crashed[94] – Suspicion Level – Ø = 1 -‐> prob(error) 10% – Ø = 2 -‐> prob(error) 1% – Ø = 3 -‐> prob(error) 0.1%
• If process is dead, – Ø is monotonically increasing & Ø→α as t →α
• If process is alive and kicking, Ø=0 • Account for lost messages, network latency and actual crash of
system/process
• Well known heartbeat period Δi, then network latency Δtr can be tracked by inter-‐arrival Xme modeling
Write/Read Mechanisms • Read & Write to a random node (StorageProxy)
• Proxy coordinates the read and write strategy (R/W = any, quorum et al)
• Memtables/SSTables from big table • Bloom Filter/Index • LSM Trees
BF
Index
BF
Index
BF
Index
Commit Logs
MemTable
SSTable • Immutable • Compaction • Maintain Index & Bloom Filter
Node
Node
Flushing
Read
Write
Memory
Disk
Hbase – WAL, Memstore, HDFS File system
How… does HBase work again?
http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html
http://hbaseblog.com/2010/07/04/hug11-hbase-0-90-preview-wrap-up/
Bloom Filter • The BloomFilter answers the quesXon • “Might there be data for this key in this SSTable?” [Ref: Cassandra/Hbase mailer] – “Maybe" or – “Definitely not“ – When the BloomFilter says "maybe" we have to go to disk to check out the content of the SSTable
• Depends on implementaXon – Redone in Cassandra – Hbase 0.20.x removed, will be back in 0.90 with a “jazzy” implementaXon
Was it a vision, or a waking dream? Fled is that music:—do I wake or sleep?
-Keats, Ode to a Nightingale
• http://www.readwriteweb.com/enterprise/2011/11/infographic-data-deluge---8-ze.php
• http://www.crn.com/news/data-center/232200061/efficiency-or-bust-data-centers-drive-for-low-power-solutions-prompts-channel-growth.htm
• http://www.quantumforest.com/2011/11/do-we-need-to-deal-with-big-data-in-r/
• http://www.forbes.com/special-report/2011/migration.html • http://www.mercurynews.com/bay-area-news/ci_19368103 • http://www.businessinsider.com/apple-new-data-center-north-
carolina-created-50-jobs-2011-11