Building infrastructure for Big Data
-
Upload
promptcloud -
Category
Technology
-
view
1.151 -
download
4
description
Transcript of Building infrastructure for Big Data
Building the Infrastructure
for Big Data
@ The Fifth Elephant
July 27th
, 2012
-Prashant Kumar, Founder- PromptCloud
1 © PromptCloud 2012, All rights reserved
Agenda
2
About
Context
Machines, Installation & Cloud Automation
Building blocks of a system
Sample application sketch
Lack of time components
© PromptCloud 2012, All rights reserved
About PromptCloud
How??
• Large-scale data crawl and extraction
• Hosted indexing
• Custom data analytics
• Working round the clock
4
We provide data feeds and feed ourselves on data- since 2009
About Me • PromptCloud’s Founder
• Yahoo! - 2007-2008
• IIT-Kanpur CS- 2007
© PromptCloud 2012, All rights reserved
Generic Big Data Systems
• Multiple nodes (incoherent set of coherent ones)
• Compute layer- Interdependent processes
• Data storage layer & multiple middleware
• Tools for installation, monitoring & scheduling
*Meta- source control, code reviews, continuous integration
7 © PromptCloud 2012, All rights reserved
Installation
Create an image and install
9
•Easy to install •No maintenance cost •1 image for 1 purpose
•Modifications? Difficult to save it back •Apt, yum, etc-keeper like systems but difficult to scale
Solutions??
© PromptCloud 2012, All rights reserved
Virtual Machines
11
Virtual Machines
AWS, Xen, KVM,…
Virtual Box Installation
Vagrant
Init
Shared directory
Port Forwarding
Up ssh
© PromptCloud 2012, All rights reserved
Code the Installation using Chef
12
Give the recipe- code what’s to be done
I’m Solo
Roles, Recipes
Templates, Run List
Knife
Chef Server
Data Files
© PromptCloud 2012, All rights reserved
To keep processes running,
14
Option 1- Install GOD to monitor processes and to keep them in place
Courtesy- BIT Mesra
Option 2 (for atheists)- Install MONIT
© PromptCloud 2012, All rights reserved
God’s Snippet
God.watch do |w|
w.name = watcher_name
w.start = start_command
#w.restart = restart_command
w.stop = stop_command
w.behavior(:clean_pid_file)
#w.group = "some group"
w.log = "/tmp/god_monitoring_#{watcher_name}.log"
w.keepalive
w.stop_timeout = 10.seconds
end
15
© PromptCloud 2012, All rights reserved
Job Scheduling
16
Resque, Beanstalk, Gearman, Celery, + cron and queues
Things to remember while making choices- • Persistence • Priorities • Tags • Option for retry • Ability to inspect the queue
© PromptCloud 2012, All rights reserved
Data Storage Layer
• For large systems, maintenance cost is a primary overhead
• Replication & Availability
• Consistency guarantees
• Full-text search
17
SQL/NoSQL, key/value, document-based, graph databases
© PromptCloud 2012, All rights reserved
Voldemort
• Distributed key/value store
• Great performance
• Easy to add/remove nodes
• Alternatives- Mongo, Riak, Hbase, Cassandra
18
Courtesy- harrypotter.wikia.com
Not me!!!!!!!!
© PromptCloud 2012, All rights reserved
Messaging Layer-
• RabbitMQ- most commonly used in high-load production systems
• Implements AMQP
• Robust exchange server
• Multiple kinds of exchanges- direct, topic, fanout
• Options for HA with Pacemaker/DRBD
19 © PromptCloud 2012, All rights reserved
21
1. We’ll generate random sentences based on Markov chain
2. Store these in Voldemort 3. Enqueue corresponding jobs in RabbitMQ 4. Another set of workers will process these
sentences
Demo Sketch
© PromptCloud 2012, All rights reserved
Sensu &Graphite
• Monitoring router
• "check scripts” on nodes
• “handler scripts” on servers
• Output can be sent to pagerduty, graphite, twitter or IRC
23 © PromptCloud 2012, All rights reserved
Distributed Log Collection
Flume
• Allows multiple topologies
• Agent
• Collector
• Sink
24
Scribe, Flume, Splunk
© PromptCloud 2012, All rights reserved
Feel free to reach out
25
Big Data made Small
© PromptCloud 2012, All rights reserved
Appreciate your time
Thanks to Arpan Jha for her help with the slides