#iranic
Me
• I work for InterSystems who:– Drives http://globalsdb.org NoSQL project.– Has 20+ years of NoSQL production deployments– Has 20+ years of Big Data production
deployments– Built a ~250 million Euro business on the above
• Email: [email protected]• Twitter: #iranic
#iranic
Big Data is …
• Important data in varying formats and volumes that is being generated across all areas affecting your business that is generally not centrally correlated or managed.
• Examples include:– Word Files, PowerPoint, PDFs– Emails, Instant Messaging, Texts– Blogs and Social Media – Automated data from machine activities– Stream data from financial stock markets
#iranic
Some Big Data Numbers• Source: McKinsey Global Institute• 5 Billion mobile phones used in 2010• 30 Billion pieces of info shared on Facebook each month• 40% projected growth in global data generated• 235 Terabytes collected by US Library of Congress 04/11
– 15 out of 17 sectors in US have more data stored per company than this.
#iranic
Some Big Data Numbers …• Source: McKinsey Global Institute• $300 Billion in potential value in US Healthcare system• €250 Billion in Europe’s public sector administration• $600 Billion in annual consumer surplus using location data• 60% Potential increase in retail operating margins • 140,000 – 190,000 analytical talent positions in US• 1.5 Million data-savvy managers needed in US
#iranic
Case Study: Credit Suisse
• Key Challenges:– Revamp order routing architecture– Revamp order management architecture– Serve current demand and scale to new levels– Address downtime challenges
#iranic
Case Study: Credit Suisse …
• Big Data in the form of volumes of transactions• Leveraged Caché’s:
– In-memory architecture for performance– On-disk resiliency for availability– Distributed architecture for data coherency
• Can easily process 1,000,000,000 transactions– During business hours
#iranic
Case Study: European Space Agency (ESA)
• Key Challenges– Make the largest, most precise 3-D map of our Galaxy– Monitor 1,000,000,000 stars over 5 years, precisely
charting position, movement, and brightness– Along the way discover hundreds of thousands of new
celestial objects
#iranic
Case Study: ESA Continued …
1,000,000,000 objectsX 100 observations per objectX 600 bytes per observation
60,000,000,000,000 (60TB)
Solution: Caché/XEP, delivering 100,000+ sustained inserts per second per server, stored as real objects with SQL access
• Challenge Calculation:• Capture data for 1 Billion Celestial Objects• http://www.intersystems.com/cache/whitepapers/pdf/
Charting_the_Galaxy.pdf
#iranic
Enabling Technology
• Focus on Caché• A quick look at the architecture
Relational Object Key-Value Graph Array ?
Global
COS Java C++ ?
Paradigm
Language
#iranic
Enabling Technology …• Java + C database kernel run in same process
Java I/O, NIOJava Native
Interface
Collections Framework
Concurrency
Java SE
#iranic
Enabling Technology …• ECP, Distributed Computing
Data Server
App Server
1
App Server
2
App Server
3
App Server
4
App Server
?
#iranic
Enabling Technology …• Multiple, simultaneous data to disk writers
Caché Buffer
Journalers
Hard Disk
Global
Journal
Disk I/O
Global
Journal
Disk I/O
Global
Journal
Disk I/O
#iranic
Who is this Guy?
• Edgar Frank “Ted” Codd• Known for 12 Rules (0 ~ 12) for Relational Data Systems
#iranic
NoSQL … Breaking the Rules
• Rule 1: The information Rule– All information is represented in 1 and only 1 way,
namely by values in column positions within rows of tables
• Rule 12: The no subversion Rule– If the system provides a low-level (record-at-a-time)
interface, then that interface cannot be used to subvert the system i.e. relational security or integrity constraints.
#iranic
Why NoSQL?
• No to ACID transactions• No to the impedance mismatch with SQL• Dealing with Big Data and Web Scale• High prices from RDBMS vendors• Use commodity hardware• Flexible data models• It’s a cool movement ….
#iranic
Is NoSQL a new Concept?
• No• Remember MUMPS?
– SET ^Car("Door","Color")="BLUE”• Remember Multi-value/PICK
– MATWRITE array.variable ON file.variable,id. ….• Ever heard of the NoSQL RDB?
– Carlo Strozzi– http://www.strozzi.it/cgi-bin/CSA/tw7/I/en_US/nosql/Home
%20Page
#iranic
CAP Theorem • Consistent
– A service that is consistent operates fully or not.• Availability
– The service is available to operate fully or not.• Partition Tolerance
– Managing data on multiple nodes. 1 node is 1 partition so it works or does not when it comes to processing data.
• Significant as you can get 2 of these only …
#iranic
CAP Theorem … • Arguments and links
– http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
– http://ksat.me/a-plain-english-introduction-to-cap-theorem/
– http://voltdb.com/company/blog/clarifications-cap-theorem-and-data-related-errors
#iranic
Distributed computing• Fallacies (Peter Deutsch)
– The network is reliable– Latency is zero– Bandwidth is infinite– The network is secure– Topology doesn’t change– There is one administrator– Transport cost is zero– The network is homogeneous
• Remember JINI? (See Apache River project)
#iranic
NoSQL: Which project?
• http://nosql-database.org/ lists 122 today.• Depends on your model selection.• Most likely choose well-known project.• Don’t forget about shared risk!
#iranic
NoSQL: Querying
• Some solutions have no querying• When available query languages differ• Lack of general AD-Hoc querying – “no” SQL• Have you heard of UnQL?
– http://www.unqlspec.org/display/UnQL/Home• NOTE: Toad for Cloud
#iranic
NoSQL: How to Succeed?
• Know your application• Don’t forget the past lessons• Consider a hybrid approach• Fight the desire to Roll-Your-Own-DB• Start small but significant
#iranic
NoSQL SQL/RDBMS
NoSQL: Hybrid Approach 1
• Two Systems• NoSQL System• SQL/RDBMS
Data Mapper / Translator
#iranic
NoSQL: Hybrid Approach 2
• One system does both NoSQL and SQL
Data
Relational
Key-Value
Document
Column
Graph
?
#iranic
GlobalsDB.org Project
• Name comes from the underlying data structure– Multi-dimensional array– Basis for commercial Caché data system
• Free for development and production deployment• NoSQL DB with Java and Node.js APIs• Code base is same as commercial product• APIs are open sourced or being open sourced• Database kernel is not open source
#iranic
A “Global” Definition
• A Global is persistent sparse multi-dimensional array, which consists of one or more storage elements or "nodes". Each node is identified by a node reference (which is, essentially, its logical address)– simple =="some data”– complex["subscript-1", "subscript-2"] =="some data”
• Example– product[item,type,os,proccessor] == quantity– product[“computer”,”laptop”,”Mac”,”i7”] == 3
#iranic
GlobalsDB Architecture
• Current Architecture
Array Document Key-Value ?
Global
Javascript Java?
Paradigm
Language
#iranic
GlobalsDB, NoSQL, Big Data
• http://nosql.mypopescu.com/• http://highscalability.com/• http://nosqltapes.com/• http://globalsdb.wordpress.com
Top Related