L’architettura di Classe Enterprise di Nuova Generazione
-
Upload
mongodb -
Category
Data & Analytics
-
view
122 -
download
9
Transcript of L’architettura di Classe Enterprise di Nuova Generazione
Massimo BrignoliPrincipal Solution [email protected]@massimobrignoli
L’architettura di Classe Enterprise di Nuova Generazione
Agenda
• Nascita dei Data Lake• Overview di MongoDB • Proposta di
un’architettura EDM• Case Study & Scenarios• Data Lake Lessons
Learned
Quanti dati?• Una cosa non manca alla aziende: dati
– Flussi dei sensori– Sentiment sui social– Log dei server– App mobile
• Analisti stimano una crescita del volume di dati del 40% annuo, 90% dei quali non strutturati.
• Le tecnologie tradizionali (alcune disegnate 40 anni fa) non sono sufficienti
La Promessa del “Big Data”• Scoprire informazioni collezionando ed analizzando i
dati porta la promessa di– Un vantaggio competitivo– Risparmio economico
• Un esempio diffuso dell’utilizzo della tecnologia Big Data è la “Single View”: aggregare tutto quello che si conosce di un cliente per migliorarne l’ingaggio e i ricavi
• Il tradizionale EDW scricchiola sotto il carico, sopraffatto dal volume e varietà dei dati (e dall’alto costo).
La Nascita dei Data Lake• Molte aziende hanno iniziato a guardare verso
un’architettura detta Data Lake:– Piattaforma per gestire i dati in modo flessibile– Per aggregare I dati cross-silo in un unico posto– Permette l’esplorazione di tutti i dati
• La piattaforma più in voga in questo momento è Hadoop:– Permette la scalabilità orizzontale su hardware commodity– Permette una schema di dati variegati ottimizzato in lettura– Include strati di lavorazione dei dati in SQL e linguaggi comuni– Grandi referenze (Yahoo e Google in primis)
Perché Hadoop?• Hadoop Distributed FileSystem è disegnato per scalare
su grandi operazioni batch• Fornisce un modello write-one read-many append-only • Ottimizzato per lunghe scansione di TB o PB di dati• Questa capacità di gestire dati multi-strutturati è
usata:– Segmentazione dei clienti per campagne di marketing e
recommendation– Analisi predittiva– Modelli di Rischio
Ma va bene per tutto?• I Data Lake sono disegnati per fornire l’output di
Hadoop alle applicazioni online. Queste applicazioni hanno dei requisiti tra cui:– Latenza di risposta in ms– Accesso random su un sottoinsieme di dati indicizzato– Supporto di query espressive ed aggregazioni di dati– Update di dati che cambiano valori frequentemente in real-time
Hadoop è la risposta a tutto?• Nel nostro mondo guidato ormai dai dati, i millisecondi
sono importanti.– Ricercatori IBM affermano che il 60% dei dati perde valore alcuni
millisecondi dopo la generazione– Ad esempio identificare una transazione di borsa fraudolenta è
inutile dopo alcuni minuti• Gartner predice che il 70% delle installazioni di Hadoop
fallirà per non aver raggiunto gli obiettivi di costo e di incremento del fatturato.
Enterprise Data Management Pipeline
…
Siloed source databases
External feeds (batch)
Streams
Stream icon from: https://en.wikipedia.org/wiki/File:Activity_Streams_icon.png
Transform
Store raw data
AnalyzeAggregate
Pub-sub, ETL, file imports
Stream Processing
Users
Other System
s
In Dettaglio• Join non necessarie causano pessime
performance• Costoso scalare verticalmente• Lo schema rigido rende difficile il
consolidamento di datai variabili o non strutturati
• Ci sono differenze nei record da eliminare durante la fase di aggregazione
• I processi soventi durano ore durante la notte
• I dati sono vecchi per prendere decisioni intraday
Veloce Overview di MongoDB
Documents Enable Dynamic Schema & Optimal Performance
Relational
MongoDB{ customer_id : 1,
first_name : "Mark",last_name : "Smith",city : "San Francisco",phones: [{
number : “1-212-777-1212”,
dnc : true,type : “home”
},number : “1-212-777-
1213”, type : “cell”
}] }
Customer ID First Name Last Name City
0 John Doe New York1 Mark Smith San Francisco2 Jay Black Newark3 Meagan White London4 Edward Daniels Boston
Phone Number Type DNC Customer ID
1-212-555-1212 home T 0
1-212-555-1213 home T 0
1-212-555-1214 cell F 0
1-212-777-1212 home T 1
1-212-777-1213 cell (null) 1
1-212-888-1212 home F 2
Document Model BenefitsAgility and flexibilityData model supports business changeRapidly iterate to meet new requirements
Intuitive, natural data representationEliminates ORM layerDevelopers are more productive
Reduces the need for joins, disk seeksProgramming is more simplePerformance delivered at scale
{customer_id : 1,first_name : "Mark",last_name : "Smith",city : "San Francisco",phones: [{
number : “1-212-777-1212”,dnc : true,type : “home”
},number : “1-212-777-1213”,
type : “cell”}]
}
MongoDB Technical CapabilitiesApplicatio
n
Driver
Mongos
PrimarySeconda
rySeconda
ry
Shard 1PrimarySeconda
rySeconda
ry
Shard 2
…PrimarySeconda
rySeconda
ry
Shard N
db.customer.insert({…})db.customer.find({ name: ”John Smith”})
1. Dynamic Document Schema{ name: “John Smith”,
date: “2013-08-01”, address: “10 3rd St.”, phone: {
home: 1234567890, mobile: 1234568138 } }
2. Native language drivers
5. High performance
- Data locality
- Indexes- RAM
3. High availability
6. Horizontal scalability
- Sharding
4. Workload Isolation
Morphia
MEAN Stack
Java Python PerlRuby
Drivers & Ecosystem
3.2 Features Relevant for EDM• WiredTiger as default storage engine• In-memory storage engine• Encryption at rest• Document Validation Rules• Compass (data viewer & query builder)• Connector for BI (Visualization)• Connector for Hadoop• Connector for Spark• $lookUp (left outer join)
Data Governance with Document Validation
Implement data governance without sacrificing agility that comes from dynamic schema
• Enforce data quality across multiple teams and applications
• Use familiar MongoDB expressions to control document structure
• Validation is optional and can be as simple as a single field, all the way to every field, including existence, data types, and regular expressions
MongoDB Compass
For fast schema discovery and visual construction of ad-hoc queries
• Visualize schema– Frequency of fields– Frequency of types– Determine validator rules
• View Documents• Graphically build queries• Authenticated access
MongoDB Connector for BIVisualize and explore multi-dimensional documents using SQL-based BI tools. The connector does the following:
• Provides the BI tool with the schema of the MongoDB collection to be visualized
• Translates SQL statements issued by the BI tool into equivalent MongoDB queries that are sent to MongoDB for processing
• Converts the results into the tabular format expected by the BI tool, which can then visualize the data based on user requirements
Dynamic LookupCombine data from multiple collections with left outer joins for richer analytics & more flexibility in data modeling
• Blend data from multiple sources for analysis
• Higher performance analytics with less application-side code and less effort from your developers
• Executed via the new $lookup operator, a stage in the MongoDB Aggregation Framework pipeline
Aggregation Framework – Pipelined AnalysisStart with the original collection; each record (document) contains a number of shapes (keys), each with a particular color (value)
• $match filters out documents that don’t contain a red diamond
• $project adds a new “square” attribute with a value computed from the value (color) of the snowflake and triangle attributes
• $lookup performs a left outer join with another collection, with the star being the comparison key
• Finally, the $group stage groups the data by the color of the square and produces statistics for each group
Partner Ecosystem (500+)
MongoDB Architecture Patterns
1. Operational Data Store (ODS)2. Enterprise Data Service3. Datamart/Cache4. Master Data Distribution5. Single Operational View 6. Operationalizing Hadoop
System of Record
System of Engagement
Enterprise Data Management Pipeline
…
Siloed source databases
External feeds (batch)
Streams
Stream icon from: https://en.wikipedia.org/wiki/File:Activity_Streams_icon.png
Transform
Store raw data
AnalyzeAggregate
Pub-sub, ETL, file imports
Stream Processing
Users
Other System
s
Come scegliere il layer di Data Management per ognuno degli stage?
Processing Layer
?
When you want:1. Secondary
indexes2. Sub-second
latency3. Aggregations in
DB4. Updates of data
For:1. Scanning files2. When indexes
not needed
Wide column store (e.g. HBase)
For:1. Primary key
queries2. If multiple
indexes & slices not needed
3. Optimized for writing, not reading
MongoDB Hadoop/Spark ConnectorDistribute
d processing/analytic
s
• Sub-second latency• Expressive querying• Flexible indexing• Aggregations in
database• Great for any subset
of data
• Longer jobs• Batch analytics• Append only files• Great for scanning all
data or large subsets in files
- MongoDB Hadoop Connector
- Spark-mongodb
Both provide:• Schema-on-
read• Low TCO• Horizontal
scale
Data Store for Raw Dataset
…
Siloed source databases
External feeds (batch)
Streams
Stream icon from: https://en.wikipedia.org/wiki/File:Activity_Streams_icon.png
Transform
Store raw data
AnalyzeAggregate
Pub-sub, ETL, file imports
Stream Processing
Users
Other System
s
Store raw data
Transform
- Typically just writing record-by-record from source data
- Usually just need high write volumes
- All 3 options handle that
Transform read requirements- Benefits to reading multiple
datasets sorted [by index], e.g. to do a merge
- Might want to look up across tables with indexes (and join functionality in MDB v3.2)
- Want high read performance while writes are happening
Interactive querying on the raw data could use indexes with MongoDB
Data Store for Transformed Dataset
…
Siloed source databases
External feeds (batch)
Streams
Stream icon from: https://en.wikipedia.org/wiki/File:Activity_Streams_icon.png
Transform
Store raw data
AnalyzeAggregate
Pub-sub, ETL, file imports
Stream Processing
Users
Other System
s
Aggregate
Transform
Often benefits to updating data as merging multiple datasets
Dashboards & reports can have sub-second latency with indexes
Aggregate read requirements- Benefits to using indexes for grouping- Aggregations natively in the DB would
help- With indexes, can do aggregations on
slices of data- Might want to look up across tables with
indexes to aggregate
Data Store for Aggregated Dataset
…
Siloed source databases
External feeds (batch)
Streams
Stream icon from: https://en.wikipedia.org/wiki/File:Activity_Streams_icon.png
Transform
Store raw data
AnalyzeAggregate
Pub-sub, ETL, file imports
Stream Processing
Users
Other System
s
AnalyzeAggregate
Dashboards & reports can have sub-second latency with indexes
Analytics read requirements- For scanning all of data,
could be in any data store- Often want to analyze a
slice of data (using indexes)- Querying on slices is best in
MongoDB
Data Store for Last Dataset
…
Siloed source databases
External feeds (batch)
Streams
Stream icon from: https://en.wikipedia.org/wiki/File:Activity_Streams_icon.png
Transform
Store raw data
AnalyzeAggregate
Pub-sub, ETL, file imports
Stream Processing
Users
Other System
s
Analyze
Users
Dashboards & reports can have sub-second latency with indexes
- At the last step, there are many consuming systems and users
- Need expressive querying with secondary indexes
- MongoDB is best option for the publication or distribution of analytical results and operationalization of data
Other SystemsOften digital
applications- High scale- Expressive querying- JSON preferred
Often RESTful services,
APIs
Architettura EDM Completa
…
Siloed source databases
External feeds (batch)
Streams
Data processing pipeline
Pub-sub, ETL, file imports
Stream ProcessingDownstrea
m Systems
… …
Single CSR
Application
Unified Digital Apps
Operational
Reporting
…
… …
Analytic Reporting
Drivers & Stacks
Customer
Clustering
Churn Analysi
s
Predictive
Analytics
…
Distributed Processing
Governance to choose where to load and process data
Optimal location for providing operational response times & slices
Can run processing on all data or slices
Data Lake
Example scenarios1.Single Customer View
a. Operationalb. Analytics on customer segmentsc. Analytics on all customers
2.Customer profiles & clustering
3.Presenting churn analytics on high value customers
Single View of CustomerSpanish bank replaces Teradata and Microstrategy to increase business and avoid significant cost
Problem Why MongoDB ResultsProblem Solution Results
Took days to implement new functionality and business policies, inhibiting revenue growth
Branches needed an app providing single view of the customer and real time recommendations for new products and services
Multi-minute latency for accessing customer data stored in Teradata and Microstrategy
Built single view of customer on MongoDB – flexible and scalable app easy to adapt to new business needs
Super fast, ad hoc query capabilities (milliseconds), and real-time analytics thanks to MongoDB’s Aggregation Framework
Can now leverage distributed infrastructure and commodity hardware for lower total cost of ownership and greater availability
Cost avoidance of 10M$+
Application developed and deployed in less than 6 months. New business policies easily deployed and executed, bringing new revenue to the company
Current capacity allows branches to load instantly all customer info in milliseconds, providing a great customer experience
Large Spanish Bank
Case StudyInsurance leader generates coveted single view of customers in 90 days – “The Wall”
Problem Why MongoDB ResultsProblem Solution Results
No single view of customer, leading to poor customer experience and churn
145 years of policy data, 70+ systems, 15+ apps that are not integrated
Spent 2 years, $25M trying build single view with Oracle – failed
Built “The Wall” pulling in disparate data and serving single view to customer service reps in real time
Flexible data model to aggregate disparate data into single data store
Churn analysis done with Hadoop with relevant results output to MongoDB
Prototyped in 2 weeks
Deployed to production in 90 days
Decreased churn and improved ability to upsell/cross-sell
Top 15 Global Bank
Kicking Out OracleGlobal bank with 48M customers in 50 countries terminates Oracle ULA & makes MongoDB database of choice
Problem Why MongoDB ResultsProblem Solution Results
Slow development cycles due to RDBMS’ rigid data model hindering ability to meet business demands
High TCO for hardware, licenses, development, and support (>$50M Oracle ULA)
Poor overall performance of customer-facing and internal applications
Building dozens of apps on MongoDB, both net new and migrations from Oracle – e.g., significant portion of retail banking, including customer-facing and backoffice apps, fraud detection, card activation, equity research content mgt.)
Flexible data model to develop apps quickly and accommodate diverse data
Ability to scale infrastructure and costs elastically
Able to cancel Oracle ULA. Evaluating what apps can be migrated to MongoDB. For new apps, MongoDB is default choice
Apps built in weeks instead of months or years, e.g., ebanking app prototyped in 2 weeks and in production in 4 weeks
70% TCO reduction