Data Platforms: Why Nothing Has Changed Except Everything | Max Clark
Everything has changed except us
-
Upload
mark-madsen -
Category
Data & Analytics
-
view
723 -
download
1
Transcript of Everything has changed except us
Copyright Third Nature, Inc.
Everything has changed except us
February, 2015
Mark Madsenwww.ThirdNature.net@markmadsen
Copyright Third Nature, Inc.
The DW group as the crazy uncle of the organization
Madness: doing more of what you already did and expecting different results.
We’ve been struggling with shrinking load windows, performance problems, and most important, inability to quickly meet data needs, for a decade, yet we keep doing the same things to try to fix them.
Copyright Third Nature, Inc.
I never said the “E” in EDW meant “everything”…
What do you mean, “Just tables?”
Copyright Third Nature, Inc.
It’s going to get a lotworse
Not E
E
Conclusion: any methodology built on the premise that you must know and model all the data first is untenable
© Third Nature Inc.© Third Nature Inc.
The good news is: we solved the bigness problem
Source: Noumenal, Inc.
Copyright Third Nature, Inc.
Now, analytics embiggens the data volume problem
Many of the processing problems are O(n2) or worse, so small data can be a problem for DB‐based platforms
© Third Nature Inc.© Third Nature Inc.
What makes data “big”?
Aside from very large amounts:
Hierarchical structures
Nested structures
Linked structures
Encoded values
Non‐standard (for a database) types
Deep structure
Human authored text
“big” is better off being defined as “complex” or “hard to manage”
Copyright Third Nature, Inc.
Copyright Third Nature, Inc.
Datasets today: Interconnection and Dependency
Dynamic models are missing from most data systems today. These drive new workloads, generate different data, need new techniques.
Hierarchical Edge Bundles: Visualization of Adjacency Relations in Hierarchical Data, Danny Holten
Copyright Third Nature, Inc.
It’s not the number of genes that determine complexity, it’s the interactions between them.
Source: M. Pertea and S. Salzberg/Genome Biology 2010
Copyright Third Nature, Inc.
It’s not the number of genes that determine complexity, it’s the interactions between them.
Source: M. Pertea and S. Salzberg/Genome Biology 2010
Copyright Third Nature, Inc.
Categorizing the measurement data we collectThe convenient data is the transactional data.▪ Goes in the DW and is used, even if it isn’t the right measurement.
The inconvenient data is observational data.▪ It’s not neat, clean, or designed into most systems of operation.
The difficult and misleading data is declarative data.▪ What people say and what they do require ground truth.
We need an architecture that supports all three categories.
Copyright Third Nature, Inc.
Copyright Third Nature, Inc.
Observations
Sensor data doesn’t fit well with current methods of collection and storage, or with the technology to process and analyze it.Copyright Third Nature, Inc.
Copyright Third Nature, Inc.
Declarations
Copyright Third Nature, Inc.
Unstructured is Not Really Unstructured
Slide 14
Unstructured data isn’t really unstructured: objects have structure, language has structure. Text can contain traditional structured data elements. The problem is that the content is unmodeled.
Our real problem is making implicit structure explicit.
Conclusion: the data warehouse must cope with more complex data structures, storage and processing.
Copyright Third Nature, Inc.
The creation, flow and use of data is different for transactions and machine‐generated events
Data entry Extract Cleanse Load UseStore
TransactionsMDM
Generate Store
Use
UseCleanse
Program
Capture
This runs at human speed
This runs at machine speed, with slower feedback cycle
Copyright Third Nature, Inc.
We’re moving BI from information to actuation
This means monitoring as data flows,
detecting rather than querying, as well as feedback to the sources.
Copyright Third Nature, Inc.
The architecture we’ve been using.
The general concept of a separate architecture for BI has been around longer, but this paper by Devlin and Murphy is the first formal data warehouse architecture and definition published.
17
“An architecture for a business and information system”, B. A. Devlin, P. T. Murphy, IBM Systems Journal, Vol.27, No. 1, (1988)
Slide 17Copyright Third Nature, Inc.
Copyright Third Nature, Inc.
Origins: in 1988 there was only big hair.▪ No real commercial email, public internet barely started
▪ Storage state of the art: 100MB, cost $10,000/GB
▪ Oracle Applications v1 GL released; SAP goes public, enters US market
▪ Unix is mostly run by long‐haired freaks
▪ Mobile was this
This is the context: scarcity of data, of system resources, of automated systems outside core financials, of money to pay for storage.
Copyright Third Nature, Inc.
We think of BI as publishing, an old metaphor.
Publishing has value, but may not be actionable.
Copyright Third Nature, Inc.
Data strategy means understanding the context of data use so we can build the right infrastructure
Collect new data
Monitor Analyze Exceptions
Analyze Causes Decide Act
Act on the process
Act within the process
We need to focus on what people do with information as the primary task, not on the data or the technology.
Copyright Third Nature, Inc.
The usage models for conventional BI
Collect new data
Monitor Analyze Exceptions
Analyze Causes Decide Act
No problem No idea Do nothing
Act on the processUsually days/longer timeframe
Act within the processUsually real-time to daily
This is what we’ve been doing with BI so far: static reporting, dashboards, ad-hoc query, OLAP
Copyright Third Nature, Inc.
The usage models for analytics and “big data”
Collect new data
Monitor Analyze Exceptions
Analyze Causes Decide Act
No problem No idea Do nothing
Act on the processUsually days/longer timeframe
Act within the processUsually real-time to daily
Analytics and big data is focused on new use cases: deeper analysis, causes, prediction, optimizing decisions
This isn’t ad-hoc, reporting, or OLAP.
Copyright Third Nature, Inc.
As practices evolve based on new capabilities…
A new level of complexity develops over top of the older, now better understood processes, leading to new data and analysis needs.
Copyright Third Nature, Inc.
Growing complexity has changed our context
Internal 3rd party & custom applications, logs, event streams, hosted & external apps, 3rd party datasets…
Copyright Third Nature, Inc.
Enterprise architecture changes
External = no data layer access
SOA and REST = no data layer access
Streams and messages are becoming the norm
Observations and Transactions
Copyright Third Nature, Inc.
Reality: continuous change in the DW
You can’t keep up with source changes
You can’t keep up with new data requests
You are already scale, performance, latency limitedBut:
Many parts of the organization need current operational data
Copyright Third Nature, Inc.
The emerging big data market has an answer…
Copyright Third Nature, Inc.
Centralize: that solves all problems!
Creates bottlenecks
Causes scale problems
Enforces a single model
Copyright Third Nature, Inc.
Data quality and definitions in a single schema are based on the strictest requirement, reducing flexibility
Copyright Third Nature, Inc.
The data warehouse vs business agility
All the data
Common, typed, tabular data
The bottleneck is you
Copyright Third Nature, Inc.
We have a design for stability. We need one for adaptability
Copyright Third Nature, Inc.
Which is best, 3NF or dimensional?
The core assumption that there can be just one big schema model on one big platform is flawed.
Answer: neither.
We think we can model all the data before use, but that’s a bottleneck. Current techniques for modeling and managing data are too rigid and incapable of describing all the possible relationships.
Copyright Third Nature, Inc.
A core problem with one big schema is change
Copyright Third Nature, Inc.
Big data answer?
Schema‐on‐read!
There’s a price to pay with using “schema‐on‐read” for everything.
You won’t see the problems with this until you add a second application, and a third.
“One writer‐many readers” kills schema‐on read benefits.
Copyright Third Nature, Inc.
Why is the choice no schema or hard schema?
Simple key‐value files give you flexibility in some areas. Tables give you flexibility in other areas.
Which area do you need flexibility in and why?
Programs writing data?
Files Tables
Programs processing data?
Programs reading data?
Why not flexible schemas instead of either-or?
Copyright Third Nature, Inc.
“We can't solve problems by using the same kind of thinking we used when we created them.”
Albert Einstein
Page 37
Copyright Third Nature, Inc.
With too much data the approach has to be inverted
The process we still use:1. Model
2. Collect
3. Analyze
The new process is:1. Collect
2. Analyze
3. Model
4. Promote
This is a shift from planned design to evolutionary design for the data warehouse
Copyright Third Nature, Inc. Slide 39
The solution to our problems isn’t necessarily technology, it’s architecture.
Copyright Third Nature, Inc.
Workloads
OLTP BI Analytics
Access Read‐Write Read‐only Read‐mostly
Predictability Predictable Unpredictable Fixed path
Selectivity High Low Low
Retrieval Low Low High
Latency Milliseconds < seconds msecs to days
Concurrency Huge Moderate 1 to huge
Model 3NF, nested object Dim, denorm BWT
Task size Small Large Small to huge
Copyright Third Nature, Inc.
DATA ARCHITECTURE
We’re so focused on the light switch that we’re not talking about the light
Copyright Third Nature, Inc.
Decoupled Data Architecture
The core of the data warehouse isn’t the database, it’s the data architecture that the database and tools implement.
We need a data architecture that is not limiting:▪ Deals with data and schema change easily
▪ Does not always require up front modeling
▪ Does not limit the format or structure of data
▪ Assumes a full range of data latencies, from streaming to one‐time bulk loads, both in and out,
Copyright Third Nature, Inc.
Food supply chain: an analogy for data
Multiple contexts of use, differing quality levels
Integrate
Manage
Decouple data architecture layers
Use
This implies a new warehouse architecture and data modeling approaches
Collect
Transactions Observations Declarations
Copyright Third Nature, Inc.
Break down the monolithic architecture
The technology architecture must change, based on work done with the data:▪ Collection separate from▪Data management separate from
▪Data delivery and use
Data may live in more than one place because it may have more than one model, for more than one use, using more than one engine
Copyright Third Nature, Inc.
Reinforcing relationships keep architectures from changing, despite radical technology shifts
Note how only one third is tech
ArchitecturalRegime
MethodologyTechnology
Organization
Organization defines where the work is done and the roles.
Technology defines what work can be done in a given area. Methodology
defines how work is done and what that work is.
Slide 49Copyright Third Nature, Inc.
Copyright Third Nature, Inc.
Agile architectures without agile methods fail
Copyright Third Nature, Inc.
How can you move to a more agile architecture?
Start by deploying faster.
Things will break.
You will fix them.
You will get better.
So will your architecture.
Copyright Third Nature, Inc.
The geography we have been using is out of date
The box we created:• not any data, rigidly typed data• not any form, tabular rows and columns of typed data
• not any latency, persist what the DB can keep up with
• not any process, only queries
The digital world was diminished to only what’s inside the box until we forgot the box was there.
Copyright Third Nature, Inc.
Data infrastructure is a platform▪ Any data – structures, forms
▪ Any latency –in motion, at rest
▪ Any process – query, algorithm, transform
▪ Any access – SQL, API, queue, file movement
Copyright Third Nature, Inc.
Don’t follow the market
Some people can’t resist getting the next new thing because it’s new and new is always better.
Many IT organizations are like this, promoting a solution and hunting for the problem that matches it.
Better to ask “What is the problem for which this technology is the answer?”
Copyright Third Nature, Inc.
Copyright Third Nature, Inc.
Think like an architect, not like a consumerNo more “enterprise standard” ‐ now “what works”
The technology providers are selling you what they have, not what you need.
Follow the goals of the business.
Translate the goals into capabilities and match those to the architecture required.
Copyright Third Nature, Inc.
“The future, according to some scientists, will be exactly like the past, only far more expensive.” ~ John Sladek
Copyright Third Nature, Inc.
CC Image Attributions
Thanks to the people who supplied the creative commons licensed images used in this presentation:
round hole square peg ‐ https://www.flickr.com/photos/epublicist/3546059144firemen not noticing fire.jpg ‐ http://flickr.com/photos/oldonliner/1485881035/pyramid_camel_rider.jpg ‐ http://www.flickr.com/photos/khalid‐almasoud/1528054134/House on fire ‐ http://flickr.com/photos/oldonliner/1485881035/glass_buildings.jpg ‐ http://www.flickr.com/photos/erikvanhannen/547701721Circos, Hierarchical Edge Bundles:Visualization of Adjacency Relations in Hierarchical Data, Danny Holtentext composition ‐ http://flickr.com/photos/candiedwomanire/60224567/Building demolition ‐ https://www.flickr.com/photos/gregpc/4429888820peek_fence_dog.jpg ‐ http://www.flickr.com/photos/webwalker/114998078/donuts_4_views.jpg ‐ http://www.flickr.com/photos/le_hibou/76718773/shady_puppy_sales.jpg ‐ http://www.flickr.com/photos/brizzlebornandbred/5001120150subway dc metro ‐ http://flickr.com/photos/musaeum/509899161/
Copyright Third Nature, Inc.
About the Presenter
Mark Madsen is president of Third Nature, a technology research and consulting firm focused on business intelligence, data integration and data management. Mark is an award‐winning author, architect and CTO whose work has been featured in numerous industry publications. Over the past ten years Mark received awards for his work from the American Productivity & Quality Center, TDWI, and the Smithsonian Institute. He is an international speaker, a contributor to Forbes Online and on the O’Reilly Strata program committee. For more information or to contact Mark, follow @markmadsen on Twitter or visit http://ThirdNature.net
About Third Nature
Third Nature is a research and consulting firm focused on new and emerging technology and practices in analytics, business intelligence, information strategy and data management. If your question is related todata, analytics, information strategy and technology infrastructure then you‘re at the right place.
Our goal is to help organizations solve problems using data. We offer education, consulting and research services to support business and IT organizations as well as technology vendors.
We fill the gap between what the industry analyst firms cover and what IT needs. We specialize in product and technology analysis, so we look at emerging technologies and markets, evaluating technology and hw it is applied rather than vendor market positions.