How to Build a Data-Driven Company: From Infrastructure to Insights
How to Build a Data-Driven Company: From Infrastructure to Insights
-
Upload
janessa-lantz -
Category
Technology
-
view
755 -
download
0
Transcript of How to Build a Data-Driven Company: From Infrastructure to Insights
#datastack#datastack
Shaun
#datastack#datastack
What you’re going to learn1 How top engineering organizations are
building their data infrastructure
The 7 core challenges of data integration
Why companies like Asana, Buffer, and SeatGeek choose Redshift for their analytics warehouse
...and much more!
2
3
Shaun
#datastack
Data Infrastructure: Then and Now
Dillon
#datastack
The traditional approach: ETL Dillon
END USERBI TEAMETL TEAM EDW TEAM
AB
DC
I
G
JM
H
F
L
D
K
Q
BZ C
P
E
F
X
EB
Z
A
X
EVENT
DATA
TRANSACTIONAL DATA
SUMMARY
ELT - Heavy Transformation Restricted Q&AOLAP / Silos
SUMMARY
FE
#datastack
How companies are doing it today: ELT
Dillon
Modeling LayerTransform at Query
FFF
Database
Extract Load
- name: first_purchasers type: single_value base_view: orders measures:[orders.customer.all]
AnalyticsViz & Exploration
3rd Party Data
C
C
C
Transform (and Explore!)
#datastack
Benefits of this approach1.Redshift is performant enough to handle most
transformations2.Users prefer performing transformations in a
language they already use (SQL) or with UI3.Transformations are much simpler, more
transparent4.Performing transformations alongside raw data
is great for auditability
Dillon
#datastack
Data infrastructure has geek cred Shaun
#datastack
Data infrastructure has geek cred Shaun
#datastack
Data infrastructure has geek cred Shaun
#datastack
Data infrastructure has geek cred Shaun
#datastack#datastack
Data Integration
Data Warehouse
BI/Analytics
What the stack looks like Shaun
#datastack
Data Integration
Shaun
#datastack
Why consolidation matters
#datastack#datastack
Common data sources for internal analytics Shaun
#datastack
Quick poll Shaun
What top five data sources are a top priority for you to integrate/keep integrated?● production databases● events● error logs● billing● email marketing● crm● advertising● erp● a/b testing● support
#datastack
“A year ago, we were facing a lot of stability problems with our data processing. When there was a major shift in a graph, people immediately questioned the data integrity. It was hard to distinguish interesting insights from bugs. Data science is already an art so you need the infrastructure to give you trustworthy answers to the questions you ask. 99% correctness is not good enough. And on the data infrastructure team, we were spending a lot of time churning on fighting urgent fires, and that prevented us from making much long-term progress. It was painful.”
- Marco Gallotta, Asana, How to Build Stable, Accessible Data Infrastructure at a Startup
#datastack
“Our story would end here if real-time processing were perfect. But it’s not: some events can come in days late, some time ranges need to be re-processed after initial ingestion due to code changes or data revisions, various components of the real-time pipeline can fail, and so on.”
- Gian Merlino, MetaMarkets, Building a Data Pipeline That Handles Billions of Events in Real-Time
#datastack
7 core challenges of data integration
Connections: Every API is aunique and special snowflake
Accuracy: Ordering data on a distributed system
Latency: Large object data stores (Amazon S3, Redshift) are optimized for batches not streams
Scale: Data will grow exponentially as your company grows
Flexibility: you’re interacting with systems you don’t control
Monitoring: Notifications for expired credentials, errors, notifications of disruptions
Maintenance: Justifying investment in ongoing maintenance/improvement
Shaun
#datastack
Or...try Pipeline Shaun
Ad Platforms Customer Support
Web Data
Marketing Automation
CRM PaymentsEcommerce
#datastack
Warehousing Infrastructure
Shaun
#datastack
Analytics warehouse Shaun
Redshift is the most common analytics warehouse.
Chosen by: Asana, Braintree, Looker, Seatgeek, VigLink, Buffer
#datastack#datastack
Why Redshift is awesome Shaun
#datastack#datastack
AirBnB experimentHive Redshift
Test 1: 3 billion rows of data 28 minutes <6 minutesTest 2: two joins with millions of rows
182 seconds 8 seconds
Cost $1.29/hour/node $0.85/hour/node
Shaun
#datastack
Periscope research Shaun
#datastack
DiamondStream’s dashboard query performance Shaun
#datastack
Business Intelligence & Analytics
Dillon
#datastack#datastack
A broken model Dillon
● Feedback loop is broken
● Disparate reporting● Non-unified decision
making● Versioning● Reusability is lost
Marketing
Finance
AM
#datastack
Constraints of SQL Dillon
SQL is versatile, but shares the same flavor as assembly-only languages such as Perl
Can write but not readPromotes one-off, piecemeal analysisDisparate interpretation
#datastack
The critical multiplier: modeling Dillon
Any SQL Data Warehouse
Modeling Layer
What’s our most successful marketing campaign
How does our Q4 Pipeline looks?
Who are our healthiest / happiest customers?
#datastack#datastack
Interactive, collaborative analytics Dillon
● Data access
● Uniform definitions
● A Shared View
● Collaboration
● Analytical Speed
#datastack
What You Can Do
Dillon
#datastack#datastack
Integrated data + analytics tools Dillon
Week 1 Week 2-3RJMetrics Pipeline
BLOCKS
#datastack#datastack
Looker blocks: sales & marketing
#datastack#datastack
Looker blocks: sales & marketing
#datastack#datastack
Looker blocks: event analytics
#datastack#datastack
Looker blocks: event analytics
#datastack
Thank you!