The evolution of advertising technology and the importance of personalization
-
Upload
valtech-ab -
Category
Technology
-
view
694 -
download
0
Transcript of The evolution of advertising technology and the importance of personalization
EXPECTATIONS
AGENDA§ Introduction
§ Data Collection
§ Data Consumption
§ Data Analysis
§ Exposing Data
§ Q & A
DharmicData, Data Center of Excellence
Data StrategyData Management
Platform Data-Driven Solutions KPI’s and experimentation
Transforming the whole Value Chain
http://www.dharmicdata.com@dharmicdata@fsroque @moshtan
Data is everywhere
(BIG) DATA
Contains informationExtracting information allows us to act proactively
Overcome problemsOptimize returns
the more data we can collect and use, the more information we’ll generate, the better we’ll operate
(BIG) DATA
Varying perceptions of “BIG DATA”
BIG DATA“It’s about being smarter with your data?”
“It means making faster decisions?”
“It simply means more data?”
“It’s about cheaper storage technology?”
“It’s all about social media?”
What is (Big) Data?
Volume Velocity
VarietyValue
Big Data
Data in motionEnabling real-time decisions
Data in many formsStructured, unstructured, text, multimedia
Data in numbersExtracting businessinsights and revenue from data
Data at scaleTerabytes to petabytes of data(or when your work processes dictate)
Trends driving the importance of Big Data
Customer-centric outcomes
49%
Operational optimization
18%
Risk/Financial management
15%
New business model14%
Employee Collaboration
4%
Businesses' Big Data objectives
“Analytics, the real world use case of Big Data”. IBM Institute of Business Value Study, October 2012
• Everything is digitized• Advanced analytics technologies • Customer-centricity ‘smarter’ solutions
Data is everywhere, and should be accessible
PIPELINES
SENSORSSOCIAL INTERACTIONS
BEHAVIOR
CONSUMPTION
SOLUTION’SQUALITY
CAPTUREDATA $$$
Capturing most Data
PIPELINES
Turn it to valuable information
A pipeline ties in several Data processing steps together
ISSUESDEALING WITH DATA
From batches to pipelines
#1TRANSPORTING DATA BETWEEN
SYSTEMS
DATA INTEGRATION (~ETL)
RELATIONAL DATABASES
HADOOP
SEARCH AND INDEXINGMONITORING
KEY-VALUE STORES
http://www.confluent.io/blog/stream-data-platform-1/
THE SPAGHETTI MONSTER
#2NEED FOR RICH ANALYTICAL DATA
PROCESSING
VERY LOW LATENCY DATA PROCESSING
STREAM PROCESSING
REAL-TIME ANALYTICS
DATA CLOSE TO PROCESSING
MORE ISSUES• Lossy and high-latency connections
• Segmented (siloed) data sources
• Batched database migrations, data insertions, etc.
• Unscalable, tightly connected systems
• ‘Duct taped connections’
• Unreliable data – leading to a lot of QA
• No room for data processing outside of batch, data archival or ad hoc processing
http://www.confluent.io/blog/stream-data-platform-1/
STREAM DATA PLATFORM TO THE RESCUE
UNIVERSAL DATA PIPELINE
CONTINUOUS FEEDS OF WELL-FORMED DATA
http://www.confluent.io/blog/stream-data-platform-1/
THE STREAM DATA PLATFORM
REAL-TIME STREAM PROCESSING
STREAM
User profile
Enrich user profile
Store in db
Predict user behavior
Target user
WHAT DOES A STREAM DATA PLATFORM NEED TO DO?
FAST?HIGH THROUGHPUT?
SCALE WELL?
KEY REQUIREMENTS FOR A STREAM DATA PLATFORM
• Reliable, no data loss
• High Throughput to handle large event data
• Persist data for longer periods, for enabling batch based workflows
• Low latency data for real-time applications
• Central system
• Close integration with stream processing systems
STREAM DATA PLATFORM RELATED TO EXISTING THINGS
STREAM DATA PLATFORMENTERPRISE MESSAGING SYSTEM
One-off deployment Central data hub
Limited storage capacity Large log history
STREAM DATA PLATFORMDATA INTEGRATION TOOLS
Disparate tools and deployments True platform
Many routine data-cleanup steps Stream abstracting and data locality make it easier to tap into and build applications around a stream
STREAM DATA PLATFORMENTERPRISE SERVICE BUSES
Transformation logic is embedded Data transformation is decoupled from the stream
Processing tasks need to agree with multiple stakeholders
Individual teams can use and reuse streams, no bottleneck
STREAM DATA PLATFORMDATA WAREHOUSES AND HADOOP
Quickly flow dataPublish resultsLong term storage
A BIG IDEAThe democratization of “data”
=> Making data available through more of the organization
The democratization of the “cluster” => Making data+resources available through more of the
organization
In the Hypervisor world, low utilization has been widely observed
A McKinsey study in 2008 pegged data-center utilization at roughly 6 percent.
A Gartner report from 2012 put industry wide utilization rate at 12 percent.
An Accenture paper, from 2011, sampling Amazon EC2 machines found 7 percent utilization over the course of a week.
The business case for data Warehouse Scale Computing
Arguments for WSCRather than running several specialized clusters, each at relatively low utilization rates, instead run many mixed workloads obvious benefits are realized in terms of:
• scalability, elasticity, fault tolerance, performance, utilization • reduced equipment capex, Ops overhead, etc.• reduced licensing, eliminating need for VMs or potential vendor lock-in • reduced time for engineers to ramp up new services at scale • reduced latency between batch and services, enabling new high ROI use cases• enables Dev/Test apps to run safely on a Production cluster• Eases deployment
Prior Practice• Low utilization rates• Longer time to ramp up new services• Even more machines to manage• Substantial performance decrease• VM licensing costs and specific data center vendor economics
• Even more machines to manage• Substantial performance decrease• VM licensing costs and specific data center vendor economics• Failures make static partitioning more complex to manage
Current Practice: WSC
“We wanted people to be able to program for the datacenter just like they program for their laptop.”- Ben Hindman, Co-creator of Mesos
WAREHOUSE SCALE COMPUTING
Sever management and granular resource allocationExternal Scalability, Horizontal Scalability, Health checks, Monitoring, and Scheduling
MESOS MARATHONCHRONOS
BREAK
WHERE TO FIND DATA?
Step 1. Ingest DataStep 0. Find Data
GETTING DATA
REQUEST/RESPONSE
STREAMING
REQUEST/RESPONSEClassic, the client just asks the third party
Issue a request, return a response
- What if the service returns a lot of data?
- What if the service generates data very fast?
- What if the data points will only be sent once?
STREAMINGA Permanent connection is made between the service and the consumer
The data flows continuously through the pipe. The consumer subscribes to the service
What if the incoming data rate is too high?
MICROSERVICES ARCHITECTUREThin collection layer:
• Pass the data to the next layer (the queue)
• Scalable vertically (increase req. rate)
• Scalable horizontally (support fast data)
• Extendable (capture new sources and types of data)
DMPs and Data CollectionWhat is a data management platform?
In simple terms, a data management platform is a data warehouse. It’s a piece of software that sucks up, sorts and houses information, and spits it out in a way that’s useful for marketers, publishers and other businesses.
Data Collection at DD
EVENTSOrders, Sales, Clicks
Sensor Data
Databases?
SIGNALS – CAPTURING USER BEHAVIOR
“BigData,thefutureoflogistics”.Luxembourg-Poland BusinessClub, KPMG
QUEUESP
S
P
P
P
S
S
S
PUB/SUB
“… put them (messages) on a software bus where all processes can see them”
- Gartner
A COMMIT LOG
0 1 2 3 4 5 6 7 8 9
1st Record Next Record Written
A log, is perhaps the simplest possible storage abstraction.It is an append only, totally-ordered sequence of records,
ordered by time.
Time
A DISTRIBUTED COMMIT LOG
0 1 2 3 4 5 6 7 8 9Partition 0
Writes0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
Partition 1
Partition 2
Old New
Is partitioned and replicated across multiple nodes.
10
10
10
11
11
11
12
12
scalableTime (duration configurable)
http://www.confluent.io/blog/stream-data-platform-1/
SPARK STREAMING
1. receive
3. output
2. process
streaming data from data sources
the data
the results out to downstream
Architecture of SPARK Streaming: Discretized Streams
batches (RDDs)
Receiver
Records are processed in batches and each batch a RDD, a partitioned dataset.
Event-driven ApplicationsStream processing
Real time Event data
Streaming data platforms
Event-data Can Be Thought of as Event-Streams
Evolution of Traditional ETL
Production DB Standby DB
Periodic Full backups
Production DB Standby DB
Frequent diffs
amount of data amount of data
Production DB Standby DB
Even more frequent diffing
What we are left with is a continuous sequence of single row changes amount of data
100s of users 1,000s users 10,000s of users 100,000s of users
Scalable micro-services
Micro-service
Micro-service
Micro-service
Micro-service
Micro-service
Micro-service
Micro-service
Micro-service
Micro-service
Micro-service
Micro-service
Micro-service
Micro-service
Micro-service
Micro-service
Micro-service
WORKSHOP• Split into groups of 3-4
• Discuss the question:
• “How can Big Data, pipelining, streaming, and real-time computing apply to my current or future
assignments?”
• One lucky group will be randomly selected for presenting :)
BREAK
THE EXPLORATORY PHASE
EXPLORATORY PHASEUnavoidable
Understand the data you are working with
Computationally Expensive
Lot’s of retries, the model chosen down the line will involve trial and error
PIPELINING (UNIFIED DATA SILOS)
Productizing Data Science
MODELING DEPLOYINGCODING
Finding dataParsing structures
CleaningReducingLearning
Predicting
Connect to prod dataTuning training parameters
Create prediction serviceGenerate Deployable model
Connect to Prod infrastrIntegration with existing ENV
Allocate Schedule resourcesEnsure availability
Extended Pipeline
COLLECT CODE
COLLECTION TIERQUEING TIER
QUEING TIERIN MEMORY TIERCOMPUTING TIER
MODEL CODE DEPLOY CREATESERVICES
INTEGRATEAPPLICATION
COLLECTION TIERQUEING TIER
QUEING TIERIN MEMORY TIERCOMPUTING TIER
IN MEMORY TIERCOMPUTING TIER
RESOURCE MANAGER TIER
RESOURCE MANAGER TIERSERVICE TIER
CREATING SERVICESAbstracts access to prepared views Exposes Prediction capabilities Highly horizontally scalable Scaling micro services cluster→ cheaper than computing cluster
“Extra” Coding phase
EXPLORATORY PHASE – NEED FOR SPEEDWe can’t afford loosing time due to inefficient toolset
Interactivity and reactivity to find the optimal result and move forward
NOTEBOOKREPL evolution
DASHBOARD
SPARK NOTEBOOK
http://spark-notebook.io/
Spark + Scala
Exploration of Big Data
NOTEBOOK DEMO
DASHBOARD
http://redash.io/
Connect to any DB
(Custom) hdfs integration via Drill
Interactive
SQL-like querying
DASHBOARD DEMO
Exposing Views on data
The data science pipeline now has to include the way for results to be consumed by third parties
(service oriented architecture)
What are the results?
Intermediate results and the model need to be exposed
Having services for views (APIs) allows us to abstract the way they are created.
APIs Expose a stream
Expose intermediate results
Expose models
APIs
STREAM
Events Current events/sec Increment counter
APITotal number of events
Average events/sec
# occurrences of specific event
Event details
EVENT-DRIVEN APPLICATIONS
Monitor and respond. Real-time