Building a Sustainable Data Platform on AWS
-
Upload
smartnews-inc -
Category
Technology
-
view
11.949 -
download
2
Transcript of Building a Sustainable Data Platform on AWS
![Page 1: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/1.jpg)
Building a Sustainable Data Platform on AWS
Takumi Sakamoto 2016.01.27
![Page 2: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/2.jpg)
Takumi Sakamoto @takus 😍 = ⚽ ✈ 📷
![Page 4: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/4.jpg)
Mentioned by @jeffbarr
https://twitter.com/jeffbarr/status/649575575787454464
http://www.slideshare.net/smartnews/smart-newss-journey-into-microservices
![Page 5: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/5.jpg)
AWS Case Study
http://aws.amazon.com/solutions/case-studies/smartnews/
![Page 6: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/6.jpg)
Data Platform at SmartNews
![Page 7: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/7.jpg)
What is SmartNews?
• News Discovery App
• Launched in 2012
• 15M+ Downloads in World Wide
https://www.smartnews.com/en/
![Page 8: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/8.jpg)
Our Mission
the world's quality information?
the people who need it?
How?
![Page 9: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/9.jpg)
Machine Learning
URLs Found
Structure Analysis
Semantics Analysis
Importance Estimation
Diversification
Internet
100,000+ /day
1000+ /day
Feedback
DeliverTrending Stories
![Page 10: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/10.jpg)
Data Platform Use Cases
• Product development
• track KPI such as DAU and MAU
• A/B test for new feature, on-boarding, etc...
• ad-hoc analysis
• Provide data to applications
• realtime re-ranking news articles
• CTR prediction of Ads system
• dashboard service for media partners
![Page 11: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/11.jpg)
Data & Its Numbers
• User activities
• ~100 GBs per day (compressed)
• 60+ record types
• User demographics or configurations etc...
• 15M+ records
• Articles metadata
• 100K+ records per day
![Page 12: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/12.jpg)
Sustainable Data Platform?
![Page 13: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/13.jpg)
Sustainable Data Platform
• Provide a reliable and scalable "Lambda Architecture"
• Minimize both operation & running cost
• Be open to uncertain future
![Page 14: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/14.jpg)
Lambda Architecture
http://lambda-architecture.net/
![Page 15: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/15.jpg)
Why Sustainable?
• Do a lot with a few engineers
• no one is a full-time maintainer
• avoid to waste too much time
• Empower brilliant engineers in SmartNews
• everything should be as self-serve as possible
• don't ask for permission, beg for forgiveness
![Page 16: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/16.jpg)
System Design
![Page 17: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/17.jpg)
λ Architecture at SmartNews
Input Batch Serving
Speed
Output
![Page 18: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/18.jpg)
Design Principles
• Decoupled "Computation" and "Storage" layers
• multiple consumers can use the same data
• run consumers on Spot Instances
• prevent serious data lost with minimum effort
• Use the right tool for the job
• leverage AWS managed service as possible
• fill in the missing pieces by Presto & PipelineDB
![Page 19: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/19.jpg)
An Example
Amazon EMRAMI 3.x
Amazon S3
Amazon EMRHive
GeneralUsers
ApplicationEngineer
I wanna upgrade hive
Ad Engineer
I wanna combine news data with
ad data
Amazon EMRAMI 4.x
Amazon EMRSpark
We’re satisfied with current
version
DataScientist
I wanna test my algorithm with the
latest spark
Batch Layer Run multiple EMR clusters for each usages
Kinesis Stream
Sparkon EMR
AWSLambda
DataScientist
I wanna consume streaming data by
Spark
ApplicationEngineer
I wanna add a streaming monitor
by Lambda
Speed Layer Consume the same data for each usages
• AWS managed services • Replicated data into Multiple AZs • High availability
![Page 20: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/20.jpg)
Input Data
![Page 21: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/21.jpg)
Collect Events by Fluentd
• Forwarder (running on each instances)
• store JSON events to S3
• forward events to aggregators
• collect metrics and post them to Datadog
• Aggregator
• input events into Kinesis & PipelineDB
• other reporting tasks (not mentioned today)
![Page 22: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/22.jpg)
Forwards to S3
<source> @type tail format json path /data/log/user_activity.log pos_file /data/log/pos/user_activity.pos tag smartnews.user_activity time_key timestamp </source>
<match smartnews.user_activity> @type copy <store> @type relabel @label @s3 </store> <store> @type forward @label @forward </store> </match>
@include conf.d/s3.conf @include conf.d/forward.conf
<label @s3> <% node[:td_agent][:s3].each do |c| -%> <match <%= c[:tag] %>> @id s3.<%= c[:tag] %> @type s3 ... path fluentd/<%= node[:env] %>/<%= node[:role] %>/<%= c[:tag] %> time_slice_format dt=%Y-%m-%d/hh=%H time_key timestamp include_time_key time_as_epoch reduced_redundancy true format json utc buffer_chunk_limit 2048m </match> <% end -%> </label>
td-agent.conf conf.d/s3.conf
![Page 23: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/23.jpg)
Capture DynamoDB Streams
<source> type dynamodb_streams stream_arn YOUR_DDB_STREAMS_ARN pos_file /path/to/table.pos fetch_interval 1 fetch_size 100 </source>
https://github.com/takus/fluent-plugin-dynamodb-streams
DynamoDB DynamoDBStreams
Amazon S3
AWSLambda
Fluentd
![Page 24: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/24.jpg)
Recommended Practices
• Make configuration simple as possible
• fluentd can cover everything, but shouldn't
• keep stateless
• Use v0.12 or later
• "Filter" : better performance
• "Label": eliminate 'output_tag' configuration
![Page 25: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/25.jpg)
Monitor Fluentd Status
• Monitor traffic volume & retry count by Datadog
• Datadog's fluentd integration
• fluent-plugin-flowcounter
• fluent-plugin-dogstatsd
![Page 26: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/26.jpg)
Archive to Amazon S3
• I have 2 recommended settings
• versioning
• enable to recover from human error
• lifecycle policy
• minify storage cost
Archives to IA or Gracier xx days after the creation date
Keep previous versions xx days Save you in the future!!
![Page 27: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/27.jpg)
Batch Layer
![Page 28: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/28.jpg)
Various ETL Tasks
• Extract
• dump MySQL records by Embulk
• make files on S3 readable to Hive
• Transform
• transform text files into columnar files (RCFile, ORC)
• generate features for machine learning
• aggregate records (by country, by channel)
• Load
• load aggregated metrics into Amazon Aurora
![Page 29: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/29.jpg)
Hive
• Most popular project on Hadoop ecosystem
• famous for its lovely logo :)
• HiveQL and MapReduce
• convert SQL-like query into MR jobs
• Not adopt Tez engine yet
• Amazon EMR doesn't support now
• limited improvement to our queries
![Page 30: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/30.jpg)
How to process JSON?
A. Transform into columnar table periodically
• required converting job
• better performance
B. Use JSON-SerDe for temporary analysis
• easy way for querying raw json text files
• required to "drop table" for change schema
• performance is not good
![Page 31: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/31.jpg)
Transform Tables-- Make S3 files readable by Hive ALTER TABLE raw_activities ADD IF NOT EXISTS PARTITION (dt='${DATE}', hh='${HOUR}');
-- Transform text files into columnar files (Flatten JSON) INSERT OVERWRITE TABLE activities PARTITION (dt='${DATE}', action) SELECT user_id, timestamp, os, country, data, action FROM raw_activities LATERAL VIEW json_tuple( raw_activities.json, 'userId','timestamp','platform','country','action','data' ) a as user_id, timestamp, os, country, action, data WHERE dt = '${DATE}' CLUSTER BY os, country, action, user_id ;
![Page 32: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/32.jpg)
JSON-SerDe
-- Define table with SERDE CREATE TABLE json_table ( country string, languages array<string>, religions map<string,array<int>> ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE;
-- Result: 10 SELECT religions['catholic'][0] FROM json_table;
![Page 33: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/33.jpg)
cf. hive-ruby-scripting
-- Define your ruby (JRuby) script SET rb.script= require 'json'
def parse (json) j = JSON.load(json) j['profile']['attribute1'] end ;
-- Use the script in HQL SELECT rb_exec('&parse', json) FROM user;
https://github.com/gree/hive-ruby-scripting
![Page 34: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/34.jpg)
Spark
http://www.slideshare.net/smartnews/aws-meetupapache-spark-on-emr
![Page 35: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/35.jpg)
Self-Serve via AWS CLI
# Create EMR clusters that runs Hive & Spark & Ganglia
aws emr create-cluster \
--name "My Cluster" \
--release-label emr-4.2.0 \
--applications Name=Hive Name=Spark Name=GANGLIA \
--ec2-attributes KeyName=myKey \
--instance-type c3.4xlarge \
--instance-count 4 \
--use-default-roles
![Page 36: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/36.jpg)
Minimize expenses
• Use Spot Instances as possible
• typically discount 50-90%
• select instance type with stable price
• C3 families spike often :(
• Dynamic cluster resizing
• x2 capacity during daily batch job
• 1/2 capacity during midnight
![Page 37: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/37.jpg)
Handle Data Dependencies
![Page 38: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/38.jpg)
Typical Anti-Pattern
5 * * * * app hive -f query_1.hql 15 * * * * app hive -f query_2.hql 30 * * * * app hive -f query_3.hql 0 * * * * app hive -f query_4.hql 1 * * * * app hive -f query_5.hql
![Page 39: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/39.jpg)
Workflow Management
• Define dependencies
• task E is executed after finishing task C and task D
• Scheduling
• task A is kicked after 09:00 AM
• throttle concurrent running of the same task
• Monitoring
• notification in failure
• task C must finish before 01:00 PM (SLA)
cf. http://www.slideshare.net/taroleo/workflow-hacks-1-dots-tokyo
![Page 40: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/40.jpg)
Airflow
• A workflow management systems
• define workflow by Python
• built in shiny UI & CLI
• pluggable architecture
http://nerds.airbnb.com/airflow/
![Page 41: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/41.jpg)
Define Tasks
dag = DAG('tutorial', default_args=default_args)
t1 = BashOperator( task_id='print_date', bash_command='date', dag=dag)
t2 = BashOperator( task_id='sleep', bash_command='sleep 5', retries=3, dag=dag)
t3 = BashOperator( task_id='templated', bash_command=""" {% for i in range(5) %} echo "{{ ds }}" echo "{{ macros.ds_add(ds, 7)}}" echo "{{ params.my_param }}" {% endfor %} """, params={'my_param': 'Parameter I passed in'}, dag=dag)
t2.set_upstream(t1) t3.set_upstream(t1)
Task
Dependencies
Python code
DAG
![Page 42: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/42.jpg)
Workflow as Code
Deploy codes automatically after merging into master
![Page 43: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/43.jpg)
Visualize Dependencies
![Page 44: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/44.jpg)
What is done or not?
![Page 45: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/45.jpg)
Alerting to Slack• SLA Violation
• task A should be done till 00:00 PM
• other team's task K has dependency into task A
• Output validation failure
• stop the following tasks if the output is doubtful
![Page 46: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/46.jpg)
Retry from Web UI
Once clear histories, airflow scheduler back fill the histories
![Page 47: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/47.jpg)
Retry from CLI
// Clear some histories from 2016-01-01
airflow clear etl_smartnews \
--task_regex user_ \
--downstream \
--start_date 2016-01-01
// Backfill uncompleted tasks
airflow backfill etl_smartnews \
--start_date 2016-01-01
![Page 48: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/48.jpg)
Check Rendered Query
![Page 49: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/49.jpg)
How Long Each Tasks?
![Page 50: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/50.jpg)
Pluggable Architecture
• Built-in plugins
• operator: bash, hive, preto, mysql
• transfer: hive_to_mysql
• sensor: wait_hive_partition, wait_s3_file
• Written our own plugin
• mysql_partition
![Page 51: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/51.jpg)
Examplesuser_sensor = S3KeySensor(
task_id='wait_user',
bucket_name='smartnews',
bucket_key='user/dt={{ ds }}/dump.csv',
)
etl = HiveOperator(
task_id="task1",
hql="INSERT OVERWRITE INTO...."
)
etl.set_upstream(user_sensor)
import = HiveToMySqlTransfer(
task_id=name,
mysql_preoperator="DELETE FROM %s WHERE date = '{{ ds }}'" % table,
sql="SELECT country, count(*) FROM %s" % table,
mysql_table=table
)
import.set_upstream(etl)
Wait a S3 file creation
After the file is created,
Run ETL Query
After that,
Import into MySQL
![Page 52: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/52.jpg)
Serving Layer
![Page 53: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/53.jpg)
Provides batch views in low-latency and ad-hoc way
![Page 54: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/54.jpg)
Presto
• A distributed SQL query engine
• join multiple data sources (Hive + MySQL)
• support standard ANSI SQL
• designed to handle TBs or PBs scale data
cf. http://www.slideshare.net/frsyuki/presto-hadoop-conference-japan-2014
![Page 55: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/55.jpg)
Presto Architecture
Amazon S3 Kinesis Stream
AmazonRDS
AmazonAurora
PrestoWorker
PrestoWorker
PrestoWorker
PrestoWorker
PrestoWorker
PrestoWorker
PrestoCoordinator
Client
1. Query with Standard SQL
4. Scan data concurrently5. Aggregate data without disk I/O6. Return result to client
2. Generate execution plan3. Dispatch tasks into multiple workers
Amazon EMR(Hive Metastore)
Provides Hive table metadata(S3 access only)
※ https://github.com/qubole/presto-kinesis
※
![Page 56: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/56.jpg)
Why Presto?
• Join multiple data sources
• skip large parts of ETL process
• enable to merge Hive/MySQL/Kinesis/PipelineDB
• Low latency
• ~30s to scan billions records in S3
• Low maintenance cost
• stateless, and easy to integrate with Auto Scaling
![Page 57: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/57.jpg)
Use case: A/B Test-- Suppose that this table exists DESC hive.default.user_activities; user_id bigint action varchar abtest array<map<varchar, bigint>> url varchar
-- Summarize page view per A/B Test identifier -- for comparing two algorithms v1 & v2 SELECT dt, t['behaviorId'], count(*) as pv FROM hive.default.user_activities CROSS JOIN UNNEST(abtest) AS t (t) WHERE dt like '2016-01-%' AND action = 'viewArticle' AND t['definitionId'] = 163 GROUP BY dt, t['behaviorId'] ORDER BY dt ;
2015-12-01 | algorithm_v1 | 40000 2015-12-01 | algorithm_v2 | 62000
![Page 58: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/58.jpg)
Use case: Troubleshoot
-- Store access logs to S3, and query to them -- Summarize access & 95pct response time by SQL SELECT from_unixtime(timestamp), count(*) as access, approx_percentile(reqtime, 0.95) as pct95_reqtime FROM hive.default.access_log WHERE dt = '2015-11-04' AND hh = '13' AND role = 'xxx' GROUP BY timestamp ORDER BY timestamp ;
2015-11-04 22:00:00.000 | 6377 | 0.522 2015-11-04 22:00:01.000 | 3580 | 0.422
![Page 59: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/59.jpg)
Scheduled Auto Scaling
$ aws autoscaling describe-scheduled-actions { "ScheduledUpdateGroupActions": [ { "DesiredCapacity": 2, "AutoScalingGroupName": "presto-worker-prd", "Recurrence": "59 14 * * *", "ScheduledActionName": "scalein-2359-jst" }, { "DesiredCapacity": 20, "AutoScalingGroupName": "presto-worker-prd", "Recurrence": "45 0 * * 1-5", "ScheduledActionName": "scaleout-0945-jst" } ] }
![Page 60: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/60.jpg)
Presto Covers Everything? No!
• Fixed system on Amazon Aurora (or other RDB)
• provides KPI for products & business
• require high availability & low latency
• has no flexibility
• Ad-hoc system on Presto
• provides access to all dataset on data platform
• require high scalability
• has flexibility (join various data sources)
![Page 61: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/61.jpg)
Why Fixed vs Ad-hoc?
• Difficulties on the Ad-hoc only solution
• difficult to prevent heavy queries
• large distinct count exhausts computing resources
• decrease presto maintainability
![Page 62: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/62.jpg)
Output Data
![Page 63: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/63.jpg)
Chartio• Dashboard as A Service
• helps businesses analyze and track their critical data
• one of AWS partners (※)
• Combine multiple data sources at one dashboard • Presto, MySQL, Redshift, BigQuery, Elasticsearch ... • enable to join BigQuery + MySQL internally
• Easy to use for every one • everyone can make their own dashboard • write SQL directly / generate query by drag & drop
※ http://www.aws-partner-directory.com/PartnerDirectory/PartnerDetail?id=8959
![Page 64: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/64.jpg)
Creating dashboard1. Building query (Drag&Drop / SQL)
2. Add step (filter、sort、modify)
3. Select visualize way (table、graph)
![Page 65: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/65.jpg)
Examples
![Page 66: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/66.jpg)
Why Chartio?
• Chartio saves a lot of engineering resources
• before
• maintain in-house dashboard written by rails
• everyone got tired to maintain it
• after
• everyone can build their own dashboard easily
• Chartio's UI is cool
• very important factor for dashboard tool
![Page 67: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/67.jpg)
Missing Pieces of Chartio
• No programable API provides
• need to edit dashboard / chart manually
• No rollback feature
• all changes are recorded, but not rollback to the
previous state
• work around : clone => edit => rename
![Page 68: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/68.jpg)
Speed Layer
![Page 69: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/69.jpg)
Why Speed is Matter?
![Page 70: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/70.jpg)
Today’s News is Wrapping
Tomorrow’s Fish and Chips
![Page 71: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/71.jpg)
↑ Yesterday's News
http://www.personalchefapproach.com/tomorrows-fish-n-chips-wrapper/
![Page 72: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/72.jpg)
How News Behaves?
https://gdsdata.blog.gov.uk/2013/10/22/the-half-life-of-news/
![Page 73: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/73.jpg)
Use cases
• Re-rank news articles by user feedback
• track user's positive/negative signal
• consider gender, age, location, interests
• Realtime article monitoring
• detect high bounce rate (may be broken?)
• make realtime reporting dashboard for A/B test
![Page 74: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/74.jpg)
Realtime Re-Ranking
ref. Stream 処理 (Spark Streaming + Kinesis) と Offline 処理 (Hive) の統合 www.slideshare.net/smartnews/stremspark-streaming-kinesisofflinehive
Amazon CloudSearch
SearchAPI
APIGateway
Kinesis Stream
Amazon S3
Amazon EMR
Amazon S3 Amazon EMR
DynamoDB
RealtimeFeedback
Re-rankArticles
ArticleMetadata
UserInterests
UserBehaviors
Offline Proceesby Hive / Spark
![Page 75: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/75.jpg)
Realtime Monitoring
APIGateway
Stream
Continuous View
Continuous View
Continuous View
Discard raw record soon afterconsumed by Continuous View
Incrementallyupdated in realtime
PipelineDB Chartio
AWSLambda
Slack
Access Continuous Viewby PostgreSQL Client
Record
※1
※1 Use cron on 26 Feb. 2016 Migrate it soon after supporting VPC
![Page 76: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/76.jpg)
PipelineDB
• OSS & enterprise streaming SQL database
• PostgreSQL compatible
• connect to Chartio 😍
• join stream to normal PostgreSQL table
• Support probabilistic data structures
• e.g. HyperLogLog
https://www.pipelinedb.com/ http://developer.smartnews.com/blog/2015/09/09/20150907pipelinedb/
![Page 77: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/77.jpg)
Continuous View
-- Calculate unique users seen per media each day -- Using only a constant amount of space (HyperLogLog) CREATE CONTINUOUS VIEW uniques AS SELECT day(arrival_timestamp), substring(url from '.*://([^/]*)') as hostname, COUNT(DISTINCT user_id::integer) FROM activity_stream GROUP BY day,hostname;
-- How many impressions have we served in the last five minutes? CREATE CONTINUOUS VIEW imps WITH (max_age = '5 minutes') AS SELECT COUNT(*) FROM imps_stream;
-- What are the 90th, 95th, 99th percentiles of request latency? CREATE CONTINUOUS VIEW latency AS SELECT percentile_cont(array[90, 95, 99]) WITHIN GROUP (ORDER BY latency::integer) FROM latency_stream;
![Page 78: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/78.jpg)
Summary
![Page 79: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/79.jpg)
Sustainable Data Platform
• build a reliable and scalable lambda architecture
• minimize operation & running cost
• be open to uncertain future
![Page 80: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/80.jpg)
My Wishlist to AWS
• Support Reduced Redundancy Storage (RRS) on EMR
• Faster EMR Launch
• Set TTL to DynamoDB records
• Auto-scale Kinesis Stream
• Launch Kinesis Analytics in Tokyo region
![Page 81: Building a Sustainable Data Platform on AWS](https://reader033.fdocuments.net/reader033/viewer/2022051521/586fb38b1a28abe57d8b6cf1/html5/thumbnails/81.jpg)
Thank you!!