Migrating from Oracle to Espresso€¦ · About LinkedIn New York Engineering •Located in Empire...
Transcript of Migrating from Oracle to Espresso€¦ · About LinkedIn New York Engineering •Located in Empire...
Migrating from Oracle to Espresso
David MaxSenior Software Engineer
About LinkedIn New York Engineering
• Located in Empire State Building
• Approximately 100 engineers and 1000 employees total
• Multiple teams, front end, back end, and data science
New YorkEngineering
About Me
• Software Engineer at LinkedIn NYC since 2015
• Content Ingestion team
• Office Hours –Thursday 11:30-12:00
David MaxSenior Software Engineer
LinkedInwww.linkedin.com/in/davidpmax/
What is Content Ingestion?
Content Ingestion
Babylonia
Content Ingestion
Babylonia
Content Ingestion
Babylonia
Content Ingestion
Babylonia
url: https://www.youtube.com/watch?v=MS3c9hz0bRg
title: "SATURN 2017 Keynote: Software is Details”
image: https://i.ytimg.com/vi/MS3c9hz0bRg/hqdefault.jpg?sqpoaymwEYCKgBEF5IVfKriqkDCwgBFQAAiEIYAXAB\\u0026rs=AOn4CLClwjQlBmMeoRCePtHaThN-qXRHqg
Content Ingestion
Babylonia
What is Content Ingestion?
Content Ingestion
Babylonia
• Extracts metadata from web pages
• Source of Truth for 3rd party content
• Also contains metadata for some public 1st party content
• Used by LinkedIn services for sharing, decorating, and embedding content
• Data also feeds into content understanding and relevance models
Babylonia Datasets
Database HDFSETL
Content Ingestion
Babylonia Data Change Events
Downstream and Upstream Datasets
Database HDFSETL
Near Line
Offline
Data Change Events
Content Ingestion
Babylonia
Babylonia use of Oracle (before migration)
• Schema – Metadata extracted from each URL stored in individual rows
• Client –Babylonia the main (but not only) client to directly execute queries on Oracle DB
• Rest.li – Most online interaction with dataset in Oracle via Babylonia’s Rest.li API
• RDBMS – Relational Database Management System
• Databus – Platform for streaming data change events to near line consumers
• Offline – ETL to HDFS for offline consumers
What isEspresso?
Espresso is LinkedIn’s strategic distributed, fault-tolerant NoSQL database that powers many of LinkedIn’s services
• ~100 clusters in use*
• ~420TB of SoT data*
• ~2 million qps at peak load*
* as of August 1, 2017
What is Espresso?
• Document – A table is a container for documents of the same schema (defined in Avro)
• Keys – Documents index by key fields, which are defined in the table schema
• NoSQL – Non relational
• Distributed – A single database can be distributed over a cluster of machines
• Scalable – Able to scale clusters horizontally by adding more nodes
Why Migrate?
• Integration – Support for Espresso integrated with other tools and systems at LinkedIn
• Rest.li – Espresso’s API is based on Rest.li, which makes it easier to treat Espresso endpoints like other LinkedIn Rest.li endpoints
• Schema Evolution – Supported with zero downtime and no coordination with DBA teams
• Maintenance – Babylonia’s Oracle tables required periodic jobs to be run that involved downtime for each server
• Cost – Oracle more expensive to run
• Strategy – Espresso is the preferred platform at LinkedIn for data of this type
• Support – Espresso team part of LinkedIn
Data Formats (Oracle)
Oracle Database
HDFSETL
Near Line
Offline
Oracle DatabusEvents
Content Ingestion
Babylonia
Rest.liEndpoints
Oracle RowPegasusObject
PegasusData
Oracle Row
Oracle Row
Oracle Row
• Complex transformation between Oracle format and Pegasus format
Pegasus and Avro
Pegasus Schema
Avro Schema
Java Objects
Java Objects
• Both can be used to generate Java objects with very similar interfaces
• Pegasus schema can be used to auto-generate the Avro schema
• Pegasus and Avro schema definitions are very similar
Data Formats (Espresso)
Espresso Database
HDFSETL
Near Line
Offline
Espresso Brooklin Events
Content Ingestion
Babylonia
Rest.liEndpoints
Espresso AvroPegasusObject
PegasusData
Espresso Avro
Espresso Avro
Espresso Avro• Simple transformation
between Avro format and Pegasus format
Why Migrate? Schema Evolution
• ALTER TABLE
• Not tied to code deployment – need to coordinate with DBAs
• Schema change involves server downtime
• In practice, developers go to great lengths to avoid the hassle
• Schema accumulates tech debt
• Document schema auto-registration
• Schema changes are registered automatically as part of the Babylonia deployment process
• Backwards compatibility is enforced –existing data does not need to be transformed
• Avro schema more natural fit with Rest.li Pegasus schema
Espresso
Goals forMigration Process
• Zero down time
• Transparent to Rest.li clients
• Give offline and nearline consumers time to migrate
• Validate each step
• Mirroring in real time
Pre-Migration State of Babylonia
Oracle Database
HDFSETL
Near Line
Offline
Oracle DatabusEvents
Content Ingestion
Babylonia
Pre-Migration State of Babylonia
Oracle Database
Oracle DatabusEvents
Rest.liEndpoints
Other Services
Rest.liCalls
Pre-Migration Cleanup
Oracle Database
Oracle DatabusEvents
Rest.liEndpoints
Other Services
Rest.liCalls
• Identify code that is tightly-coupled to the database
• Decide which code should be reimplemented for Espresso, and which code should be decoupled or eliminated.
• Reduce number of code paths to migrate
The easiest lines of code to migrate are the lines of code that don’t exist
Bootstrap Espresso Database
Oracle Database
HDFSETL
Offline Convert
Job
Espresso Database
Espresso Bulk
Loader
Avro Data File
Bootstrap Espresso Database
Oracle Database
HDFSETL
Espresso Database
Shadow Read Validation
Databus Listener, Shadow Read Validation
Oracle Database
Oracle DatabusEvents
Espresso Database
DatabusListener
Direct Writes to Espresso
Oracle Database
Oracle DatabusEvents
Espresso Database
DatabusListener
Shadow Read Validation
DirectWrite
Resolving Write Conflicts
Oracle DatabusEvents
Espresso Database
DatabusListener
DirectWrite
• Dual Write Conflict – Databus Listener and Babylonia updating same record
• Migration Control – optional field added to scheme indicating which process wrote the record: Bulk Loader, Databus listener, or Babylonia
Espresso New SoT
Oracle Database
Oracle DatabusEvents
Espresso Database
DirectRead/Write
Dual Writes
Espresso Brooklin Events
Deprecated
Oracle Turnoff
Espresso Database
DirectRead/Write
Espresso Brooklin Events
Thank you