Data Modeling @Uber - World Wide Web Consortium · Data Modeling @Uber Joshua Shinavier, PhD...

15
Data Modeling @Uber Joshua Shinavier, PhD Presentation for the W3C Automotive Working Group May 21 st , 2019

Transcript of Data Modeling @Uber - World Wide Web Consortium · Data Modeling @Uber Joshua Shinavier, PhD...

Page 1: Data Modeling @Uber - World Wide Web Consortium · Data Modeling @Uber Joshua Shinavier, PhD Presentation for the W3C Automotive Working Group May 21st, 2019 200k managed data sets

Data Modeling @UberJoshua Shinavier, PhD

Presentation for the W3C Automotive Working Group

May 21st, 2019

Page 2: Data Modeling @Uber - World Wide Web Consortium · Data Modeling @Uber Joshua Shinavier, PhD Presentation for the W3C Automotive Working Group May 21st, 2019 200k managed data sets

● 200k managed data sets● 10+ billion trips

○ Low thousands of new entities per second● Even more sensor data

○ Use cases for graph stream processing● On-demand, streaming, RPC

○ Each dataset, stream, and service has its own schema

Data and schemas @Uber

Page 3: Data Modeling @Uber - World Wide Web Consortium · Data Modeling @Uber Joshua Shinavier, PhD Presentation for the W3C Automotive Working Group May 21st, 2019 200k managed data sets

• Data sources are not composable• Strong identifiers, weak semantics

• Duplicate types, homonyms, synonyms• Per-language data islands• Diversity of data modeling conventions

Schema integration: challenges

Page 4: Data Modeling @Uber - World Wide Web Consortium · Data Modeling @Uber Joshua Shinavier, PhD Presentation for the W3C Automotive Working Group May 21st, 2019 200k managed data sets

● Controlled vocabularies for all of Uber○ Basic type aliases (URL, UUID, timestamps, etc.)○ Structured types (sensor events, currency values, etc.)○ Entities and relationships (User, Vehicle, Trip, etc.)○ Metadata vocabularies

● Shared logical data model● Basic domain vocabularies

○ E.g. time, geometry and geolocation, addresses and contact info, sensors, money, etc.

● Tooling carries schemas between data representation languages○ Protobuf, Thrift, Avro, RDF, PG, etc.○ Schema and data transformations are composable

Data Standardization

Page 5: Data Modeling @Uber - World Wide Web Consortium · Data Modeling @Uber Joshua Shinavier, PhD Presentation for the W3C Automotive Working Group May 21st, 2019 200k managed data sets

● Need metadata for each dataset at Uber● Data protections and user trust

○ GDPR and other regulations, Uber’s own data policies○ What kind of user data? Where is it?○ Heroic numbers of manual annotations

■ Limited expressivity, limited guarantees■ Inference is required

● In annotating datasets, standardize and compose schemas

Metadata graph

Page 6: Data Modeling @Uber - World Wide Web Consortium · Data Modeling @Uber Joshua Shinavier, PhD Presentation for the W3C Automotive Working Group May 21st, 2019 200k managed data sets

● Common data model for RPC, storage, and KR at Uber● In progress: alignment with the Property Graph Schema Working Group● In progress: “Universal structure” of TinkerPop4

Algebraic Property Graphs

Page 7: Data Modeling @Uber - World Wide Web Consortium · Data Modeling @Uber Joshua Shinavier, PhD Presentation for the W3C Automotive Working Group May 21st, 2019 200k managed data sets

Transformations

Page 8: Data Modeling @Uber - World Wide Web Consortium · Data Modeling @Uber Joshua Shinavier, PhD Presentation for the W3C Automotive Working Group May 21st, 2019 200k managed data sets

Logical

Page 9: Data Modeling @Uber - World Wide Web Consortium · Data Modeling @Uber Joshua Shinavier, PhD Presentation for the W3C Automotive Working Group May 21st, 2019 200k managed data sets

YAML

Page 10: Data Modeling @Uber - World Wide Web Consortium · Data Modeling @Uber Joshua Shinavier, PhD Presentation for the W3C Automotive Working Group May 21st, 2019 200k managed data sets

Protocol Buffers

Page 11: Data Modeling @Uber - World Wide Web Consortium · Data Modeling @Uber Joshua Shinavier, PhD Presentation for the W3C Automotive Working Group May 21st, 2019 200k managed data sets

Apache Thrift

Page 12: Data Modeling @Uber - World Wide Web Consortium · Data Modeling @Uber Joshua Shinavier, PhD Presentation for the W3C Automotive Working Group May 21st, 2019 200k managed data sets

Apache Avro

Page 13: Data Modeling @Uber - World Wide Web Consortium · Data Modeling @Uber Joshua Shinavier, PhD Presentation for the W3C Automotive Working Group May 21st, 2019 200k managed data sets

Turtle (OWL)

Page 14: Data Modeling @Uber - World Wide Web Consortium · Data Modeling @Uber Joshua Shinavier, PhD Presentation for the W3C Automotive Working Group May 21st, 2019 200k managed data sets

Docs

Page 15: Data Modeling @Uber - World Wide Web Consortium · Data Modeling @Uber Joshua Shinavier, PhD Presentation for the W3C Automotive Working Group May 21st, 2019 200k managed data sets

[email protected]

Thanks