Gobblin @ NerdWallet (Nov 2015)
-
Upload
nerdwallethq -
Category
Software
-
view
1.286 -
download
0
Transcript of Gobblin @ NerdWallet (Nov 2015)
![Page 2: Gobblin @ NerdWallet (Nov 2015)](https://reader033.fdocuments.net/reader033/viewer/2022051404/5876016b1a28ab4a508b5a7f/html5/thumbnails/2.jpg)
Agenda
● Introduction to NerdWallet● Gobblin @ NerdWallet Today● Initial Pain Points & Learnings● Contributions (Present and Future)● Future Use Cases & Requests
2
![Page 3: Gobblin @ NerdWallet (Nov 2015)](https://reader033.fdocuments.net/reader033/viewer/2022051404/5876016b1a28ab4a508b5a7f/html5/thumbnails/3.jpg)
What Is NerdWallet?
● Started in 2009. 275+ employees● Highly profitable. Series A funding Feb 2015. ● We want to bring clarity to life’s financial decisions.
3
![Page 4: Gobblin @ NerdWallet (Nov 2015)](https://reader033.fdocuments.net/reader033/viewer/2022051404/5876016b1a28ab4a508b5a7f/html5/thumbnails/4.jpg)
Front-End
Services Tier
NerdWallet Tech Stack
Data Analytics
Data Systems & Platforms
4
![Page 5: Gobblin @ NerdWallet (Nov 2015)](https://reader033.fdocuments.net/reader033/viewer/2022051404/5876016b1a28ab4a508b5a7f/html5/thumbnails/5.jpg)
Data Types @ NerdWallet
● Partner Offer Data (MySQL & ElasticSearch: heavy reads, rare writes)○ Synced to Redshift periodically
● Consumer Identity Data (Postgres: medium reads, medium writes)● Site Generated Tracking Data (Redshift: heavy reads, heavy writes)● Operational Data (e.g. Nginx logs) (Redshift: low reads, heavy writes) ? ● Internal Business Data (e.g. Salesforce) (Redshift: medium reads, rare writes)● External 3rd Party Analytics Data (Redshift: medium reads, batch import)
5
![Page 6: Gobblin @ NerdWallet (Nov 2015)](https://reader033.fdocuments.net/reader033/viewer/2022051404/5876016b1a28ab4a508b5a7f/html5/thumbnails/6.jpg)
Gobblin @ NW Today
● Running in standalone mode● Ingests user tracking and operational log data● Tracking Data:
○ ~10 Kafka topics - 1 per event & schema type○ Hourly Gobblin jobs pull from kafka and dump to date-partitioned directory in S3○ Events are already serialized as protobuf in each Kafka topic○ Around 100 events/second
● Log Ingestion (Operational Data):○ Extracts data from AWS logs sitting in S3○ Parses log lines and serializes them to protobuf○ Writes the serialized protobuf files back to S3 and eventually into redshift
6
![Page 7: Gobblin @ NerdWallet (Nov 2015)](https://reader033.fdocuments.net/reader033/viewer/2022051404/5876016b1a28ab4a508b5a7f/html5/thumbnails/7.jpg)
Tracking Pipeline
7
![Page 8: Gobblin @ NerdWallet (Nov 2015)](https://reader033.fdocuments.net/reader033/viewer/2022051404/5876016b1a28ab4a508b5a7f/html5/thumbnails/8.jpg)
Learnings: Deploying Gobblin w/Internal Code
● Have a repo of internal Gobblin modules (this is where we compile everything)● Modified the build script to link the gobblin project to our gobblin-modules
project● Use jenkins to compile gobblin on the remote machine● Maintain a separate repository with .pull files that we can sync with our stage
and production environments
8
![Page 9: Gobblin @ NerdWallet (Nov 2015)](https://reader033.fdocuments.net/reader033/viewer/2022051404/5876016b1a28ab4a508b5a7f/html5/thumbnails/9.jpg)
Current Contributions
● Simple Data Writer○ class gobblin.writer.SimpleDataWriter○ Writes binary record as bytes with no regard to encoding
○ Optionally prepends records by record size or uses a char delimiter at the end of records (i.e. \n
for string data)
● Kafka Simple Extractor○ class gobblin.source.extractor.extract.kafka.KafkaSimpleExtractor○ class gobblin.source.extractor.extract.kafka.KafkaSimpleSource○ Extracts binary data from Kafka as an array of bytes without any serde
9
![Page 10: Gobblin @ NerdWallet (Nov 2015)](https://reader033.fdocuments.net/reader033/viewer/2022051404/5876016b1a28ab4a508b5a7f/html5/thumbnails/10.jpg)
Future Contributions
● Gobblin Dashboards● S3 Source & Extractor
○ Given an S3 bucket, extract all files matching a regex■ Leverages FileBasedExtractor
■ We would also like to modify this to have similar functionality to
DatePartitionedDailyAvroSource
● S3 Publisher○ Publishes files to S3
○ Currently there is an issue where the AWS S3 Java API doesn’t work correctly with HDFS; since
we are running in standalone this is not an issue for us
10
![Page 11: Gobblin @ NerdWallet (Nov 2015)](https://reader033.fdocuments.net/reader033/viewer/2022051404/5876016b1a28ab4a508b5a7f/html5/thumbnails/11.jpg)
Future: Dashboards
11
![Page 12: Gobblin @ NerdWallet (Nov 2015)](https://reader033.fdocuments.net/reader033/viewer/2022051404/5876016b1a28ab4a508b5a7f/html5/thumbnails/12.jpg)
Gobblin @ NW tomorrow
● More data types○ Offer data from partners: JSON/CSV/XML over {HTTP, FTP} => S3
○ Offer data from our site: MySQL => S3 (batch and incremental)○ Identity data from out site: Postgres => S3 (batch and incremental, data hiding)
○ Salesforce Data
● Integration with Airflow DAGs● Integration with data cleansing & entity matching frameworks
12
![Page 13: Gobblin @ NerdWallet (Nov 2015)](https://reader033.fdocuments.net/reader033/viewer/2022051404/5876016b1a28ab4a508b5a7f/html5/thumbnails/13.jpg)
Early Adoption Pain Points & Solutions
● Best practices around for ingestion w/ transformation steps● Initial problems integrating NW specific code (especially extractors &
converters) into Gobblin’s build process● Best practices around scheduler integration - Quartz (built-in) vs ETL
schedulers● Backwards incompatible changes caused us to make migrations to upgrade
versions● No changelogs & tagged releases
13
![Page 14: Gobblin @ NerdWallet (Nov 2015)](https://reader033.fdocuments.net/reader033/viewer/2022051404/5876016b1a28ab4a508b5a7f/html5/thumbnails/14.jpg)
Things we would like to see/add in future
● Abstract out Avro specific code● Best practices for scheduler integration (can contribute for Airflow)● Clustering without requiring Hadoop & YARN● Metadata support (job X produced files Y,Z)● Release notes & tags :)● The build & unit test process is very bloated
○ Hard to differentiate warnings/stack traces vs legitimate build issues
○ Opens ports, creates temporary dbs, etc which make it difficult to test on arbitrary machines
(port collisions)
14
![Page 15: Gobblin @ NerdWallet (Nov 2015)](https://reader033.fdocuments.net/reader033/viewer/2022051404/5876016b1a28ab4a508b5a7f/html5/thumbnails/15.jpg)
Thanks! Questions??
15