Hadoop at aadhaar

Hadoop at Aadhaar
(Data Store, OLTP & OLAP)

github.com/regunathb RegunathB

Bangalore Hadoop Meetup

Enrolment Data

600 to 800 million UIDs in 4 years

1 million a day with transaction, durability guarantees

350+ trillion matches every day

~5MB per residentMaps to about 10-15 PB of raw data (2048-bit PKI encrypted)

About 30 TB I/O every day

Replication and backup across DCs of about 5+ TB of incremental data every day

Lifecycle updates and new enrolments will continue for ever

Enrolment data moves from very hot to cold, needing multi-layered storage architecture

Additional process dataSeveral million events on an average moving through async channels (some persistent and some transient)

Needing insert and update guarantees across data stores

Authentication Data

100+ million authentications per day (10 hrs)

Possible high variance on peak and average

Sub second response

Guaranteed audits

Multi-DC architectureAll changes needs to be propagated from enrolment data stores to all authentication sites

Authentication request is about 4 K100 million authentications a day

1 billion audit records in 10 days (30+ billion a year)

4 TB encrypted audit logs in 10 days

Audit write must be guaranteed

Aadhaar Data Stores

Mongo cluster(all enrolment records/documents demographics + photo)Shard 1Shard 4Shard 5Shard 2Shard 3

Low latency indexed read (Documents per sec), High latency random search (seconds per read)

MySQL (all UID generated records - demographics only, track & trace, enrolment status )

Low latency indexed read (milli-seconds per read), High latency random search (seconds per read)

UID master (sharded)Enrolment DB

Solr cluster(all enrolment records/documents selected demographics only)

Low latency indexed read (Documents per sec), Low latency random search (Documents per sec)Shard 0Shard 2Shard 6Shard 9Shard aShard dShard f HDFS(all raw packets)Data Node 1Data Node 10Data Node ..

High read throughput (MB per sec), High latency read (seconds per read)Data Node 20 HBase(all enrolment biometric templates)Region Ser. 1Region Ser. 10Region Ser. ..

High read throughput (MB per sec), Low-to-Medium latency read (milli-seconds per read)Region Ser. 20 NFS(all archived raw packets)

Moderate read throughput, High latency read (seconds per read)LUN 1LUN 2LUN 3LUN 4

Systems Architecture

Work distribution using SEDA & Messaging

Ability to scale within JVM and across

Recovery through check-pointing

Sync Http based Auth gateway

Protocol Buffers & XML payloads

Sharded clusters

Near Real-time data delivery to warehouse

Nightly data-sets used to build dashboards, data marts and reports

Real-time monitoring using Events

Enrolment Biometric Middleware

Distribute, Reconcile biometric data extraction and de-dup requests across multiple vendors (ABISs)

Biometric data de-referencing/read service(Http) over sharded HDFS and NFSServes bulk of the HDFS read requests (25TB per day)

Locate data from multiple HDFS clustersSharded by read/write patterns : New, Archive, Purge

Calculates and maintains Volume allocation, SLA breach thresholds of ABISsThresholds stored in ZK and pushed to middleware nodes

Event Streams & Sinks

Event framework supporting different interaction/data durability patterns

P2P, Pub-Sub

Intra-JVM and Queue destinations - Durable / Non-Durable

Fire & Forget, Ack. after processing

Event SinksEphemeral data consumed by counters, metrics (dashboard)

Rolling file appenders that push data to HDFSPrimary mechanism for delivering raw fact data from transactional systems to the warehouse staging area

Data Analysis

Statistical analysis from millions of eventsView into quality of enrolments e.g. Enrolment Agencies, Operators

Feature introduction e.g. Based on avg. time taken for biometric capture, demographic data input

Enrolment volumes e.g. By Registrar, Agency, Operator etcUseful in fraud detection

Goal to share anonymized data sets for use by industry and academia information transparency

Various reports Self-serve, Canned, Operational and/or Aggregates

UID BI PlatformData Analysis architecture

Data Access FrameworkUIDAI SystemsEvents(Rabbit MQ)Server DB(MySQL)Hadoop HDFSData Warehouse (HDFS/Hive)Event CSVFact DataDimension DataDatasets

On-Demand DatasetsDatamarts(MySQL)Raw DataDimension Data(MySQL)

PigPentaho Kettle

Hive

Pentaho KettleCanned ReportsDashboardSelf-service Analytics

Pentaho BI

FusionCharts

E-mail/Portal/Others

Hadoop stack summary

CDH2 (Enrolment, Analysis), CDH3(Authentication)

Data Store

HDFS : Enrolment, Events, Audit Logs, Warehouse

HBase : Biometric templates used in Authentication

Coordination/ConfigZK : Biometric middleware thresholds

AnalysisPig : ETL for loading analysis data from staging to atomic warehouse

Hive : Dataset generation framework

Learnings

Watch out fortoo many small files. HDFS is better suited for fewer but large files

Data loss from HDFS in spite of having 3 replica copies maybe fixed in releases post CDH2?

Give careful consideration to HBase table design row key primarily to avoid region-server hot-spotting

Hive data (HDFS files) does not handle duplicate records can be an issue if data injestion is replayed for data setsHive over Hbase is a viable alternative

References

Aadhaar Portal : https://portal.uidai.gov.in/uidwebportal/dashboard.do

Data Portal : https://data.uidai.gov.in/uiddatacatalog/dataCatalogHome.do

Analytics whitepaper : http://uidai.gov.in/images/FrontPageUpdates/uid_doc_30012012.pdf

Click to edit the title text formatClick to edit Master title style



Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level

Ninth Outline LevelClick to edit Master text styles

Second level

Third level

Fourth level

Fifth level


(c) UIDAI, 2011

Hadoop at aadhaar

Technology

Transcript of Hadoop at aadhaar