Real-Time Fraud Detection with Storm and Kafka
-
Upload
alexey-kharlamov -
Category
Technology
-
view
415 -
download
0
Transcript of Real-Time Fraud Detection with Storm and Kafka
WELCOMEAlexeyKharlamov,VPTechnology
1
• AdFraud–Weeliminatefraudulentimpressionsbyadbotsandmakesureadsdon’tshowuponfraudulentwebsites
• BrandSafety–Wemakesureadsdon’tshowupinplacesthatbrandsdon’twantthem
• Viewability–Wemeasurewhetherapersonactuallyviewedanad
WEMEASUREANDENSUREQUALITY
2
• Adimpressionsprocessed:5+Billion/day
• HTTPRequests:50+Billion/day
• DataCenters:10+(AWSandon-premises)
• Datastoredinclusters:6+petabytes
• Newdatacollecteddaily:20+terabytes
• Hadoopclusterprocessingcores~20,000
INTEGRALENGINEERINGBYTHENUMBERS
H20World,11/10/2015
3
ADFRAUDNEARLYALLADFRAUDISCAUSEDBYBOTACTIVITY
AdStackingPlacingmulVpleadsontopofoneotherinasingleadplacement,withonlythetopadinview
IllegalBotsCompromisedcomputerswithbreachedsecuritydefensesconcededtoathirdparty
PixelStuffingStuffinganenVread-supportedsiteintoa1x1pixelAD
H20World,11/10/2015
4
FRAUDDETECTION
facebook cnn ebay
nothingtoseehere.com
thisisnotabotnet.com
H20World,11/10/2015
5
FRAUDDETECTION
facebook cnn ebay
nothingtoseehere.com
thisisnotabotnet.com
6
REQUIREMENTS
• QuicklyidenVfyfreshlyacVvatedbot
• HighaccuracyofdetecVonalgorithms
• AvoidtransferofpersonalinformaVonacrossborders
• Withstandsingledatacenterfailure
BLOCKINGMONITORING
5+billioneventsperday
8
EVENTSESSIONIZATION
TimeTransaction 1 Transaction 2
Join Window
Impression 1
Impression 2UnloadDTDTDTInit
DTDTDTInit Timeout
Emit
Emit
Impression 3DTDTDT Timeout
Drop
9
DATAFLOW
Inpu
t Top
icSessionBuilder
QLo
g To
pic
Fraud Detection
Hadoop
Model TrainingAssets
Firewall
10
• LocallogaggregaVonandprocessing• TransferoverlonglinkscausesallsortsofsynchronizaVonproblems
• Intra-DClinksarereliable,InternetisNOT.WecankeepdatalocalityandlogVmecoherence
• Singlefirewallserverfailureisnot“stop-the-world”event.DatapresentonKaaacluster.
• Acompletelyautonomoussystem
• HigheravailabilitydueDCredundancy
INTRA-DCDATAPROCESSING
11
DATACENTERARCHITECTURE
Server 1
Front-End Server
STORMFront-End Server
Server N
STORM
Front-End Server
Front-End Server
LOGSOURCING:TAILERAGENT
● Non-invasiveeventsourcing
● Decoupleddatapublica[onandeventprocessing
● Datafan-out
● Hardlatencyrequirements● <10msresponse
● Periodiccheckpointstorecoveracerfailure
RECOVERYSTRATEGY
• Readlogsinmicro-batchesandmaintainstateinmemory
• ReliableProcessing- OnsuccessoperaVon-writecheckpoint- Onfailurereturntopreviouscheckpoint- Oncatastrophicfailurerewinddatafeedtoapointbeforetheproblemstarted
LOGICALTIME● Wall-clockdoesnotwork● Loadspikes● Recoveryrewindsdatafeedto
previousVme
● Logicalclock● MaximumVmestampseenbyBolt● Newmessageswithsmaller
Vmestamparelate
● NoclocksynchronizaVon● Allboltsarein“weaksynchrony”
DEBUGGINGANDMONITORING
• MetricsrecordingandvisualizaVonisessenValcomponentofdevelopmentcycle- EasefailuresymptomscorrelaVon- Acceleratebuild/deploy/debugcycle- ProvidetraceforproducVonissues
• Monitorbusinessmetrics- Thisistheonlythingyoucare- Technicalissuesmayormaynothaveconsequences
• Doitalot- 150Kmetrics/sec
15
GLOBALCONFIGURATION
16
EAST COAST EUROPEDC-X
Stream Mirror
Stream Mirror
Stream Mirror
Kafka Backbone
Spark
CENTRAL
Hadoop
LESSONSLEARNED
• Usestagedroll-out- Startfromminimalinfrastructureforlogsdelivery
• Donottrytobuildafortress- ItismucheasiertobuildasystemsaccepVnglimiteddataloss
• Minimizepersistentstate- Slowssystemdown- Expensivetomaintain
• Hardwaremagers
17