Datasalt - BBVA case study - extracting value from credit card transactions
Transcript of Datasalt - BBVA case study - extracting value from credit card transactions
Value extraction from BBVA credit card transactions
Case Study
104,000 employees 47 million customers
The idea
Extract value from
anonymized credit card transac5ons data & share it
Always: ü Impersonal ü Aggregated ü Dissociated ü Irreversible
Helping
Consumers
Sellers
Informed decision ü Shop recommenda5ons (by loca5on and by category) ü Best 5me to buy ü Ac5vity & fidelity of shop’s customers
Learning clients pa:erns ü Ac5vity & fidelity of shop’s customers ü Sex & Age & Loca5on ü Buying paIerns
Shop stats For different periods ü All, year, quarter, month, week, day
… and much more
The applica5ons
Customers
Internal use
Sellers
The challenges
Company silos
The amount of data
The costs
Security
Development flexibility/agility
Human failures
The plaOorm
S3 Data storage Elas5c Map Reduce Data processing
EC2 Data serving
The architecture
Hadoop
Distributed Filesystem ü Files as big as you want ü Horizontal scalability ü Failover
Distributed Compu5ng ü MapReduce ü Batch oriented
• Input files processed and converted in output files ü Horizontal scalability
Easier Hadoop Java API ü But keeping similar efficiency
Common design paIerns covered ü Compound records ü Secondary sor5ng ü Joins
Other improvements ü Instance based configura5on ü First class mul5ple input/output
Tuple MapReduce implementaDon for Hadoop
Tuple MapReduce
Pere Ferrera, Iván de Prado, Eric Palacios, Jose Luis Fernandez-‐Marquez, Giovanna Di Marzo Serugendo: Tuple MapReduce: Beyond classic MapReduce. In ICDM 2012: Proceedings of the IEEE Interna6onal Conference on Data Mining Brussels, Belgium | December 10 – 13, 2012
Our evoluDon to Google’s MapReduce
Tuple MapReduce Sales difference between the most selling offices per each loca6on
Tuple MapReduce
Main constraint
ü Group by clause must be a subset of sort by clause
Indeed, Tuple MapReduce can be implemented on top of any MapReduce implementaDon
• Pangool -‐> Tuple MapReduce over Hadoop
Efficiency
hIp://pangool.net/benchmark.html
Similar efficiency to Hadoop
Voldemort
Distributed key/value store
Voldemort & Hadoop
Benefits ü Scalability & failover ü Upda5ng the database does not affect serving queries ü All data is replaced at each execu5on
• Providing agility/flexibility § Big development changes are not a pain
• Easier survival to human errors § Fix code and run again
• Easy to set up new clusters with different topologies
Basic sta5s5cs
Count Average Min Max Stdev
Easy to implement with Pangool/Hadoop ü One job, grouping by the dimension over which you want to
calculate the sta5s5cs.
CompuDng several Dme periods in the same job
ü Use the mapper for replica5ng each datum for each period ü Add a period iden5fier field in the tuple and include it in the
group by clause
Dis5nct count Possible to compute in a single job
ü Using secondary sor5ng by the field you want to dis5nct count on
ü Detec5ng changes on that field
Example
Shop Card
Shop 1 1234
Shop 1 1234
Shop 1 1234
Shop 1 5678
Shop 1 5678
Change +1
Change +1
2 dis5nct buyers for shop 1
ü Group by shop, sort by shop and card
Histograms Typically two-‐pass algorithm
ü First pass for detec5ng the minimum and the maximum and determine the bins ranges
ü Second pass to count the number of occurrences on each bin
AdaptaDve histogram
ü One pass ü Fixed number of bins ü Bins adapt
Op5mal histogram Calculate the be:er histogram that represents the original one using a limited number of flexible width bins
ü Reduce storage needs ü More representa5ve than fixed width ones -‐> beIer
visualiza5on
Op5mal histogram
Exact Algorithm Petri Kontkanen, Petri Myllym aki MDL Histogram Density EsDmaDon hIp://eprints.pascal-‐network.org/archive/00002983/
Too slow for producDon use
Op5mal histogram
Alterna5ve: Approximated algorithm
Random-‐restart hill climbing
1. Iterate N 5mes, keeping best solu5on 1. Generate a random solu5on 2. Iterate un5l no improvement
1. Move to next beIer possible movement
ü A solu5on is just a way of grouping exis5ng bins ü From a solu5on, you can move to some close
solu5ons ü Some are beIer: reduce the representa5on error
Algorithm
Op5mal histogram
Alterna5ve: Approximated algorithm
Random-‐restart hill climbing ü One order of magnitude faster ü 99% accuracy
Everything in one job
Basic staDsDcs -‐> 1 job
DisDnct count staDsDcs -‐> 1 job One pass histograms -‐> 1 job Several periods & shops -‐> 1 job
We can put all together so that compu5ng all sta5s5cs for all shops
fits into exactly one job
Shop recommenda5ons
Based on co-‐occurrences ü If somebody bought in shop A and in shop B, then a co-‐occurrence
between A and B exists ü Only one co-‐occurrence is considered although a buyer bought
several 5mes in A and B ü Top co-‐occurrences per each shop are the recommenda5ons
Improvements ü Most popular shops are filtered out because almost everybody buys
in them. ü Recommenda5ons by category, by locaDon and by both ü Different calcula5on periods
Shop recommenda5ons
Implemented in Pangool ü Using its coun5ng and joining capabili5es ü Several jobs
Challenges ü If somebody bought in many shops, the list of co-‐occurrences can
explode: • Co-‐occurrences = N * (N – 1), where N = # of dis5nct shops
where the person bought ü Alleviated by limi5ng the total number of dis5nct shops to consider
ü Only uses the top M shops where the client bought the most
Future ü Time aware co-‐occurrences. The client bought in A and B and he
did it in a close period of 5me.
Some numbers EsDmated resources needed with 1 year data
270 GB of stats to serve
24 large instances ~ 11 hours of execu5on
$3500 month ü Op5miza5ons s5ll possible ü Cost without the use of reserved instances ü Probably cheaper with an in-‐house Hadoop cluster
Conclusion It was possible to develop a Big Data soluDon for a Bank
ü With low use of resources ü Quickly ü Thanks to the use of technologies like Hadoop, Amazon Web
Services and NoSQL databases
The soluDon is ü Scalable ü Flexible/agile. Improvements easy to implement ü Prepared to stand human failures ü At a reasonable cost
Main advantage: doing always everything
Future: Splout Key/value datastores have limitaDons
ü Only accept querying by the key ü Aggrega5ons no possible ü In other words, we are forced to pre-‐compute everything
ü Not always possible -‐> data explode ü For this par5cular case, 5me ranges are fixed
Splout: like Voldemort but SQL! ü The idea: to replace Voldemort by Splout SQL ü Much richer queries: real-‐5me aggrega5ons, flexible 5me ranges ü It would allow to create some kind of Google Analy5cs for the
sta5s5cs discussed in this presenta5on ü Open Sourced!!!
hIps://github.com/datasalt/splout-‐db