Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012
-
Upload
big-data-spain -
Category
Technology
-
view
896 -
download
0
description
Transcript of Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012
Value extraction from BBVA credit card transactions
Iván de Prado Alonso – CEO of Datasalt www.datasalt.es @ivanprado @datasalt
www.bigdataspain.org November 16th, 2012 ETSI Telecomunicación Madrid Spain #BDSpain
BIG “MAC” DATA
104,000 employees 47 million customers
The idea
Extract value from
anonymized credit card transacNons data & share it
Always: ü Impersonal ü Aggregated ü Dissociated ü Irreversible
Helping
Consumers
Sellers
Informed decision ü Shop recommendaNons (by locaNon and by category) ü Best Nme to buy ü AcNvity & fidelity of shop’s customers
Learning clients paCerns ü AcNvity & fidelity of shop’s customers ü Sex & Age & LocaNon ü Buying paXerns
Shop stats For different periods ü All, year, quarter, month, week, day
… and much more
The applicaNons
Customers
Internal use
Sellers
The challenges
Company silos
The amount of data
The costs
Security
Development flexibility/agility
Human failures
The pla]orm
S3 Data storage ElasNc Map Reduce Data processing
EC2 Data serving
The architecture
Hadoop
Distributed Filesystem ü Files as big as you want ü Horizontal scalability ü Failover
Distributed CompuNng ü MapReduce ü Batch oriented
• Input files processed and converted in output files ü Horizontal scalability
Easier Hadoop Java API ü But keeping similar efficiency
Common design paXerns covered ü Compound records ü Secondary sorNng ü Joins
Other improvements ü Instance based configuraNon ü First class mulNple input/output
Tuple MapReduce implementaJon for Hadoop
Tuple MapReduce
Pere Ferrera, Iván de Prado, Eric Palacios, Jose Luis Fernandez-‐Marquez, Giovanna Di Marzo Serugendo: Tuple MapReduce: Beyond classic MapReduce. In ICDM 2012: Proceedings of the IEEE Interna2onal Conference on Data Mining Brussels, Belgium | December 10 – 13, 2012
Our evoluJon to Google’s MapReduce
Tuple MapReduce Sales difference between the most selling offices per each loca2on
Tuple MapReduce
Main constraint
ü Group by clause must be a subset of sort by clause
Indeed, Tuple MapReduce can be implemented on top of any MapReduce implementaJon
• Pangool -‐> Tuple MapReduce over Hadoop
Efficiency
hXp://pangool.net/benchmark.html
Similar efficiency to Hadoop
Voldemort
Distributed key/value store
Voldemort & Hadoop
Benefits ü Scalability & failover ü UpdaNng the database does not affect serving queries ü All data is replaced at each execuNon
• Providing agility/flexibility § Big development changes are not a pain
• Easier survival to human errors § Fix code and run again
• Easy to set up new clusters with different topologies
Basic staNsNcs
Count Average Min Max Stdev
Easy to implement with Pangool/Hadoop ü One job, grouping by the dimension over which you want to
calculate the staNsNcs.
CompuJng several Jme periods in the same job
ü Use the mapper for replicaNng each datum for each period ü Add a period idenNfier field in the tuple and include it in the
group by clause
DisNnct count Possible to compute in a single job
ü Using secondary sorNng by the field you want to disNnct count on
ü DetecNng changes on that field
Example
Shop Card
Shop 1 1234
Shop 1 1234
Shop 1 1234
Shop 1 5678
Shop 1 5678
Change +1
Change +1
2 disNnct buyers for shop 1
ü Group by shop, sort by shop and card
Histograms Typically two-‐pass algorithm
ü First pass for detecNng the minimum and the maximum and determine the bins ranges
ü Second pass to count the number of occurrences on each bin
AdaptaJve histogram
ü One pass ü Fixed number of bins ü Bins adapt
OpNmal histogram Calculate the beCer histogram that represents the original one using a limited number of flexible width bins
ü Reduce storage needs ü More representaNve than fixed width ones -‐> beXer
visualizaNon
OpNmal histogram
Exact Algorithm Petri Kontkanen, Petri Myllym aki MDL Histogram Density EsJmaJon hXp://eprints.pascal-‐network.org/archive/00002983/
Too slow for producJon use
OpNmal histogram
AlternaNve: Approximated algorithm
Random-‐restart hill climbing
1. Iterate N Nmes, keeping best soluNon 1. Generate a random soluNon 2. Iterate unNl no improvement
1. Move to next beXer possible movement
ü A soluNon is just a way of grouping exisNng bins ü From a soluNon, you can move to some close
soluNons ü Some are beXer: reduce the representaNon error
Algorithm
OpNmal histogram
AlternaNve: Approximated algorithm
Random-‐restart hill climbing ü One order of magnitude faster ü 99% accuracy
Everything in one job
Basic staJsJcs -‐> 1 job
DisJnct count staJsJcs -‐> 1 job One pass histograms -‐> 1 job Several periods & shops -‐> 1 job
We can put all together so that compuNng all staNsNcs for all shops
fits into exactly one job
Shop recommendaNons
Based on co-‐occurrences ü If somebody bought in shop A and in shop B, then a co-‐occurrence
between A and B exists ü Only one co-‐occurrence is considered although a buyer bought
several Nmes in A and B ü Top co-‐occurrences per each shop are the recommendaNons
Improvements ü Most popular shops are filtered out because almost everybody buys
in them. ü RecommendaNons by category, by locaJon and by both ü Different calculaNon periods
Shop recommendaNons
Implemented in Pangool ü Using its counNng and joining capabiliNes ü Several jobs
Challenges ü If somebody bought in many shops, the list of co-‐occurrences can
explode: • Co-‐occurrences = N * (N – 1), where N = # of disNnct shops
where the person bought ü Alleviated by limiNng the total number of disNnct shops to consider
ü Only uses the top M shops where the client bought the most
Future ü Time aware co-‐occurrences. The client bought in A and B and he
did it in a close period of Nme.
Some numbers EsJmated resources needed with 1 year data
270 GB of stats to serve
24 large instances ~ 11 hours of execuNon
$3500 month ü OpNmizaNons sNll possible ü Cost without the use of reserved instances ü Probably cheaper with an in-‐house Hadoop cluster
Conclusion It was possible to develop a Big Data soluJon for a Bank
ü With low use of resources ü Quickly ü Thanks to the use of technologies like Hadoop, Amazon Web
Services and NoSQL databases
The soluJon is ü Scalable ü Flexible/agile. Improvements easy to implement ü Prepared to stand human failures ü At a reasonable cost
Main advantage: doing always everything
Future: Splout Key/value datastores have limitaJons
ü Only accept querying by the key ü AggregaNons no possible ü In other words, we are forced to pre-‐compute everything
ü Not always possible -‐> data explode ü For this parNcular case, Nme ranges are fixed
Splout: like Voldemort but SQL! ü The idea: to replace Voldemort by Splout SQL ü Much richer queries: real-‐Nme aggregaNons, flexible Nme ranges ü It would allow to create some kind of Google AnalyNcs for the
staNsNcs discussed in this presentaNon ü Open Sourced!!!
hXps://github.com/datasalt/splout-‐db
Iván de Prado Alonso – CEO of Datasalt www.datasalt.es @ivanprado @datasalt
QuesJons?