Big Data Workshop: Analytics and Models por Esteban Moro y Alejandro Llorente

63
INNOVA CHALLENGE Workshop 30 Oct Esteban Moro Alejandro Llorente www.iic.uam.es Analytics & Models

description

Big Data Workshops: Analytics and Models por Esteban Moro y Alejandro Llorente

Transcript of Big Data Workshop: Analytics and Models por Esteban Moro y Alejandro Llorente

  • 1. Analytics & Models Esteban Moro Alejandro Llorente www.iic.uam.esINNOVA CHALLENGEWorkshop 30 Oct

2. Analytics and Models Challenge participant roadmapData Maps Infrastructures/Plac es ActivityINNOVA CHALLENGEMiningAnalysisDevelopment App ContentModelsVisualizationWorkshop 30 Oct 3. SummaryIntroduction to geo-tagged data Access to (open) geo-tagged data Example: development of geolocalized recommender app.INNOVA CHALLENGEWorkshop 30 Oct 4. Introduction to geo-tagged data 5. Introduction to geo-tagged data Information: Person, event, infrastructure.Geography: GPS coordinates, zone, cityINNOVA CHALLENGEWorkshop 30 Oct 6. Geospatial BigdataActivity (Transport)Geospatial BigDataMapsSatellite Images INNOVA CHALLENGESocial MediaSensors Workshop 30 Oct 7. Geo-tagged BigData applications With geo-tagged data we can Measure zone/area occupation & activity Identify flows of persons/money between different areas Identificar movimientos / flujos entre zonas With those data we can build applications in Geo-social analysis Geomarketing Optimal allocation of resources Fraud detection Event detection INNOVA CHALLENGEWorkshop 30 Oct 8. Geo-social AnalysisUse of pervasive sensors (mobile phones, social media) to model movement and communication of people in urban areas. INNOVA CHALLENGEWorkshop 30 Oct 9. Geo-social analysis !!Estudio de geolocalizacin en MadridLocalizacin:!!Puerta!del!Sol!placen_checkinsuser10 5 0cn ot ufo o d nh e i tf gl i sp hs o0 ln us em s m os j e s a r t e ir l e ue c vd i avre s a i n e s d d ig b o on mo7 0 0 6 0 0cn ot u5 0 0factor(tipo) a _ t ram r ee i e t ntn n s t4 0 0fo o d 3 0 0nh e i tf gl i sp hs o2 0 0 1 0 0 0 051 0ha o r1 52 02 5n_checkinsfa n c3 1 61aa l 6 m 6 z 6 e1 2 12sru so e t b kcf a c fe2 6 92rn y ua w47 33ma ds me e d ea i u r o c n gl2 5 13eid dnl ai4 04eo i gs lc t n r e l1 3 64m to r s a rd i e a s u3 95ma ds a e d ea nn r o c nt1 1 35i o ap vc o _m s3 56ymi ei e 3 e on d l d l c s a8 76dp eo sp3 37INNOVA CHALLENGEa _ t ram r ee i e t ntn n s t5 0 01!factor(tipo)10 0 0vs i p8 47e ma d a u i z3 38m nd c a' d ls o7 88do 8 ag l u3 29cer n a do t f i e e7 79dd l e e eb t s a r03 21 0sa ysv aj e a l o l a1 5 0factor(tipo0)cn ot uCharacterization of urban neighborhoods according to their social/commerci al use! Nmero!de!checkins!totales:!2651!(30.5!al!da)! Nmero!de!usuarios!nicos!en!la!zona:!1231! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !1 0 0a _ t ram r ee i e t ntn n s t fo o d nh e i tf gl i sp hs o5 034!07 11 0mt f t n me e a a3 0a b1 r 1m1 a 1 y t ey i ds m ajn 1 u1 Workshop 30 Oct 10. Fraud detectionUse merchant localization and/or IP address in online transactions to detect fraud. INNOVA CHALLENGEWorkshop 30 Oct 11. GeomarketingBarsShopsINNOVA CHALLENGEWorkshop 30 Oct 12. Optimal resource allocation Optimize Bares incash holding bank branches, minimizing costs associated with it.TiendasIdentify best placement for a new shop/branchINNOVA CHALLENGEWorkshop 30 Oct 13. Event detectionDetect unexpected behavior using social/mobile/urban sensorsINNOVA CHALLENGEWorkshop 30 Oct 14. Access to (open) geographical data 15. Geographical dataMap Infrastructure/place sActivityINNOVA CHALLENGEWorkshop 30 Oct 16. Types of dataMapsEconomic/Demographic data Other type of data Googles POIs Weather forecast Activity Twitter BBVA API INNOVA CHALLENGEWorkshop 30 Oct 17. Maps:: Google Maps Google Maps has a number of different services/APIs, with different restrictions and protocols. It allows to define maps, routes, markers, etc.Example: get a static map (without authentication). URL Base: http://maps.google.com/maps/api/staticmap Parameters: center: 40.4153,-3.6875 size: 640x640 maptype: mobile format: png32 sensor: trueINNOVA CHALLENGEWorkshop 30 Oct 18. Maps :: OpenStreetMap Open and collaborative project to create and distribute free maps. Different APIs to get information about routes, points, maps, etc. There are a number of Mapping projects (applications) build on top of OSM with very different purposesExample: get the route between two locations. MapQuest. URL Base: http://open.mapquestapi.com/guidance/v1/ Parameters: Key: authentication key From: latitud y longitud del origen en JSON. To: latitud y longitud del destino en JSON.INNOVA CHALLENGEWorkshop 30 Oct 19. Mapas :: shapefiles Geospatial vector data format for geographical information Regions, points, paths defined as points, lines, polygons Each of them usually has attributes that describe it Region Codes, Names, Population, etc. pyshp: http://code.google.com/p/pyshp/ maptools: http://cran.r-project.org/web/packages/maptools http://www.naturalearthdata.com/downloads/INNOVA CHALLENGEWorkshop 30 Oct 20. Mapas :: shapefiles Edition and Visualization of Shapefiles: http://www.qgis.orgINNOVA CHALLENGEWorkshop 30 Oct 21. Maps :: Spain cartography CartoCiudad (Ministerio de Fomento): shapefiles for each province at municipality and postal code levels. They also include data about the urban background http://www.cartociudad.es/portal/INNOVA CHALLENGEWorkshop 30 Oct 22. Maps :: Madrid cartography Nomecalles (CAM): shapefiles, POIs (museums, theaters, health services ), subway (stations), etc. http://www.madrid.org/nomecalles/DescargaBDTCorte.icm Resolution level: municipalities, districts, postal codes, etc.INNOVA CHALLENGEWorkshop 30 Oct 23. Maps :: Barcelona province cartography Plan territorial metropolitano de Barcelona Generalitat de Catalunya LinkINNOVA CHALLENGEWorkshop 30 Oct 24. Maps :: Barcelona City cartography Open data gencat Catalonia CartographyLinkINNOVA CHALLENGEWorkshop 30 Oct 25. Maps :: Barcelona city cartography Plan territorial metropolitano de Barcelona Generalitat de Catalunya Link This web has also data about mobility, economic development, population, etc. at the district level There is nothing at this level of detail in Madrid.Solution: Use other data sources to estimate them (see below).INNOVA CHALLENGEWorkshop 30 Oct 26. Demographic/Economic data :: Spain Demographic Data: Instituto Nacional de Estadstica (INE) Census by municipality. Link Economic Data: Servicio Pblico de Empleo Estatal (SEPE). Unemployment by municipality. LinkINNOVA CHALLENGEWorkshop 30 Oct 27. Demographic/Economic data :: Madrid Madrid City Madrid City Council database: http://www-2.munimadrid.es/CSE6/jsps/menuBancoDatos.jsp Population by districts, neighborhoods, etc. Madrid Region Comunidad de Madrid database: http://www.madrid.org/desvan/Inicio.icm?enlace=almudena Population by municipality. Economical data by municipalityINNOVA CHALLENGEWorkshop 30 Oct 28. Demographic/Economic data :: Barcelona Barcelona city Departament dEstadstica http://www.bcn.cat/estadistica/castella/ Population by district. Unemployment by district. Catalonia region Idescat (Institut dEstadstica de Catalunya) http://www.idescat.cat/es/ Population by municipality Economical data by municipalityo.INNOVA CHALLENGEWorkshop 30 Oct 29. Other data sources :: Google Points of Interest Google API ConsoleINNOVA CHALLENGEWorkshop 30 Oct 30. Other data sources :: Google Points of Interest Google API ConsoleINNOVA CHALLENGEWorkshop 30 Oct 31. Other data sources :: Google Points of Interest Google API ConsoleINNOVA CHALLENGEWorkshop 30 Oct 32. Other data sources :: Google Points of InterestPoints of interest around Puerta del Sol (Madrid) Service 1: Places Search Parameters : location: 40.417, -3.703 radius: 1000 Service 2: Places Details parameters: reference: cdigo del placeINNOVA CHALLENGEWorkshop 30 Oct 33. Other data sources :: Weather forecastGFS: Global Forecast System OpeNDAP protocol. Python implementation : pydap Query format: SERVER = http://nomads.ncep.noaa.gov:9090/dods/gfs_hd/ DATE = AAAAMMDD HOUR = HH VAR = weather metric r (tmp2m, ugrd10m, pressfc, ) LAT = latitude interval [259:263] (0.5 steps from South Pole) LON = longitude interval [710:714] (0.5 steps from Greenwich) QUERY = SERVERgfs_hdDATE/gfs_hd_HOURz.dods?VAR[0:0][LAT][LON] dataset = open_dods(QUERY)INNOVA CHALLENGEWorkshop 30 Oct 34. Activity :: data from Twitter API Developers webpage http://dev.twitter.comINNOVA CHALLENGEWorkshop 30 Oct 35. Activity :: data from Twitter API Developers webpage http://dev.twitter.comINNOVA CHALLENGEWorkshop 30 Oct 36. Activity :: data from Twitter API Developers webpage http://dev.twitter.comINNOVA CHALLENGEWorkshop 30 Oct 37. Activity :: data from Twitter API Developers webpage http://dev.twitter.comConsumer Key Consumer Secret Access token Access token secretINNOVA CHALLENGEWorkshop 30 Oct 38. Activity :: data from Twitter API OAuth Authentication Consumer Key Consumer Secret Access token Access token secretRest APIStream APISeveral queries with parameters Number of requests is limitedINNOVA CHALLENGEOnly one query (with parameters) Requests are not timelimitedWorkshop 30 Oct 39. Activity :: data from Twitter API Stream API Example: Geolocalized Tweets in the Madrid region API Service: POST statuses/filterparameters: locations: -4.59, 39.90, -3.04, 41.17INNOVA CHALLENGEWorkshop 30 Oct 40. Activity :: data from Twitter API Stream API As we said before, there are no data in Madrid about administrative zones below the municipality. But we can estimate some of the with Twitter Example: population by postal codes 1. Round geographical coordinates to the 3rd decimal place (square cells of approx 100 meters squared). 2. Analyze the most visited postal code by user. Define that as his/her residence. Count number of residents by postal code 3. Visualize.INNOVA CHALLENGEWorkshop 30 Oct 41. Activity :: data from Twitter API Stream APIINNOVA CHALLENGEWorkshop 30 Oct 42. Activity :: data from Twitter API Stream APIINNOVA CHALLENGEWorkshop 30 Oct 43. Activity :: data from BBVA API https://www.centrodeinnovacionbbva.com/signupINNOVA CHALLENGEWorkshop 30 Oct 44. Activity :: data from BBVA APIhttps://developer.bbva.com/panelINNOVA CHALLENGEWorkshop 30 Oct 45. Activity :: data from BBVA APIhttps://developer.bbva.com/panelINNOVA CHALLENGEWorkshop 30 Oct 46. Activity :: data from BBVA APIhttps://developer.bbva.com/panelINNOVA CHALLENGEWorkshop 30 Oct 47. Activity :: data from BBVA API Getting the authentication data: 1. With the APP_ID and APP_KEY, generate the authorization code concatenating both strings with and codifying it to base64. 2. This authorization code is added to the Http Request Header.Ejemplo: APP_ID = "iic_formacion_innovachallenge" APP_KEY = "0f1d750a5baea6c7022452d0d2ece01fc5901ad7 str_to_encode="iic_formacion_innovachallenge:0f1d750a5baea6c7022452d0d2ece01fc5901ad7 auth = strToBase64(str_to_encode) Request = HttpRequest(SERVICE, PARAMETERS, header = {Authorization : auth})INNOVA CHALLENGEWorkshop 30 Oct 48. Activity :: data from BBVA API Economical flows from Puerta del SolServicio API: customer_zipcodes Parmetros: date_min:201304 date_max:201304 zipcode:28013 by:cards group_by:monthINNOVA CHALLENGEWorkshop 30 Oct 49. Example: development of a geolocalized recommender app. 50. Recommender systems :: Introduction Objective: recommend users what areas to visit according to their profile, residence, preferences, etc. Using information about what similar users do.Data used: 1. Twitter data. 2. API Innova Challenge CARDS_CUBE. 3. API Innova Challenge CUSTOMER_ZIPCODES.INNOVA CHALLENGEWorkshop 30 Oct 51. Recommender systems :: user languageUse twitter data to1. Get what people are talking about in city areas. 2. Analyze user language in Twitter3. Compare user language with area language and recommend user most similar areas.INNOVA CHALLENGEWorkshop 30 Oct 52. Recommender systems :: user language CP 28013: Madrid city centerINNOVA CHALLENGEWorkshop 30 Oct 53. Recommender systems :: user language CP 28009 : RetiroINNOVA CHALLENGEWorkshop 30 Oct 54. Recommender systems :: user demographic profileUse CARDS_CUBE service from the BBVA API INNOVA CHALLENGEWorkshop 30 Oct 55. Recommender systems :: user demographic profile Use CARDS_CUBE service data For each merchant cathegory Z (bars, fashion, health, etc.) build a matrix in which each entry is the number of different credit cards for a given profile X (gender, age) that went shopping to the postal code Y in a merchant of chategory Z.Where do people like me go shopping? Which restaurants are visited by people similar to me?INNOVA CHALLENGEWorkshop 30 Oct 56. Recommender systems :: user demographic profile Example: Male, age 36-45 FashionINNOVA CHALLENGEBars and restaurantsWorkshop 30 Oct 57. Recommender systems :: user geographic profileUse CUSTOMER_ZIPCODES service in the BBVA API INNOVA CHALLENGEWorkshop 30 Oct 58. Recommender systems :: user geographic profile Use data from the CUSTOMER_ZIPCODES service For each mercant cathegory Z (bars, fashion, health, etc.) we build a matrix in which each entry is the number of different credit cards from a postal code X that go shopping to postal code Y in merchant cathegory Z.Where do people in my district go shopping? What restaurants are visited by people living in my district?INNOVA CHALLENGEWorkshop 30 Oct 59. Recommender systems :: user geographic profile Example: postal code 28045 FashionINNOVA CHALLENGEBars and restaurantsWorkshop 30 Oct 60. Recommender systems :: combinationGeographical and demographic recommendation system INNOVA CHALLENGEWorkshop 30 Oct 61. Recommender systems :: combination Example: Male, age 36-45, living in postal code 28045.FashionINNOVA CHALLENGEBars and restaurantsWorkshop 30 Oct 62. From the data to the app 63. From data to the app 1. The idea. 2. What data do I need to carry out this idea? Which services of the Challenge API do I need? May I improve it with other information sources? 3. Analysis: distilling the idea and assessing its viability. Extracting the hidden value of analytics and models. 4. How can the user take advantage of this idea? 5. Iterate 2,3 and 4 until the idea and the user profit show up. 6. Convert the value of the analysis to an application.INNOVA CHALLENGEWorkshop 30 Oct