INEGI ESS big data workshop
-
Upload
abel-alejandro-coronado-iruegas -
Category
Data & Analytics
-
view
183 -
download
0
Transcript of INEGI ESS big data workshop
![Page 1: INEGI ESS big data workshop](https://reader036.fdocuments.net/reader036/viewer/2022062523/58a284031a28ab891a8b6839/html5/thumbnails/1.jpg)
Twitter: A source of Big Data in a NSO.
Experimental Indicators in Sentiment and MobilityThe ESS Big Data Workshop 2016
![Page 2: INEGI ESS big data workshop](https://reader036.fdocuments.net/reader036/viewer/2022062523/58a284031a28ab891a8b6839/html5/thumbnails/2.jpg)
Objective
To share the experience of INEGI in the use of Twitter as a big data source.
![Page 3: INEGI ESS big data workshop](https://reader036.fdocuments.net/reader036/viewer/2022062523/58a284031a28ab891a8b6839/html5/thumbnails/3.jpg)
Types of Big Data Sources• Meters and smart sensors. Traffic cameras, GPS devices, power
consume meters, IoT, smartwatches, smartphones, etc.• Social interactions. Conversations and publications on social
networks like Twitter, Facebook, FourSquare, etc.• Business transactions. Credit cards movements, scanned data,
cell phone records, etc.• Electronic files. Documents which are available in electronic
formats such as PDF files, websites, videos, audio, images, photos, etc.
• Broadcast media. Digital video and audio streamed on real-time
![Page 4: INEGI ESS big data workshop](https://reader036.fdocuments.net/reader036/viewer/2022062523/58a284031a28ab891a8b6839/html5/thumbnails/4.jpg)
Types of Big Data Sources• Meters and smart sensors. Traffic cameras, GPS devices,
power consume meters, IoT, smartwatches, smartphones, etc.• Social interactions. Conversations and publications on social
networks like Twitter, Facebook, FourSquare, etc.• Business transactions. Credit cards movements, electronic
cash registers, cell phone records, etc.• Electronic files. Documents which are available in electronic
formats such as PDF files, websites, videos, audio, digital media broadcasting
• Broadcast media. Digital video and audio streamed on real-time
![Page 5: INEGI ESS big data workshop](https://reader036.fdocuments.net/reader036/viewer/2022062523/58a284031a28ab891a8b6839/html5/thumbnails/5.jpg)
Process Followed by INEGI (until now)
![Page 6: INEGI ESS big data workshop](https://reader036.fdocuments.net/reader036/viewer/2022062523/58a284031a28ab891a8b6839/html5/thumbnails/6.jpg)
Production process
![Page 7: INEGI ESS big data workshop](https://reader036.fdocuments.net/reader036/viewer/2022062523/58a284031a28ab891a8b6839/html5/thumbnails/7.jpg)
Study Case
• Initial objective of INEGI’s Big Data Project: To generate experimental indicators using Big Data techniques with social media data, to complement statistical information obtained from traditional methods and sources.
• Initial Goal: To obtain indicators of subjective wellbeing from social media data sources.
![Page 8: INEGI ESS big data workshop](https://reader036.fdocuments.net/reader036/viewer/2022062523/58a284031a28ab891a8b6839/html5/thumbnails/8.jpg)
Why did we choose Twitter as a data source?
• It’s a widely adopted social network where you can find content written by common people
• Tweets are public, so we can use them without concerns about privacy
• There is a free API which allows to get up to 1% of the tweets that are being produced on real time
• ( https://dev.twitter.com/streaming )
![Page 9: INEGI ESS big data workshop](https://reader036.fdocuments.net/reader036/viewer/2022062523/58a284031a28ab891a8b6839/html5/thumbnails/9.jpg)
In the beginning…
February 2014 – October 201524/7
![Page 10: INEGI ESS big data workshop](https://reader036.fdocuments.net/reader036/viewer/2022062523/58a284031a28ab891a8b6839/html5/thumbnails/10.jpg)
Collection / Analysis infrastructureSince October 2015 - …
(EMC / VMWare)
LOGSTASHLocation Query
Free Access
https://twittercommunity.com/t/stream-filter-sample-size/30865
Apache SparkClean & Sentiment
Analysis
TweetsDaily Processing(4 a.m.)300 K Geo-Tweets
MinimalRepresentation
~260 Millionsof Geo-Tweets ~130 Millions inside Mexico~ 2 Years and 8 Months ~ 24/7
![Page 11: INEGI ESS big data workshop](https://reader036.fdocuments.net/reader036/viewer/2022062523/58a284031a28ab891a8b6839/html5/thumbnails/11.jpg)
Multivariate Stratification of blocks in Apache Spark
%Acceso a Internet, %Pc, %Telefono Celular, %Automovil
![Page 12: INEGI ESS big data workshop](https://reader036.fdocuments.net/reader036/viewer/2022062523/58a284031a28ab891a8b6839/html5/thumbnails/12.jpg)
Software Stack (2016)
![Page 13: INEGI ESS big data workshop](https://reader036.fdocuments.net/reader036/viewer/2022062523/58a284031a28ab891a8b6839/html5/thumbnails/13.jpg)
Tweet Structure
![Page 14: INEGI ESS big data workshop](https://reader036.fdocuments.net/reader036/viewer/2022062523/58a284031a28ab891a8b6839/html5/thumbnails/14.jpg)
Tweets Map
![Page 15: INEGI ESS big data workshop](https://reader036.fdocuments.net/reader036/viewer/2022062523/58a284031a28ab891a8b6839/html5/thumbnails/15.jpg)
Tweets Map
![Page 16: INEGI ESS big data workshop](https://reader036.fdocuments.net/reader036/viewer/2022062523/58a284031a28ab891a8b6839/html5/thumbnails/16.jpg)
We found that
• The JSON structure is easy to process (APACHE SPARK)
• The content is text that we can examine to make the sentiment analysis
• Geographical coordinates can be used to filter the tweets and obtain only those of interest (warning: not all the tweets are geo referenced)
• We can make mobility analysis, based in Tweets’ location and time (a serendipity)
![Page 17: INEGI ESS big data workshop](https://reader036.fdocuments.net/reader036/viewer/2022062523/58a284031a28ab891a8b6839/html5/thumbnails/17.jpg)
Exploring Big Data Base
![Page 18: INEGI ESS big data workshop](https://reader036.fdocuments.net/reader036/viewer/2022062523/58a284031a28ab891a8b6839/html5/thumbnails/18.jpg)
Supervised Training
Manual Tagging
5000 people (TecMilenio students), 100 tweets tagged by each one,each tweet was tagged nine times,about 40,000 different tagged tweets,interpreted accordingly to regional idioms
http://cienciadedatos.inegi.org.mx/animotuitero/
They’re not enchiladas!
![Page 19: INEGI ESS big data workshop](https://reader036.fdocuments.net/reader036/viewer/2022062523/58a284031a28ab891a8b6839/html5/thumbnails/19.jpg)
Training and modeling Collaborative Research
PythonImplementation
![Page 20: INEGI ESS big data workshop](https://reader036.fdocuments.net/reader036/viewer/2022062523/58a284031a28ab891a8b6839/html5/thumbnails/20.jpg)
Sentiment Visualization (2015)(Monthly Indicator)
http://www.inegi.org.mx/inegi/contenidos/investigacion/experimentales/animotuitero/default.aspx
{JSON:File}
![Page 21: INEGI ESS big data workshop](https://reader036.fdocuments.net/reader036/viewer/2022062523/58a284031a28ab891a8b6839/html5/thumbnails/21.jpg)
Sentiment Visualization (Late 2016)(Daily Indicator)
C#{RESTful:API}
{NoSQL}
![Page 22: INEGI ESS big data workshop](https://reader036.fdocuments.net/reader036/viewer/2022062523/58a284031a28ab891a8b6839/html5/thumbnails/22.jpg)
Mobility (Late 2016)
![Page 23: INEGI ESS big data workshop](https://reader036.fdocuments.net/reader036/viewer/2022062523/58a284031a28ab891a8b6839/html5/thumbnails/23.jpg)
Integration of other sourcesBig Spatial Join
(5 M Economic Units with +60 M Tweets)
https://github.com/syoummer/SpatialSpark
![Page 24: INEGI ESS big data workshop](https://reader036.fdocuments.net/reader036/viewer/2022062523/58a284031a28ab891a8b6839/html5/thumbnails/24.jpg)
Some applications• Tourism• Migration• Use of roads• Regional influence of big cities• Mobility patterns• Business activity patterns• Subjective wellbeing• Inequities impact
• Impact analysis of relevant news
• Mental health • Misogynist/discriminatory
language use• SDG indicators?
![Page 25: INEGI ESS big data workshop](https://reader036.fdocuments.net/reader036/viewer/2022062523/58a284031a28ab891a8b6839/html5/thumbnails/25.jpg)
Collaboration• International
– UNECE• ICHEC
– UNSD– LAMBDoop– University of Pensylvania
• National– KioNetworks– Dattlas– TecMilenio– INFOTEC– Centro Geo– CIDE– CIMAT– Sectur
• Internal– INEGI General Directorates