Doral Academy Charter School 11100 NW 27th Street Doral, Florida 33172
ANALISIS Y MODELIZACI ON DE LA DIN AMICA EMERGENTE...
Transcript of ANALISIS Y MODELIZACI ON DE LA DIN AMICA EMERGENTE...
UNIVERSIDAD POLITECNICA DE MADRID
ESCUELA TECNICA SUPERIOR DE INGENIEROS AGRONOMOS
ANALISIS Y MODELIZACION DE LA DINAMICAEMERGENTE DURANTE EL PROCESO DE
DIFUSION DE INFORMACION EN LAS REDESSOCIALES DE INTERNET
ALFREDO JOSE MORALES GUZMAN
Ingeniero en Telecomunicacion
Master en Fısica de Sistemas Complejos
TESIS DOCTORAL
2014
ii
GRUPO DE SISTEMAS COMPLEJOS
ESCUELA TECNICA SUPERIOR DE INGENIEROS AGRONOMOS
ANALYZING AND MODELING THE EMERGENTDYNAMICS DURING THE INFORMATION
DIFFUSION PROCESS ON INTERNET SOCIALNETWORKS
ALFREDO JOSE MORALES GUZMAN
Telecommunications Engineer
MSc in Physics of Complex Systems
Advisor:
ROSA MARIA BENITO ZAFRILLA
PhD in Chemistry Sciences
2014
ii
A mi madre Kalena, por ser mi ejemplo
ii
AGRADECIMIENTOS
En primer lugar, quiero agradecer a la Dra. Rosa Marıa Benito Zafrilla por su incansable
labor como directora de esta tesis. Durante estos anos, con mucha paciencia y teson, me
ha ensenado con gran firmeza la labor de la investigacion cientıfica y los estandares de la
excelencia. Especialmente, le estare infinitamente agradecido por haberme dado esa primera
oportunidad, que sin ser conciente, cambio el rumbo de mi vida para siempre.
Por otra parte, quiero agradecerle a mis profesores, colaboradores y companeros del
Grupo de Sistemas Complejos de la Universidad Politecnica de Madrid. Sin sus ensenanzas,
aportes, consejos y apoyo, el trabajo realizado durante estos anos no hubiera sido el mismo.
Con especial carino me gustarıa recordar a los profesores: Juan Carlos Losada, Werner Creix-
ell (visitante), Javier Galeano, Ramon Alonso, Miguel A. Porras y Ana Tarquis. Ası como
a mis companeros del laboratorio: Javier Borondo, Fabio Revuelta, Izaskun Oregui, Pedro
Benıtez, Henar Hernandez, Johan Martınez y Maxi Fernandez. Ademas, debo agradecer
a la Universidad Politecnica de Madrid por otorgarme la beca UPM-BSCH, sin la cual, la
culminacion de este trabajo hubiera sido imposible.
Ası mismo me gustarıa agradecer a los miembros del New England Complex Systems
Institute, donde tuve el gusto de realizar una estancia de movilidad. En especial, me gustarıa
agradecer al prof. Yaneer Bar-Yam por haberme dado la oportunidad de colaborar con el
instituto, ası como al prof. Hiroki Sayama por sus aportes en la labor investigativa. Por
otra parte, me gustarıa recordar a mis companeros de trabajo: Debra Gorfine, Francisco
Prieto, Joe Norman, Maya Bialik, Vaibhav Vavilala, Molly Wexler-Romig, Vincent Wong,
Lili y Katriel Friedman.
Tambien quiero agradecer a mis colaboradores de Global Pulse de las Naciones Unidas,
Telefonica Digital y Centro de Innovacion en Tecnologıa para el Desarrollo Humano de la
Universidad Politecnica de Madrid, por haberme dado la oportunidad de trabajar y apren-
der de ellos en un proyecto conjunto. En especial, quiero agradecer y recordar a Miguel
A. Luengo-Oroz, David Pastor, Yolanda Torres, Vanessa Frıas-Martınez y Enrique Frıas-
Martınez.
iii
Ademas, quiero agradecer a todas las personas, amigos y familiares que me acompanaron
durante este largo viaje. En primer lugar, quiero recordar a mi padre, suegros, hermanos,
cunados, abuela, tıas, sobrinos y primos, que con su carino incondicional me dieron las fuerzas
necesarias para emprender este camino. Por otra parte, quiero agradecer a mis amigos de
vida Zhandra, Edu, Patricia, Sergio, Laura, Andrei, Iuri, Cesar y Carolina, que con su apoyo
y companıa me hicieron el viaje mas placentero.
Finalmente, quiero agradecer de forma absoluta a mi esposa, Vanessa Pechiaia, coautora
honorıfica de esta tesis. Su apoyo y amor inagotable fueron la base fundamental para la
realizacion de este trabajo. A ella, mi mas profunda gratitud por haber hecho de esta, otra
de las mejores etapas de mi vida. Por ultimo, he de decir con mucho honor, que este trabajo
esta dedicado a mi madre, el pilar fundamental de mi vida. Fue ella la primera persona en
animarme a tomar este camino y en darme su absoluta confianza para recorrerlo con exito.
Sin palabras capaces de expresarle mi profunda admiracion, le agradecere eternamente por
ser mi ejemplo a seguir y constante motivo de inspiracion.
Desde el fondo de mi corazon, gracias a todos.
iv
RESUMEN
Durante la actividad diaria, la sociedad actual interactua constantemente por medio de
dispositivos electronicos y servicios de telecomunicaciones, tales como el telefono, correo
electronico, transacciones bancarias o redes sociales de Internet. Sin saberlo, masivamente
dejamos rastros de nuestra actividad en las bases de datos de empresas proveedoras de
servicios. Estas nuevas fuentes de datos tienen las dimensiones necesarias para que se puedan
observar patrones de comportamiento humano a grandes escalas. Como resultado, ha surgido
una reciente explosion sin precedentes de estudios de sistemas sociales, dirigidos por el analisis
de datos y procesos computacionales.
En esta tesis desarrollamos metodos computacionales y matematicos para analizar sis-
temas sociales por medio del estudio combinado de datos derivados de la actividad humana
y la teorıa de redes complejas. Nuestro objetivo es caracterizar y entender los sistemas emer-
gentes de interacciones sociales en los nuevos espacios tecnologicos, tales como la red social
Twitter y la telefonıa movil. Analizamos los sistemas por medio de la construccion de redes
complejas y series temporales, estudiando su estructura, funcionamiento y evolucion en el
tiempo. Tambien, investigamos la naturaleza de los patrones observados por medio de los
mecanismos que rigen las interacciones entre individuos, ası como medimos el impacto de
eventos crıticos en el comportamiento del sistema. Para ello, hemos propuesto modelos que
explican las estructuras globales y la dinamica emergente con que fluye la informacion en el
sistema.
Para los estudios de la red social Twitter, hemos basado nuestros analisis en conversa-
ciones puntuales, tales como protestas polıticas, grandes acontecimientos o procesos elec-
torales. A partir de los mensajes de las conversaciones, identificamos a los usuarios que
participan y construimos redes de interacciones entre los mismos. Especıficamente, constru-
imos una red para representar quien recibe los mensajes de quien y otra red para representar
quien propaga los mensajes de quien. En general, hemos encontrado que estas estructuras
tienen propiedades complejas, tales como crecimiento explosivo y distribuciones de grado
libres de escala. En base a la topologıa de estas redes, hemos indentificado tres tipos de
v
usuarios que determinan el flujo de informacion segun su actividad e influencia.
Para medir la influencia de los usuarios en las conversaciones, hemos introducido una
nueva medida llamada eficiencia de usuario. La eficiencia se define como el numero de
retransmisiones obtenidas por mensaje enviado, y mide los efectos que tienen los esfuer-
zos individuales sobre la reaccion colectiva. Hemos observado que la distribucion de esta
propiedad es ubicua en varias conversaciones de Twitter, sin importar sus dimensiones ni
contextos. Con lo cual, sugerimos que existe universalidad en la relacion entre esfuerzos
individuales y reacciones colectivas en Twitter. Para explicar los factores que determinan
la emergencia de la distribucion de eficiencia, hemos desarrollado un modelo computacional
que simula la propagacion de mensajes en la red social de Twitter, basado en el mecanismo
de cascadas independientes. Este modelo nos permite medir el efecto que tienen sobre la
distribucion de eficiencia, tanto la topologıa de la red social subyacente, como la forma en
que los usuarios envıan mensajes. Los resultados indican que la emergencia de un grupo
selecto de usuarios altamente eficientes depende de la heterogeneidad de la red subyacente
y no del comportamiento individual.
Por otro lado, hemos desarrollado tecnicas para inferir el grado de polarizacion polıtica
en redes sociales. Proponemos una metodologıa para estimar opiniones en redes sociales y
medir el grado de polarizacion en las opiniones obtenidas. Hemos disenado un modelo donde
estudiamos el efecto que tiene la opinion de un pequeno grupo de usuarios influyentes, lla-
mado elite, sobre las opiniones de la mayorıa de usuarios. El modelo da como resultado una
distribucion de opiniones sobre la cual medimos el grado de polarizacion. Aplicamos nues-
tra metodologıa para medir la polarizacion en redes de difusion de mensajes, durante una
conversacion en Twitter de una sociedad polıticamente polarizada. Los resultados obtenidos
presentan una alta correspondencia con los datos offline. Con este estudio, hemos demostrado
que la metodologıa propuesta es capaz de determinar diferentes grados de polarizacion de-
pendiendo de la estructura de la red.
Finalmente, hemos estudiado el comportamiento humano a partir de datos de telefonıa
movil. Por una parte, hemos caracterizado el impacto que tienen desastres naturales, como
innundaciones, sobre el comportamiento colectivo. Encontramos que los patrones de comu-
nicacion se alteran de forma abrupta en las areas afectadas por la catastofre. Con lo cual,
demostramos que se podrıa medir el impacto en la region casi en tiempo real y sin necesidad
de desplegar esfuerzos en el terreno. Por otra parte, hemos estudiado los patrones de ac-
tividad y movilidad humana para caracterizar las interacciones entre regiones de un paıs en
desarrollo. Encontramos que las redes de llamadas y trayectorias humanas tienen estructuras
de comunidades asociadas a regiones y centros urbanos.
vi
En resumen, hemos mostrado que es posible entender procesos sociales complejos por
medio del analisis de datos de actividad humana y la teorıa de redes complejas. A lo largo de
la tesis, hemos comprobado que fenomenos sociales como la influencia, polarizacion polıtica
o reaccion a eventos crıticos quedan reflejados en los patrones estructurales y dinamicos
que presentan la redes construidas a partir de datos de conversaciones en redes sociales de
Internet o telefonıa movil.
vii
viii
ABSTRACT
During daily routines, we are constantly interacting with electronic devices and telecom-
munication services. Unconsciously, we are massively leaving traces of our activity in the
service providers’ databases. These new data sources have the dimensions required to enable
the observation of human behavioral patterns at large scales. As a result, there has been an
unprecedented explosion of data-driven social research.
In this thesis, we develop computational and mathematical methods to analyze social
systems by means of the combined study of human activity data and the theory of complex
networks. Our goal is to characterize and understand the emergent systems from human
interactions on the new technological spaces, such as the online social network Twitter and
mobile phones. We analyze systems by means of the construction of complex networks
and temporal series, studying their structure, functioning and temporal evolution. We also
investigate on the nature of the observed patterns, by means of the mechanisms that rule the
interactions among individuals, as well as on the impact of critical events on the system’s
behavior. For this purpose, we have proposed models that explain the global structures and
the emergent dynamics of information flow in the system.
In the studies of the online social network Twitter, we have based our analysis on specific
conversations, such as political protests, important announcements and electoral processes.
From the messages related to the conversations, we identify the participant users and build
networks of interactions with them. We specifically build one network to represent who-
receives-whose-messages and another to represent who-propagates-whose-messages. In gen-
eral, we have found that these structures have complex properties, such as explosive growth
and scale-free degree distributions. Based on the topological properties of these networks,
we have identified three types of user behavior that determine the information flow dynamics
due to their influence.
In order to measure the users’ influence on the conversations, we have introduced a new
measure called user efficiency. It is defined as the number of retransmissions obtained by
message posted, and it measures the effects of the individual activity on the collective reac-
ix
tions. We have observed that the probability distribution of this property is ubiquitous across
several Twitter conversation, regardlessly of their dimension or social context. Therefore, we
suggest that there is a universal behavior in the relationship between individual efforts and
collective reactions on Twitter. In order to explain the different factors that determine the
user efficiency distribution, we have developed a computational model to simulate the diffu-
sion of messages on Twitter, based on the mechanism of independent cascades. This model,
allows us to measure the impact on the emergent efficiency distribution of the underlying
network topology, as well as the way that users post messages. The results indicate that the
emergence of an exclusive group of highly efficient users depends upon the heterogeneity of
the underlying network instead of the individual behavior.
Moreover, we have also developed techniques to infer the degree of polarization in social
networks. We propose a methodology to estimate opinions in social networks and to measure
the degree of polarization in the obtained opinions. We have designed a model to study the
effects of the opinions of a small group of influential users, called elite, on the opinions of the
majority of users. The model results in an opinions distribution to which we measure the
degree of polarization. We apply our methodology to measure the polarization on graphs
from the messages diffusion process, during a conversation on Twitter from a polarized
society. The results are in very good agreement with offline and contextual data. With
this study, we have shown that our methodology is capable of detecting several degrees of
polarization depending on the structure of the networks.
Finally, we have also inferred the human behavior from mobile phones’ data. On the one
hand, we have characterized the impact of natural disasters, like flooding, on the collective
behavior. We found that the communication patterns are abruptly altered in the areas
affected by the catastrophe. Therefore, we demonstrate that we could measure the impact
of the disaster on the region, almost in real-time and without needing to deploy further
efforts. On the other hand, we have studied human activity and mobility patterns in order
to characterize regional interactions on a developing country. We found that the calls and
trajectories networks present community structure associated to regional and urban areas.
In summary, we have shown that it is possible to understand complex social processes
by means of analyzing human activity data and the theory of complex networks. Along the
thesis, we have demonstrated that social phenomena, like influence, polarization and reaction
to critical events, are reflected in the structural and dynamical patterns of the networks
constructed from data regarding conversations on online social networks and mobile phones.
x
Contents
1 INTRODUCTION 1
1.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 COMPLEX NETWORKS 7
2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Topological Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Degree Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Geodesic Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Types of Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 Regular Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Random Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.3 Small World Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.4 Scale-free Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Community Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.1 Detection Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Assortativity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 Networks Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6.1 Erdos-Renyi Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6.2 Watts and Strogatz Model . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6.3 Barabasi-Albert Models . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.7 Dynamics on Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7.1 Disease Contagion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.7.2 Social Contagion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.7.3 Cascades on Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.8 Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
xi
2.9 Time Varying Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3 COMPUTATIONAL SOCIAL SCIENCE 29
3.1 Human Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Socio-Technological Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Information Spreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4 Influence and Popularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5 Polarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4 DIGITAL TRACES AND COMPUTATIONAL METHODS 41
4.1 From Data to Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1.1 Data Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.1.2 Finding Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1.3 Statistical Significance . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Twitter Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.1 Data Gathering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.3 Representativity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3 Mobile Phones Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4 Additional Sources of Information . . . . . . . . . . . . . . . . . . . . . . . . 55
5 HUMAN BEHAVIOR DURING POLITICAL MOBILIZATION 57
5.1 Temporal Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 Individual Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3 Followers Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.4 Retweets Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.5 Degree Assortativity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.6 Retweet Cascades . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.7 Analysis of User Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.8 Mesoscale Communities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6 EFFICIENCY OF HUMAN ACTIVITY AS A MEASURE OF INFLU-
ENCE 79
6.1 User Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
xii
6.2 Universality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.3 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.5 Analytical Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7 MEASURING POLITICAL POLARIZATION 97
7.1 A Model to Estimate Opinions in a Social Network . . . . . . . . . . . . . . 98
7.2 A Measure of Polarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.3 Study of Polarization on Retweet Networks . . . . . . . . . . . . . . . . . . . 101
7.3.1 Retweets Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.3.2 Elite nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.3.3 Estimating Opinions . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.3.4 Contagion by Influence . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.3.5 Offline Polarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
8 URBAN COLLECTIVE PATTERNS 131
8.1 World Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
8.2 Urban Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
8.3 Dynamical Classes of Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . 134
8.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
9 INFERRING HUMAN BEHAVIOR FROM MOBILE PHONE DATA 139
9.1 Characterizing Communication and Mobility Patterns in a Developing Country140
9.1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
9.1.2 Characterizing Populated Areas . . . . . . . . . . . . . . . . . . . . . 142
9.1.3 Ethnic Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
9.1.4 Effects of Selectiveness in the Calling Behavior . . . . . . . . . . . . . 149
9.1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
9.2 Flooding through the Lens of Mobile Phone Activity . . . . . . . . . . . . . 151
9.2.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
9.2.2 Assessing the Representativeness of CDR data . . . . . . . . . . . . . 154
9.2.3 Population Response to Floods . . . . . . . . . . . . . . . . . . . . . 154
9.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
xiii
10 Conclusions 163
A User Behavior 171
B Videos 175
xiv
List of Figures
2.1 Homogeneous vs. power-law distributions. (a) A Homogeneous function and
a power-law function with γ = 2.1. Both distributions have 〈k〉 = 10. The
curves in (a) are shown on a linear plot and in (b) on a log-log plot. (c) A
random network with 〈k〉 = 3 and N = 50. (d) A scale-free network with
〈k〉 = 3. Figure adapted from [Bar12] . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Complementary Cumulative degree distributions for six different networks.
(a) Collaboration network of mathematicians [GI95]; (b) Citations between
1981 and 1997 to papers cataloged by the Institute for Scientific Information
[Red98]; (c) A 300 million vertex subset of the World Wide Web, circa 1999
[BKM+00]; (d) The Internet at the level of autonomous systems, April 1999
[CCG+02]; (e) The power grid of the western United States [WS98]; (f) The
interaction network of proteins in the metabolism of the yeast S. Cerevisiae
[JMBO01]. (c), (d) and (f), appear to have power-law degree distributions
and (b) has a power-law tail but deviates its behavior for small degree. (e)
has an exponential degree distribution and (a) appears to possibly have two
separate power-law regimes with different exponents. Figure adapted from
[New03b] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 A simple graph with three communities, enclosed by the dashed circles. Figure
taken from [For10] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 (a) Schematic of the Watts-Strogatz model. (b) Normalized average shortest
path length L and clustering coefficient C as a function of the random rewiring
parameter p for the Watts-Strogatz model with N=1000, and k=10. Figure
taken from [WS98]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
xv
2.5 (A) Degree distribution of networks generated by the Barabasi-Albert model
in linearly-binned (red symbols) and log-binned version (green symbols). The
number of edges per new node m = 3. Size of (A) N = 100, 000, (B) N = 100,
(C) N = 10, 000 and (D) N = 1, 000, 000. The straight line has slope γ = 3,
corresponding to the resulting networks degree distribution. Figure adapted
from [Bar12]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6 Comparison of disease spreading on homogeneous random graph and scale-free
networks. The fraction of infected nodes displays a distinct phase transition
(or epidemic threshold) in the case of an homogenous random graph, but not
for the scale-free network. Figure taken from [Wat04] . . . . . . . . . . . . . 23
2.7 Schematic representation of cascade on a network. The red and yellow nodes
belong to the cascade. The white nodes belong to the network but are not
part of the cascade. The cascade layers have been marked in gray. . . . . . . 25
2.8 Schematic representation of the activity driven network creation model. Red
nodes show the active nodes at each time T . The bottom plot represents the
final aggregated structure of the network. This figure has been adapted from
[PGPSV12]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1 Bursts of individual activity on an e-commerce site. In the left panel we repre-
sent the temporal behavior of four individuals, showing that bursts of activity
(color stripes) coexist with large moments of inactivity (white periods). The
x-axis represents time and the colored lines represent individual actions. In
the right panel we show the distribution of inter-action waiting times for each
of the four users. Figure adapted from [ZCH+12]. . . . . . . . . . . . . . . . 31
3.2 Collective response to a critical event. In the top panel we show the emer-
gent networks between affected users during an event at three times. In the
bottom panel we show the calls pattern between the same users a week be-
fore the event, indicating that the cascades observed during the event are
extraordinary. Figure adapted from [BWB11] . . . . . . . . . . . . . . . . . 31
3.3 Emergent networks from the propagation of four videos on Twitter. In panels
(A) and (B) the local influential leaders performed a remarkable role in the
diffusion process. Whereas in panels (C) and (D) the influence of hubs was
much more stronger. Figure adapted from [DO14]. . . . . . . . . . . . . . . 34
xvi
4.1 Temporal evolution of Twitter activity (messages/hour) corresponding to datasets:
(A) 20N, (B) Egypt, (C) Obama and (D) Chavez, described in Table 4.1. At
all panels, we are displaying the impact of events on Twitter activity. The
four of them present a burst of activity when the event takes place, which
gradually decreases down to previous levels. Panels (A), (B) and (C) have
similar patterns despite spanning three orders of magnitude on the y-axis.
The envelope curve in panel (D) presents the same pattern across a different
time scale. The gradual decrease of activity spans for several days. The inset
curve corresponds to the activity during the shadowed area in green in a linear
scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.1 Top: Time evolution of the message rate (messages/minute) of the Venezue-
lan protest #SOSInternetVE. Arrows indicate some of the times when the
protest convoker participated. Bottom: Time evolution of the accumulated
percentage of messages (dashed line) and participant users (solid line). . . . 58
5.2 Complementary cumulative distribution of the user activity during the Venezue-
lan protest #SOSInternetVE. Solid line is the fit to an exponentially truncated
power law, P (x > x∗) ∝ x−βe−x/c, where β = 0.880±0.001 and c = 65, 0±0.6
at the last day. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3 In (top) and out (bottom) degree complementary cumulative distributions of
the followers network from the Venezuelan protest #SOSInternetVE. . . . . 60
5.4 Scatter plot of in and out degree of the followers network from the Venezuelan
protest #SOSInternetVE. Dots represent users. . . . . . . . . . . . . . . . . 61
5.5 In (top) and out (bottom) strength complementary cumulative distributions
of the retransmission network of the Venezuelan protest #SOSInternetVE.
Solid line is the fit to an exponentially truncated power law P (Sout > S∗out) ∝S−βoute
−Sout/c, where β = 0.890± 0.002 and c = 61.0± 1.2. . . . . . . . . . . . 63
5.6 Edge’s weight complementary cumulative distribution of the retransmission
network from the Venezuelan protest #SOSInternetVE. . . . . . . . . . . . . 64
xvii
5.7 Visualization of the retweet network emergent from the message propagation
on the followers network. (A) Subgraph of the retweet network (green) super-
imposed to the corresponding followers network (black), from the #SOSInter-
netVE dataset. In the figure a subset of 1000 random nodes (yellow and red)
are presented. The node size is proportional to the respective in degree on
the followers network. (B, C and D) Example of the formation of the retweet
network from independent retweet cascades on an artificial followers network.
(B) shows when two users (red nodes) post independent messages which are
received by their followers (gray). (C) shows when some users retweeted the
message (yellow) and this message arrives to their followers (gray). (D) shows
the final shape of the cascades on the network, compound only by the acti-
vated nodes (red and yellow) connected by the green links. The white nodes
and gray links represent the rest of the substratum (followers network) who
did not activate. (E) shows the schema of a single cascade. The black circles
determine the cascade layers. . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.8 Retweets cascades statistical properties. (A) Complementary cumulative den-
sity function of the number of users per cascade, (B) Cascade depth distribu-
tion P (d) and (C) Retransmission rate by layer λl in terms of retweets over
followers. The data correspond to the #SOSInternetVE dataset. . . . . . . . 68
5.9 Analysis of the user behavior. (A) Scatter plot of retransmissions obtained
by user versus its activity and colored by its number of followers. (B) Scatter
plot of retransmissions obtained by user versus its number of followers and
colored by its activity. (C) Scatter plot of retransmissions obtained by user
versus the ratio between the number of followers and followees, and colored
by its activity. (D) Scatter plot of retransmissions made by user versus its
number of followers and colored by its activity. Dots represent users. Data
correspond to the #SOSInternetVE dataset. . . . . . . . . . . . . . . . . . . 70
5.10 Community structure for the follower graph. Circles represent communities of
users and their size is proportional to the amount of users that belong to the
community. Edges represent the inter-community links, either followers (Left)
or retransmissions (Right), and their width is proportional to the amount of
edges, normalized by the size of the outgoing community. The data correspond
to the #SOSInternetVE dataset. . . . . . . . . . . . . . . . . . . . . . . . . 72
xviii
5.11 Community structure for the retransmission graph. Nodes represent com-
munities and edges represent the inter-community links. The nodes’ size are
proportional to the number of people that compound the community and
the edges’ width are proportional to the number of inter-community links
normalized by the size of the community. The data correspond to the #SOS-
InternetVE dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.1 Scatter plot of the user in degree vs out degree in the followers network, colored
by the respective user efficiency. Dots represent users. Data correspond to
the #SOSInternetVE dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.2 User efficiency probability density function (A) and complementary cumula-
tive density function (B). The red dots correspond to the empirical results,
the black solid line represents the lognormal fit and the black dashed line
represents a power law fit. Quantile-Quantile plot (C) of the user efficiency
distribution, filtered by the in degree in the followers network KFin. The dis-
tributions correspond to the #SOSInternetVE dataset. . . . . . . . . . . . . 81
6.3 Complementary cumulative density function of the user activity, from sev-
eral Twitter conversations, increasingly ordered according to the number of
messages (A-F): (A) Andreafabra, (B) Gringich, (C) Leones, (D) 20N, (E)
Obama, and (F) Egypt. The black dashed line represents a power law fit and
the red dots correspond to the measured distributions. . . . . . . . . . . . . 84
6.4 Complementary cumulative density function of the retweets obtained by user,
from several Twitter conversations, increasingly ordered according to the num-
ber of messages (A-F): (A) Andreafabra, (B) Gringich, (C) Leones, (D) 20N,
(E) Obama, and (F) Egypt. The black dashed line represents a power law fit
and the red dots correspond to the measured distributions. . . . . . . . . . . 85
6.5 Probability density function of the user efficiency on several Twitter conver-
sations, ordered increasingly according to the number of messages (A-F): (A)
Andreafabra, (B) Gringich, (C) Leones, (D) 20N, (E) Obama, and (F) Egypt.
The properties of these conversations may be found in Table 6.1. The black
solid line represents the lognormal fit, the black dashed line represents a power
law fit and the red dots correspond to the measured distributions. . . . . . . 86
6.6 Model results to the user efficiency distribution (left column) and retweets
gained by user distribution (right column), with the empirical results. The
model has been applied to the followers network from the #SOSInternetVE
dataset (top panel) and the #20N dataset (bottom panel). . . . . . . . . . . 87
xix
6.7 Effects of the underlying network topology on the model results in terms of the
user efficiency distribution (left column) and retweets gained by user distribu-
tion (right column). The model has been applied to the followers network (blue
crosses) and their randomized versions (red x symbols). Two datasets have
been considered: #SOSInternetVE (top panel) and #20N (bottom panel).
In all cases, an heterogeneous initial activity distribution P (A0) ∝ A−1.40 has
been considered. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.8 Effects of the individual user behavior on the model results in terms of the user
efficiency distribution (left column) and retweets gained by user distribution
(right column). The model has been applied to the followers network (blue
crosses) and their randomized versions (red x symbols). Two datasets have
been considered: #SOSInternetVE (top panel) and #20N (bottom panel). In
all cases, an homogeneous activity distribution P (A0) = 1/6 where A0 ∈ [1, 6]
has been considered. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.9 Results from the analytical model of user efficiency, considering cascades up
to three layers of depth in the followers network from the #SOSInternetVE
dataset. Resulting η average (A) and standard deviation (B) from evaluating
the model with 0.2 < P (d > 0) < 1.0 (x-axis) and 0.05 < r0 < 0.3 (color). The
dashed lines indicate the empirical values. (C) Resulting η distribution from
applying the analytical model to the followers network with the empirical
activity distribution P (A0) by setting P (d > 0) = 0.775 and r0 = 0.15.
The white dots represent the empirical distribution of user efficiency and the
triangles represent the distribution obtained from the analytical model. . . . 93
7.1 Schema explaining the proposed polarization index µ. (A) Density distribu-
tion of opinions. gc stands for the gravity center of each pole, A stands for
the population associated to each ideology, and d stands for the pole distance.
(B) Visualization of the polarization index, µ, for three different situations. . 99
7.2 Schema of the influence spreading process in the opinion estimation model.
(A) Displays the seed nodes in the network, colored according to their re-
spective ideology. (B) Displays the network at t = 0, before seeds start to
propagate their influence. (C) Shows the state of the network at t = 1. (D)
shows the state of the network at t = n/2. (E) Displays the final state of the
network at t = n. (F) and (G) Visualizations of two examples of the result
of the opinion estimation model to the Venezuelan dataset for non polarized
(F) and polarized (G) days. See the video B.1 described in the Appendix B . 101
xx
7.3 Visualization of the retweet network at day D − 29. The Giant Component
has been colored in blue and red, while the rest of components have been
colored in gray. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.4 (Left) Distributions of the components size of the retweet networks from the
Twitter conversation about the Venezuelan President Hugo Chavez for three
days: D − 29, D and D + 20, where D represents the day of the main occur-
rence. (Right) Time evolution of the Giant Component (GC) of the retweets
networks: (A) Ratio between the number of nodes that conform the GC and
the number of nodes in the respective networks. (B) Time evolution of the
whole network and GC size in terms of nodes. (C) Relative number of mes-
sages inside Venezuela from the geolocalized users in the GC. The orange
stripe represents the day D and the state funeral period. . . . . . . . . . . . 104
7.5 Visualization of geolocated messages from the Chavez conversation on three
days from different periods: before the announcement (top), during the an-
nouncement (middle), after the announcement (bottom). The dots represent
geolocalized messages. The label indicates the day of observation, being D
the day of the announcement. . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.6 Evolution of the topological properties of the retweet networks emergent at
each day of the observation period, in terms of: (A) Out strength comple-
mentary cumulative distribution, (B) In strength complementary cumulative
distribution, (C) Gini index evolution of the strength distributions. (D) Di-
rected degree assortativity evolution. The orange stripe represents the day of
the main occurrence. In A and B, the blue curves correspond to the first days
and the red curves correspond to the last days. . . . . . . . . . . . . . . . . . 107
7.7 Conditioned probability density function of the accumulated in-strength (Sin)
given the participation rate (ρ), from the Twitter conversation about the
Venezuelan President Hugo Chavez. The color correspond to the density of
users. The red line indicates the average accumulated in-strength value Sin
for a given participation rate ρ. . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.8 Adjacency matrices (top) and corresponding visualization (bottom) of the
considered elite networks. (A) Corresponds to the seed with Sin ≥ 10000
and ρ ≥ 0. (B) Corresponds to the seed with Sin ≥ 1000 and ρ ≥ 0.89.
(C) Corresponds to the seed with Sin ≥ 10 and ρ ≥ 0.82. Nodes have been
ascendantly ordered according to their opinions Xs. The color indicates the
average value of the node’s opinions Xij at both sides of the edge i− j. . . . 112
xxi
7.9 Visualization of two cases of possible retweet networks and expected outcomes.
The top row represents a polarized case and the bottom row represents a
nonpolarized case. Panels A and E show the position of the elite nodes,
colored in each network. Panels B and F shows the respective networks,
coloring the nodes with their estimated opinion. Panels C and G show the
opinion adjacency matrices AXij. The colored dots in the matrices represent
interactions: blue and red dots indicate interactions within the same group;
pale blue and yellow dots indicate interactions across groups. Nodes have
been ascendently ordered according to their estimated opinion Xi. Panels D
and H represent the resulting opinion distributions. . . . . . . . . . . . . . . 113
7.10 Time evolution of estimated opinions (Xi) probability density functions (p(X))
for the Venezuelan conversation. These distributions respectively result from
applying the model to the retweet networks using the elites No. 1 (top panel),
No. 2 (middle panel) and No. 3 (bottom panel) described in section 7.3.2. La-
bels indicate the day of observation, D standing for the day of the President’s
death. Colors indicate the number of participants. . . . . . . . . . . . . . . . 115
7.11 Time evolution of the polarization index µ (C), and the variables associated
with it: pole distance d (B) and the difference in population sizes (A) for
the Venezuelan conversation in the undirected version of the networks. The
magenta line represents the average of the results from applying the model
with the three elite users from section 7.3.2. The gray shadow shows the
standard deviation. The orange stripe indicates the day of main event. . . . 116
7.12 Time evolution of the statistical properties of the Xi distribution in terms of
(A) Average, (B) Standard deviation and (C) Kurtosis. The orange stripe
represents the day of the main occurrence (D) and the state funeral period.
The magenta line represents the average of the results from applying the model
with the three elite users from section 7.3.2. The gray shadow represents the
standard deviation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.13 Time evolution of the opinion adjacency matrices AXijfrom the Twitter con-
versation about the Venezuelan President Hugo Chavez. Nodes have been
plotted in ascendant order according to their estimated opinion Xi. The label
indicates the day of observation (from D− 29 to D+ 26). The color indicates
the average value of the node’s opinions at both sides of the edge i− j. . . . 119
xxii
7.14 Effects of rewiring edges in the results of the opinion estimation model. Time
evolution of estimated opinion (Xi) cumulative probability density functions
(CDF) resulting from the opinion estimation model to the undirected networks
(solid) and corresponding rewired versions (dashed). The label indicates the
day of observation (from D−29 to D+26). Columns are ordered from Monday
to Sunday. The labels indicate the corresponding day of observation, from
D − 29 to D + 26, being D the day of the President’s death announcement.
The distributions for the rewired networks represent the average over 200
realizations. These curves correspond to the results from applying the model
with the elite No. 3 described in 7.3.2. . . . . . . . . . . . . . . . . . . . . . 121
7.15 Time evolution of the estimated opinions (Xi) probability density functions
(p(X)) for the Venezuelan conversation. Labels indicate the day of observa-
tion, D standing for the day of the President’s death. Colors indicate the
number of participants. These curves are the average of the results from
applying the model with the three elite users from section 7.3.2. . . . . . . . 122
7.16 Time evolution of the polarization index µ (C), and the variables associated
with it: the pole distance d (B) and the difference in population sizes (A) for
the Venezuelan conversation. The magenta line represents the average of the
results from applying the model with the three elite users from section 7.3.2.
The gray shadow shows the standard deviation. . . . . . . . . . . . . . . . . 123
7.17 Effects of edges’ direction in the results of the opinion estimation model. Time
evolution of estimated opinion (Xi) cumulative probability density functions
(CDF) resulting from the opinion estimation model on the directed network
(solid) and undirected network (dashed). The label indicates the day of ob-
servation (from D − 29 to D + 26). Columns are ordered from Monday to
Sunday. The color indicates the kurtosis values of the distributions. The la-
bels indicate the corresponding day of observation, from D − 29 to D + 26,
being D the day of the President’s death announcement. These curves are
the average of the results from applying the model with the three elite users
from section 7.3.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.18 Electoral polarization in Venezuela. Distribution of voting stations accord-
ing to the winner party and the location of station, according to the 2013
Venezuelan Presidential elections. . . . . . . . . . . . . . . . . . . . . . . . . 126
xxiii
7.19 Mass of tweets in the city of Caracas. Contour levels (from inside to outside
0.25, 0.20, 0.15, 0.10) represent the mass of tweets identified as in favor of
the government (red) and against it (blue). Areas bordered in green corre-
spond to the five municipalities that conform the city. White regions display
unpopulated areas, yellow regions represent populated areas and pink regions
correspond the informal and poorer neighborhoods (slums). The label color
indicates the ruling party at each municipality, according to the 2013 Venezue-
lan local elections: red represents the officialism party at Libertador and blue
indicates opposition parties at Chacao, Sucre, Baruta and El Hatillo. . . . . 129
8.1 World Twitter Activity. Geographical density of Twitter activity (number of
tweets) during one average day in logarithmic scale. Red and orange indicate
a high concentration of activity, while blue and green indicate a lower concen-
tration of tweets, and black indicates the absence of activity. Insets: Average
week of Twitter activity on several cities (ac,d(t)). . . . . . . . . . . . . . . . 132
8.2 Temporal behavior of 52 cities across all continents. Series represent the
representative week of Twitter activity for each city (ac,i(t)). Color indicates
the result of the clustering classifier. . . . . . . . . . . . . . . . . . . . . . . . 136
8.3 Clustering of cities according to their temporal behavior. Colors indicate
the results of k-means clustering algorithm. Axes correspond to collapsed
dimensions using multidimensional-scaling algorithms. On the top panel we
show the average behavior of each class (from A to C). We have respectively
marked the morning and afternoon peaks of activity with a red x symbol and
a circle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
9.1 Ethno-linguistic map of Ivory Coast. Figure adapted from [Lew09] . . . . . 141
9.2 Mapping the community structure of the trajectories network of Ivory Coast.
Antennas represent nodes and are plotted in different colors and shapes, ac-
cording to the community they belong gotten from the community detection
algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
9.3 Mapping the structure of the trajectories network on the Ivory Coast geo-
graphical map. The blue lines represent the edges of the network and their
width is proportional to the edge weight. Superimposed the main roads of
Ivory Coast have been plotted as red lines. The location of the country’s
main cities are marked with black circles. . . . . . . . . . . . . . . . . . . . . 143
xxiv
9.4 Mapping the closeness-centrality property of the trajectories network in Ivory
Coast. The edges have been colored according to the closeness centrality mean
value of the two connected nodes. The red regions indicate higher closeness-
centrality, the yellow and pale blue regions indicate medium centrality, and
the dark blue regions indicate lower closeness-centrality. . . . . . . . . . . . . 144
9.5 Mapping the linguistic identity of the trajectories network of Ivory Coast.
The edges have been colored according to the linguistic group to which the
most connected antenna at each community belongs to. There are four major
linguistic families represented in yellow (northwest), purple (northeast), green
(southwest) and blue (southeast). Black circles indicate the location of the
major cities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
9.6 Normalized adjacency matrices of the calls network corresponding to the com-
munity structure from the trajectories network (A), ethnic group aggregation
(B) and linguistic family aggregation (C). Assortativity coefficient of selec-
tiveness to call on local scale (community), subregional scale (ethnic group)
and regional scale (linguistic family) (D). . . . . . . . . . . . . . . . . . . . . 147
9.7 Scatter plot of intra linguistic family flux (calls directed to an antenna in the
same linguistic family as the emitter antenna) versus inter linguistic family flux
(calls directed to an antenna in a different linguistic family than the emitter
antenna). Symbols represent communities from the trajectories network and
the color indicates the linguistic family to which the community belongs. The
dashed line has slope 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
9.8 Mapping the community structure of the calls network of Ivory Coast. Anten-
nas represent nodes and are plotted in different colors and shapes, according
to the community they belong gotten from the community detection algorithm.150
9.9 Mapping the classification results of antennas according to the way the calls
network communities are related. A k-means clustering classifier has been
applied to the community structure of the calls network. . . . . . . . . . . . 150
9.10 Left: Visualization of the precipitation data obtained from the NASA TRMM
at November, 2nd, 2009. The red square encloses the observed region. Right:
Accumulated rainfalls during the first two weeks of November, 2009 (jet col-
ormap) over the Tabasco area. The floods segmentation is shown by the white
shade. The area correspond to the red square in the left panel. . . . . . . . . 152
xxv
9.11 Left: map of 2010 census (green bars) vs CDRs based population estima-
tion (purple bars) in several cities of Tabasco (red=affected cities, blue=other
cities) and surroundings. Right: The plot shows linear correlation between
the CDR census and the real census (r-square 0.97). . . . . . . . . . . . . . . 153
9.12 Time evolution of the number of unique users per cell tower x(t). The gray
stripes indicate the Flood and Christmas periods where stronger variations are
observed. The labels at the top-right of each chart indicate the municipality
where the tower is located. Towers have been ordered and colored according
to the maximum degree of variation during floods in decreasing order. . . . . 155
9.13 Scheme of the Antenna Variation metric for cell towers. The black curve
represents the raw signal x(t). The gray stripe indicates the Flood period.
The red line indicates the average value (µBL) of users served during the
Baseline period. The pink stripe indicates the standard deviation (σBL) from
the average value during the Baseline period. The blue line indicates the
deviation from the average value at a given day. Our measure of antenna
variation results from the ratio of the blue line divided by the green line. . . 156
9.14 Time evolution of the Antenna Variation metric (xnorm) for the considered
towers. The gray stripes indicate the Flood and Xmas periods. Color is
proportional to the degree of variation during the flooding period. It can be
noticed that antennas have a spike of activity during the floods (left shadowed
region), as well as during Christmas and New Years Eve. . . . . . . . . . . . 157
9.15 Impact Map of Tabasco for the 2009 floods. Circles represent antennas and
their size is proportional to the variation metric during the floods. The dark
blue segmentation represents the flooded region. The color of municipalities
is proportional to the number of affected people. The map shows the most
critical day featuring the highest values of the antenna variation metric. . . . 158
9.16 Distribution of the maximum of the antenna variation metric for the BL period
(gray) and floods (red). The curves show the percentage of antennas (y-axis)
whose maximum variation metric value (xnorm) is higher than a given value
(x-axis). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
9.17 Top: Antenna variation metric (red) vs the precipitation level (blue) for the
six hottest antennas (A to F). The slashed line shows the emergency warning
date as notified in the news. Bottom: Map featuring the position and date
(e.g. 6N is 6th November) where the maximum of the antenna variation metric
was observed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
xxvi
A.1 Analysis of the user behavior. (A) Scatter plot of retransmissions obtained
by user versus its activity and colored by its number of followers. (B) Scatter
plot of retransmissions obtained by user versus its number of followers and
colored by its activity. (C) Scatter plot of retransmissions obtained by user
versus the ratio between the number of followers and followees, and colored
by its activity. (D) Scatter plot of retransmissions made by user versus its
number of followers and colored by its activity. Dots represent users. Data
correspond to the 20N dataset. . . . . . . . . . . . . . . . . . . . . . . . . . 172
A.2 Analysis of the user behavior. (A) Scatter plot of retransmissions obtained
by user versus its activity and colored by its number of followers. (B) Scatter
plot of retransmissions obtained by user versus its number of followers and
colored by its activity. (C) Scatter plot of retransmissions obtained by user
versus the ratio between the number of followers and followees, and colored
by its activity. (D) Scatter plot of retransmissions made by user versus its
number of followers and colored by its activity. Dots represent users. Data
correspond to the ETA dataset. . . . . . . . . . . . . . . . . . . . . . . . . . 173
B.1 Evolution of the opinion estimation model. Nodes are colored according to
their opinion Xi. In principle, all nodes’ opinions are zero; thus, they are
colored in white. However, nodes with an opinion below zero are red and
above zero are blue. The elite is hidden in the network and will spread their
opinions iteratively. We see how the network is increasingly colored at each
time step. Because the network is polarized around the elite, the red and blue
colors are not mixed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
B.2 Worldwide Twitter reaction to the announcement of Hugo Chavez decease.
Yellow circles represent a geolocated tweet. The video spans for a 24h period.
We show a counter indicating the remaining time before the announcement
and the time after it. It can be noticed that at the moment of the announce-
ment the whole world reacted massively to the news by posting related mes-
sages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
B.3 Worldwide Twitter activity. In this video we present the worldwide Twitter
activity during an arbitrary week. We plot all geolocated tweets as white dots
in the map. It can be noticed that there is a wave of activity from the east
to the west side of the globe as days evolve. Also, it is noticeable that the
activity decreases to its minimum levels during early mornings. . . . . . . . 179
xxvii
B.4 Human trajectories network evolution in Ivory Coast. In this video, we present
the dynamical growth of the human trajectories network during an arbitrary
day. Dots represent users moving across the country from antenna to antenna.
The edge color is related to the network community where the target node
belongs to. It can be noticed that the network grows in a sparse way, mostly
connecting nodes that are geographically close to each other. Other regions
like the capital city (right bottom) concentrate most of the long distance edges. 180
B.5 Calls network evolution in Ivory Coast. In this video, we present the dynam-
ical growth of the calls network during a period of 12 hours at an arbitrary
day. Dots represent calls, traveling from one antenna to the other at each
hour. The edge color is related to the network community where the target
node belongs to. It can be noticed that there is an explosion of calls after
6am, showing the dense structure of the network. . . . . . . . . . . . . . . . 180
B.6 Time-lapse of the Tabasco impact map. The video displays the absolute value
of the antenna variation metric from Oct, 2009 to Jan, 2010 as in the temporal
series. Each antenna is represented by a circle with color and size proportional
to the daily metric value. The segmented flooded area has been colored in light
blue. It can be noticed that the antennas near the flooding area dramatically
increased their variation during the floods. This effect is noticeable during
Christmas and New Years Eve, where all antennas present extremely large
variation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
xxviii
List of Tables
4.1 Description of the studied datasets. . . . . . . . . . . . . . . . . . . . . . . . 51
5.1 Followers and retweet network properties from the Venezuelan protest #SOS-
InternetVE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2 Pearson correlation (r) by user of the number of followers (F), retweets (R)
and activity (A). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.3 Main collectives around which each follower community is formed from the
Venezuelan protest #SOSInternetVE. . . . . . . . . . . . . . . . . . . . . . . 74
5.4 Most retransmitted account at each retransmission community from the Venezue-
lan protest #SOSInternetVE. . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.1 Properties of the studied datasets and their resulting user efficiency distribu-
tion properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.1 Elite networks topological properties. Sin and ρ columns represent minimum
values. Off. C-node indicates the number of network communities related
to the officialism, and Opp. C-nodes indicates the number of communities
related to the opposition. The numbers in the parentheses indicate the number
of nodes in each pole. Q stands for modularity. r stands for the Pearson
coefficient of mixing patterns by ideology. . . . . . . . . . . . . . . . . . . . . 111
9.1 Properties of the Calls and Human Trajectories Networks. . . . . . . . . . . 142
xxix
xxx
Chapter 1
INTRODUCTION
Nowadays, we are constantly interacting with electronic devices on daily basis, such as
mobile phones, e-mail or online social networks. The increasing integration of technological
solutions into people’s life, is certainly affecting the way people relate to each other and
consequently the properties of the social system. Historically, the exchange of information
among social groups has influenced and determined the course of events across societies
[Dia97]. In fact, the development of societies is associated with the number and diversity of
exchanging connections. The recent explosion of information technologies has enabled the
emergence of a global society without precedents. One society in which distances no longer
exist and where previously isolated events may trigger worldwide reactions in a few instants.
Researchers say that the world is heading towards a networked society, where Internet-
based solutions are emerging as alternatives to traditional centralist institutions [Cas96].
For instance, social media allows people to broadcast information extremely affordable in a
global scale, in detriment to mass media companies, which no longer control the monopoly of
information. Also, the large number of collaborative online working sites and the increasing
activity of international freelancing, are signs that corporations are no longer needed to con-
duct large businesses. In fact, new virtual currencies are already working as an alternative to
traditional financial systems, waving intermediaries and international monetary institutions.
As a consequence, current business and political models must chose between either adapting
to the new times or becoming extinct.
The current challenge is to characterize the social systems that emerge from these new
technological spaces and to understand their rules of behavior [LPA+09]. For this purpose, we
must enhance our ability to measure these systems in their actual dimensions. Fortunately,
the mentioned explosion of information technologies is providing the data required for these
analyses. When consuming these services, we are unconsciously leaving traces of our activity
1
as a by-product in the providers’ databases. Individually, these records contain detailed
information about the user activity and may serve for billing processes. It is natural to
think that users will have a unique profile determined by their own habits and customs.
However, these databases are so large that they have the dimensions required to enable
the observation of large scale human behavioral patterns. In fact, they are unveiling the
characteristics of societies as a whole physical system, rather than a collection of isolated
individuals [Pen14]. Besides, these datasets have the advantage of being real measurements
of people’s actual behavior, instead of the result of some sparse observations and honesty-
based questionnaires.
The human society is a complex system. Many social phenomena strongly depend upon
the way people behave and how collective actions are combined together in the society
[Bet13]. Like in other complex systems, there are global properties that emerge from the
relationships between the individuals, rather than the properties of the individuals themselves
[BY97]. The elements in complex systems do not behave independently from each other
but neither behave fully coherently. Instead, individuals create interdependencies in their
actions, given in the form of collective behaviors. During this process, the individuals loose
independence in their behavior, in favor for the system to gain properties and capabilities
at larger scales. As a result, the emergence of a collective behavior increases the system’s
complexity.
In the case of nowadays societies, we can find traces of the collective behavior that enables
larger scale patterns in the data derived from human activity. That information is embedded
and unstructured in the raw data. Therefore, in order to retrieve this knowledge, we must
treat the data properly [BYB13]. On the one hand, we can not explain the system through
the individual states, since it would require as many descriptors as to make the system too
complicated to understand. But, on the other hand, we can not reduce the overall behavior
into mere statistics either. By doing this, we will loose the heterogeneity and diversity typical
of social systems. Therefore, we need frameworks to observe the system at all its complexity.
The theory of complex networks is an adequate tool to treat and analyze this kind of sys-
tems [New03b, Wat04]. Networks are mathematical structures compound by a set of nodes,
linked to each other by a set of edges that represent relationships or interactions between the
systems’ elements. By analyzing systems in the form of networks, we can understand their
structure, their dynamical evolution and the responsible mechanisms for patterns formation.
In general, networks are a common ground for analyzing complex systems in a variety of sci-
entific disciplines. In part because networks reveal the systems’ characteristics across several
scales. They can describe the systems’ global properties and their functioning as a whole.
2
At the same time, they can also describe local interactions, the role of individuals in their
environment and the connection patterns, which include structures at intermediate scales.
Recently, there has been an explosion of research for ways to retrieve societal knowledge
from data. Most of these studies take advantage of the size, diversity and real-time nature
of the data in order to revise old sociological questions and to ask new ones. Such way
of studying social systems is unprecedented and it is revealing the true nature of societal
phenomena. For instance, patterns in the diversity of connections can explain the economical
development of cities [EMC10], as well as the emotional state of individuals [RMM+10]. Also,
patterns of popularity can explain the economical value of stocks [BMZ11] or earthquake
epicenters [SOM10]. Finally, patterns of mobility can predict the propagation of infectious
diseases [WET+12] or evaluate urban land use [TUGB12].
Further from remarkably increasing our societal knowledge, these and many other scien-
tific advances suggest that the analysis of data can be incorporated as valuable information
for decision making and policy evaluation processes, in both private and public sectors. First
because the analysis of data has the potential to show an unprecedented view of the impact
of policies on the population, so that they can be revised and modified if needed. Second
because they can also provide the knowledge to rethink the way our social and engineered
systems are functioning together, in order to design new rules of complex interactions for
building better societies in the future.
1.1 Goals
In this thesis we develop computational and mathematical methods to analyze social systems
from the combined study of data derived from human activity with the theory of complex
networks. Our main goal is to characterize and model the human behavior during people’s
interactions on the new technological spaces, such as online social networks and mobile
phones. We intend to understand the social systems that emerge from such interactions, by
means of their structure, functioning and temporal evolution. To this end, we will analyze the
systems as complex networks and propose metrics based on physics magnitudes to measure
their characteristics. Furthermore, we will model the dynamics of information flow in the
system and measure the impact of external and critical events on the system’s behavior.
In order to achieve these goals, we have defined the following targets:
1. To develop methods to characterize and understand the social system’s structure, func-
tioning and time evolution. For this purpose the following particular goals must be
achieved:
3
(a) To represent the systems as complex networks, temporal variables and geograph-
ical information systems. Then, to characterize the system by analyzing the
properties of these abstractions.
(b) To characterize and classify users, as system’s elements, according to their rela-
tionship with the environment and their role in the collective functioning.
(c) To understand how users influence each other and the way information flows
among people. For this matter, we will characterize users by their influence to
spread information.
(d) To detect and measure the degree of polarization on social networks. For this
purpose, we will develop a methodology to infer opinions in social networks and
to measure the polarization in the resulting opinion distributions.
(e) To characterize the impact of critical events on the collective behavior. This
means to analyze the way critical events influence the communication patterns.
For this purpose, we will study the development of events like political protests,
news events and natural disasters.
(f) To characterize the geographical distribution of human activity, developing meth-
ods to measure interactions among geographically located social systems, like
urban areas or regions.
2. To develop dynamical models to explain global properties in the system. This means,
to explain the nature of the observed patterns through the dynamical mechanisms that
rule the interactions among individuals. For this matter, we propose to achieve the
following specific goals:
(a) To model the propagation of information across social networks as independent
cascades. Then, to explain the effects of the underlying networks’ topology and
user behavior on the information flow dynamics.
(b) To model the flow of opinions in a social network. Then, to explore the effects of
an influential minority’s opinion on the majority of users.
3. In order to achieve the previous goals, we must first develop a computer platform for
the collection, storage, querying and treatment of the data. For this matter, we propose
to reach the following goals:
4
(a) To develop applications to collect data from online social networks’ servers. These
applications must be able to authenticate with servers and to manage queries
automatically.
(b) To develop methodologies to store and query the data. We will implement tra-
ditional solutions like MySQL and develop applications based on Map-Reduce
algorithms.
(c) To develop software applications for the mathematical and statistical treatment
of the data, computational modeling and simulations, as well as visualization
techniques.
1.2 Organization
The thesis is organized in 10 chapters. After this introduction, in chapter 2 we review the
most important concepts of complex networks theory used in this thesis. We present the
main properties of complex networks, as well as the models proposed for network generation
and dynamics on networks. In chapter 3 we describe relevant previous work in the compu-
tational social science related with this thesis. We analyze the dynamics of human activity,
the properties of socio-technological networks and the information spreading processes. In
chapter 4 we introduce the concept of digital traces and discuss the computational meth-
ods to treat the data in order to retrieve patterns and knowledge. We also go through the
datasets we have analyzed, their properties and the techniques followed to build them.
In chapter 5, we present the analysis of the user behavior during political mobilization.
We analyze a Venezuelan protest that took place exclusively by Twitter at December, 2010.
We characterize the user behavior by describing the networks of collective attention and
information flow. We study the social networks’ topological properties, finding communities
structure and highly connected hubs. We classify users according to their role during the
information flow dynamics and identified three different kinds of user behavior. We show
that traditional media still hold too much influence on social media.
In chapter 6 we define a new measure of influence in the network called user efficiency.
It characterizes the relationship between the activity employed by users and the emergent
collective response to such activity. On this basis, we propose a model to understand the
emergence of the user efficiency distribution, based on independent cascades taking place on
networks. We show that this measure and the underlying mechanism are universal across
Twitter conversations from different and diverse social contexts.
In chapter 7 we introduce a methodology to quantify the degree of polarization in social
5
media conversations. We propose an opinion estimation model in which a minority of in-
fluential individuals propagate their opinions through social networks. The model results in
an opinion probability density function that represents the state of the population. Next,
we propose an index to quantify to which extent this resulting distribution is polarized. We
apply this model to study the time evolution of the user behavior on Twitter during an
important event for politics in Venezuela, such as the death announcement of the President
in office in 2013.
In chapter 8 we characterize the dynamics of human activity in urban environments. We
analyze the kinetics of Twitter activity across several cities worldwide. We characterize the
cyclic behavior of daily routines with the construction of temporal series. We show that
cities are classified according to three kinds of dynamical behavior, based on morning and
afternoon activity.
In chapter 9 we study the human behavior from mobile phones data. First, we analyze
the way social factors affect regional communication patterns. We propose a methodology
to infer regional relationships by means of human mobility and calling activity patterns. We
show that the regional communication patterns in Ivory Coast are correlated with ethnic
and economical factors. Then, we study the reaction of a population to a natural disaster.
We investigate the viability of using mobile phone data, combined with other sources of
information, in order to characterize the floods occurred in Tabasco, Mexico in 2009. We
propose methods to evaluate the population behavior during the tragedy. We show that the
analysis of data could help for the evaluation of policies and resource allocation strategies.
Finally, in chapter 10 we briefly summarize our results and present our conclusions. Also,
we present two appendices with supplementary information and additional visualizations.
First, in the appendix A we generalize some results from chapter 5. Second, in the appendix
B we present the videos that we have made to illustrate some of our results.
6
Chapter 2
COMPLEX NETWORKS
Many natural systems can be modeled in the form of complex networks [New03b, Wat04,
BLM+06]. In this abstraction, the system’s elements are represented as nodes, which are
linked to each other due to the existence of relationships. In general, networks are a common
ground to visualize and explain systems across different scales. For instance, it is possible
to measure hierarchies in the systems’ structure, correlations in the connections or the re-
lationship between the local elements’ behavior and the global system’s properties. Recent
research has shown that many of these patterns are universal across previously unrelated dis-
ciplines, such as biology [WRB06], sociology [NP03], economy [HKBH07, FGH12] or ecology
[GPG12]. These systems typically present complex properties like small-world structures
and scale-free degree distributions. Such complexity has remarkable effects in the system’s
functioning as a whole, such as the robustness to random failures and vulnerability to se-
lected attacks [CFHB+05, CPRVP09]. Therefore, the complexity of these systems must be
fully understood in order to properly characterize them and predict their behavior.
Over the last decade, there has been an explosion in modeling systems as complex net-
works. In some cases, the network structure is more evident given the existence of physical or
explicit connections, like the Internet [AJB99], flights connecting airports [BBPSV04], neu-
rons [BS09] or social interactions [MLB12, MBLB14, BMLB12, BMBL14]. However, other
kind of phenomena can also be modeled in the form of networks, such as linking elements
according to correlated behaviors [KKK02] or common functions in the system [BS09]. At
all cases, complex networks are a powerful tool to understand the structure of these systems
as well as the evolution of their dynamical processes.
In this chapter we will review the main concepts of network science. We will give some
basic definitions and discuss the main topological properties of complex networks. We will
study network generation models, that explain the emergence of real networks properties,
7
and dynamical processes on networks, like information spreading or opinion formation.
2.1 Definitions
In this section a set of basic concepts of network science will be defined. These concepts are
constantly referred throughout the thesis and must be clarified in order to understand the
following sections and chapters.
Nodes: Nodes or vertex are the simplest representation for elements in a system.
Edges: Links or edges represent relationships between nodes.
Directed Edges: Edges may be directed when the sense of the relationship is relevant
in the system representation, or undirected otherwise. For instance, in a scientific
collaboration network, the sense of the edges has no effect, since the interaction took
place in the same way for both nodes. However, interactions like phone calls, must be
analyzed taking into account the sense of the edge, since it is not the same making
than receiving a phone call.
Weighted Edges: An edge can also be weighted with any value according to a given
property in the system representation [BBPSV04]. For instance, in commercial net-
works, one could weight the edge between the seller and buyer according to the sum of
all the transactions made.
Network: The network or graph, G, is a mathematical structure that consists in a set
of nodes, V , and a set of edges, E, relating all pairs of nodes i and j. A directed graph
takes into account the sense of the edges. In weighted graphs, edges are weighted with
a value different than one.
Adjacency Matrix: The Adjacency Matrix, Aij, is the simplest representation of a
graph. Its elements may take a value of 1 if there exists a connection between i and j,
or 0 if there is not a connection. In the case of undirected networks Aij = Aji and the
adjacency matrix is symmetric. In turn, in the case of directed networks Aij might not
be equal to Aji, and the adjacency matrix might be asymmetric.
Multiplex: The multiplex is the representation of a system where the same set of
nodes may be linked by different types of relationships, which are modeled as networks
in different layers. For instance, a people can have many types of relationships, like
networks of family, friends or coworkers.
8
Degree: A node’s degree is the sum of all the edges that connect it to the rest of the
network. In terms of the adjacency matrix Aij, the degree ki of the node i is defined
as ki =∑
j Aij. Its value indicates the connectivity of the node in the network.
Directed Degree: Given the fact that Aij is asymmetric in the case of directed
networks, two types of connection degree are considered, the out degree kout,i =∑
j Aij
and in degree kin,i =∑
j Aji. In these networks, the total degree would be ki =
kout,i + kin,i.
Strength: The node’s strength Si is similar to the node’s degree but taking also into
account the weights of the edges. Its value indicates the strength to which the node
is connected to the rest of the network, beyond the absolute number of edges. In the
case of directed and weighted networks, it is common to consider the strength in both
senses of the edges: Sin,i and Sout,i.
Path: A path or trajectory is the set of connected nodes that separate a pair of nodes
in the network.
Shortest Path: Minimum set of connected nodes that separate a pair of nodes in the
network.
Distance: Length of the shortest path between a pair of nodes (lij).
Diameter: The diameter of the network is understood as the longest shortest path in
the network.
Components: In a component all nodes are reachable with a given trajectory. The
Giant Component has a size in the same scale as the whole network.
Connected and Disconnected Graph: A graph is said to be connected when all
nodes belong to the same component, and disconnected when there exist more than
one component in the network. The distance between nodes from different components
is infinite.
2.2 Topological Properties
In this section we will review the main topological properties of complex networks. These
properties describe the structure of the network which strongly determines the functioning
of the system.
9
2.2.1 Degree Distribution
The probability of randomly choosing a node in a network with degree k is the first approach
to understand the structure of a network. The statistical properties of its density function
P (k) determine several of the system’s emergent properties, such as the system’s robustness
or vulnerability to failures and attacks. In the case of directed networks two distributions
are considered according to the sense of edge (outgoing or incoming). When the network is
weighted it is common to work with the strength distribution, or the in- and out-strength
distributions when the network is also directed.
2.2.2 Geodesic Distance
The shortest path lenght between two nodes is the minimum set of nodes that separate them
from each other in the network. The geodesic distance indicates the average value of the
shortest paths between any pair of nodes. It is a network measure that indicates in average
how distant nodes are from one another:
L =1
N(N − 1)
∑i,j∈V,i6=j
lij (2.1)
2.2.3 Clustering
The concept of clustering is related to the amount of a node’s neighbors that are connected
with each other forming triangles. It is a local measure quantified as the portion of existing
triangles in relation to all the possible ones, defined as [WS98]:
ci =
∑j,mAijAjmAmi
ki(ki − 1)(2.2)
It can also be a global measure, quantified as the average clustering coefficient of all
nodes in the network:
C =1
N
∑i∈V
ci (2.3)
2.3 Types of Networks
In this section we will review the most common types of networks. We will explain the
structural properties of these networks and relate their applications to real study cases. We
10
will emphasize on scale-free networks.
2.3.1 Regular Networks
Lattices or grids are networks whose connections form a regular tilling. They are not random
nor disordered. Instead all nodes repeat the same regular and coherent pattern, which could
form triangles, squares, hexagons, etc. Apart from the nodes at the borders, all nodes in these
networks present the same degree. The clustering coefficient is high, because neighbors are
regularly connected with other neighbors. The average shortest path length is also high, since
far regions are only reachable after hopping across several nodes. This kind of networks are
common in material science, when modeling bonds between atoms in crystalline materials.
They also serve as the substratum to dynamical models like cellular automata [AS11].
2.3.2 Random Networks
In random networks the connections between the nodes are random and independent from
each other. The degree distributions follow normal curves. This means that all nodes present
a number of connections bounded within the limits of the standard deviation. Therefore the
majority of nodes’ degree fluctuate close to an average value. In these networks, the average
shortest path length is usually low. The clustering coefficient is also low and even decreases
with network’s size. That happens because connections are independent from each other:
Therefore, the probability of two neighbors being connected is the same probability of two
independently chosen nodes in the network being connected. As a consequence, the larger
the graph, the least the probability of finding triangles.
2.3.3 Small World Networks
A network is said to be small world when the average shortest path, L, scales as the logarithm
of the network size, N , in the form:
L ∝ logN (2.4)
This means that L grows much more slower than the number of nodes in the network.
This property is related to the famous six-degree-of-separation experiment performed by Mil-
gram in the 1960s [Mil63]. The experiment showed for the first time that any two randomly
chosen strangers are only separated by 6 individuals in average. More recently, new tech-
nologies have revealed that the average shortest path length between social media users is
11
Figure 2.1: Homogeneous vs. power-law distributions. (a) A Homogeneous function and a power-
law function with γ = 2.1. Both distributions have 〈k〉 = 10. The curves in (a) are shown on a
linear plot and in (b) on a log-log plot. (c) A random network with 〈k〉 = 3 and N = 50. (d) A
scale-free network with 〈k〉 = 3. Figure adapted from [Bar12]
even bellow that value [MLB12]. Also, it is a pattern frequently found in many real systems
[ASBS00], such as biological [WF01], ecological [MS02], neuronal [SJN+07] or economical
networks [DYB03]. Moreover, small-world networks are also characterized for having a high
clustering coefficient.
2.3.4 Scale-free Networks
Scale-free networks are characterized for not having a characteristic scale for the nodes’
degree. They are often called heterogeneous networks because nodes’ degree does not homo-
geneously fluctuate close to an average value. Instead, most of nodes have a less-than-average
degree while a very few of them, called hubs, have a far-above-than-average degree. This
means that the majority of nodes are poorly connected while only a few of them are ex-
tremely connected and link most of the network. An example of these distributions is the
power law:
P (k) ∼ k−γ (2.5)
where typically 2 < γ < 3. At this range of γ, the second moment (standard deviation)
is not defined. That means that the fluctuations around the average value diverge with the
12
size of the network. This might seem like a contradiction to traditional statistical techniques,
where the more independent samples the more accurate the estimation of the moments, like
the average or the standard deviation. However, in scale-free networks there is no indepen-
dence among the nodes’ behavior. Instead, there are interdependencies and correlations in
their connections that lead to the emergence of extremely connected cases.
In the double logarithmic scale, scale-free distributions look like a straight line across
several orders of magnitude. This means that they do not have the characteristic cut-off of
scale-defined distributions. In Fig. 2.1 we present an example of two distributions plotted
in linear and logarithmic scale. It can be noticed, that the scale-defined distribution (green
curve) rapidly converges to zero in a sharp cut-off, no much farther than the average value.
Meanwhile, the scale free distribution (red curve) has a fat tail that spans across several
orders of magnitudes, indicating that there is not a defined scale to represent this population.
Several examples can be found in the literature of real scale-free networks. In Fig.
2.2, we present the degree distribution of six networks of different nature, like scientific
collaborations, the Internet web pages’ network and physical connections, as well as protein
interactions. The fact that such different kinds of phenomena present an universal structure
is remarkable. Such universality has radically changed the way scientists look at natural
systems and the emergence of their properties.
The implications of scale-free distributions on the systems’ functioning are also remark-
able. First, the extremely connected hubs shortcut distant regions of the network, giving
place to very short average path lengths. In fact, depending on the value of γ the geodesic
distance may behave like: L ∝ log logN [CH03]. This effect is called ultra small world. Sec-
ond, because the probability of finding a poorly connected node is very high, the network’s
functioning is robust to random failures. However, as a counterpart, these networks are very
vulnerable for selected attacks, since the failure of hubs compromises the structure of the
network as a whole.
2.4 Community Structure
Often referred as the meso-scale, the community structure indicates the existence of groups
of nodes within the network that share a larger amount of edges between them than with
the rest of nodes in the network [For10]. This means that networks with communities have a
hierarchical structure where nodes can be classified in groups, which are densely connected
internally and sparsely connected with other groups. In Fig. 2.3 we present an schematic
representation of a network with community structure.
13
Figure 2.2: Complementary Cumulative degree distributions for six different networks. (a) Collab-
oration network of mathematicians [GI95]; (b) Citations between 1981 and 1997 to papers cataloged
by the Institute for Scientific Information [Red98]; (c) A 300 million vertex subset of the World
Wide Web, circa 1999 [BKM+00]; (d) The Internet at the level of autonomous systems, April 1999
[CCG+02]; (e) The power grid of the western United States [WS98]; (f) The interaction network
of proteins in the metabolism of the yeast S. Cerevisiae [JMBO01]. (c), (d) and (f), appear to
have power-law degree distributions and (b) has a power-law tail but deviates its behavior for small
degree. (e) has an exponential degree distribution and (a) appears to possibly have two separate
power-law regimes with different exponents. Figure adapted from [New03b]
14
Figure 2.3: A simple graph with three communities, enclosed by the dashed circles. Figure taken
from [For10]
The analysis of the meso-scale is important for several reasons. First, it allows to un-
derstand the structure of the network at intermediate scales between the most local and the
most global space. In such hierarchy, large scale structures are often assembled by smaller
subparts previously assembled, such as cells and organisms [Sim62]. Second, the communi-
ties enhance the nodes’ characterization according to their role in the sub-structure. Some
of the nodes play a central role, keeping the module together, while others may play a bridge
role, connecting different modules.
It is intuitive to think about community structure in social networks since people have
a tendency to form groups within their coworkers, friends or families. However, non evi-
dent community structure has been detected in networks of different nature. For instance,
in ecological networks it has been reported that dolphins interact with each other within
communities due to racial issues [Lus03]. Also in functional networks, protein-to-protein
interactions are grouped in communities according to their function in the cell [JCZB06], as
well as genes are organized in community structures according to common ends [WH04].
2.4.1 Detection Algorithms
The detection of communities in graphs has been a hot topic of research during the last years.
The broad definition of communities has resulted in a wide amount of interpretations and
15
therefore many different ways to detect them. A very popular algorithm to find community
structure is the modularity optimization method [BGLL08]. First, Newman introduced
the concept of modularity to quantify the number of edges within and between groups, in
comparison to what would be expected for a random case [New06]. Then, Blondel introduced
an algorithm based on modularity optimization in order to find the best partition of the
network [BGLL08]. At the beginning of the algorithm, every node is considered as an
independent community. Then, the algorithm iteratively proposes network partitions until
finding the one that maximizes modularity. It is a powerful algorithm capable of finding
community structure of very large networks in a small amount of time.
Other methods are based on different ideas. For example, random walks algorithms are
based on the principle that a random surfer would be trapped within a community of nodes
given the density of shared edges [RB10]. In this algorithm, we first let a random walker
to surf the net for a period of time. Then, we identify the communities as the clusters of
nodes which were more frequently visited from one another. Moreover, the concept of edge
betweenness has been proposed to determine the edges that could be connecting communities
[GN02]. According to this method, if the edges with high betweenness are systematically
removed from the network, the communities will eventually disconnect from one another.
Finally, data mining techniques, such as clustering algorithms [Mac67], have been proposed to
determine communities. These algorithms understand nodes as vectors in a multidimensional
space where there are as many dimensions as nodes in the networks. Then, we calculate
distances between nodes as the more similar connections the closer nodes are. At last, we
identify clusters of nodes that are closer to each other than with the rest of the network.
2.5 Assortativity
The assortative mixing patterns quantify the tendency of nodes to be connected to other
nodes that are similar or dissimilar to them [New02a]. It is measured by the correlation
coefficient, r, which indicates if there is a tendency in the way that nodes are connected to
each other, or whether nodes are independently mixed among each other. It is a network
metric that measures who is connected to whom and to which extent.
The degree assortativity r indicates if the pair of connected nodes have a correlated
degree. It is defined as:
r =
∑i∈E (ji − 〈ji〉)(ki − 〈ki〉)√∑
i∈E (ji − 〈ji〉)2√∑
i∈E(ki − 〈ki〉)2(2.6)
where ji and ki are the degree of the nodes at both extremes of the edge i. In the case of
16
directed edges, ji and ki may respectively represent the in- or out-degree [FFGP10]. There-
fore, there are four kind of directed assortativities according to the possible combination:
rin−in, rout−in, rin−out and rout−out.
The graph is positively assortative if r > 0, meaning that the hubs and most connected
nodes tend to be connected among each other with a greater probability than with the less
connected nodes. Instead, a negative value of assortativity or dissasortativity (r < 0) means
that hubs are preferentially connected to the poorly connected nodes rather than with each
other. If there is no correlation in the nodes’ connections then r ∼ 0 and we could say that
the connections between hubs and non-hubs occur independently.
Analogously to the degree correlation, we can define other types of mixing patterns.
For example, networks may also present patterns of connectivity with any nodes’ discrete
characteristic, like language, sex or race. It can be quantified by a matrix eij that measures
the fraction of edges that connect nodes of type i to type j. Then the correlation coefficient,
r, is defined as:
r =Tr eij − ||e2ij||
1− ||e2ij||(2.7)
where ||x|| means the sum of all elements in the matrix x. This formula gives r = 0 when
there is no assortative mixing and r = 1 when there is perfect assortative mixing. In the
case of r = 0, the connections are independently randomly mixed.
2.6 Networks Models
In this section we review the most important network models. These models explain the
emergence of real networks’ properties, by means of defining a set of underlying rules of
behavior.
2.6.1 Erdos-Renyi Model
The Erdos-Renyi model (ER) is one of the first ones to study the properties and generation
of random graphs [ER60]. Originally proposed by Paul Erdos and Alfred Renyi in 1959, the
model consists in independently connecting a set of nodes with a previously defined amount
of edges and a probability of connection q. Depending on the value of q, the network transits
from an sparse and disconnected network (q → 0) to a fully connected one (q → 1). In-
between these extreme cases, there is a critical probability qc after which a giant component
emerges. During the process, nodes are independently connected with a number of edges
17
that homogeneously fluctuate around an average value bounded by the standard deviation.
The resulting are homogeneous networks with degree distributions that follow normal curves,
small average shortest path lengths and low clustering coefficient.
2.6.2 Watts and Strogatz Model
The first model to explain small world networks was proposed by Duncan Watts and Steven
Strogatz [WS98] in 1998. It is a random graph generation model that explains the collective
dynamics behind some topological patterns found in real networks, such as small average
shortest path lengths and high clustering coefficients together. The model does not consider
a network growth process. Instead it consists in rewiring a fraction of the existing edges in
order to drive the network to a complex border between order and disorder.
The process is illustrated in Fig. 2.4 A. It begins with a highly ordered network, such
as grid or lattice, where the rewiring probability p is null (p = 0). This kind of networks
typically have very high clustering coefficient and very long average shortest path lengths.
Then some edges are randomly rewired with a given probability p, introducing a certain
amount of disorder to the network. In the extreme case of p = 1 all edges have been rewired
and the network behaves like an ER graph, where both clustering and average shortest
path length are very low. Somewhere in-between, as shown in Fig. 2.4 B, there are some
values of p where the average shortest path length dramatically decreases (black squares)
without loosing the high clustering property (white squares). This happens because the new
shortcuts rapidly connect distant nodes, before rewiring as many edges as required in order
to break the clustering coefficient, which is the last to diminish in the process.
The limitations of this model to explain the behavior of real systems are given in the
resulting unrealistic degree distributions. These distributions do not explain the distributions
of real networks, which are scale-free. However, the model explains how some properties of
real systems, like the small world effect and high clustering, are the result of a neither
coherent nor random behavior in the nodes and their relationships.
2.6.3 Barabasi-Albert Models
The Barabasi-Albert model (BA) is a network growth model that explains the emergence
of fat-tailed degree distributions like the ones found in real networks [BA99]. It is based on
two mechanisms: population growth and preferential attachment. The first mechanism is
based on the observation that networks grow in time as new nodes are added to the system.
The second mechanism states that these new nodes will tend to connect with previously
18
Figure 2.4: (a) Schematic of the Watts-Strogatz model. (b) Normalized average shortest path
length L and clustering coefficient C as a function of the random rewiring parameter p for the
Watts-Strogatz model with N=1000, and k=10. Figure taken from [WS98].
19
Figure 2.5: (A) Degree distribution of networks generated by the Barabasi-Albert model in
linearly-binned (red symbols) and log-binned version (green symbols). The number of edges per
new node m = 3. Size of (A) N = 100, 000, (B) N = 100, (C) N = 10, 000 and (D) N = 1, 000, 000.
The straight line has slope γ = 3, corresponding to the resulting networks degree distribution.
Figure adapted from [Bar12].
well connected nodes rather than less connected ones. The combination of both mechanisms
gives place to the heterogeneity of the resulting degree distribution.
The model specifies that the probability of a new node j, to connect an edge to a node
i, already in the network, is proportional to the degree of i, ki, in the following way:
Pj→i =ki∑l kl
(2.8)
This means that the more connections a node has, the higher the probability to gain new
ones. That mechanism is usually called rich-get-richer or preferential attachment. Barabasi
et al. [BA99] showed that the emergent degree distributions converge to a power law of
exponent 3. Therefore the networks that emerge from this model are scale-free; which
implies that while most of nodes present a small amount of connections, only a very few
nodes concentrate the largest amount of them. In Fig. 2.5 we show the resulting degree
distribution of a numerical simulation of the BA model.
The power law emerges due to the correlation in the nodes’ behavior (collective behavior).
These nodes do not distribute their edges independently among the rest of nodes in the
network. Instead they prefer to connect with the well connected ones. Therefore, the new
nodes’ decisions of choosing whom to connect will depend upon the decisions previously taken
by those nodes who are already connected in the network. This creates a time dependence
20
phenomenon, where the aggregation of individual contributions lead to the emergence of
extremely large cases.
The BA networks present the small world effect. The extremely connected hubs link
large parts of the network. Moreover, the clustering coefficient is typically low in these net-
works. That is a limitation for modeling real systems, where high clustering coefficients are
typically found. However, BA networks are widely used to evaluate dynamical phenomena
in heterogeneous networks.
Over the last years, the preferential attachment model has been generalized in order
to explain other properties of complex networks. For instance, in some models nodes may
present a set of attributes and properties of their own that identify them and influence the
rules of connections. That observation is based on real social systems. People not only
decide to connect with those who are popular. We also look to connect with those that are
similar to us in certain ways. The heterogeneous preferential attachment model [San07, SB08]
proposes a formalism to bias the probability of connection with an affinity value based on
nodes’ attributes. This means that the rich-get-richer mechanism is biased with an affinity
function that increases or decreases the probability of connection between two nodes based
on their attributes.
The affinity function is the rule that nodes will apply when deciding whom to connect. It
can be either a global or local rule. In both cases, all nodes would have their own attribute.
However, depending on the rule, nodes will compare themselves with either unique or general
ways. In the global case, the affinity between nodes is defined due to a single global rule
that all nodes equally apply. Whereas, in the local case, nodes may also have an individual
function to determine their own affinities. This implementation increases the heterogeneity
of users, in terms of their characteristics and behavior. The heterogeneous preferential
attachment formalism has been used to model different kinds of real networks, such as
networks of politicians’ interactions on Twitter [BMLB12].
2.7 Dynamics on Networks
The analysis of dynamical processes among nodes is studied by applying models on networks
with a previously defined topology. In this section we will explore the most important
dynamical processes taking place on networks, such as contagion processes and cascading
effects.
21
2.7.1 Disease Contagion
The most popular model for disease contagion processes is called SIR model [KM27]. It
was first introduced by Kermack and MacKendrick in 1927. The SIR model is named after
the three possible states that nodes may adopt: Susceptible, Infected and Recovered. Its
dynamics consist in the temporal change of the nodes’ state. The susceptible may be infected
with a rate βi. The infected are recovered with a rate βr. The recovered may be immune
or susceptible with a rate βs. The quantities S(t), I(t) and R(t) will define if there is an
epidemic outbreak or if it can be controlled.
The critical value of the infection rate depends on the network topology [PSV01, PSV02].
Standard random networks need a critical infection rate higher than zero in order to cause
an outbreak (Fig. 2.6). Below that critical limit the disease would not diffuse largely enough
and may even become extinct. However, Pastor-Sotorras et al. [PSV01] showed that the
critical limit of infection rate decreases down to zero in scale-free networks. This means
that an initial infection of only a few nodes can compromise the network as a whole. That
phenomenon happens because the diffusion process occurs much more rapidly in scale-free
networks due to the effects of hubs and small average shortest path lengths.
Recently, this model has been applied to many kinds of real networks, like actual sexual
contacts networks [New02b, RLH11] and air transit networks [CBBV06]. These studies give
much more realistic results about the true dynamics of actual epidemics and provide better
basis for designing response strategies.
2.7.2 Social Contagion
Social contagion is the process of people making decisions influenced by the decisions taken by
other people. For instance, the decision of adopting a trend, acquiring a product or forming
an opinion are part of social contagion processes. As opposed to epidemic models, the spread
of ideas is not typically negative as the spread of diseases. Therefore, the strategies of social
contagion are commonly intended to reach as many people as possible, instead of retrieving
the information needed to prevent the outbreak.
The Threshold Model is one of the first models to understand the diffusion of ideas
among people [Gra78]. Proposed by Mark Granovetter in 1978, it is a model that defines a
collective interacting behavior among agents, based on the tipping point ideas from Shelling’s
segregation model [Sch71]. In the model, agents require a critical mass of neighbors who
already adopted the new state before deciding to do so. This means that nodes change their
state after exceeding a threshold value based on the absolute or relative number of neighbors
22
Figure 2.6: Comparison of disease spreading on homogeneous random graph and scale-free net-
works. The fraction of infected nodes displays a distinct phase transition (or epidemic threshold)
in the case of an homogenous random graph, but not for the scale-free network. Figure taken from
[Wat04]
that already adopted the new state. Individual thresholds may be unique for all nodes or
either vary according to a probability distribution. Watts applied the threshold model in
complex networks in 2002 [Wat02]. He found that hubs influence the spread of adoption in
two different ways. On one hand, hubs influence a very large amount of users when they
adopt a new state due to their high connectivity. However, as the threshold is a percentage
of connections, it is more difficult for hubs to reach the number of neighbors needed in order
to change their own state.
Opinion formation processes are also part of social contagion. A popular model of opinion
formation is the voter model [HL75]. In this model, nodes can only adopt binary states based
on the states of their neighbors and a set of interaction rules. These interaction rules are
based on connections, whether in grids or networks, and probabilities of interaction. The
mechanism is the following:
1. First, we select a node from the network at random with given probability distribution.
2. Then, the chosen node selects a neighbor of his own, with another probability distri-
bution.
3. Finally, the first node adopts the state of the selected neighbor, and another node is
chosen to interact.
23
These interactions are iteratively repeated until nodes reach consensus. The consensus is
said to be reached when no more changes occur in the nodes’ states. Recently, the voter
model has been applied to complex networks [SAR08, SEM05]. The results indicate that
the heterogeneity of the network and the existence of hubs facilitate the reach of consensus.
Other scholars have been interested in the effects of a set of nodes called “zelots” whose
opinions remain constant along time [Mob03, MMR07]. They have found that the presence
of zelots in the network prevents the reach of consensus in the population.
Other models of opinion formation do not consider binary states, but a continuous spec-
trum of possible opinions. For instance, the DeGroot model [DeG74] describes how a group
of individuals might reach a shared opinion by iteratively updating their opinion as the av-
erage of their current opinion with the opinions of their neighbors. In this model there is no
external data and nodes are only able to reach opinions based on observing the neighbors’
opinions. The result is an explicitly determined distribution of the opinions reached. The
shape of the distribution indicates the way consensus has been achieved, whether opinions
merge in a single view or there are multiples points of view among people. Recently, the
DeGroot model has been used to study the conditions under which consensus is achieved
[AO11, GJ10a, Jac10]. However, as consensus is rarely reached in real world [Kra09, IJBZ08],
variants of this model can held to a diversity of opinions [BKO11, ACFO13, Kra00, FJ90].
For example, by weighting edges and biasing the way nodes interpret their incoming infor-
mation, divergence and polarization of people criteria may be reproduced.
2.7.3 Cascades on Networks
Occasionally, during contagion processes, the nodes’ adoption of a new state triggers a se-
quence of reactions among its neighbors, and neighbors of neighbors, in the shape of cascades
(see Fig. 2.7). During cascades, individuals show a heard-like behavior, making decisions
solely based on the actions of others. Many real systems constantly show cascading behavior,
such as clashes in the stock market [Shi95], failures in the electrical system [SCL00], biolog-
ical procedures [GIT09] or viral marketing campaigns [GLM01]. The dynamics of cascades
are related to the avalanches of Bak’s self-organized criticality [BTW87]. In both systems,
the propagation of actions occurs due to long range correlations among the different nodes
or agents.
A kind of cascades on networks result from the threshold model. The adoption of a new
state by a given node can provoke a cascade, if any of its neighbors’ thresholds is exceeded
with such adoption. Watts showed that the size distribution of this kind of cascades follows
a power law when the network connectivity is limited [Wat02]. Besides, he showed that the
24
Figure 2.7: Schematic representation of cascade on a network. The red and yellow nodes belong
to the cascade. The white nodes belong to the network but are not part of the cascade. The cascade
layers have been marked in gray.
heterogeneity of behavior plays an ambiguous role in the cascades’ propagation. On the one
hand, the heterogeneity in the threshold distribution makes the network more vulnerable to
the occurrence of large scale cascades. However, on the other hand, the heterogeneity in the
degree distribution makes the network more robust to the propagation of them. This occurs
because, although hubs trigger many more cascades than the average nodes, they are less
likely to propagate the already existing ones.
The cascades that result from the threshold model depend on the overall state of the
social contagion process. However, there is another kind of cascades which occur indepen-
dently of the nodes’ history. These cascades are analyzed through the independent cascade
model [GLM01]. In this model, every nodes’ adoption may trigger an independent cascade,
regardlessly of whatever happened before the adoption. When a node is active, it has a single
chance to activate each of its neighbors with a given transmission probability. Saito [SNK08]
proposed a model to predict the optimal diffusion probabilities based on maximizing the
likelihood of possible episodes.
Kempe et al. [KKT03] proposed a general framework where the threshold model and
independent cascade model are included as particular cases. In their general cascade model,
the independent cascades behave as the threshold model, with the difference that the nodes’
threshold is reduced to one and there is a probability to propagate the cascades once the
threshold is exceeded. They also proposed an algorithm to find the initial group of individ-
25
uals that will produce the largest cascades in a social networks based on a combinatorial
optimization process.
2.8 Social Networks
Social groups can be analyzed in the form of networks. In this abstraction, the nodes
represent persons and the edges represent social relationships of any kind, such as friendship,
family or work. The analysis of social groups in the form of networks is a well known concept
in social sciences [Mor51]. However, until recently, large scale social structures were not
addressed in this way. Many examples can be found in the literature about social networks,
such as actors performing in common movies [NWS02, ASBS00], scientists collaborating in
papers [BBPSV04], criminals acting in gangs [XC05] or people interacting through Internet
[MLB12, MBLB14, BMLB12, BMBL14]. Most of these studies analyze the social structure
at different scales, the dynamical processes taking place on the networks and the effects on
peoples’ behavior on the social structure in which are embedded.
The social networks typically present complex properties, such as scale-free degree distri-
butions and small-world effect. The degree distributions behave like power laws, the average
shortest path length is small and the clustering coefficient is usually much greater than what
would be expected for a random process. Moreover, social networks tend to be positively
assortative [New02a]. Popular people tend to relate with other popular people, while unpop-
ular people tend to be friend with unpopular people. In cases like scientific collaboration,
movie actors or directive boards a positive assortativity has been found. In contrast to
networks like the Internet, airports or semantic which tend to be disassortative.
A very important property of social networks is the community structure. Social networks
are usually subdivided in groups of nodes more closely related to each other than with the
rest of the network. Communities can be related to working disciplines [RB10], racial issues
[GHKV07], language spoken [BGLL08] or mobility patterns [EEBL11]. Newman showed that
the community structure is responsible for other properties, such as the positive assortativity
together with the high clustering coefficient [NP03]. During the growth process of a network
divided in communities, the new edges tend to stay within the same group, thus the clustering
coefficient does not decrease as the network grows.
26
2.9 Time Varying Networks
In real systems, not all networks can be modeled as constant growing processes where edges
remain invariant after their creation. Some systems change their structure dynamically, as
elements appear and disappear, or relationships are built and destroyed. There is a class of
networks that takes into consideration this dynamical nature of real processes. This kind of
networks are called time varying, and their edges are created, removed and rewired according
to the nodes’ behavior. They have been recently used to explain the evolution of online inter-
actions among people [PGPSV12], genetic procedures [KSA+10], wireless routing strategies
[NMR05] and contagion processes [LBP13]. These studies have found remarkable differences
with their static counterparts, showing the importance of considering the dynamical nature
of real processes.
Many real systems are determined by the nodes’ dynamical behavior. Perra et al.
[PGPSV12] proposed an activity driven model where the network formation depends on
the nodes’ activity. The model takes into account the different dynamics of activation on
social networks, such as in social media or scientific collaboration. In Fig. 2.8 we illustrate
three different time steps of the evolution of the model. On each time step, active nodes (red)
create a new set of edges (white). The bottom visualization shows the aggregated network
at the end of the process. This model is capable of providing explanation of the emergence
of hubs in networks based on the heterogeneity of nodes’ activity distribution, rather than
the preferential attachment mechanism. This model is a clear example of how the complex
structural properties of the system at larger scales are the result of the complexity in the
individual behavior, and also demonstrates that structure and dynamics in complex systems
are intimately related to each other.
27
Figure 2.8: Schematic representation of the activity driven network creation model. Red nodes
show the active nodes at each time T . The bottom plot represents the final aggregated structure
of the network. This figure has been adapted from [PGPSV12].
28
Chapter 3
COMPUTATIONAL SOCIAL
SCIENCE
Over the last decades, fields like biology or physics have been revolutionized by increasing
amounts of data obtained from electronic measurements. Social sciences, however, were par-
ticularly slow in this competition. The technical challenges to survey at larger scales impeded
the conduction of data driven studies and experiments in social disciplines. Fortunately, that
reality has recently changed, as humans constantly interact with electronic devices nowadays.
These devices are acting like social cathodes, recording our activity and enabling unexplored
research and knowledge. The computational social science is the arising field of data driven
analysis to understand societal phenomena [LPA+09]. The theory of complex systems [BY97]
is a powerful framework to understand this kind of phenomena. In order to understand the
social system we can not ignore the complexity of our behavior and the relationships we
build. Complex systems’ tools, like complex networks [New03b, Wat04, BLM+06], explain
patterns of societies, like the structure of the system, the way it evolves and the underlying
mechanisms responsible for the pattern formation.
Until now, social sciences were studied with surveys that do not meet the requirements
to develop a data driven analysis. Traditional surveys usually represent snapshots of some
hundreds randomly sampled individuals that do not show the structural nor dynamical
patterns of the social system as a whole. The complexity of the social system is simplified and
the diversity of the population is not captured. The computational social science provides a
new perspective on social processes. It enables sociologists and experts to revise old concepts
and answer new questions. The new data sources have brought many benefits. First, the
fact that these datasets are automatically collected dramatically reduces the costs and efforts
to deploy the data gathering. That for instance is very helpful for less developed countries
29
which can gain information from databases already collected in an affordable way [BLT+11].
Second, this human activity represents a new dimension that enhances the characterization of
societies. For example, phone call behavior explains from individual properties, like emotions
[RMM+10], up to large scale patterns, such as the economical development [EMC10]. Third,
the temporal evolution allows to detect patterns and find trends. Nowadays, the price of
stocks in the market is predicted with people’s mood on Internet [BMZ11]. Finally, the
modeling techniques unveils the nature of the system. Computational and mathematical
models control and predict the behavior at different times and in different conditions.
There are two ways to conduct experiments in computational social sciences. One is to
analyze large datasets from service providers like emails, social media, mobile phones and
e-commerce. Then, to retrospectively look for patterns in the collective behavior of the
population, in order to discover the micro-macro connections of the social process. Another
one is to build living labs with people and conduct large scale experiments [VH86]. Living
labs represent techniques for research on user-centric behavior by sensing, designing and
validating complex solutions in real scenario.
In the present chapter we will discuss some of the most important advances in the field
of computational social science. We will review the main characteristics of online human
activity and their applications to understand society in section 3.1. We will introduce the
concept of socio-technological networks in section 3.2. Finally, from sections 3.3 to 3.5, we
will review the dynamical processes that take place in these new spaces, like social contagion,
influence propagation and social polarization.
3.1 Human Activity
In the context of computational social science, the human activity is related to the action
of consuming any of these communications services. As people use these technologies a
picture emerges with their social interactions. The emergent patterns are complex and
heterogeneous. For instance, there is not a characteristic rate of activity. Instead, there are
long periods of inactivity, followed by fewer shorter moments of highly intense activity called
bursts [Bar05, ZCH+12]. This behavior is captured by the time between two consecutive
actions which scales as power laws (see Fig. 3.1). Barabasi [Bar05] modeled this activity
effect as a queue process where tasks or actions are taken over consecutively. In the model,
most of tasks are rapidly executed one after another, but some fewer ones experience very
long waiting times before being executed.
In a larger scale, we are part of societies. Studies using electronic media data report that
30
Figure 3.1: Bursts of individual activity on an e-commerce site. In the left panel we represent
the temporal behavior of four individuals, showing that bursts of activity (color stripes) coexist
with large moments of inactivity (white periods). The x-axis represents time and the colored lines
represent individual actions. In the right panel we show the distribution of inter-action waiting
times for each of the four users. Figure adapted from [ZCH+12].
Figure 3.2: Collective response to a critical event. In the top panel we show the emergent
networks between affected users during an event at three times. In the bottom panel we show the
calls pattern between the same users a week before the event, indicating that the cascades observed
during the event are extraordinary. Figure adapted from [BWB11]
31
individual activities like commuting, working, and sleeping on a daily basis combine into
area-wide pulsing patterns. Measurements like the number of calls [CGW+08], electricity
consumption [PSR12] or emails sent [WWT+11] display regular cycles of high activity during
work hours and low activity during rest hours. These urban rhythms were reported as
responsible for the heterogeneity in the queuing times modeled by Barabasi [MSMA08].
However, the burstiness of the human activity results to be robust to any seasonal or daily
variations [JKKK12]. On the other hand, collective activity in large areas, also explain the
economical development of the region. Eagle et al. [EMC10] found that regions with diverse
communication patterns tend to be wealthier than regions with insular communications.
The regular patterns of collective behavior are disrupted during critical events like natural
disasters or episodes of social unrest [BWB11, MLB12]. During these events, the populations
encounter unfamiliar conditions and their reactions determine the outcome of the crisis. Elec-
tronic media allows to measure and understand the impact of the events in the social system.
Recent studies have characterized the disruption of the collective patterns by comparing the
behavior during the event to the usual one [BWB11]. They found abrupt variations in the
activity, closely related to the emergence of extraordinary information cascades (see Fig.
3.2). As the emergence occurs people tend to communicate it to others, triggering chain re-
actions. The geographical distribution of the activity also describes the event. For instance,
during earthquakes Twitter activity allows to locate the epicenter with an extraordinary
accuracy, by geographical measuring the volume of related messages [SOM10]. Episodes of
social unrest are another kind of collective disruption [LeB96]. Recently, many social move-
ments have been analyzed by means of human activity data. The propagation of action and
influence across networks during the episodes of social unrest, including how leaders attract
and influence followers, has been described [MLB12, MR13]. Efforts have also been made to
understand the evolution of some of these movements and to investigate possible reasons for
their eventual decay [CFMF13].
Another way to analyze human activity is by means of the mobility patterns. By looking
at geolocalized data it is possible to analyze the laws that govern people’s movement. Eagle
and Pentland [EP06] predicted the movements of students and employees in a university,
based on their individual characteristics such as studies or employment level. On a larger
scale, Gonzalez et al. [GHB08] studied thousands of anonymous mobile phone users in
order to unveil that humans follow simple and predictable mobility patterns. They found
that human trajectories present a strong temporal and spatial regularity, characterized by
a significant probability of returning to some fewer but highly visited locations. Moreover,
electronic media like Twitter has shown global migrations and the actual exchange of humans
32
between countries [HSB+13].
3.2 Socio-Technological Networks
Telecommunication solutions, like phones or Internet, require underlying technological net-
works in order to transfer the data and be able to provide services. When consuming these
services, people interact and communicate with other people, creating social networks that
emerge from the exploitation of the technological resources. In these socio-technological
networks, information is shared and ideas flow among people. The characterization of these
networks allows to understand large scale patterns of the society. Two things can be ad-
dressed with social-technological networks: the topological properties of the social structure
and the dynamical properties of the interaction processes. On the one hand, the social struc-
ture is defined as the network of social contacts, either by making calls or sending messages.
On the other hand, dynamical processes are due to user activations in the network. These
activations are usually not independent from each other and present large scale patterns like
the emergence of information cascades.
These networks present complex properties. Fat-tailed degree distributions have been
frequently found, indicating that the number of contacts by person is scale free. This has
been shown in email networks [NFB02], mobile phone networks [APR99, OSH+07] or Twitter
networks [MLB12, MBLB14, BMLB12, BMBL14, KLPM10]. The small world property is
also common in these networks. The average shortest path length have been calculated
in email networks and Twitter networks, being below six degrees of separation in systems
with thousands of millions participants. Moreover, these networks have a negative degree
correlation [HW09]. That happens because the new interaction mechanisms allow regular
people to interact with famous people.
Huberman et al. [HRW09] showed that people behave quite selectively when truly inter-
acting with their contacts. Each person has a subset of friends with which interacts much
more frequently than with the rest of acquaintances. In fact, Kleinberg et al. [LNK07]
showed that people even start new relationships with their friends’ friends. This behavior
constitutes a network that matters contained within the overall social structure. Such effect
is due to limitations in people’s attention, since there is a maximum number of relationships
that we can manage simultaneously. Originally measured by Dunbar [Dun92] and later con-
firmed with Twitter data [GPV11], it is difficult for people to manage more than a couple
of hundred relationships at the same time.
Another property of socio-technological networks is the community structure. These
33
Figure 3.3: Emergent networks from the propagation of four videos on Twitter. In panels (A)
and (B) the local influential leaders performed a remarkable role in the diffusion process. Whereas
in panels (C) and (D) the influence of hubs was much more stronger. Figure adapted from [DO14].
graphs typically present communities of users whose interactions occur more frequently
among themselves than with the rest of the network [JSFT09]. Due to the dynamical na-
ture of human interactions, these communities are not static but rather emerge, evolve and
disappear in time [NDXT11]. In general, these communities are formed due to similari-
ties in people’s characteristics and dramatically impact the way information spreads across
the network [GRM+12]. For instance, linguistic communities are usually found in socio-
technological networks, as people usually interacts with those who speak the same language
[BMBL14]. Also, people conform communities according to their interests. Recent studies
showed that news sharing networks on Twitter have communities of those who are inter-
ested in either on global, national or local scale issues [HZGMBY13, AMV+14]. Besides,
other characteristics, like political affinity, determine the emergence of communities during
political discussions and electoral campaigns [BMLB12, BMBL14].
3.3 Information Spreading
The exchange of information during social interactions leads to the contagion of ideas, opin-
ions and adoptions between people. The underlying laws that govern the contagion and flow
of ideas are similar to those that rule information spreading processes on social networks. In
the context of social media, recent studies have revealed that most of the information posted
in these networks is hardly propagated by the participants. Around 71% of the messages do
not travel any farther than the authors time-line [CE09]. However, some authors manage to
spread their content in a wide variety of proportions, due to a combination of several factors,
such as their popularity, posting frequency, and novelty or resonance of the posted content
34
[RGAH11]. In Fig. 3.3 we present the respective networks from propagating four videos on
Twitter. In these networks, nodes are users and links represent the diffusion of messages. It
can be noticed that the shape of networks A and B is radically different than C and D. In A
and B, local leaders played an influential role, making the diffusion really viral. Meanwhile,
in C and D, the diffusion happened mainly due to the popularity of hubs.
In 1973, Granovetter argued that while most of information is shared within sets of
strongly tied individuals (or communities), weak ties link communities and spread the infor-
mation across the whole network [Gra73]. More recently, Onnela et al. [OSH+07] showed
that millions of mobile phone users do behave like social bridges, allowing information to flow
across communities in the social network. They showed that social networks are robust to
removing the strong intracommunity ties, but are destroyed if intercommunity weak ties are
removed. Also, the heterogeneity of human temporal behavior slows the diffusion process.
The temporal bursts of activity are typically trapped within the communities, producing fast
local cascades but reducing the diffusion at larger scales [MML10]. Moreover, other factors
also play a fundamental role in the process of social contagion. For example, the diversity of
contacts who already adopted the trend is highly determining for users to imitate. Ugander
et al. [UBMK12] showed that people’s decision to join the social media depends not only
upon the absolute number of friends who already joined, but also on the diverse kinds of
social groups that these friends represent, such as family, coworkers, etc.
3.3.1 Models
Many researchers have modeled the information spreading processes as branching processes.
These were firstly introduced by Galton and Watson in the 19th century to model the
survival of family names [WG75]. A branching process is an stochastic model where a set of
individuals Zn, at the generation n, produces a random number of new individuals Zn+1, at
the generation n + 1. The individual multiplications occur independently from each other
or previous generations. Other branching models, like the Bellman-Harris process [BH48],
propose that individuals live for a random period of time independently of each other. Then,
once an individual life time is expired, a random number of new individuals is produced.
More recently, branching processes have been used to model several information spreading
processes on social media, inspiring network growth models characterized by the information
retrieved from data [IE11b, WWT+11, GJ10b]. Barabasi et al. [WWT+11] studied the flow
of emails in a corporation by means of a biased Galton-Watson model to reproduce the bushy
but shallow cascades. Golub [GJ10b] instead modeled deep and narrow email chains with a
similar process. Doerr et al. [DBM13] found that the adoption times of individuals in the
35
inter-arrival time of Twitter messages, or the propagation time of stories on a social media
site, can be explained through a convolution of log-normally distributed observation and
reaction times of the individual participants. Moro et al. [IE11b] studied the dynamics of
information flow from the individual activity patterns of the nodes’ branching dynamics. The
model includes a temporal variable that determine the time lapse between generations, based
on the observations of human activity patterns. They conclude that the heterogeneity in the
user behavior gives place to extreme events. Therefore the size of the emergent networks
from the diffusion process depends upon the activity of the spreaders.
Information cascades in social media have also been modeled as BA networks, biasing
the seed’s probability to gain more connections and linking all new nodes with a single
edge [GKK11]. This model reproduces very well the cascades’ shapes across different online
social networks, which are characterized by the model parameters. Wang et al. [AHSW11]
have proposed that the number of tweets in trending topics grows in a multiplicative way.
Kawamoto [Kaw13] proposed a dynamical model with stochastic branching growth to predict
cascades’ sizes taking into account the network of followers. The stochastic parameter in
the model follows a log-normal curve with a fatter tail. Zhang et al. [XLZ+12] proposed to
modify roles in the SIR model in order to include a message forward mechanism. In the new
state, contacted, nodes know about the message but have not forwarded it yet. Like in other
SIR models, the network heterogeneity diminishes the spreading probabilities.
3.4 Influence and Popularity
Social influence is the process by which individuals change their behavior as a result of social
interactions with other people. During these interactions, individuals exchange information
and adapt their opinions and beliefs based on the information received. This process may
happen conscientiously or not, through processes of persuasion and leadership [Kel58]. For
example, social interactions are known to be responsible for the transmission of social be-
haviors like eating [CF07], drug abuse [RMFC10, CF08] or emotions [FC08]. In fact, these
studies show that we are influenced not only by our acquaintances, but also from friends of
friends, and many other persons in the network. In the context of social media, the user
influence can be measured by the propagation of his contents across the network. However,
from a marketing perspective, the social influence is seen as the capacity of some users to
encourage the adoption of a trend or product among his contacts [AW12].
Bakshy et al. [BHMW11] defined influence in social media according to the size of
the information cascades that users’ produce. They found that word-of-mouth in social
36
media functions by means of many small cascades seeded by regular users, rather that the
cascades produced by popular users. On the other hand, Aral et al. [AW12] found that
social properties like age, sex and marital status play a fundamental role in the adoption of
products. They found that influential people are less likely to be influenced, and that these
users are usually clustered in communities. Therefore, they think that influential people with
influential friends are a good target to start campaigns. Most of these studies are structural,
explaining influence either by the nodes’ properties or the connection patterns. However,
influence is also an interplay between dynamics and structure. In the context of the SIR
model, Klemm et al. [KSESM12] defined a measure of influence called, spreading efficiency,
as the expected infected fraction of a network, when a node is initially infected and the rest
of the network is susceptible.
Popularity is related to having a large number of connections. Cha et al. [CHBG10]
found that popular users on Twitter may not be the most influential. They say that influence
also includes being retransmitted and mentioned in the conversations. Both influence and
popularity are the result of the way people pay attention to each other. Ratkiewicz et
al. [RFF+10] showed in 2010 that the evolution of the collective attention is complex and
characterized by a bursty behavior, since the popularity of contents has abrupt growths
due to external factors. They modeled this process as preferential increase mechanisms
combined with random popularity shifts. This means that the popular contents tend to be
more popular in the future, boosted by the occurrence of external events. During this process,
the resonance of the content is extremely important in order to attract people’s attention
[RGAH11]. However, all content has an effective period to catch the collective attention
before loosing that capacity. The novelty of information decays quite rapidly, stretching the
effective time to attract the collective attention [AHSW11].
3.5 Polarization
Polarization is a social phenomenon that frequently appears in many collectives and societies,
when individual beliefs diverge during dynamical opinion formation processes [BG08]. A
reason for such divergence is the tendency of people to discuss in a biased way, looking for and
accepting opinions that reinforce their own beliefs, as well as rejecting those positions that
contradict them [DGL13]. However, the heterogeneity of individuals background, beliefs and
needs also play a fundamental role in polarization emergence. The diverse people’s criteria
strongly condition the evaluation of issues like policies outcomes or the appreciation of the
state of a nation or any society [DW07].
37
Political polarization is understood as the alignment of individual ideologies with extreme
positions. A usual case called elite polarization [ZFT+08] occurs among leaders between
parties, or even within the same party, when mutually exclusive positions are assumed and
radically different solutions are proposed to common problems. Although it evidently seems
harmful for institutional cohesion and stability [Dia90], some scholars have pointed it out as
beneficial for democracies, arguing that the electorate is able to recognize their leaders’ ide-
ological positions more accurately [BB07]. At any case, elite polarization has been reported
to be much stronger than popular polarization, as regular people tend to assume more mod-
erate positions than politicians or opinion leaders [GMSS12, SS12]. Popular polarization
emerges when the ideological divergence occurs in societies as a whole. A reflection of it
occurs in electoral processes when the political options are reduced into two or a few parties
with extreme positions. Spatial segregation is another kind of polarization. This is related
with the Schelling model [Sch71], where agents move among slots in a grid, according to their
tolerance to coexist with the other. The results are social segregation patterns given in the
spatial configuration. During the process, an initially homogeneously distributed population
ends up grouped in clusters of others with similar characteristics.
More recently, political processes have been analyzed by means of data exhaust from
human communications through electronic devices [TL13]. In the context of the U.S. elec-
tions, Adamic et al. [AG05] found a divided blogsphere where the two main political parties
(liberal and conservative) mainly cited their own community blogs with a very few crossed
interactions. Such division was reflected in their own articles, as each band showed respec-
tive interest on different subjects. On Twitter conversations, Conover et al. [CRF+11] found
the retransmissions mechanism to be the most polarized one. They proposed a method to
determine the political valence of certain keywords in the content of messages, based on how
parties use them in a mutually exclusive way. Moreover, Livne et al. [LSAA11] used Twitter
data to determine which party exploited the tool more effectively. They found that partisans
show respective interest on specific aspects with few coincidences. In general, these studies
show that there is a remarkable lack of debate in social media, and that people are usually
exposed to similar opinions.
Other researchers have proposed ways to measure the degree of polarization in the pop-
ulation. From a dynamical point of view, Dandekara et al. [DGL13] measure whether
polarization is emerging in a population at each time step by looking at the evolution of dis-
agreement. Also, Weibull et al. [DW07] measure whether the distributions of opinions are
turning bimodal. Baldassarri et al. [BB07] measure the bi-modality of opinion distributions
by means of the kurtosis and variance. On a different perspective, Latane et al. [LNL94]
38
measure polarization based on the change of group sizes. They claim that polarization oc-
curs when a minority group of individuals grow in comparison to the majority group, thus
measuring polarization based on the growth rate. From a structural point of view, Conover
et al. [CRF+11] compare the polarization in Twitter networks by means of the modularity
measure. Modularity has also been used to measure polarization in congressional networks
[ZFT+08] and networks of modeled interactions [BB07]. From a different perspective, King
et al. [KOS11] defined Twitter users as a collection of binary decisions of either following or
not an influential set of politicians. They measured the distance between users, so that the
more coincidences in the follow decisions, the closer users are. They found two large clusters
of users and measured the polarization based on how close users are within a cluster, and
how far are clusters from each other.
39
40
Chapter 4
DIGITAL TRACES AND
COMPUTATIONAL METHODS
Every time we consume telecommunication services we leave behind digital records of our
activity in the service providers’ data bases. That exhaust of data is an increasing by-product
of nowadays human life style. The datasets are huge in volume. Their nature is diverse and
include several dimensions of human societies like friendships, debates, dating, payments,
etc. The analysis of data derived from human activity enables the possibility to measure
the social systems in their actual dimensions [Pen08]. However, most of the information
is unstructured, embedded in text messages, images or videos. Many of the traditional
querying methods are limited to retrieve information in this kind of data. Therefore, a
new set of methodologies and technologies have been recently proposed to query the data
[Whi09, Cho13], based on new principles of data treatment.
In this chapter we will discuss methods to treat data, obtain patterns and retrieve un-
structured information. First, we will show how to store and query data depending on its
structure. Second, we will revise some of the techniques for patterns recognition. Finally,
we will describe and characterize the datasets we have built and analyzed in this thesis, as
well as revise the main Twitter’s functionalities, limitations and how to gather data from its
servers.
4.1 From Data to Knowledge
The process to retrieve information from datasets require several interdisciplinary skills.
Fist, we need computational knowledge like being able to store, query and manage the digi-
tal stream of data. Also, we must be able to abstract the data and mathematically treat it,
41
in order to quantify and measure it. Finally, we must extract the information contained in
the form of patterns, by means of recognition techniques. These include statistical, computa-
tional, modeling and visualization methods. The ultimate goal is to understand the meaning
of patterns and the behavior of the system; in order to be able to propose dynamical models
to explain, reproduce or predict the observed behavior.
The information is unstructured within the data. The methodology applied in this the-
sis to retrieve it consists in abstracting the data and constructing intermediate structures,
which are easier to manage and where information is condensed. Such structures can be den-
sity functions, multi-dimensional samples or complex networks. Most of them, result from
the aggregation of the collective behavior, either by looking for overall kinetics or internal
structures due to relationships. Finally, by means of patterns recognition techniques, we
can reduce the complexity of the system and find the best descriptions for its structure and
dynamical change.
4.1.1 Data Management
The first step for data analytics is to build the infrastructure necessary to store and query the
data. In this section we will present the methodologies followed to manage these processes,
according to the structure of the data model.
All databases are technically possible due to large files in the server’s hard drives con-
taining enormous amounts of bytes. The data contained in these files are usually organized
by following a data model, where entries are defined as a set of values that correspond to
attributes or fields. The nature of the information and the way fields are organized define
the data structure and consequently the methodology followed to treat it.
There are two kinds of data: structured and unstructured. The structured data is given
in tables of rows and columns; while unstructured data do not follow a predefined model.
Take an online message for example, one could build a table with a field called text and
have all the posts in a single table. But traditional ways to query data will not answer
the questions like which is the influential topic or which is the least popular. That kind of
information is unstructured across the registers.
Relational Tables
Relational tables are the way structured data are usually stored [DD87]. Tables follow
a scheme data model compound by fields of any data type, such as numbers, text strings,
dates or boolean. These fields are table columns and the rows are entries that fill the columns
with a set of values. The number of columns is defined, but the number of rows is not. Each
42
row is identified by an unique key. Keys serve as indexes for the database querying motor,
like SQL, to optimize searches. Also, they can be used as columns in other tables, creating
relationships between entries of different tables. The ultimate goal of these methods is to
retrieve fields across tables joined by their related identifiers.
Non-Relational Methods
Big Data is not about querying tables with very large amounts of entries. It is about
finding patterns in the entries at different usually larger scales [BYB13], and retrieving the
unstructured information contained in the raw data. For this matter data must be processed
with other kind of methods. New architectures, like Hadoop [Whi09] or MongoDB [Cho13],
provide mechanisms to build flexible models for the diversity of data found nowadays. These
methodologies usually allow parallel processing, which enables a better scaling than SQL
tables. Queries can be performed on multi-server environments, where several stations query
in parallel across millions of entries. Also, the models are file-based. The data entries are
typically stored in text files in the form of key/values dictionaries, like the format JSON
[Cro06], where keys identify the columns from the data model and values represent the
register. This kind of organizing data allows to nest registers and dynamically modify the
data model.
Map and Reduce
The Map and Reduce is an alternative methodology to query data [DG08]. It is based in
two processes: to map and to reduce. The goal is to divide the problem in several smaller
sub-problems which could run in parallel and later be summarized.
The map procedure turns the original data into intermediate structures. These are usually
relations between keys and values, that contain a partial processed amount of data. The
reduce procedure gathers all the results of the Mappers and combine them into a single result.
For this purpose, the reduce procedure can either sum, accumulate or process the partial
results with any mathematical function. For instance, if we want to count all the visible stars
in the sky, we could have one single station counting region by region; or instead divide the
sky into a thousand pieces, and assign the same task to a thousand processes. Each of them
will be the mappers. They will count the stars on their sky portion alone, which should be
possible for them to count in a reasonable amount of time. Then another process will gather
and combine all the partial results into a single one. Furthermore, let us suppose we want
to classify the observed stars according to their type. In this case, each mapper can count
the number of each type of star in their own sky portion. Accordingly, the reduce algorithm
43
can sup up all the partial counts into a single final one.
This methodology allows to do parallel computing. This results very effective in time
and computational costs. The parallel computing could take place on a single station, by
means of multi-threading the code, or several working stations as the computing clouds. A
typical implementation of this algorithm is compound by the following three methods:
1. Manager: This is the main method where we open the data files, create the mappers,
execute them, wait for their partial results and call the reduce algorithm.
2. MapFunction: This is the mappers method, where we build the partial results and
deliver them to the manager process. The communication between processes can be
done by means of a queue object.
3. ReduceFunction: This method aggregates all the partial results into a single one.
In Python1, the new processes are created with the Process function from the multipro-
cessing package2. This function creates an process object which receives as arguments: the
target function to execute and the corresponding arguments. The new process will execute
the target function and terminate at the conclusion.
4.1.2 Finding Patterns
In order to find the patterns hidden in data a set of variables must be defined first. These
variables are going to characterize the information. The raw data will be aggregated into
these variables. Then, the information would be retrieved by means of statistical analysis
and pattern recognition. In this section, we will review the most important techniques for
pattern recognition that we have used in this thesis.
Probability Distributions
The first approach to analyze a variable is to characterize its distribution. Such characteri-
zation would allow to determine expected outcomes of the different experiments, determine
ranges of the sample space and find anomalies in the data. A probability distribution is a
function that indicates the likelihood of a random variable to take on given values, defined in
a probabilistic space. These values could be categorical or numerical, contained in discrete or
continuous random variables. The probability of a variable to take on a given range of values
1http://docs.python.or2http://docs.python.org/2.7/library/multiprocessing.html
44
is given by the integral of the density function at that range. The curve of a probability
distribution is nonnegative and the integral across the whole space is equal to one, just as
the probability of any of the possible events to happen.
After defining variables in data, the probability density functions are built by counting
the frequency of each outcome. An easy way is to build first the cumulative probability
density function. For this matter we first sort the data entries from lower to higher and plot
them against a vector from 0 to 1 with the same number of equally distanced points as the
original sample. Then we can take the derivative of this function and find the probability
density function of the variable.
Depending on the distribution, the statistical moments like average and standard de-
viations are good enough to characterize the population. Homogeneous distribution like
Gaussian curves or Uniform distributions meet this criterion. However, these statistical
moments may diverge in heterogeneous distributions like power laws. Therefore, when char-
acterizing the distribution of complex systems variables it is important to understand the
structure of the system, before making wrong assumptions.
Correlations
The correlation of two or more variables indicates whether there exists or not a relationship
of dependence between them. There are several ways to measure the strength and direction
of such dependence. The strength measures how strongly or weakly variables are correlated.
The direction indicates if the variables are correlated in the same direction or opposite
directions.
There are several correlation coefficients to quantify the degree of the variables’ rela-
tionship. The Pearson coefficient is one of the most popular and used in this thesis. This
coefficient is often denoted as r and measures the linear relationship between two variables.
It is obtained with the ratio of the covariance of the two variables, X and Y , and the product
of their standard deviations, σx and σy, in the following way:
r =E[(X − µx)(Y − µy)]
σxσy(4.1)
where E[] is the expected value function and µx is the average value of X. r ∼ 1 when
there is an increasing relationship between the variables, which means that the larger the
first, the larger the second. If in turn, the relationship of the two variables is inverse, and
the larger the first the smaller the second, then r ∼ −1. Finally if there is no correlation
between the variables and they seem to behave independently, then r ∼ 0.
45
It is also possible to compare series of random variables. In this case a different kind
of correlation is applied, such as the cross-correlation. Statistically it is defined as the
correlation of all points of both time series at different times, defined as the convolution of
both signals:
f ∗ g(t) =
∫ ∞−∞
f(τ)g(t− τ)dτ (4.2)
A shifted correlation would indicate the evolution of the joint behavior of two series,
highlighting the instants where the series have independent or coherent behavior.
Data Clustering
The clustering process aims to find similarity between samples of a set of data. It is the
process of grouping a set of observations in a way that those who belong to the same group
are more similar than those who belong to other groups. The wide definition of clusters
gave place to several interpretations and therefore several kinds of algorithms to solve the
problem. Most of them are based on concepts of distance between observations and defining
criteria to determine which element is closer to which.
An example used in this thesis is the k-means clustering algorithm [Mac67]. In this
algorithm clusters are represented by central vectors. These are not necessarily part of the
dataset and their quantity is fixed to a previously defined number k. The algorithm therefore
finds the k cluster centers such that the squared distances from the cluster are minimized.
Its main limitations are that the number k must be given in advance and that it is not
that accurate to determine cluster borders since it has been optimized to find centers and
not borders. However, methods like the silhouette [Rou87] allow to measure the quality of
partitions, based on how close are elements to their centers, versus how apart are elements
from other centers.
Multidimensional scaling (MDS) is a methodology to visualize the similarity of entries in
the dataset [BG05]. It understands the objects as a set of variables in a multi-dimensional
space. The model consists in reorganizing the elements in order to reduce the multidimen-
sionality, and preserve the original distances between the elements. Therefore, the multiple
dimensions are reduced to a few, and new coordinates are assigned to elements in the new
space. Two dimensions will allow to plot the entries as scattered dots and find clusters of
closer elements.
46
Networks
Another technique to find patterns in data is the theory of complex networks [BLM+06]. By
means of the construction and analysis of networks, the structure and dynamics of many
natural, social and technological systems have been revealed. This methodology unveils
patterns at different scales by aggregating the relationships occurring at the local scale. The
idea is to understand the complex system abstracted in the form of network, by means of the
graphs topological properties. This means to characterize the way the system is structured,
like unveiling mixing patterns, as well as modeling the dynamical processes that take place
on the network, such as information spreading.
4.1.3 Statistical Significance
A challenge when measuring patterns in related random variables is to discriminate sampling
errors and to confirm that the claimed effects are not the result of random processes. The
statistical significance measures how apart are observations from being just random and
indicates whether observations represent actual properties in the population [Cum12].
The z-score is a measure to test the significance of a given value with respect to the
expected from a normal distribution. It normalizes the variable x according to the average
µ and the standard deviation σ in the following way:
z =x− µσ
(4.3)
Te value of z indicates how common an observation is, in respect to the probabilistic
space. A low z-score means that the observation is common and that the probability for it
to happen is high. A high z-score means that the observation is not that frequent and that
its probability of occurrence is low. The probability of occurrence of a normalized variable is
called p-value. The p-value is inversely proportional to the amount of information contained
in an observation. The least the probable, the more information it contains [Sha01].
The same happens with patterns. Up to which point, the patterns observed are not
just a matter of chance and how much of information is contained in the measure. A
common technique to test the significance of the structure of networks or dynamical models,
is to suppose that the measured configuration is just a realization of an stochastic process.
In order to measure so, we can rewire the network edges and create several independent
configurations from the same space of possibilities. Then we can compare our measure with
the results from the reshuffled networks, and estimate its statistical significance.
47
4.2 Twitter Datasets
Twitter is an online social network with over 200 million users around the globe. Its main
feature consists in allowing people to post and exchange text messages limited by 140 char-
acters [JSFT09]. People use it from personal computers and more increasingly from mobile
devices. According to recent user tendencies research [Com11], most of people participate
in social media away from personal computers. Each message contains information about
its author, creation date, device source, text body and some times geolocation. By default
messages are public on Twitter, but users have the option to make them private and share
them with selected contacts.
There are several mechanism for users to interact on Twitter. The first of these is the
followers mechanism. It allows users to passively receive all the messages posted by those
who follow, as well as to deliver their own messages to their own followers. In this sense, it
establishes the Twitter followers network, where the users are connected among each other,
through links that determine the explicit ways where messages are delivered. The Twitter’s
global followers network is a directed graph where non reciprocal relations are admitted.
Previous studies have reported complex properties in this network [KLPM10], like degree
distribution with power law behavior, small mean distance between nodes and modular
structure.
An important mechanism on Twitter is the retweet, which is used to retransmit messages
from other sources. This mechanism allows individual messages to travel throughout the
network. The retweet is the most popular mechanism to propagate the received messages
throughout the network. By retweeting a message, users deliver specific information to their
own followers, at the same time that endorse ideas and gain visibility in the network [BGL10].
The study of the retweets cascades has served to characterize user profiles [GAC+10], measure
influence [CHBG10] and propose spreading models [XLZ+12].
All messages on Twitter may be identified using keywords called hashtag. This mechanism
organize conversations and individuals use it to exchange ideas on specific subjects. It is
responsible for generating the trending topics, and people use it to discuss and exchange ideas
without the necessity of having any explicit relationship. Recently, the statistical analysis
of the hashtags usage has let prediction on social relations [RTU11] and collective attention
[LGRC12].
48
4.2.1 Data Gathering
Twitter has several Application Program Interfaces (API) for people to programmatically
interact with the online service. These APIs are used to gather the data. There are three
main Twitter APIs:
1. The Search API3 queries messages from a temporal index of recent tweets, posted
within a lapse of a week old. Queries must contain a keyword to look for in the
message’s text. Its limitations are specified as the result of queries complexity and
frequency, instead of a percentage of the main stream.
2. The Stream API4 is the one that delivers real time data, providing about 1% of the
main stream. It may track keywords, users or geolocated messages.
3. Finally the REST API5 is the one used to do programmatically functions like posting
messages or following people by means of applications. It also allows to download
user-related information like profiles or followers lists.
4.2.2 Datasets
Using the Twitter Search API, we have built several datasets from public access messages.
Many of the datasets are related to events, like political protests, electoral campaigns or
historical announcements. We have queried the Twitter databases by looking for messages
that contain keywords (or hashtags) that identify the events. In this section we will describe
each of these datasets. Their properties can be found in Table 4.1.
The main analyses conducted on this thesis are related two datasets regarding Venezuelan
politics. Venezuela is the thirteen country in the world with the largest penetration on
Twitter [Sem12]. Close to 3 million Venezuelans participate on this online social network,
which is the equivalent to almost 10% of the country’s population6. The political usage of
Twitter in Venezuela is of great importance and has played a fundamental role in the recent
Venezuelan history [MV12, NT12]. The late President Hugo Chavez was considered to be the
second most influential world leader on Twitter [Cou12], preceded only by the US President
Barack Obama. The collective who opposes the late President, also finds on social media a
channel to freely speak to their supporters and protest against the Government [MLB12].
3https://dev.twitter.com/docs/using-search4https://dev.twitter.com/streaming/overview5https://dev.twitter.com/rest/public6http://www.ine.gov.ve/
49
The first dataset we considered is related to a Venezuelan political protest that took
place exclusively by digital means at December 16th, 2010. The event consisted in posting
messages identified with the hashtag #SOSInternetVE. We downloaded all the messages
that included this hashtag between December 14th-19th, 2010 (two days before and after
the protest). At total we found 421.602 messages, written by 77.706 users. It is remarkable
that 42% of messages where retweets and 60% were sent from smart mobile phones.
Then, we considered a conversation about the late Venezuelan President Hugo Chavez on
Twitter. The conversation includes the day of the announcement of the President’s death,
as well as the schedule for new elections. In total we downloaded over 16,383,490 messages
written by 3,173,090 users for a two month period, from February 4th, 2013 (29 days be-
fore the death announcement) to April 4th, 2013 (26 days after the death announcement).
Messages were posted in more than 159 countries (according to the 0.4% of geographically
located messages). Our analysis is based on those messages that represent retweets or re-
transmissions, which correspond to 49% of the downloaded messages, and more specifically
those that conform the giant components of the retweet networks, which come from 57% of
original set of users.
In order to generalize results, we have also considered other datasets related to conversa-
tions of diverse nature such as sports, news, political protests and electoral campaigns. One
these datasets is related to a political scandal that took place on the Spanish parliament on
2012 due to some unappropriated comments from a congresswoman that echoed loudly on
the social networks. This dataset was built by downloading 35,835 messages from 23,498
users, using the hashtag #Andreafabra, from July 12th, 2012, to July 23th, 2012. Another
dataset concerns a conversation about a Venezuelan baseball team. It was built by down-
loading 142,808 messages that contained the team’s name leones, posted by from 46,608 users
during a 3 weeks period from Dec. 22th, 2010, to Jan. 12th, 2011. We have also constructed
a dataset regarding the announcement of the Spanish separatist band, ETA, declaring the
end of the armed struggle. We downloaded 617,545 messages posted by 241,292 users during
a ten days period from Oct, 10th to 25th, 2011. We have also built another dataset concern-
ing the 2011 Arab Spring, by downloading 7,433,542 messages that contained the keyword
(and hashtag) Egypt, posted by 1,180,715 users during a 5 weeks period, from Jan. 12th,
2011, to Feb. 17th, 2011. During this period the former Egyptian president Mubarak was
overthrown by the social demonstrations. One dataset concerning the 2012 US presidential
elections was built by gathering all the messages that contained the word Gingrich during a
week period from Feb. 29th, 2012, to Mar. 3rd, 2012. This dataset is compound by 93,063
messages and 43,061 users. Another dataset regarding the same elections was built by col-
50
Identifier Messages Users Dates
Andreafabra 35, 835 23, 498 Jul. 12th to 23th, 2012
Gingrich 93, 063 43, 061 Feb. 29th, 2012, to Mar. 3rd, 2012
Leones 142, 808 46, 608 Dec. 22th, 2010 to Jan. 12th, 2011
20N 389, 988 123, 710 Oct. 29th, 2011 to Nov. 27th, 2011
SOSInternetVE 421, 602 77, 706 Dec. 14th to 19th, 2010
ETA 617,545 241,292 Oct, 10th to 25th, 2011
Obama 6, 818, 782 2, 265, 799 Oct. 3th, 2012 to Oct. 5th, 2012
Egypt 7, 433, 542 1, 180, 715 Jan. 12th, 2011 to Feb. 17th, 2011
Chavez 16, 383, 490 3, 173, 0905 Feb. 4th, 2013 to Apr. 5th, 2013
Geolocated 500, 000, 000 - Oct. 1st, 2013 to Jan. 31th, 2014
Table 4.1: Description of the studied datasets.
lecting messages mentioning Obama during the first televised debate from Oct. 3th, 2012,
to Oct. 5th, 2012. This dataset is compound by 6,818,782 messages and 2,265,799 users.
The last of these datasets is related to the 2011 Spanish electoral process. It has been built
with all the messages that contained the keyword (and hashtag) 20N, which was used by all
parties in reference to the election day on Nov. 20th, 2011. This dataset comprehends the
period from Oct. 29th, 2011, to Nov. 27th, 2011 and it is compound by 389,988 messages
adn 123,710 users. In [BMLB12], we characterized the user and politicians interactions dur-
ing these elections and found that the mass media accounts widely dominated the attention
received through the retweets mechanism, while politicians ruled the mentions scenario.
Most of these datasets are related to events that occurred offline, such as televised debates,
electoral processes or historical happenings. In Fig. 4.1 we present the temporal evolution
of the Twitter activity during three of these events: the Spanish election (in panel A), the
Obama’s debate (B) and the Egyptian revolts (C). It can be noticed that during these event
there is a burst of activity, characterized for having an abrupt growth followed by a smooth
decay. This pattern is remarkably ubiquitous regardless of the amount of people participating
or number of messages sent. As shown in panels A, B and C, the height of the activity peak
can span over several orders of magnitude, and yet the curves still present a similar shape.
Moreover, the scale independence is also temporal. The gradual decrease of activity after
the peak can last from a couple of hours, as shown in panel C, up to several days, as shown
by the enveloping curve in panel D.
Finally, we also studied geolocated messages from the Twitter Stream API. Unlike in
51
the previous datasets, these messages are not filtered by keywords but by having enabled
the geolocation option. In general, geo-located messages represent around 3% of the Stream
API messages. However, since these messages represent a minority of the overall stream, the
Stream API provides 90% of them [MPLC13]. In summary, we collected roughly 500 million
geolocated tweets between October 1, 2013 - January 31, 2014 from across all latitudes and
longitudes.
4.2.3 Representativity
In order to conduct research with Twitter data, we must consider the following facts. Due to
the technological nature of Twitter, in general, its users tend to be younger than the average
person and live in denser, more urban areas [DB13, MLA+11]. Also, not all countries use
Twitter the same. For instance, it is banned from countries like China or Iran. Therefore, a
random sample of Twitter users may not necessarily be representative of the whole society
[GA12]. However, Twitter datasets are so massive that they enable the observation of
tendencies and patterns in the behavior of millions of persons and their interactions [Mil11].
4.3 Mobile Phones Datasets
Mobile phones datasets are made out of Call Detail Records (CDR). These are produced by
any phone call or SMS in the communication provider data bases. A CDR usually contains
information about the origin and destination phone numbers, starting time of the call and
duration, and the antenna that is serving the subscriber. For telephone service providers,
CDR are critical for the production of the monthly bill. However, their information is wealth
to identify individuals and their usual behavior and location.
Over the last few years, due to the exponential increase in the penetration of mobile
phones, new opportunities for obtaining such indicators have emerged. In particular, the
use of mobile phones as sensors of human behavior has yielded important research findings
in large-scale social dynamics analysis in areas such as human mobility [GHB08], informa-
tion diffusion [OSH+07], social development [BLT+11], epidemiology [WET+12] and disaster
response [BWB11].
Studies that use mobile phone data usually anonymize the CDR personal data, exchang-
ing phone numbers by random identifiers. However, recent research has shown that individ-
ual trajectories are so unique that just a few locations are enough to identify any individual
[MHVB13]. Although, such fact is relevant for user privacy, most of the scientific inter-
est does not regard tracking individuals, but rather finding collective behavioral patterns
52
Figure 4.1: Temporal evolution of Twitter activity (messages/hour) corresponding to datasets:
(A) 20N, (B) Egypt, (C) Obama and (D) Chavez, described in Table 4.1. At all panels, we are
displaying the impact of events on Twitter activity. The four of them present a burst of activity
when the event takes place, which gradually decreases down to previous levels. Panels (A), (B) and
(C) have similar patterns despite spanning three orders of magnitude on the y-axis. The envelope
curve in panel (D) presents the same pattern across a different time scale. The gradual decrease
of activity spans for several days. The inset curve corresponds to the activity during the shadowed
area in green in a linear scale.
53
that explain social processes. For that matter, data is usually aggregated either socially or
geographically.
4.3.1 Datasets
We first analyzed the CDR data provided by France Telecom /Orange Cote dIvoire within the
framework of the Data for Development D4D Challenge [BEC+12]. The data was collected
for 150 days, from December 1, 2011 until April 28, 2012. The set of collected CDRs contains
2.5 billion calls and SMS exchanges between around five million anonymized users. In this
thesis we work on the following datasets from the D4D project:
1. Antenna-to-antenna: This dataset includes the aggregated number and duration of
calls between any pair of antennas per hour. This means that each register of the
dataset contained the number and duration of all calls made from one antenna to the
other at each hour of the observation period. Therefore, there is no user detailed data
on this dataset.
2. Individual trajectories: This dataset regards the movement of people between the
antennas during calls. It contains the trajectory of 50,000 individuals among antennas.
Each register indicates the time and location of each user, whenever they started or
received a call. In order to preserve privacy, the identity of all users was randomized
every two weeks.
3. Antenna location: The location of antennas were provided together with the datasets.
However, a random displacement was added to the actual location, in order to protect
the company’s sensitive information.
We also analyzed CDR data from the mobile operator Telefonica7 in Mexico. Among all
the data contained in a CDR, our study uses the anonymized originating and destination
numbers, the date and the duration of the call, as well as the latitude and longitude of the
serving antennas. We analyzed a total of nine months, from July, 2009 to March, 2010. In
order to protect privacy, all the information presented is aggregated above the user level.
No contract or personal data was collected, accessed or utilized for this study. No authors
of this study participated in the extraction of the dataset.
7www.telefonica.es
54
4.4 Additional Sources of Information
In order compliment the data collected from Twitter or the mobile phones datasets, we also
analyzed the following sources of information.
1. Global Administrative Areas Database
The GADM8 provides GIS-compatible maps of administrative areas worldwide. GADM
was used to classify the antennas locations in the map and associate them to admin-
istrative boundaries in Venezuela, Ivory Coast and Mexico.
2. Language Map from Ethnologue
The Ethnologue: Languages of the World9 is a reference work cataloging all of the
worlds known living languages. We have used the ethnic and language maps of Ivory
Coast [Lew09] in order to classify antennas locations and map them to ethnic groups
and languages.
3. African Infrastructure Knowledge Program
The African Infrastructure Knowledge Program10 from the African Development Bank
provides GIS-compatible maps of transport, communication, power, sanitation and
water infrastructure. We have used the maps of main roads in Ivory Coast.
4. Electoral Data
The results from the national and regional 2013 elections in Venezuela11 have been
used to compare the results from the Twitter analysis with the offline context.
5. Census Data
The most recent official census from Venezuela12 has been considered to estimate the
Twitter penetration in Venezuela. Also, the most recent official census of Mexico13
has been used to assess the representativeness and validate the population distribution
inferred with the mobile phone data.
8http://www.gadm.org/9http://www.ethnologue.com/
10http://www.infrastructureafrica.org/11http://www.cne.gob.ve12http://www.ine.gov.ve/13http://www.censo2010.org.mx/
55
6. Satellite Imagery Data
Multispectral, medium resolution (15 to 60 meters) ETM+Landsat714 satellite images
have been used for detecting and delimiting floods. The temporal resolution of this
data source is 16 days, so it helps to approximate the flooded area with reasonable
accuracy, at least before and after the flooding happened. The spatial resolution is
high enough to segment broad floods, river overflows or lake leakages. The satellite
imagery data allows us to spatially limit the affected regions with better accuracy than
the vague approximations that could be inferred retroactively from news or historical
documents.
7. Precipitation data
The Tropical Rainfall Measuring Mission project15 provides high resolution (3 hours
of temporal resolution and 0.25 squared degrees of spatial resolution) of precipitation
levels worldwide. The spatial resolution of this data is lower than the satellite images
used to segment the floods, but high enough to obtain a realistic precipitation level in
the affected area. On the other hand, the temporal resolution is adequate to generate
a time series comparable to the CDR data.
14http://earthexplorer.usgs.gov/15http:// http://trmm.gsfc.nasa.gov/
56
Chapter 5
HUMAN BEHAVIOR DURING
POLITICAL MOBILIZATION
In this chapter, we analyze the users’ behavior from Twitter activity during a political
mobilization process, such as the Venezuelan protest #SOSInternetVE (see section 4.2.2).
We characterize users according to their role in the information diffusion process [MLB12].
We build two kind of networks to represent the phenomena. First, we construct networks
to represent who receives whose messages, that we have identified as the social substratum
at which the information may flow. Second, we build the information diffusion networks,
relating who forwards whose messages, in order to represent the effective channels through
which information actually flowed within the social substratum. Then, based on the graph
theory (see chapter 2), we calculate and correlate several measures to understand the social
structure and the dynamical patterns that emerge from the studied conversation.
The organization of this chapter is as follows. In section 5.1 we present the temporal
evolution of the protest activity and in section 5.2 we study the individual user behavior.
Then, from sections 5.3 to 5.6 we discuss the structures formed by the users when they
interact with each other, either passively or actively. Next, in section 5.7, we describe the
underlying user behavior behind such structures. And finally in section 5.8 we show these
structures from the mesoscale point of view.
5.1 Temporal Behavior
We first analyze the temporal evolution of the Twitter activity related to the protest mea-
sured by the number of messages posted by minute. At the top of Fig. 5.1 we present the
evolution of the message rate for the period December 14th-19th, 2010. This series has a
57
Figure 5.1: Top: Time evolution of the message rate (messages/minute) of the Venezuelan protest
#SOSInternetVE. Arrows indicate some of the times when the protest convoker participated. Bot-
tom: Time evolution of the accumulated percentage of messages (dashed line) and participant users
(solid line).
similar shape as the Twitter time series modeled in the study of Yang [YL11] during critical
events. It can be noticed that at the beginning of December 14th, 2010, the studied hashtag
did not even exist in the Twitter servers. Then, after its first appearance on the same day,
some user activity was recorded. Yet it is on December 16th, 2010, when the protest takes
actual place and the trending topic bursts and reaches its highest point, showing critical
phenomena features. However, after December 18th, 2010, much of the interest is lost and
the trend tends to decay really fast as expected for trend topics on Twitter [AHSW11].
The protest growth can be seen more clearly at the bottom of Fig. 5.1, where we have
plotted the accumulated number of messages (dashed line) and users (solid line) as a function
of time. It is remarkable that the system grew from 22% to 87%, in terms of users, and 12%
to 84%, in terms of messages, in a time frame of 7 hours, which has been highlighted around
the afternoon of December 16th, 2010 in Fig. 5.1, and coincides with the main burst.
Furthermore, it can be noticed that the number of users that participate in the protest
saturates faster than the amount of messages at all times. This is a typical feature of local
interest conversations [KLPM10] where users post messages repetitively on the same topic.
For example, after the day the protest was convoked on December 14th, 2010, already 15%
of the users had participated. However, the messages they posted did not even reached 7%
58
Figure 5.2: Complementary cumulative distribution of the user activity during the Venezuelan
protest #SOSInternetVE. Solid line is the fit to an exponentially truncated power law, P (x >
x∗) ∝ x−βe−x/c, where β = 0.880± 0.001 and c = 65, 0± 0.6 at the last day.
of the total amount.
5.2 Individual Behavior
The user activity Ai is considered as the sum of the original and retransmitted messages,
sent by each participant i. In Fig. 5.2 we show the evolution of the cumulative distribution
of the number of messages sent (posted) by user, at the different days that the protest lasted.
It can be noticed that the distribution can be fitted to an exponentially truncated power
law, in the form: P (x > x∗) ∝ x−βe−x/c, where β = 0.880± 0.001 and c = 65, 0± 0.6 at the
last day. It is remarkable that there is a clear distinction between the days before and after
the main burst (see Fig 5.1) which reflects the criticality of the phenomena. However, at
each day of the both stages, the users presented the same behavior, in the sense that they
are distributed in the same way during the days before the protest, but also during the days
after the protest.
This distribution indicates a certain degree of complexity in the phenomena and het-
erogeneity in the user behavior. Before the main burst, 60% of the participants had sent
less than a couple of messages, 1% over 30 messages, and about 0.01% had posted over 100
59
Figure 5.3: In (top) and out (bottom) degree complementary cumulative distributions of the
followers network from the Venezuelan protest #SOSInternetVE.
messages. On the other hand, at the last day of the protest, 50% of the users also had sent a
couple of messages at most, while 1% sent over 60 messages, and just about 0.0013% posted
over 600 messages. This result shows that the percentage of most active users decreases
rapidly as the system grows.
5.3 Followers Network
In the same manner that users post messages quite differently among them, these messages
have also different relevance in the conversation development. On Twitter, not all the users
account the same level of visibility in the message stream, because the number of recipients,
and possible readers, strongly depends on the source’s in degree on the followers network.
This social substratum may be analyzed by the construction of a graph with the protest
participants, linking the users according to who follows who. The resulting is a directed
and non weighted network compound by 77,706 nodes and 5,761,331 links, displaying the
structure through which information is delivered and might be spread. The edge direction
goes from the follower to the message source, thus information flows in the opposite sense
of the edges. The attention received can be measured by means of the in degree kin. The
attention payed is measured by the out degree kout, indicating the number of people who the
60
Figure 5.4: Scatter plot of in and out degree of the followers network from the Venezuelan protest
#SOSInternetVE. Dots represent users.
user follows.
Both in and out degrees follow power law distributions as shown in Fig. 5.3. In terms of
the in degree, the distribution indicates that over 50% of the users are followed by less than
15 users, while just 1% of the users have over 1,000 followers and around 0.01% of the users
have over 20,000 followers. For the out degree distribution, we found that over 50% of the
users follow less than 40 users, while 1% of the participants follow over 600 users and 0.01%
follow over 9,500 users. This distribution presents an exponent within the expected range
for human actions [New05].
As can be seen in Table 5.1, the mean distance between nodes in this network is dF =
2.2. This value indicates the presence of the small world effect [WS98]. Previous studies
performed on the Twitter global follower graph, state that the mean distance between users
is to be 4.12 [KLPM10]. This fact is related to the presence of users that act like hubs,
concentrating a large quantity of incoming and outgoing links. However, our results are
lower than the previously reported values, due to the special characteristics of the event and
its participants. For example, the protest convoker, which is a TV station, is followed by
over 52% of the participants, linking half of the total population.
Based on the degrees correlation shown in Fig. 5.4, we found that user profiles are highly
heterogeneous and that the network is very asymmetrical. It is remarkable that there are
some users, corresponding to the scattered points located below the dotted diagonal, that are
61
Network Nodes Edges Mean distance Density Degree Assortativity
Followers 77,706 5,761,331 2.22 1.42× 10−3 -0.10
Retweets 54,423 231,485 3.40 1.25× 10−4 -0.15
Table 5.1: Followers and retweet network properties from the Venezuelan protest #SOSInter-
netVE.
widely followed but do not follow many people. At the same time we found other users, who
are more reciprocal and stay near the dotted diagonal, specially after Kin > 1, 000 followers,
where practically any users are found above the diagonal. Finally there are some users,
corresponding to the region densely located above the diagonal, who follow more people
than what they are followed. These users represent the majority of the participants.
5.4 Retweets Network
The second network is built according to who retransmits whose messages. It is a network
that emerges from the users’ interactions. The nodes are users that retransmitted messages
to its own followers, as well as users whose messages were retransmitted. This network
indicates the effective links through which the information actually flows inside the active
social substratum. In principle it might seem to be a subgraph of the follower networks,
but it is not so, since on Twitter people are able to retransmit any message, no matter if
it does not hold any type of relation with the source user. The resulting network is also a
directed graph, where edges are weighted according to the number of times a user retweeted
the source user. At total, by December 19th, 2010, the graph is compound by 54,423 nodes
and 231,485 links. The difference between the amount of nodes found in the followers graph,
shows that 30% of the users behave much more passively than the others. Furthermore, we
found that 75% of the participants were not retweeted at all.
In the retransmissions network, we have analyzed the strength function for each user. The
in strength value represents the number of times a user has been retweeted. Its distribution
follows a power law, as shown at the top of Fig. 5.5. Such distribution indicates the presence
of highly connected hubs, which explains why the mean distance between nodes is dR = 3.4,
which is also a very low value. On the other hand, the out strength shows the number of
times a single user has retransmitted. Its distribution can be fitted better to an exponentially
truncated power law distribution, as shown at the bottom of Fig. 5.5. The truncation value,
near 500, is related to the limitation for human actions as stated on the Dumbar number
62
Figure 5.5: In (top) and out (bottom) strength complementary cumulative distributions of the
retransmission network of the Venezuelan protest #SOSInternetVE. Solid line is the fit to an
exponentially truncated power law P (Sout > S∗out) ∝ S−βoute−Sout/c, where β = 0.890 ± 0.002 and
c = 61.0± 1.2.
theory [GPV11]. This theory states that people are only able to maintain tie relationships
with less than 200 people. The reason for which we found a higher value relies on the fact
that a retweet do not imply strictly a mutual relation between people. In fact, it is an
individual choice that has a very low cost in money, time and personal energy, which makes
it easy to happen.
The difference between the in and out strength distributions, is related to the way that
we have designed the network. While the out strength is due to one person’s activity, the in
strength distribution is due to the aggregation of several individual efforts. Such aggregation
is responsible for the emergence of extreme cases and a higher complexity level in the final
distribution. From the in strength distribution, shown at the top of Fig. 5.5, it can be
noticed that over 60% of the users that participated in the retransmission process gained
less that 3 retransmissions, while 1% gained more than 150 retransmissions, and only 0.01%
gained over 5,000 retransmission. Analogously, for the out strength distribution, we found
that over 60% of the users who retransmitted messages, did it over less than 3 messages,
while 1% of them retweeted over 60 messages, and less than 0.01% retransmitted more than
300 messages.
We also calculated the edge’s weight distribution and found that it follows a power law
63
Figure 5.6: Edge’s weight complementary cumulative distribution of the retransmission network
from the Venezuelan protest #SOSInternetVE.
as shown in Fig. 5.6. The edge’s weight represents the number of times that a single user
retweeted another user. The figure shows that only 10% of the edges present a weight higher
than 2. However, we found that near 0.001% of the edges have a weight higher than 80.
This indicates that the majority of users retweeted other users individually only a couple of
times, yet a small fraction of them maintained a closer tie with other users, in the sense that
they retweeted their messages close to 100 times. On the other hand, the retransmission
network also presents the same asymmetries found in the followers network. For example
the 10 most retransmitted users caused more than 20% of all retransmissions, writing less
than 0,4% of all messages.
It is remarkable that the retransmissions network is much less dense than the followers
network as stated in Table 5.1. This indicates that inside the contacts web there is a
finer structure where the information actually travels. The reason for this result is that
retransmitting implies an active behavior, instead of the passivity of the following relation.
This shows how users are more selective when it comes to take some action.
64
Figure 5.7: Visualization of the retweet network emergent from the message propagation on the
followers network. (A) Subgraph of the retweet network (green) superimposed to the corresponding
followers network (black), from the #SOSInternetVE dataset. In the figure a subset of 1000 random
nodes (yellow and red) are presented. The node size is proportional to the respective in degree
on the followers network. (B, C and D) Example of the formation of the retweet network from
independent retweet cascades on an artificial followers network. (B) shows when two users (red
nodes) post independent messages which are received by their followers (gray). (C) shows when
some users retweeted the message (yellow) and this message arrives to their followers (gray). (D)
shows the final shape of the cascades on the network, compound only by the activated nodes (red
and yellow) connected by the green links. The white nodes and gray links represent the rest of the
substratum (followers network) who did not activate. (E) shows the schema of a single cascade.
The black circles determine the cascade layers.
65
5.5 Degree Assortativity
In order to unveil how such heterogeneous users interacted with each other, we calculated
the assortativity by degree coefficient [New03a, FFGP10] for the followers network (rF ) and
the retweets network (rR). Both of them resulted to be disassortative: rF = −0.10 and
rR = −0.15 (see Table 5.1), which reveals the asymmetric shape of these networks. The
hubs that concentrate much of the incoming links, are often targeted by regular users, who
do not receive much of the collective attention. Although social networks have been reported
to be assortative [New03a], this pattern changes in the online world, where disassortativity
is usually found [HW09]. The reason relies on the fact that in the online world regular
people are now able to relate and communicate with popular accounts, either by following
or retweeting their messages in the case of Twitter.
5.6 Retweet Cascades
The retweet network can also be seen as the aggregation of independent retweet cascades,
that respectively occur when a single message is retransmitted by any user to its followers,
allowing them and their own followers, to do the same. An example of the resulting structure
is shown in Fig. 5.7 A, where a subset of the retweet network (green edges) has been plotted,
superimposed to the respective subgraph of the followers network (gray edges). The red nodes
represent those who posted an original message and the yellow nodes represent the message
propagators (those who retweet). It can be noticed that the retweet network represents a
subset of the followers graph where messages are actually being propagated. This graph
evidences that people are more selective to actively interact with their declared contacts
than just receiving updates from them [HRW09].
In order to explain the dynamical process behind these cascades, an scheme of the evo-
lution of two cascades on an artificial followers network is sketched from panels B to D in
Fig. 5.7. In panel B two independent messages are respectively posted by the red nodes
and received by their followers (gray nodes). Some of these followers retransmitted the mes-
sages (yellow nodes), through the green edges, and others did not (white nodes), as shown in
panel C. Accordingly, in panel D some of the followers of followers retransmitted the message
(also yellow nodes), and the final shape of the cascades may be appreciated. To summarize it
schematically, a single retweet cascade from the dataset is presented in Fig. 5.7 E. The white
nodes do not belong to the cascade, as we only consider those who actively participated in
the retransmission process. Using this schema some of the main cascade properties will be
66
explained in the remaining section, such as the amount of retransmissions gained by user,
as well as the cascade size, depth and rate of retransmission.
The first property we analyzed is the number of retweets gained by user, Ri, which
may also be considered as the node i in strength of the retweet network. This quantity
may increase either from cascades originally seeded by i, as well as cascades where i acted
as a propagator. For example, for the cascade shown in Fig. 5.7 E, Ri would take the
following values: R0 = 15, which is the total number of users who retweeted the message
originally posted by the node 0, either directly (nodes 1 to 11) or indirectly (nodes 12
to 15). Accordingly, R8 = 2, since the node 8 has been retweeted by nodes 15 and 14;
R1 = R4 = 1, since node 1 and 4 have been retweeted by node 12 and 13 respectively; and
finally R2 = R3 = R5 = R6 = R7 = R9 = R10 = R11 = 0, as no one retweeted them.
Another property analyzed is the cascade size, which is defined as the total amount of
nodes that have been activated in the context of a given cascade. In the example shown
in Fig. 5.7 E the resulting cascade size would be 16, as we have 1 author (node 0) plus
15 propagators (nodes 1 to 15). In the studied conversation, this property is distributed
following a power law behavior, as presented in Fig. 5.8 A. This indicates that most of the
cascades are extremely small, as more than half of them (60%) are compound at most by
2 persons besides the author, and just a small fraction are large, since around 5% of them
have more than 10 users, and 0.03% present more than 100 participants.
In order to understand the cascades structure, we have divided them by layers, as shown
with the black circles in Fig. 5.7 E. The cascade layer indicates the number of hops from
a propagator node to the source node, through the cascade links. The users correspondent
to the layer l = n represent those who retransmitted the message coming from a user of the
previous layer l = n − 1. In Fig. 5.7 E, the message author (red node) stands alone in the
layer l = 0, while in the consequent layers, we find those nodes who retweeted the message,
like the nodes 1 to 11 in layer l = 1, and the nodes 12 to 15 in layer l = 2.
The cascade depth d corresponds to the farthest layer from the message source, in which
a node has been activated. In the example shown in Fig. 5.7 E, it would take the value of
d = 2. In the analyzed conversation, the probability of a cascade to have a certain depth,
P (d), is presented in Fig. 5.8 B. Those cascades of depth d = 0, represent original messages
that were not retweeted by anyone, which comprehends close to 80% of them. In this sense,
only 17% of the cascades just have one layer of retransmission (d = 1), and this quantity
decreases exponentially as we move farther from the message’s source, reaching a maximum
depth of d = 6 layers with a very low likelihood (∼ 10−5). This indicates that the retweets
cascades found in this conversation are quite shallow, which might result counterintuitive, as
67
100 101 102 103 104
Users per cascade
10-6
10-5
10-4
10-3
10-2
10-1
100
CCDF
A
0 1 2 3 4 5 6d (depth)
10-6
10-5
10-4
10-3
10-2
10-1
100
P(d
)
B1 2 3 4 5 6
l (layer)
10-6
10-5
10-4
10-3
10-2
10-1
100
λl
C
Figure 5.8: Retweets cascades statistical properties. (A) Complementary cumulative density
function of the number of users per cascade, (B) Cascade depth distribution P (d) and (C) Re-
transmission rate by layer λl in terms of retweets over followers. The data correspond to the
#SOSInternetVE dataset.
68
Topic rF,A rF,R rR,A
SOSInternetVE 0.07 0.57 0.17
Table 5.2: Pearson correlation (r) by user of the number of followers (F), retweets (R) and activity
(A).
we would expect retransmissions to increase directly to the message’s visibility, which should
increase with each retransmission. However, shallow cascades have been detected on Twitter
in works of influence dynamics [BHMW11] and prediction of urls propagation [GAC+10], as
cases of different media, like the flow of emails inside a corporation [WWT+11]. It has
been shown that information tends to loose its capacity to attract attention when we move
farther from the author’s social surroundings, and hence the probability of a cascade to grow
is inversely dependent on the distance from the source node [WHAT04].
Finally, the rate of retransmission at each layer, λl, is estimated by averaging the ratio
between the number of users who retransmitted a message normalized by the number of indi-
viduals who received it at each layer, taking into account the followers network information.
The results are shown in Fig. 5.8 C, and it shows that λl ∼ 0.01 for l > 1, while in the first
layer the average retransmission ratio reached up to 5% (λl ∼ 0.05) of the exposed users.
5.7 Analysis of User Behavior
In this section, we will discuss the way users behaved in the #SOSInternetVE dataset, based
on their activity and role in the followers and retweets networks. In the Appendix A, we
present similar results obtained from analyzing the datasets: #20N and ETA (see section
4.2.2).
In Table 5.2, the Pearson coefficient between the users number of followers F (measured
as the kin in the followers network), retweets gained by user R and activity A, are presented.
It can be noticed that there is no correlation between the number of followers and activity
employed (rF,A = 0.07), which means that the amount of messages posted is independent of
the user position in the followers network. However, there is a strong correlation between the
number of followers and the retransmissions gained (rF,R = 0.57), which means that the most
retransmitted users tend to be the most followed ones as well. Besides, there is a positive
correlation between the number of retransmissions and activity employed (rA,R = 0.17),
which indicates that the chances of being retransmitted increase with every message posted
69
Figure 5.9: Analysis of the user behavior. (A) Scatter plot of retransmissions obtained by user
versus its activity and colored by its number of followers. (B) Scatter plot of retransmissions
obtained by user versus its number of followers and colored by its activity. (C) Scatter plot of
retransmissions obtained by user versus the ratio between the number of followers and followees,
and colored by its activity. (D) Scatter plot of retransmissions made by user versus its number of
followers and colored by its activity. Dots represent users. Data correspond to the #SOSInternetVE
dataset.
70
for all users.
In Fig. 5.9 A, we present a scatter plot of the retweets gained by user as a function of
its activity and colored by the user Kin in the followers network. It is important to clear
out that the users that appear in this plot were retransmitted at least once. These users
represent the 25% of the participants, as said in section 5.4. It can be clearly noticed that the
most retransmitted users are also the most followed ones (red dots), independently of their
activity. In fact, if a popular account increases its activity, the retransmission level boosts
nonlinearly, like the most retransmitted user that gained more than 10.000 retransmissions.
However, some less followed users (green or yellow dots) may also gain a significant amount
of retransmissions, but by means of a considerable increase in their own activity. These
users are located around the straight line of slope 1, and their retransmissions gained are
proportional to their activity. Finally, some not so followed users (blue dots in Fig. 5.9
below the dashed line), who are vast majority of the population, needed to post an enormous
amount of messages to gain, if any, a few retransmission at most.
In Fig. 5.9 B, we present a scatter plot of the retweets gained by user as a function of
its number of followers and colored by its activity. It can be noticed that the most active
users (red dots) do not have the largest amount of followers. However, these active users
gain as many retweets as the popular users (blue dots next to red dots), who have the largest
amount of followers but send much fewer messages.
In Fig. 5.9 C, we present the in strength, Sin, of the retransmissions network as a function
of the relation between the in and out degrees, Kin/Kout, of the followers network. The users
are represented by points colored by the users’ activity or amount of messages posted. This
representation let us separate the popular accounts, where Kin/Kout > 1, from the none
popular accounts, where Kin/Kout < 1, and the reciprocal users, where Kin/Kout ∼ 1. It can
be noticed that the popular accounts may get a high value of retransmissions while having
low activity. Meanwhile, the reciprocal users also get the same amount of retransmissions
than the popular accounts, but they must employ much more activity.
In Fig. 5.9 D, we present a scatter plot of the retweets made by user as a function of its
number of followers and colored by its activity. It can be noticed that the most active users
(red dots) are those who retweet the most and do not have the largest amount of followers.
Again, we see that those with the largest amount of followers are less active and make a few
retweets at most.
These results let us classify users into three categories: Information producers, active
consumers and passive consumers. The information producers are the widely followed users
who gain an enormous amount of retransmissions, whereas they have low activity. These
71
Figure 5.10: Community structure for the follower graph. Circles represent communities of
users and their size is proportional to the amount of users that belong to the community. Edges
represent the inter-community links, either followers (Left) or retransmissions (Right), and their
width is proportional to the amount of edges, normalized by the size of the outgoing community.
The data correspond to the #SOSInternetVE dataset.
users do not tend to follow a lot of people, nor retransmit many messages. We found that
these accounts belong to traditional mass media agents like TV, journalists, politicians and
celebrities. On the other hand active consumers are users with high reciprocity in relations.
They tend to gain as much audience and retransmission rate, as the amount of activity
employed. They are key in the information diffusion process, because they boost the content
and serve as the propagators of the information producers. At last, passive consumers are
the largest group of users who practically does not participate in the propagation process.
They consume more information than what they produce. They are characterized for having
low activity rate, not retransmitting many messages and receiving messages from much more
people than their audiences.
5.8 Mesoscale Communities
In order to get more insight in the structure and behavior of the Twitter users during
the protest, we have calculated the mesoscale structure for both networks. In this section
we describe the communities detected in our networks based on the algorithm described
in [BGLL08]. We chose this algorithm based on the modularity optimization, due to its
72
Figure 5.11: Community structure for the retransmission graph. Nodes represent communities
and edges represent the inter-community links. The nodes’ size are proportional to the number
of people that compound the community and the edges’ width are proportional to the number
of inter-community links normalized by the size of the community. The data correspond to the
#SOSInternetVE dataset.
73
Community Collective
0 Comedy accounts
1 Show business celebrities
2 Opposition media
3 Opposition politicians
4 International media
5 Government favorable politicians and media
Table 5.3: Main collectives around which each follower community is formed from the Venezuelan
protest #SOSInternetVE.
capacity to reveal mesoscale structure in large graphs with good computing performance.
On the follower graph, we found six main communities that grouped over 98% of the
population. We identified the most followed users at each community in order to understand
the reasons for which people have grouped. We found that these structures are formed
by users around central accounts that belong to similar collectives. Specifically, we found
communities around opposition media and journalists, opposition politicians, entertainment
celebrities, international media, comedy accounts and government favorable politicians, as
described in Table 5.3.
This structure is shown at the left side of Fig. 5.10. Each node represents a different
community and its size is proportional to the amount of users that compounds them. We
found that the largest communities are formed around the comedy accounts (0), celebrities
(1), opposition media and journalists (2) and opposition politicians (3). Meanwhile, the
smallest ones are formed around international media (4) and government favorable users
(5). The edges shown in Fig. 5.10 represent the inter-community links, and indicates the
existence of users who follow or are followed by users from other communities. The edges’
width is proportional to the amount of individual inter-community links, normalized by the
size of the outgoing community. As it has been pointed out in section 5.1, messages go from
the source to its followers. Thus the information flows in the opposite sense of the edges.
It can be noticed that there is a tie relation between the communities formed around the
opposition media, opposition politicians, celebrities and comedy accounts. These collectives
seem to have dominated the protest. Specially the opposition media community, group 2 in
Fig. 5.10, which concentrates the most amount of users and incoming links. Therefore their
messages are widely received throughout the network. Group 3 is certainly smaller than
other groups, which is a remarkable fact because it concentrates much of the opposition
74
politicians and the event consisted in an opposition political protest. However, this group
is strongly related with other communities and, even though they present a large amount of
outgoing links, their messages are also quite spread. In contrast, group 5, which represents
the government favorable accounts, seems to follow a lot of outside users, yet only a little
fraction of the participants seem to follow them. This means that most of their messages
mainly remain inside their community and are hardly read by the rest of the network.
Nevertheless, for all communities we found the same user behavior. In the sense that all
of them are formed around popular accounts that belong to traditional mass media agents,
no matter if they are opposition or government favorable, or even non Venezuelan users like
group 4. At the right side of Fig. 5.10 we also show how the followers communities retransmit
messages from other communities. This behavior has been pointed by Przemyslaw et al.
[GRM+12] who demonstrated that retweets transcends the friends communities and serve as
bridges for messages to spread throughout the network.
On the other hand, we also calculated the community structure for the retransmission
graph. In this case we found 34 main communities containing more than 96% of the popula-
tion. This network showed a completely different mesoscale structure as can be observed in
Fig. 5.11. However, a similar user behavior as the follower network has been detected. Each
of these communities contains at its core at least one popular account, like the information
producers described above, which is highly followed and retransmitted. In Table 5.4 we
present some information about these popular accounts, like their nicknames and profession.
Once again we found communities formed around traditional mass media agents, such as
TV Stations, newspapers, journalists and politicians, as well as humorists, civilian activists,
student leaders, community managers and micro-bloggers. This result indicates that people
behave selectively when retransmitting messages in comparison to just receiving them.
Once again the nodes represent the communities detected and the sizes are proportional
to the amount of users compounding each of them. The edges represent the inter-community
links, and its width is proportional to the amount of inter-community links found, normalized
by the size of the outgoing community. These links exist due to the fact that some users
retransmitted messages from another community. It can be noticed that communities are also
asymmetrical when referring to inter-community retransmissions, and also present different
profiles. For example, the community number four, which is formed around the Venezuelan
micro-blogger @cualrevolucion and Cuban blogger @yoanisanchez, is highly retransmitted
by all other communities, while it hardly retransmitted other communities.
75
Community Popular account Collective
0 @nelsonbocaranda Journalist
1 @rctv contigo TV Station
2 @elnacionalweb Newspaper
3 @indiferencia Community manager
4 @cualrevolucion, @yoanisanchez micro-bloggers
5 @ucabistas Student leaders
6 @erikadlv Journalist
7 @vvperiodistas TV Station
8 @kikobautista Journalist
9 @edoilustrado Political comics
10 @globovision TV Station
11 @palabrasdecersar Humorist micro-blogger
12 @rmh1947 Government favorable activist
13 @leopoldolopez Politician
14 @carlossicilia Humorist
15 @alberto ravell Journalist
16 @gabycastellanos Community manager
17 @ecualink Ecuadorian magazine
18 @leonardo padron Writer
19 @EUTrafico Newspaper
20 @2010misterchip Sports journalist
Table 5.4: Most retransmitted account at each retransmission community from the Venezuelan
protest #SOSInternetVE.
76
5.9 Summary
In summary, we studied the Venezuelan protest #SOSInternetVE, which took place exclu-
sively on Twitter. We have analyzed the structure and behavior of the participant users
based on their information exchange interactions. For this we have constructed the followers
networks to represent the social substratum, where information may flow, and the retweets
graph, where messages actually travel. Most of the degree distributions at both networks
follow power laws and the mean distances between nodes resulted to be very small. Then,
based on the networks structure, we identified three types of user behavior that determine
the dynamics of the information flow: Information Producers, Active Consumers and Passive
Consumers. We found some users that cause a lot of activity inside the network, posting a
little amount of messages, while others must post lots of messages in order to get retrans-
mitted. We also found a big fraction of very passive users who does not even retransmit nor
get retransmitted at all. We also carried out a community analysis to describe the mesoscale
structure of the networks. We found that people is organized around different collectives.
The most central users who conform each of these collectives are very popular and usually
they also generate smaller retransmission communities emergent from the propagation dy-
namic. This shows that people is more selective when it comes to take an active part in the
conversation. We noticed that although the online social media seems to be a purely social
phenomena, traditional media agents still enjoy a lot of power and influence over people,
who they use to boost and enhance their messages.
77
78
Chapter 6
EFFICIENCY OF HUMAN
ACTIVITY AS A MEASURE OF
INFLUENCE
In this chapter we address the following question: what can Twitter users do to increase
their influence? We explore two avenues for this: topology and activity. We introduce a
new index to measure the influence of users on Twitter, called user efficiency [MBLB14]. It
is based on the ratio between the emergent spreading process and the activity employed by
the user, quantified as the amount of retransmissions gained per user by message posted.
We study this property by means of a quantitative analysis of the structural and dynamical
patterns emergent from human interactions during six conversations on Twitter. We found
a universal behavior in the relation between the individual efforts, managed by the user,
and the collective reaction to such efforts, which is an emergent property of the underlying
network. In general, this universality indicates that influence can be increased by means of
the activity, but in a very expensive and inefficient way. We propose a model to explain
the user efficiency based on biased independent cascades on networks. We study this model
to understand the effects of different factors, like the topology of the underlying network
and user activity distribution, on the resulting distributions of efficiency. We found that
the emergence of a select group of highly efficient users depends on the heterogeneity of the
underlying network, rather than on the individual behavior.
The present chapter is organized as follows. First we introduce our measure of user
efficiency in section 6.1. Then in section 6.2 we show the universal behavior of such measure
across different datasets. Next we introduce a computational model to explain the obtained
distributions in section 6.3. Finally, we apply the model to the datasets and explore the
79
Figure 6.1: Scatter plot of the user in degree vs out degree in the followers network, colored by the
respective user efficiency. Dots represent users. Data correspond to the #SOSInternetVE dataset.
effects of the activity and underlying graph properties in section 6.4.
6.1 User Efficiency
The fact that not all the participants must employ the same amount of effort, to accomplish
the same level of retransmissions, implies that users have an individual efficiency to get
their messages spread by others. We define user efficiency, η, as the ratio between the
collective response to the individual efforts [MBLB14]. It is a metric of influence in the
network, quantified as the amount of retransmissions gained by user with each message
posted, defined according to the following expression:
ηi =Ri
Ai(6.1)
where Ri is the number of retweets gained by user i, and Ai is the amount of messages
posted or retweeted by the user i. The users whose η > 1 get more retweets than the
80
10-4 10-2 100 102 104
User Efficiency
10-5
10-4
10-3
10-2
10-1
100
PD
F
A
10-4 10-2 100 102 104
User Efficiency
10-5
10-4
10-3
10-2
10-1
100
CC
DF
B
−5 −4 −3 −2 −1 0 1 2 3 4Logormal quantiles
−6
−4
−2
0
2
4
6
8
Em
pir
ical quanti
les CKF
in <10
KFin <100
KFin <1000
KFin <10000
KFin <100000
Figure 6.2: User efficiency probability density function (A) and complementary cumulative density
function (B). The red dots correspond to the empirical results, the black solid line represents the
lognormal fit and the black dashed line represents a power law fit. Quantile-Quantile plot (C) of the
user efficiency distribution, filtered by the in degree in the followers network KFin. The distributions
correspond to the #SOSInternetVE dataset.
number of messages posted and therefore are more efficient to spread their information in
the network. Consequently, these users gain more influence in comparison to those whose
η < 1, which had to employ larger efforts to obtain similar outcomes.
In Fig. 6.1, we present a scatter plot of the users degree in the followers network, kin and
kout, colored by their efficiency η, from the #SOSInternetVE dataset. It may be noticed,
that the users who present an efficiency η > 1 (green, yellow, orange and red dots) are mostly
located below the dashed line of slope one, which means that their audiences (kin) are larger
than their sources of information (kout), which implies a certain level of popularity in the
network. Specially, those whose η >> 1 (orange and red dots), who may be followed by more
than 104 users, but they only follow less than 10 users. Meanwhile, the users who present a
low efficiency (blue dots), tend to receive messages from much more sources than the size of
their audiences (kout > kin), and also have a smaller amount of followers. This means that
81
these users hear more information from the network, than what they are actually listened.
However, the mean efficiency value seems to be close to 1 (Ri ∼ Ai), as shown in the user
efficiency η distribution presented in Fig. 6.2 A, which means that in average most of the
users who got retweeted, gained as many retransmissions as the amount of messages posted.
Besides, the users whose η >> 1, represent a minority part of the population, as clearly
shown in the η complementary cumulative distribution in Fig. 6.2 B. It can be noticed that
less than 2% of the retweeted population gained more than 10 retransmissions by message
sent (dashed line in Fig. 6.2 B), 0.2% gained over 100 retransmissions by message sent
(dotted line in Fig. 6.2 B) and just one user gained over 1000 retransmissions with a single
post.
In order to further understand the η distribution, we have superimposed in Fig. 6.2 A-B
the correspondent lognormal curve, with the mean and variance taken from the empirical
observations (see Table 6.1). It is known that lognormal distributions arise from multiplica-
tive growing processes, like branching processes, as they may be explained by the central
limit theorem, in the logarithmic scale [Mit04]. An example of these processes are found in
viral marketing campaigns [IE11a, IE11b], where the number of leaves grow multiplicative
as the branches split like the cascades shown in chapter 5. It can be noticed that the initial
part of the distribution fits quite well the lognormal curve, but right after its maximum the
distribution changes the scaling behavior, apparently to a power law, which we have also
superimposed in Fig. 6.2 A with a dashed line. This means that there is a higher concen-
tration of users who gain a larger amount of retransmissions by message posted, than what
is expected for a lognormal distribution. These highly efficient users correspond to the hubs
of the followers network as can be appreciated in Fig. 6.2 C, where we have plotted the
Quantile-Quantile plot of the η distribution in comparison to the lognormal distribution,
filtered by the number of followers. If η would follow a lognormal distribution, all the points
would appear in a straight line, which actually happens for the users who present less than
1000 followers. But, as we consider the most followed users, the curve begins to change its
behavior, suggesting that the underlying network topology is responsible for such deviation.
This point would be further analyzed in section 6.4.
In summary, we have seen two kind of users who may gain a significant amount of
retransmissions. One of them, are the highly connected users in the followers network,
which have no need to follow other people, and with a high efficiency, gain a much larger
amount of retweets than their own posted messages. While, there are other not so well
connected users, who may also gain a lot of retweets, but in a less efficient way, since they
need to post much more messages than the highly efficient ones.
82
Keyword Messages Users µη ση
Andreafabra 35, 835 23, 498 0.15 1.05
Gingrich 93, 063 43, 061 −0.08 1.13
Leones 142, 808 46, 608 −0.08 1.09
20N 389, 988 123, 710 −0.49 1.08
SOSInternetVE 421, 602 77, 706 −0.79 1.21
Obama 6, 818, 782 2, 265, 799 0.14 1.15
Egypt 7, 433, 542 1, 180, 715 −0.80 1.33
Table 6.1: Properties of the studied datasets and their resulting user efficiency distribution prop-
erties.
6.2 Universality
In order to identify whether this distribution is constrained to the present case study or
rather represents a consequence of an universal feature of the interaction mechanism, we
have calculated the user efficiency (η) for other conversations on Twitter. Specifically, we
performed the analysis over six different datasets described in chapter 4 and whose features
may be found in Table 6.1. All of them belong to different contexts and their sizes include
several order of magnitude in terms of the number of posted messages and participant users.
In Fig. 6.3 we present the user activity distribution of these datasets, plotted in ascendant
order according to their size (from A to F). It can be noticed that they follow a power law
behavior at the first orders of magnitude. However, the curves truncate after certain point
due to the individual constrains, as previously explained in section 5.2. Moreover, in Fig.
6.4 we present the distributions of retweets obtained by user for the same datasets. It can be
noticed that these distributions show a power law behavior at all their extension. As shown
in section 5.4, this happens because the retweets obtained are an emergent property that
results from the aggregation of many individual actions.
The results of the emergent η distributions from these datasets are presented in Fig. 6.5.
It can be noticed that the lognormal distribution emerges, even when the smallest datasets
are considered (Fig. 6.5 A-B). However, as the size of the dataset increases, the effects of
the presence of highly efficient users is more evident in the distributions, which present a
very similar shape as the one found for the #SOSinternetVE conversation (Fig. 6.2 A).
Given the fact that the size of the datasets cover from four to six orders of magnitude
and correspond to topics of different nature, it is remarkable that the resulting distributions
83
Figure 6.3: Complementary cumulative density function of the user activity, from several Twitter
conversations, increasingly ordered according to the number of messages (A-F): (A) Andreafabra,
(B) Gringich, (C) Leones, (D) 20N, (E) Obama, and (F) Egypt. The black dashed line represents
a power law fit and the red dots correspond to the measured distributions.
84
Figure 6.4: Complementary cumulative density function of the retweets obtained by user, from
several Twitter conversations, increasingly ordered according to the number of messages (A-F): (A)
Andreafabra, (B) Gringich, (C) Leones, (D) 20N, (E) Obama, and (F) Egypt. The black dashed
line represents a power law fit and the red dots correspond to the measured distributions.
85
Figure 6.5: Probability density function of the user efficiency on several Twitter conversations,
ordered increasingly according to the number of messages (A-F): (A) Andreafabra, (B) Gringich,
(C) Leones, (D) 20N, (E) Obama, and (F) Egypt. The properties of these conversations may be
found in Table 6.1. The black solid line represents the lognormal fit, the black dashed line represents
a power law fit and the red dots correspond to the measured distributions.
86
10-4 10-2 100 102 104
User Efficiency
10-710-610-510-410-310-210-1100
A
EmpiricalModel (Followers net)
100 101 102 103 104 105
Retweets Ri
10-710-610-510-410-310-210-1100
CCDF
B
10-4 10-2 100 102 104
User Efficiency
10-710-610-510-410-310-210-1100
C
100 101 102 103 104 105
Retweets Ri
10-710-610-510-410-310-210-1100
CCDF
D
Figure 6.6: Model results to the user efficiency distribution (left column) and retweets gained by
user distribution (right column), with the empirical results. The model has been applied to the
followers network from the #SOSInternetVE dataset (top panel) and the #20N dataset (bottom
panel).
present a very similar shape. This ubiquity of the resulting patterns, strongly suggests the
existence of an universal behavior in the relation between the individual efforts, managed
by the user, and the collective reaction to such efforts, which is an emergent property of the
underlying network. So we open the following question: what factors cause the emergence
of such distribution? In the next section we will propose a model to explain the emergence
of the observed distribution.
6.3 Model
In order to model the propagation of retweets that took place on the #SOSInternetVE
conversation, we propose a spreading mechanism based on independent cascades [GLM01]
taking place on the followers network. In this model, nodes are activated in analogy to
87
having posted a message, allowing their neighbors to also activate, like having retransmitted
the received message, following the cascade schema shown in Fig. 5.7. Each message may
trigger an independent cascade regardlessly of the author’s previous activations. Besides,
nodes may belong and participate in several cascades at the same time.
In the context of a given cascade, when a node i has been activated, it has a single chance
to activate each of its neighbors (followers), j, located at l layers away from the message
source. Thus the spreading probability depends on such distance l. In the sense that, the
probability of a node j to retransmit a message at l layers away from the source, is given
according to the probability of the cascade to grow vertically and have a depth of at least l
layers, P (d ≥ l), and the probability to grow inside the layer l, given by λl.
The user activity Ai is given as the result of all the messages posted by i: as a source in
layer l = 0 (Ai,0) plus all the retweets made by i at l steps farther from the message source
(Ai,l|l > 0), in the following way:
Ai = Ai,0 +dmax∑l=1
Ai,l (6.2)
where dmax is the maximum cascade depth allowed. On the one hand, Ai,0 is an indepen-
dent random variable with density distribution P (A0), and represents the initial conditions
for the spreading process. On the other hand, Ai,l|l > 0 is not independent and it rather
represents a consequence of the propagation of other nodes’ activity. Among other factors,
this quantity depends on the amount of messages received by i, which is proportional to the
amount of people who i follows on the underlying followers network (ki,out).
From this perspective, we define the retransmissions gained by user i in the following
way:
Ri =dmax−1∑l=0
Ri,l (6.3)
where Ri,l represents the retweets gained by the node i due to its given activations at
the layer l in all the cascades. This means that a node i may gain retransmissions either
from the messages originally posted by it (Ri,0), as well as from messages retweeted by i at l
layers away from the source (Ri,l). On this basis, the value of Ri,l depends on the number i’s
followers, as well as the followers of followers, and so on, until reaching the maximum depth
considered for a possible node activation, given by dmax. Hence the sum upper limit in eq.
6.3 is one layer before this value.
88
10-4 10-2 100 102 104
User Efficiency
10-710-610-510-410-310-210-1100
A
A
EmpiricalModel (Followers net)Model (Random net)
100 101 102 103 104 105
Retweets Ri
10-710-610-510-410-310-210-1100
CCDF
B
B
10-4 10-2 100 102 104
User Efficiency
10-710-610-510-410-310-210-1100
C
C
100 101 102 103 104 105
Retweets Ri
10-710-610-510-410-310-210-1100
CCDF
D
D
Figure 6.7: Effects of the underlying network topology on the model results in terms of the
user efficiency distribution (left column) and retweets gained by user distribution (right column).
The model has been applied to the followers network (blue crosses) and their randomized versions
(red x symbols). Two datasets have been considered: #SOSInternetVE (top panel) and #20N
(bottom panel). In all cases, an heterogeneous initial activity distribution P (A0) ∝ A−1.40 has been
considered.
6.4 Results
We first applied the model by computational simulations. For this purpose, we defined the
underlying network where the propagation process would take place, as well as the initial
user activity distribution P (A0). Then the messages are spread taking into account the
probability of a cascade to reach l layers P (d ≥ l) and the retransmission rate in a given
layer λl. Finally after all the initial activations are performed and the triggered cascades
extinct, we calculate the efficiency η for each user according to eq. 6.1, as well as the
correspondent density distribution.
We applied the model to two followers networks from the considered datasets. One of
these networks corresponds to the #SOSInternetVE dataset and the other one is constructed
89
from the #20N dataset (see Fig. 6.5 D). The results of the user efficiency and retweets
distribution are shown at the top and bottom panels in Fig. 6.6 respectively. These results
correspond to the average value of 50 model realizations. In both cases, the system has been
initially excited using an heterogeneous user activity distribution in the form: P (A0) ∝ A−1.40 ,
and the spreading probabilities were taken from the cascade’s characterization, given in Fig.
5.8. It can be noticed that the resulting efficiency distributions in Fig. 6.6 A and C (blue
crosses) present a very good agreement with the empirical data (open circles) in both cases.
In fact, the distributions also presents the different scaling behavior at the right side of
the curve. Besides, the resulting retweets distributions in Fig. 6.6 B and D (blue crosses),
are also in very good agreement with the empirical data (open circles). These results show
that the distributions analyzed are a reflection of the dynamical process behind the message
spreading, which happens on Twitter by means of the retweets mechanism in independent
cascades, where the probability of a cascade to grow decays as the message travels through
the network, independently of the social context. After having validated the spreading
mechanism, we are able to use the model to control the effect of the different factors that
determine the user efficiency patterns, such as the heterogeneity of the underlying network
topology and the characteristics of the individual user behavior (activity distribution).
First, we analyze the effects of the heterogeneity of underlying network topology on the
spreading process. For this matter we applied the model to two different kind of substrata:
the followers networks, from the datasets #SOSInternetVE and #20N, and their randomized
versions. These randomized networks were built to avoid the presence of hubs and create
homogeneous users profiles, by rewiring the edges so the degree distribution would follow a
Normal curve instead of a power law, but maintaining the average number of edges per node.
The resulting η distributions after having excited the system with the same heterogeneous
P (A0) are plotted by red x symbols in Fig. 6.7 A and C respectively. It can be noticed
that the distributions from these homogeneous networks present a different behavior than
the ones obtained from the empirical observations and the modelled ones on the followers
networks. There is a slightly lower density of the low efficient users, but more importantly,
the highest values of the distribution are almost two orders below the empirical values,
apparently following a lognormal behavior. However, the retweets distributions in Fig. 6.7
B and D (red x symbols) still present power law behavior, due to the heterogeneity of
P (A0), although the probabilities of retweet are lower. In both cases, this means that an
homogeneous society would allow users to gain an extremely high amount of retweets, only by
means of employing an enormous amount of initial activity as well, since the user efficiency
is strongly limited to the available connections on the underlying network.
90
10-4 10-2 100 102 104
User Efficiency
10-710-610-510-410-310-210-1100
A
EmpiricalModel (Followers net)Model (Random net)
100 101 102 103 104 105
Retweets Ri
10-710-610-510-410-310-210-1100
CCDF
B
10-4 10-2 100 102 104
User Efficiency
10-710-610-510-410-310-210-1100
C
100 101 102 103 104 105
Retweets Ri
10-710-610-510-410-310-210-1100
CCDF
D
Figure 6.8: Effects of the individual user behavior on the model results in terms of the user
efficiency distribution (left column) and retweets gained by user distribution (right column). The
model has been applied to the followers network (blue crosses) and their randomized versions (red
x symbols). Two datasets have been considered: #SOSInternetVE (top panel) and #20N (bottom
panel). In all cases, an homogeneous activity distribution P (A0) = 1/6 where A0 ∈ [1, 6] has been
considered.
91
Second, to study the effects of the individual user behavior, given by the initial ac-
tivity distribution, we also applied the model to both followers networks (the case study
#SOSInternetVE and the #20N dataset) and their randomized versions, but in this case
considering an homogeneous P (A0), in the form: P (A0) = 1/6 where A0 ∈ [1, 6], instead
of the heterogeneous one previously considered. The results of applying this homogeneous
user behavior to the heterogeneous followers networks are presented by blue crosses in Fig.
6.8. It can be noticed that the resulting user efficiency distributions in Fig. 6.8 A and C,
present the same behavior on the right side of the curve as the empirical observations (open
circles), even though the considered user behavior is radically different than the empirical
one. Besides, the retweets distribution (Fig. 6.8 B and D) also coincide quite well with
the empirical observations and hardly changes in comparison to the distributions obtained
when users posted messages in a heterogeneous way. However, if we change the substrata
to their randomized versions, the model results no longer reproduce the empirical behavior
and all the distributions loose their heterogeneity (red x symbols in Fig. 6.8). This confirms
that the emerging patterns are not dependent on the way users post original messages, but
instead a consequence of their heterogeneous connections on the underlying network.
In the case of Twitter, the followers network also represents the way that the collective
attention is organized. On this basis, this model has shown that if this collective attention
is distributed heterogeneously among the population, the way users post messages has no
further effects in the efficiency distribution, nor the retweets distribution, since the high
aggregation of users around the influential ones is what produces such large collective reac-
tions. In turn, if users would pay attention to each other homogeneously, as the randomized
version of the followers network, then the retweets gained by user would be a reflection of
the frequency and amount of posted messages, and the efficiency to gain such retweets would
be strongly limited by the properties of the underlying substratum. However, despite the
fact that in an homogeneous society it would be more difficult to find extreme cases of high
efficient users, the density of extremely low efficient users also decreases when the attention
is shared homogeneously among the collective. Therefore, this evidences that in order for
some users to gain attention from the collective, others must loose it at the same time.
6.5 Analytical Solution
In this section we provide an analytical solution to the model of user efficiency. For this
purpose, we will define the quantities Ai,l from eq. 6.2 and Ri,l from eq. 6.3, for l > 0.
Ai,l is defined in the following way:
92
Figure 6.9: Results from the analytical model of user efficiency, considering cascades up to three
layers of depth in the followers network from the #SOSInternetVE dataset. Resulting η average
(A) and standard deviation (B) from evaluating the model with 0.2 < P (d > 0) < 1.0 (x-axis)
and 0.05 < r0 < 0.3 (color). The dashed lines indicate the empirical values. (C) Resulting
η distribution from applying the analytical model to the followers network with the empirical
activity distribution P (A0) by setting P (d > 0) = 0.775 and r0 = 0.15. The white dots represent
the empirical distribution of user efficiency and the triangles represent the distribution obtained
from the analytical model.
93
Ai,l|l > 0 = 〈Al−1〉ki,outλlP (d ≥ l) (6.4)
where 〈Al−1〉 is the mean activity value of nodes in the layer l − 1 in all the cascades,
ki,out is the out degree of the user i (those who i follows), P (d ≥ l) is the probability of a
cascade to have a depth of at least l layers, and λl is the retransmission rate at the layer
l. In this sense, the activity of a node at any layer, depends on the expected activity of all
nodes on the previous layer, the node’s connectivity and the network’s permeability, given
by the probability of a cascade to grow vertically and horizontally.
Ri,l is defined as follow:
Ri,l = Ai,l
dmax∑n=l
Ki,in(n)P (d ≥ n)n∏
m=0
λm (6.5)
where Ki,in(n) is the sum of the in degree (kin,j) of nodes j, which are n layers away from
i, in the sense of the edge direction, being Ki,in(0) = ki,in, the node’s in degree.
The resulting user efficiency ηi would be:
ηi =
∑dmax−1l=0 Ri,l∑dmax
l=0 Ai,l(6.6)
We applied eq. 6.6 to the followers network from the #SOSInternetVE dataset, and con-
sidered the actual data from the original activity by node, which represent an heterogeneous
distribution. The corresponding distribution got with dmax = 2 is plotted in Fig. 6.9 C.
It can be noticed that the analytical model results present a very good agreement with the
observed data. In order to reproduce the distribution, we had to increase the probability of
the cascades to grow vertically and horizontally on the first layer to P (d > 0) = 0.775 and
r0 = 0.15 respectively. These values are different from the empirical values, which did not
reproduce the empirical distribution.
In order to obtain these probabilities, we applied eq. 6.6 by spanning 0.2 < P (d > 0) <
1.0 and 0.05 < r0 < 0.3. The results in terms of the average and standard deviation of η
are shown in 6.9 A and B respectively. The dashed lines indicate the empirical values. We
first noticed that the empirical average value of η is obtained within the range of P (d > 0)
marked with a gray shadow in Fig, 6.9 A. Then, from this range, we found the value of r0
at the intersection with the dashed line in Fig, 6.9 B.
94
6.6 Summary
In summary, we have been able to model the efficiency of users to spread their opinions during
Twitter conversations, and found that the emergent patterns are remarkable influenced by
the underlying network topology. We have shown an evidence of the robust but vulnerable
property of complex networks. In the sense that complex networks appear to be robust for
most of the external excitations, as most of people post messages that do not travel at all, but
vulnerable for selected excitations, as the activity performed by the highly efficient users have
a remarkable impact in the resulting patterns [Wat02]. This effect is also measured through
the macroscopical property of the percentage of retweets on the overall posted messages.
In the protest 47% of the messages were retweets, while our simulations gave 45 ± 3% for
the followers network and 40.3 ± 0.1% for the randomized version. This additional 5% of
retransmissions were only possible due to the complex organization of the network.
95
96
Chapter 7
MEASURING POLITICAL
POLARIZATION
In this chapter we propose a methodology to study polarization in social media and quantify
its effects. To this end, we introduce a computational model to estimate opinions [MBLBss]
from a contagion process on social networks; together with a new index [MBLBss] to quantify
the extent of polarization in the obtained opinions.
The model iteratively estimates the opinions of the majority, by fixing the opinion of a
minority of influential individuals and mapping the communication fluxes among the pop-
ulation. Its dynamics are similar to the DeGroot model [DeG74], with the introduction of
some users acting like “zelots” [Mob03, MMR07]. In absence of polarization the expected re-
sulting distribution of opinions would be a normal distribution centered at a neutral opinion.
However, as polarization emerges the resulting distribution shifts to a bimodal distribution
with two peaks emerging around the two dominant and confronted opinions [DW07].
Our measure of polarization is inspired by the electric dipole moment - a measure of the
charge system’s overall polarity. For two opposed point charges the electric dipole moment
increases with the distance between the charges. Analogously, the polarization of two equally
populated groups depends on how distant are their views. We apply this index to measure
the polarization in the opinions distributions obtained with the proposed opinion estimation
model.
At the end of the chapter, we show how to apply our methodology to online data gathered
from Twitter in order to estimate individual opinions and measure the emergent political
polarization. The data correspond to online conversations, during the death announcement
of the late Venezuelan President, Hugo Chavez. We found a good agreement between our
results and offline data.
97
7.1 A Model to Estimate Opinions in a Social Network
We present a model to estimate the opinions of individuals who interact on a social network.
In it we distinguish two types of individuals, elite and listeners. The first ones have a fixed
opinion and act like seeds of influence, while the opinion of the second ones depends on their
social interactions. The model is fully specified by the following assumptions:
1. Initial Conditions: The world is abstracted by a directed network, G, in which each
individual is represented by a node and links account for influence rather than friendship or
other kind of relationship. We define two different subset of nodes, S accounting for elite;
and L, accounting for listeners. Additionally we endow each elite with a parameter, Xs,
that determines her opinion value and that will remain constant for the duration of the
model. Xs lies in the range, −1 ≤ Xs ≤ 1, where 1 and -1 represent the two extreme and
confronted poles. Finally we set an initially neutral opinion, Xl(0) = 0 to all listeners.
2. Opinion Generation: At each iteration, elite nodes, S, propagate their own opinions
through the established network, G, influencing listeners, L. Hence each listener iteratively
updates her opinion value as the mean opinion value of her neighbors. Thus the opinion at
time step, t, of a given listener, i, is given by the following expression:
Xi(t) =
∑j AijXj(t− 1)
kouti
(7.1)
where Aij represents the elements of the network adjacency matrix, which is 1 if and only
if there is a link from j to i, and kouti corresponds to her out degree. The process is repeated
until all nodes converge to their respective Xi value, lying in the range −1 ≤ Xi ≤ 1. The
convergence is defined with a threshold Th such that: |Xi(t + 1) −Xi(t)| < Th. Thus, the
results of the model are given in a density distribution of nodes’ opinion values p(X). Note
that the opinions of individuals do not depend on their opinion in the previous step. This
is because we are estimating their opinion that a priori was unknown, rather than studying
the evolution of opinions.
The dynamics of the model is illustrated in Fig. 7.2, where we present an schema of the
influence spreading process. Panel A visualizes the instantiation of the model where each
elite node has been colored according to her opinion (red, Xs = −1; and blue Xs = +1).
Panels B-E show the dynamics of the influence process from the initialization (B) to the final
converged state (E). Panels (F) and (G) visualize two empirical networks corresponding to
a non polarized (F) and a polarized (G) case. Furthermore, we also illustrate the dynamics
of the model in the Video B.1, which is described in the Appendix B.
98
Figure 7.1: Schema explaining the proposed polarization index µ. (A) Density distribution of
opinions. gc stands for the gravity center of each pole, A stands for the population associated to
each ideology, and d stands for the pole distance. (B) Visualization of the polarization index, µ,
for three different situations.
7.2 A Measure of Polarization
We say that a population is perfectly polarized when divided in two groups of the same size
and with opposite opinions. Hence we propose a measure of polarization that quantifies both
effects for the resulting X ∈ [−1, 1] distribution obtained from our model. This definition
is inspired by the electric dipole moment- a measure of the charge system’s overall polarity.
In the simplest case of two point charges of opposite signs (−q and +q) the electric dipole
moment is proportional to the distance among the charges. This is analogous to a simple
scenario consisting of two persons with different ideologies, thus the polarization depends on
how conflicting are their points of view (i.e. the distance among the two ideologies).
We begin by calculating the population associated with each opinion (positive and neg-
ative). For this we define A− as the relative population of the negative opinions (X < 0).
By the same token we define A+ as the relative population of the positive opinions (X > 0).
Hence, both variables can be expressed as:
99
A− =
∫ 0
−1p(X)dX = P (X < 0) , (7.2)
A+ =
∫ 1
0
p(X)dX = P (X > 0) (7.3)
So we can express the normalized difference in population sizes, ∆A , as
∆A = |A+ − A−| = |P (X > 0)− P (X < 0)| (7.4)
Next we quantify the distance between the positive and negative opinions. In other words
we measure how differing are the opinions of the two sides. To this end we determine the
gravity center of the positive and negative opinions that can be written as
gc− =
∫ 0
−1 p(X)XdX∫ 0
−1 p(X)dX, (7.5)
gc+ =
∫ 1
0p(X)XdX∫ 1
0p(X)dX
(7.6)
and define the pole distance, d, as the normalized distance between the two gravity
centers. Hence it can be expressed as:
d =|gc+ − gc−||Xmax −Xmin|
=|gc+ − gc−|
2(7.7)
This formula gives d = 0 when there is no separation between the gravity centers, i.e.
there are no longer two differentiated groups and everyone shares a similar opinion; and
d = 1 when the two opinions are extreme and perfectly opposed.
Finally, we can use eqs. 7.4 and 7.7 to write down a general formula to measure polar-
ization as a function of the difference in size between both populations ∆A and the poles
distance d. Thus we define the polarization index as:
µ = (1−∆A)d (7.8)
This formula gives µ = 1 when the distribution is perfectly polarized. In this case
the opinion distribution function is two Dirac delta centered at −1 and +1 respectively.
Conversely, µ = 0 means that the opinions are not polarized and the resulting distribution
of opinions would either take the form of a Gaussian distribution centered at a neutral
opinion, or also be entirely centered in one of the poles, implying that the population (A)
100
Figure 7.2: Schema of the influence spreading process in the opinion estimation model. (A)
Displays the seed nodes in the network, colored according to their respective ideology. (B) Displays
the network at t = 0, before seeds start to propagate their influence. (C) Shows the state of the
network at t = 1. (D) shows the state of the network at t = n/2. (E) Displays the final state
of the network at t = n. (F) and (G) Visualizations of two examples of the result of the opinion
estimation model to the Venezuelan dataset for non polarized (F) and polarized (G) days. See the
video B.1 described in the Appendix B
of the other pole would be reduced to zero and ∆A = 1. In between, polarization can lie
within the range, 0 < µ < 1, for three reasons: i) The population sizes associated to each
opinion are equal, but the pole distance d is lower than 1. ii) Despite d being equal to 1, the
population sizes associated to each opinion are different and therefore there is a majority
sharing a similar opinion. iii) A combination of i and ii. Fig. 7.1 A illustrates the basic
concepts of the proposed index of polarization, as it visualizes the area associated to each
opinion, their corresponding gravity centers and the pole distance for a standard case of a
perfect bimodal distribution. In panel B of this figure we have visualized a non polarized
distribution (µ = 0), a perfectly polarized one (µ = 1) and a case in between.
7.3 Study of Polarization on Retweet Networks
In order to measure the extent of polarization on Twitter conversations, we propose the
following methodology: First, we build social networks from a conversation to be analyzed,
like the ones described in section 4.2.2. Then, we apply the model proposed in section 7.1
to the networks, in order to obtain the distribution of opinions of the population. Finally,
we quantify the polarization present in these opinions distributions, by means of the index
101
we proposed in section 7.2.
The social networks we considered are the retweet networks from Twitter conversations.
These user-to-user interaction networks represent the channels where information actually
flows on Twitter, as shown in chapters 5 and 6. Besides, the retweet mechanism have been
reported as the most polarized on Twitter [CRF+11, CGFM12] and it is typically used to
actively endorse ideas [BGL10, BMLB12].
In this section, we apply our opinion estimation model and polarization index to Twitter
data regarding the late Venezuelan President Hugo Chavez. The dataset used in this study
was described in section 4.2.2 under the keyword Chavez. First, we will present the networks’
properties. Then, we will define the elite of collective attention, and use them as seeds to
apply the opinion model to all networks. Finally, we will discuss the effects of edge direction
and the offline-online relationship of our results.
7.3.1 Retweets Networks
Since this conversation covers a two months period, we have built one independent retweet
network for each day of the observation period (56 networks). A single network contains
several retransmission cascades, seeded and propagated by the conversation participants.
When these cascades are aggregated, several disconnected network components emerge. In
Fig. 7.3 we present a visualization of a retweet network at an arbitrary day. It can be noticed
that most of the components are compound by two or three users at most (see gray graphs
Fig. 7.3), while there is a single component, called Giant Component (GC), whose size is
in the same order of the whole network (see colored graph in Fig. 7.3). We will apply the
polarization detection methodology to these GC and refer to them as the retweet networks
in the following sections.
In the left panel of Fig. 7.4, we present the distribution of the components’ size for three
different days. It can be noticed that the distribution follows a power law behavior, where
the size of the GC is much higher than the rest of components. In the right panel of Fig. 7.4
we present the time evolution of the GC properties. First, we present the time evolution of
the ratio between the number of nodes in the GC and the number of nodes in the respective
networks in Fig. 7.4. It can be noticed that the GC contained around 80% of the network
nodes along the observation period. The number of users in the GC widely varied during
the unfolding events. Then, in Fig. 7.4 B we present the time evolution of the GC size,
measured as the total number of nodes at each network. It can be noticed that the GC size
fluctuated around a median value of 20,000 users (gray dashed line), and grew up to 1 million
users during the death announcement (orange stripe). This temporal behavior is typical of
102
Figure 7.3: Visualization of the retweet network at day D− 29. The Giant Component has been
colored in blue and red, while the rest of components have been colored in gray.
103
Figure 7.4: (Left) Distributions of the components size of the retweet networks from the Twitter
conversation about the Venezuelan President Hugo Chavez for three days: D− 29, D and D+ 20,
where D represents the day of the main occurrence. (Right) Time evolution of the Giant Component
(GC) of the retweets networks: (A) Ratio between the number of nodes that conform the GC and
the number of nodes in the respective networks. (B) Time evolution of the whole network and GC
size in terms of nodes. (C) Relative number of messages inside Venezuela from the geolocalized
users in the GC. The orange stripe represents the day D and the state funeral period.
104
breaking news topics [YL11], with a bursty increase during the main occurrence and a slow
decay that may last for several days. During the burst the conversation went viral and many
international users joined the conversation from all around the globe (see Fig. 7.5 and the
video B.2 described in the Appendix B). This is shown in the amount of geolocated messages
inside Venezuela, given in Fig. 7.4 C. It can be noticed that the Venezuelan share of messages
represented about 80% of the analyzed content for most of the observation period (dashed
line), with the exception of the death announcement. During this day the Venezuelan share
of messages reached its lowest point close to 20% of the messages.
The retweet networks characterize the way that the collective attention is organized
during an event on Twitter. The out-strength (sout) indicates the amount of attention paid
by a given user in the conversation, while the in-strength (sin) indicates the amount of
attention received by a user from the rest of the network. The first is measured by the
number of retweets made by the participant, and the second is given by the number of
retweets gained by the participant. In Fig. 7.6 we have superimposed the in-strength (left)
and out-strength (right) complementary cumulative density functions (CCDF) for each of
the constructed networks. In both cases the distributions display power law behavior, being
the in-strength distributions broader than the out-strength distributions.
To understand how people distributed their attention, we studied the evolution of the
Gini coefficient [CV12] of these two distributions. The Gini coefficient is used to measure
inequalities in people’s income, and indicated the heterogeneity of the distribution. It gives
the value 1, when the population is perfectly unequal, indicating that hubs are concentrating
all the links. In turn, it takes the value 0 when the population is perfectly equal, indicating
that links equally distributed among nodes. Here we propose to use it as an indicator of how
the people’s attention is being distributed among the information sources. The results are
shown in Fig. 7.6 C. It can be noticed that the Gini index of incoming links is very close
to 1 during the whole observation period (blue curve in Fig. 7.6 C), which means that hubs
concentrate practically all of the collective attention. On the contrary, we see that the Gini
index for outgoing links is closer to 0 (red curve in Fig. 7.6 C), indicating that the attention
given is less unequally distributed among users, than the attention received.
Moreover, in order to understand the way these heterogeneous users interacted with each
other we studied the directed assortativity by degree evolution [New03a, HW09]. The results
are shown in Fig. 7.6 D. It can be noticed that the out-in degree assortativity (green) is
negative for all the observation period. This means that the content posted by the influential
hubs is usually retweeted by the users who are not that connected. On the other hand, the
out-out degree distribution (blue) is clearly positive for all the observation period, which
105
Figure 7.5: Visualization of geolocated messages from the Chavez conversation on three days
from different periods: before the announcement (top), during the announcement (middle), after
the announcement (bottom). The dots represent geolocalized messages. The label indicates the
day of observation, being D the day of the announcement.
106
Figure 7.6: Evolution of the topological properties of the retweet networks emergent at each day
of the observation period, in terms of: (A) Out strength complementary cumulative distribution,
(B) In strength complementary cumulative distribution, (C) Gini index evolution of the strength
distributions. (D) Directed degree assortativity evolution. The orange stripe represents the day of
the main occurrence. In A and B, the blue curves correspond to the first days and the red curves
correspond to the last days.
107
Figure 7.7: Conditioned probability density function of the accumulated in-strength (Sin) given
the participation rate (ρ), from the Twitter conversation about the Venezuelan President Hugo
Chavez. The color correspond to the density of users. The red line indicates the average accumu-
lated in-strength value Sin for a given participation rate ρ.
108
means that very active users, are usually retweeted by very active users. That effect is
related to the cascades shown in chapter 5. The other two assortativities, in-out and in-in
are close to zero, which means that no major correlation is detected.
Participation
To further understand the relationship between the individual activity and the attention
received, we will aggregate the observation period by characterizing the individuals according
to their rate of participation and total amount of retweets gained. The participation rate is
defined as:
ρ = ρi/T (7.9)
where ρi is the number of days that the user i actively participated in the retweet process
and T is the total length of the observation period. The total number of retweets gained by
user is measured as:
Sin =T∑t=0
sin(t) (7.10)
where sin(t) is the in-strength of the node i at day t. If the user did not actively partici-
pated at day t, then sin(t) = 0.
The conditioned probability density function of the accumulated in-strength Sin given
a participation rate ρ, P (Sin|ρ), is shown in Fig. 7.7. This distribution indicates the total
amount of attention received by users according to their participation rate. It can be noticed
that the largest density of users (red and orange dots in Fig. 7.7) participated less than 20%
(ρ < 0.2) of the days and present a small in-strength value (Sin < 10), which means that
most of them received a little amount of the collective attention. However, there is a direct
relation between the average conditioned value of Sin, given in 〈Sin|ρ〉, and the participation
rate ρ, indicating that the more days people participate, the more the attention they receive
(see red line in Fig. 7.7). In fact, there is a very small set of users at the upper right corner
in Fig. 7.7, who participated almost every day and present an extremely high Sin (up to
almost 100,000). This minority of highly influential users captured most of the collective
attention throughout the observation period, and are considered to be the opinion leaders.
In summary, we have seen that while most of participants hardly gain attention, there is
a very small set of users who captured most of the collective attention.
109
7.3.2 Elite nodes
The opinion estimation model described in section 7.1 defines a set of influential users called
elite. These users will act like seeds of opinions and will help to infer the opinions of the
majority of listeners. In the Twitter conversation, we consider those users who often partic-
ipate and concentrate large amounts of retweets to be the elite of the collective attention.
Their messages were widely forwarded by the conversation participants on daily basis, which
makes them leaders of the information diffusion process. In this section we will describe
their properties and the way they behaved in the conversation.
We have defined three sets of elite users, according to how much they have actively
participated in the conversation and the attention they received from the rest of participants.
The first set is compound by the top 65 most influential users, who gained an extremely high
amount of retweets, independently of their participation rate (Sin > 10, 000 and ρ > 0.0).
These users correspond to the yellow rectangle in the top of Fig. 7.7, and represent accounts
from politicians, news media and journalists. The second set of users includes those who
gained considerable amount of retweets by widely participating in time (Sin > 1000 and
ρ > 0.89). This set of 136 users include those who captured a wide part of the collective
attention by means of actively participating along the observation period (green rectangle
in Fig. 7.7). The third set includes those users did not necessarily receive much of the
attention, even after having widely participated in time (Sin > 10 and ρ > 0.82). This set of
635 users include those who were very active in the conversation but not necessarily captured
much of the collective attention (black rectangle in Fig. 7.7) as well as the most influential
ones. In fact, most of the users in the smaller sets are contained in the larger sets, as some
rectangles clearly overlap in Fig. 7.7.
In order to analyze the elite’s behavior, we have built networks with these three sets of
influential users. The networks are built by merging the edges among the respective nodes
through the observation period. This means to build a network that represents the union
of all networks, but considering only the sets of elite nodes. In Table 7.1 we present the
topological properties of the elite networks.
First of all, we have found that these networks present a segregated structure. The
networks present a clearly defined community structure according to the modularity opti-
mization algorithm [BGLL08]. For the three networks the modularity is positive and high
(Q1 = 0.43, Q2 = 0.38, Q3 = 0.35), which indicates that the communities in these graphs are
well segregated from each other. Moreover, the communities’ members share political pref-
erence. The second and third network, presented one community (or C-node) in favor of the
late President (officialism) and another one identified with the opposition parties (against
110
Elite NW Sin ρ Nodes Edges Off. C-nodes Opp. C-nodes Q r
1 10000 0.00 67 334 1 (25) 3 (42) 0.43 0.77
2 1000 0.89 136 1567 1 (48) 1 (88) 0.38 0.88
3 10 0.82 635 28245 1 (197) 1 (438) 0.35 0.91
Table 7.1: Elite networks topological properties. Sin and ρ columns represent minimum values.
Off. C-node indicates the number of network communities related to the officialism, and Opp.
C-nodes indicates the number of communities related to the opposition. The numbers in the
parentheses indicate the number of nodes in each pole. Q stands for modularity. r stands for the
Pearson coefficient of mixing patterns by ideology.
the late President). Particularly, the first network presented one community identified by
the officialism and three with the opposition. To study the preference of interaction by
political affinity, we analyzed the networks mixing patterns [New03a], given by the Pear-
son coefficient r in Table 7.1. On the three cases the assortativity values are very high
(r1 = 0.77, r2 = 0.88, r3 = 0.91), which evidences that the interactions on these networks are
strongly polarized.
To further understand the polarized structures of these networks, we present a visual-
ization of the three elite networks in the bottom row of Fig. 7.8. The nodes have been
colored according to the determined political affinity (red for the officialism and blue for the
opposition). It can be noticed that the larger the network (from left to right) the clearer and
more defined the poles are. This may also be noticed in the adjacency matrices represented
in the top row of Fig. 7.8. We have colored the edges to distinguish the interactions within
poles or between them. Red dots indicate an edge between two nodes from the officialism
block, blue indicates edges within the opposition block and pale yellow represents edges that
connect two different blocks. It can be noticed that the matrices present a clearly defined
block diagonal structure. This indicates that most of the blocks’ edges remain in the same
block (over 90% of edges at all cases), and that there are scarce connections among blocks.
7.3.3 Estimating Opinions
In the present section, we will apply the model to estimate opinions described in section
7.1 to each of the 56 daily constructed retweet networks. The elite’s influence is defined by
a fixed opinion which depends on the political pole: Xs = −1 for the officialism pole and
Xs = +1 for the opposition pole. The rest of nodes would iteratively estimate their own
opinion Xi(t), by applying eq. 7.1, until reaching the convergence (|Xi(t)−Xi(t−1)| < 10−3).
111
Figure 7.8: Adjacency matrices (top) and corresponding visualization (bottom) of the considered
elite networks. (A) Corresponds to the seed with Sin ≥ 10000 and ρ ≥ 0. (B) Corresponds to the
seed with Sin ≥ 1000 and ρ ≥ 0.89. (C) Corresponds to the seed with Sin ≥ 10 and ρ ≥ 0.82. Nodes
have been ascendantly ordered according to their opinions Xs. The color indicates the average value
of the node’s opinions Xij at both sides of the edge i− j.
In Fig. 7.9 an schema of two possible networks and expected outcomes is presented. The
elite users have been represented as red and blue nodes in the networks of Fig. 7.9 A and
E. If polarization is present, like the case shown in the top row of Fig. 7.9, the network will
display a two island structure (Fig. 7.9 B), the adjacency matrix will display two diagonal
blocks of nodes well connected within, but segregated from each other (Fig. 7.9 C) and
the estimated opinions distribution will be bimodal (Fig. 7.9 D). Meanwhile, if there is no
polarization in the graph, like in the bottom row of Fig. 7.9, the network will present a single
island structure (Fig. 7.9 F), the adjacency matrix will display homogeneous connections
among nodes (Fig. 7.9 G) and the estimated opinion distribution will be monomodal (Fig.
7.9 H).
In order to show more clearly the model results, we have colored the edges of the adjacency
matrices in Fig. 7.9 B and E in proportion to the average opinion of the two connected nodes
(i and j), defining the opinions adjacency matrix AXijin the following way:
AXij=Xi +Xj
2(7.11)
Red and blue dots represent edges between users of the same ideology, while pale blue
112
Figure 7.9: Visualization of two cases of possible retweet networks and expected outcomes. The
top row represents a polarized case and the bottom row represents a nonpolarized case. Panels
A and E show the position of the elite nodes, colored in each network. Panels B and F shows
the respective networks, coloring the nodes with their estimated opinion. Panels C and G show
the opinion adjacency matrices AXij . The colored dots in the matrices represent interactions:
blue and red dots indicate interactions within the same group; pale blue and yellow dots indicate
interactions across groups. Nodes have been ascendently ordered according to their estimated
opinion Xi. Panels D and H represent the resulting opinion distributions.
113
and yellow dots represent interactions between nodes of different ideologies. In the polarized
case, the elite’s opinions will not mix given the scarce amount inter-group connections and
the resulting nodes’ opinions will gather at the extreme values. As a consequence the matrix
will display two diagonal blocks, respectively colored in red and blue (see Fig. 7.9 C).
In contrast, on the depolarized case, the elite’s opinions will mix given the existence of
connections between the poles and the nodes’ opinions will homogeneously gather around a
single value like zero. Consequently the adjacency matrix would display a larger amount of
inter-ideological interactions, shown by the non-diagonal structure of yellow and pale blue
dots (see Fig. 7.9 G).
Obtaining Opinions and Measuring Polarization
The results of applying the model to the undirected versions of the retweet networks, using
the three sets of elite nodes presented in section 7.3.2, are shown in Fig. 7.10 respectively.
It can be noticed that three elites yield to similar results. During the days preceding the
announcement (from D − 29 to D − 1), X presents a bimodal distribution in which the
officialism population (negative side of the X distribution) is considerably smaller than
the opposition (positive side of the X distribution). This means that during this period
the conversation was polarized, but predominantly monopolized by the opposition. Hence,
despite the pole distance reached values over 0.9 (Fig. 7.11 B), the polarization index just
averaged under 0.4 (Fig. 7.11 C). Then a shift in the conversation emergent patterns took
place on the day of the president’s death announcement (day D). During this day X looses
its bimodal distribution, and the resulting p(X) has a single peak closer to neutral values,
minimizing the pole distance. All these meaning that the conversation was not so polarized.
Therefore, the polarization index diminishes down to µ ≈ 0.
The explanation for the change during day D is the abrupt growth of information cascades
when people react to critical events [BWB11]. The cascades interconnected the previously
segregated modules into a single-island structure many times bigger than the usual size
of the network. Besides, a large amount of users from all around the globe joined to the
conversation, making the topic international, rather than local from Venezuela. During this
day the percentage of users tweeting from Venezuela (≈ 20%) was very low in comparison
to the rest of the days (average around > 80%). Hence, our set of Venezuelan elite were not
capable of polarizing this majority of worldwide users.
After day D, the conversation gradually recovers its bimodal distribution of opinions as
the conversation turns back to primordially Venezuelan attention. Moreover, the polarization
reaches its maximum from day D+12 (marked with the dashed line in Fig. 7.11 C) onwards,
114
Figure 7.10: Time evolution of estimated opinions (Xi) probability density functions (p(X)) for
the Venezuelan conversation. These distributions respectively result from applying the model to
the retweet networks using the elites No. 1 (top panel), No. 2 (middle panel) and No. 3 (bottom
panel) described in section 7.3.2. Labels indicate the day of observation, D standing for the day of
the President’s death. Colors indicate the number of participants.
115
Figure 7.11: Time evolution of the polarization index µ (C), and the variables associated with
it: pole distance d (B) and the difference in population sizes (A) for the Venezuelan conversation
in the undirected version of the networks. The magenta line represents the average of the results
from applying the model with the three elite users from section 7.3.2. The gray shadow shows the
standard deviation. The orange stripe indicates the day of main event.
116
day that the officialism new leader entered the conversation. The new leader entered Twitter
together with a large number of new participants from the officialism that decreased the
previously asymmetrical ∆A closer to zero. From this day onwards X presents a bimodal
distribution, where the populations of both sides are similar. Therefore, the polarization
index averages values around 0.8.
We have also analyzed the opinion distributions according to their statistical values, such
as the average, standard deviation and kurtosis. It can be noticed, that the mean value (Fig.
7.12 A) was positive until the introduction of the new leader at the dashed line. That
happened because the opposition had a larger participation than the officialism, until both
populations equaled in size and the mean value dropped to zero. Accordingly, the standard
deviation (Fig. 7.12 B) fluctuated from its lowest point during the main announcement to
its highest values during the most polarized days. Finally, the kurtosis showed a bi-modal
behavior (below the horizontal dashed line in Fig. 7.12 C) for almost all days, with the
exception of the main announcement when it showed a well defined positive value, indicating
a depolarized structure.
In order to further understand the relationship between the structure of the networks and
the opinions obtained, in Fig. 7.13 we present the time evolution of the opinion adjacency
matrices from the retweet networks. For this plot we have only considered the results from
the elite No. 1 from section 7.3.2. We have represented the matrices as explained in Fig. 7.9
B and E. Nodes have been ordered according to their estimated opinion Xi and edges have
been colored as dots, according to the value AXijdefined in eq. 7.11.
It can be noticed that before the announcement (from D − 29 to D − 1) the matrices
show well defined two block structures, where the blue block is larger than the red block.
This means that there are too scarce inter-block connections (pale yellow dots) and thus
the networks are polarized, although a single group seems to monopolize the conversation
due to its larger relative size. Then, during the week of the main announcement (from D
to D + 5) we notice how the matrix transits from a fully connected to a segregated one, by
gradually reducing the inter-module connections and increasing the number of internal edges
at both modules. That stage represents the week when the event took international relevance
and many outsiders joined the conversation. The gradual decrease of such participation is
reflected in the gradual unveiling of the polarized core of the conversation. Finally, during
the polarized days (from D + 13 to D + 25), the matrix again shows the well defined two
blocks structure, where connections between modules are abundant but across modules are
scarce.
This shows that although the pole from the officialism remarkably increased their size
117
Figure 7.12: Time evolution of the statistical properties of the Xi distribution in terms of (A)
Average, (B) Standard deviation and (C) Kurtosis. The orange stripe represents the day of the
main occurrence (D) and the state funeral period. The magenta line represents the average of the
results from applying the model with the three elite users from section 7.3.2. The gray shadow
represents the standard deviation.
118
Figure 7.13: Time evolution of the opinion adjacency matrices AXij from the Twitter conversa-
tion about the Venezuelan President Hugo Chavez. Nodes have been plotted in ascendant order
according to their estimated opinion Xi. The label indicates the day of observation (from D − 29
to D + 26). The color indicates the average value of the node’s opinions at both sides of the edge
i− j.
119
during the last stage, the networks’ structure constantly showed too few inter-modular in-
teractions and polarized interactions.
Effects of Rewiring Edges
In order to further understand the effects of the topological properties of the networks in
the resulting opinion distributions, we have applied the opinion estimation model to rewired
versions of the undirected retweet networks. In order to randomize the networks, we have
rewired the edges by keeping the nodes’ degree. That means to randomly exchange edges
between nodes, in order to create new network configurations. Our goal is to discriminate
whether the resulting opinion distributions are the result of the effects of the elite on any
random network, or whether the actual networks show actual polarized structures around
the elite.
The average results of applying the opinion estimation model to 200 rewired versions of
each of the retweet networks are presented in Fig. 7.14 with dashed black lines, together with
the corresponding results from the original networks in solid green lines. It can be noticed
that the opinion distributions from the rewired networks present a single smoother increase
near the neutral opinion of Xi = 0. This means that if edges are randomly re-distributed
among nodes, then the polarization in the network is lost and the resulting structures present
single island structures. This effect is noticeable when we compare the opinion distributions
from the rewired networks with the original behavior during the most polarized days (from
D + 12 onwards). The curves show a remarkably different behavior. This means that the
way that nodes are connected in these polarized structures is far from being the result of a
random configuration. Instead, such differences indicate the existence of strong correlations
and conditioning in the user behavior. In contrast, on day D, both the original and rewired
versions of the network give the same opinion distribution results. Such similarity confirms
that the user interactions at this day occurred without the conditioning of the political
preference, but rather like if the nodes’ interactions happened independently and randomly.
7.3.4 Contagion by Influence
The retweet mechanism is directed by nature. The edge direction is related to the influ-
ence that one user plays on another. Therefore, in order to unveil the actual contagion by
influence, we will apply the model to the same networks, but considering the direction of
the edges. In this way, all nodes will only propagate their opinions to those who directly
influenced, that is to those who retweeted their messages.
120
Figure 7.14: Effects of rewiring edges in the results of the opinion estimation model. Time
evolution of estimated opinion (Xi) cumulative probability density functions (CDF) resulting from
the opinion estimation model to the undirected networks (solid) and corresponding rewired versions
(dashed). The label indicates the day of observation (from D−29 to D+26). Columns are ordered
from Monday to Sunday. The labels indicate the corresponding day of observation, from D− 29 to
D+ 26, being D the day of the President’s death announcement. The distributions for the rewired
networks represent the average over 200 realizations. These curves correspond to the results from
applying the model with the elite No. 3 described in 7.3.2.
121
Figure 7.15: Time evolution of the estimated opinions (Xi) probability density functions (p(X))
for the Venezuelan conversation. Labels indicate the day of observation, D standing for the day of
the President’s death. Colors indicate the number of participants. These curves are the average of
the results from applying the model with the three elite users from section 7.3.2.
The resulting opinion distributions, obtained by averaging the results from the three
elites presented in Table 7.1, are shown in Fig. 7.15. Almost all distributions present a
similar behavior than the distributions previously obtained, when we did not considered the
direction of the edges in the networks (see Fig. 7.10). Moreover, the new distributions are
more extremely polarized, since they present a more clearly defined bimodal shape. Even
during the days where the undirected results indicated single island structures (D + 1 to
D + 2), in the directed case we see two peaks at each extreme of the distribution. This
is reflected in the polarization index; which is generally higher than the undirected case,
reaching almost to 0.9 at the most polarized stage (from D+12 onwards in 7.16 C). Similarly,
the pole distance d (Fig. 7.16 B) is much closer to 1 than the undirected case, indicating
that the people’s opinions are separated at their maximum distance.
In order to compare the results from applying the model in both kind of networks, we
present in Fig. 7.17 the time evolution of the cumulative probability density functions
(CDF) of nodes’ Xi, resulting from the opinion estimation model on the directed network
(solid) and undirected network (dashed). The color indicates the kurtosis values of the
distributions, which is negative for polarized and bimodal distributions (red curves) and
positive for depolarized and unimodal distributions (from yellow to blue curves). If the
network was polarized, the distribution will display two sudden increases of users near the
extreme values, and practically no increase is detected in the central values (see D− 29). In
contrast, if the network is not polarized, the distribution will only display a single, continuous
122
Figure 7.16: Time evolution of the polarization index µ (C), and the variables associated with it:
the pole distance d (B) and the difference in population sizes (A) for the Venezuelan conversation.
The magenta line represents the average of the results from applying the model with the three elite
users from section 7.3.2. The gray shadow shows the standard deviation.
123
and smoother growth (see D).
It can be noticed that the patterns of the CDF are very similar for the majority of
days in both kind of networks. However, the distributions from the undirected version of
the networks present a smoother growth than the directed version, even when the network
is polarized (see D − 26 or D − 15). This means, that the participants polarization as a
whole is lower than the polarization of those users directly influenced by the opinion leaders.
Such observation is remarkably noticed at the week of the death announcement (from D to
D+5). During these days, the apparent depolarized networks contain a highly polarized sub-
network, directly influenced by the elite nodes. Therefore, the networks in general present a
highly polarized baseline embedded in the unconditioned popular interactions.
In summary, if we consider the users that are directly influenced by the elite, we see that
polarization is much stronger in the network, defining a polarized social baseline. However, if
we consider the whole network, we see that the emergent polarization is lower and sometimes
nonexistent. Therefore, in order to detect those users who are influenced the most by the
elite, we must consider the direction of the edges.
7.3.5 Offline Polarization
So far we have shown a strong polarization around Venezuelan online political discussions
on Twitter. To further understand the basis of such online polarization, in this section we
will explain the relationship between the Twitter activity, and the polarization present in
the Venezuelan society as a whole. To this end, we will discuss the electoral results of the
elections convoked after the President’s decease. Second, we will show the territorial impact
of the Venezuelan polarization in social media.
Electoral Polarization
After the president died on March 5th 2013, new elections were convoked in Venezuela. In
these elections, the candidate from the officialism (50.6%) together with the candidate from
the opposition parties (49.1%) gathered over 99.7% of votes. This shows the high degree
of political polarization in the Venezuelan electorate. Moreover, it confirms that polarized
societies leave little space for moderate voices, as independent candidates only gathered
the remaining 0.3% of votes. Yet according to recent polls [Hin13], Venezuelan citizens not
identified with any party represent about 25% of the population, evidencing that polarization
is a cause for over-representation of the most powerful groups.
In Fig. 7.18 we present the way that votes from officialism and opposition are distributed
among the population. More specifically we show the relative number of voting stations
124
Figure 7.17: Effects of edges’ direction in the results of the opinion estimation model. Time
evolution of estimated opinion (Xi) cumulative probability density functions (CDF) resulting from
the opinion estimation model on the directed network (solid) and undirected network (dashed).
The label indicates the day of observation (from D − 29 to D + 26). Columns are ordered from
Monday to Sunday. The color indicates the kurtosis values of the distributions. The labels indicate
the corresponding day of observation, from D − 29 to D + 26, being D the day of the President’s
death announcement. These curves are the average of the results from applying the model with
the three elite users from section 7.3.2.
125
Figure 7.18: Electoral polarization in Venezuela. Distribution of voting stations according to the
winner party and the location of station, according to the 2013 Venezuelan Presidential elections.
where the officialism (red) or the opposition (blue) had won, according to the geographical
location of the voting station. It is an indirect measure of social-economical level, since we
are able to classify voting stations in the following way:
• Rural: Mostly pour inland villages [IFfAD09].
• Urban informal: Referred to informal settlements in cities or slums [UH03]
• Undefined: Urban areas that might be considered slums or not.
• Urban formal: Proper urban neighbourhoods from medium class up.
• Abroad: Referred to Venezuelan emigrants voting at consulates and embassies, which
tend to be people from higher classes [Fre11].
We see that there is a strong correlation in the voting patterns and the economical level
of the voter, since the officialism widely wins at the voting stations placed at poorer areas
(located at the left side), while the contrary occurs with the opposition, which gets stronger
as we consider the wealthier regions (located at the right side).
This result shows how the political support in Venezuela is completely catalyzed by
the two major options, who found their voters in a mutually exclusive way. The voting
preferences appear aligned to social class. Of course, as Fidel Castro famously said to
126
Hugo Chavez after having lost the 2007 Referendum: ”there are not 4 million oligarchs in
Venezuela”, which means that opposition also finds space in the poorer areas. In fact, the
disproportional amount of rejection that the officialism gets in the wealthiest regions has
been reported to be stronger than the disproportional amount of support it receives from
lower classes [Lup10], which is also noticeable in Fig. 7.18.
Territorial Polarization
To further understand the relationship between our findings on Twitter and the electoral
results, in this section we explore some of the territorial distribution of the analyzed inter-
actions. More specifically we analyze the way these messages were posted in the capital city
of Venezuela, Caracas, taking only into account the tweets from the most polarized days
presented in section 7.3.3. In Fig. 7.19 we present the map of the five municipalities that
conform the city, bordered in green. The labels correspond to the municipality name and
the color indicates the party of the respective major, like the officialism in Libertador and
the opposition in Chacao, Sucre, Baruta and El Hatillo, according to the 2013 Venezuelan
local elections.
In the map we have colored in yellow the urbanized areas and in pink the informally
populated regions (slums). The contour lines represent the location of the mass of messages
identified to each ideology. It can be noticed that these contours correspond to the electoral
results, as those municipalities that are governed by the opposition contain the highest
concentration of users identified with this pole, and the same effect happens in the officialism
side of the political spectrum. Moreover, the area with the highest concentration of users
aligned with the officialism, corresponds to the part of the city with the largest concentration
of informal and poorer neighborhoods (pink areas), at the same time that the opposition
users are concentrated in the region of highest formal urban development.
This result evidences that the political conflict in Venezuela presents a strong territorial
facet. The territorial segregation is related to the degree of intolerance of people to coexist
with those who are different [Sch71]. The consequence of such territorial polarization have
been reported to be highly harmful for the city life [GG03] as public spaces become political
insignia and free circulation is affected by the fear of being identified as an opponent. As
a result, the city looses its role of social encounter and opens place to a warlike language,
where spaces are no longer democratic, but territories of the parts of a conflict.
127
7.3.6 Discussion
Venezuela has shown considerable evidences of polarization in multiple social dimensions.
The political and electoral polarization, presents a strong class and territorial polarization as
well. These type of polarization are well reflected on the Twitter activity, which is distributed
accordingly in geographical and social-economical terms. These social arrangements are not
isolated from each other, but instead there is a strong relationship between them.
The poorer informal neighborhoods emerged in Caracas, and pretty much all of Latin
America, during the twentieth century, due to migrations from rural to urban areas [Gal73].
Migrants were looking for employment and a better life, which not always was found, in-
creasing with time the social gap between people living in the same city up to astonish-
ing levels [Ber97]. For instance in Rio de Janeiro, Brazil, some neighbourhoods present
North-European alike Human Development Indexes (HDI), while others show Sub-Saharan
equivalents1 a few kilometres away.
It is known that the larger the income gap is, the stronger the resulting political po-
larization [MPR02]. In Venezuela, however, several other kind of social segregation process
took place at the same time increasing the divergence of people’s criteria. The consequent
conditioning of the inhabitants due to HDI differences, turn the society into two well differ-
entiated populations, even with territorial borders. This social segregation served for many
authors as basis for the political polarization catalyzed by the late President Hugo Chavez
[EH02].
7.4 Summary
In this chapter, we have proposed a methodology to detect political polarization in social
networks. The methodology consists on a contagion model to infer people’s opinions and a
new index to measure the degree of polarization in the opinions obtained. We apply this
methodology to detect polarization in user interactions on the online social network Twitter,
around a conversation of political interest, such as the announcement of the sudden death
of a nation’s President in office. We found that the conversation was polarized due to the
influence of an elite of opinion leaders.
1Tabela N 1172, http://portalgeo.rio.rj.gov.br/
128
Figure 7.19: Mass of tweets in the city of Caracas. Contour levels (from inside to outside 0.25,
0.20, 0.15, 0.10) represent the mass of tweets identified as in favor of the government (red) and
against it (blue). Areas bordered in green correspond to the five municipalities that conform the
city. White regions display unpopulated areas, yellow regions represent populated areas and pink
regions correspond the informal and poorer neighborhoods (slums). The label color indicates the
ruling party at each municipality, according to the 2013 Venezuelan local elections: red represents
the officialism party at Libertador and blue indicates opposition parties at Chacao, Sucre, Baruta
and El Hatillo.
129
130
Chapter 8
URBAN COLLECTIVE PATTERNS
In this chapter we explore urban dynamical patterns around the world. We analyze geolo-
cated Twitter activity to characterize the cyclical behavior of urban routines. We found that
the urban rhythms can be classified in three kinds of behavior determined by the combination
of morning and afternoon activity.
Recent studies have found that individual activities combine into regular cycles of collec-
tive behavior [CGW+08, PSR12]. These patterns of collective behavior are also found in the
biological activity of living organisms, like heartbeats or respiration. This synchrony is not
simply due to external factors like light and dark or due to biological factors like circadian
rhythms. It arises out of complex relationships and fills a particular function in society which
has great economic and social benefits. Our economic system is based upon the contributions
of multiple workers, the completion of tasks within a given time frame depends upon the
availability of other workers either simultaneously or in the correct sequence [VDAVH04].
The functioning of complex systems, like human societies, depends not only upon the
functionalities of its members but also upon the coordination of people’s actions. Many
important societal aspects such as economical activities would not be possible to develop
if individuals behave independently from each other. Although people seem to behave ran-
domly and unpredictably, it does not mean that their actions are independent from each
other. Collective activities can only be engaged when there are interdependencies in the
individual actions. Such interdependencies condition people’s decisions and diminish in-
dividuals’ freedom of will, in order to favors the system’s ability to gain capabilities as a
whole.
131
Figure 8.1: World Twitter Activity. Geographical density of Twitter activity (number of tweets)
during one average day in logarithmic scale. Red and orange indicate a high concentration of
activity, while blue and green indicate a lower concentration of tweets, and black indicates the
absence of activity. Insets: Average week of Twitter activity on several cities (ac,d(t)).
8.1 World Activity
We first analyzed the geographical distribution of the world activity (see the Video B.3
described in the Appendix B). We built a map with a representative day of activity, by
averaging the number of geolocated tweets across latitudes and longitudes. For this purpose,
we defined a matrix Tij that will aggregate the geolocations of the messages in a grid of
0.25 squared degrees of spatial resolution per hour. Therefore, we map the coordinates
(lonm, latm) of messages m to indexes in the matrix i, j as:
i = b4(lonm + 180)c (8.1a)
j = b4(latm + 90)c (8.1b)
where b· c represents the floor function. Then, we count all the tweets that meet this
criterion.
We aggregated the tweets for each week of the observation period w, and each day d of
the week, and built a respective matrix Tij for each hour t of the day. Then, we aggregated
all the hourly grids, Tij,d,w(t), into daily grids Tij,d,w =∑
t Tij,d,w(t) that contain the activity
of each day from the observation period. Finally, we averaged across all days and weeks from
132
the observation period and build an average daily grid, T ′ij, in the following way:
T ′ij =1
W
1
D
∑w
∑d
Tij,d,w (8.2)
where W is the total number of weeks from the observation period and D is the number
of days at each week (7).
In Fig 8.1 we show the resulting geographical density of tweets during the average day.
Red and orange regions indicate a high concentration of activity, while blue and green regions
indicate a lower concentration of tweets. Black regions indicate the absence of activity. It
can be noticed that Twitter is not homogeneously used across the world. Regions like the
Americas, Europe, Middle-East and South-East Asia seem to concentrate many more Twitter
users, than countries like China or India that present much less activity that the expected
for their large populations. Moreover, we can notice the different demographic densities.
For instance, in the US, vast void black regions in the west side of the country coexist with
densely populated red regions in the east side. That effect is also noticeable in Europe, where
the west is much more active than the east side; as well as Korea, where north and south
present remarkable differences. In fact, the red spots indicate the presence of active large
and medium cities. Next, we will analyze some of this cities by aggregating their localized
behavior into temporal series.
8.2 Urban Dynamics
We have analyzed the dynamics of 52 main cities across the world, by looking at the variation
of the number tweets per hour. For this purpose, we built a temporal series representing
an average week of Twitter activity per city, c: ac,d(t). An average week is compound by d
representative days (from 1 to 7), each of which are compound by t hours (from 0 to 23).
In order to build ac,d(t), we first determined the slots that comprehend the city in the grid,
according to the city coordinates and eq. 8.1. Then we sequentially collected the number of
tweets at the selected slots, and built a temporal series of tweets per hour, nc,d,w(t), where w
is the number of observed weeks (total W). For this purpose, the number of tweets nc,d,w(t)
from city c, in hour t, of day d, and week w was normalized according to:
n′c,d,w(t) =nc,d,w(t)− 〈nc,d,w(t)〉
σ(nc,d,w(t))(8.3)
where 〈nc,d,w(t)〉 = (1/24)∑
t nc,d,w(t) is the average and σ(nc,d,w(t)) is the standard
deviation. The Twitter activity of the representative week, of seven representative days d,
133
was given by:
ac,d(t) =1
W
∑w
n′c,d,w(t) (8.4)
In Fig 8.2 we show the temporal behavior of all these cities, and some of them are also
shown as insets in Fig. 8.1. It can be noticed that all series cycle between valleys and peaks
of activity during weekdays. The valleys of activity occur at early morning hours when
most people are sleeping, while the peaks of activity occur during the day, either during the
morning or the afternoon, while people go to work or return home. Depending on the height
of these peaks, we have identified different kinds of behaviors. Some cities like New York
City or Jakarta display a single large peak (green curves). Other cities like Sao Paulo or
Santiago show several small peaks of activity during the morning before a large peak at the
afternoon (blue curves). Finally, cities like London or Moscow display two peaks of activity
of similar size (yellow curves).
8.3 Dynamical Classes of Behavior
In order to further understand the dynamical patterns of the cities, we performed clustering
and multidimensional scaling algorithms to the temporal series. Specifically, we applied
the k-means algorithm in order to find clusters of cities’ temporal series [Mac67]. For this
purpose, we interpret each hour of the temporal series as an independent dimension and
cities represent a single point in a multidimensional space (24x7 dimensions). The clustering
algorithm associates cities that have a similar behavior, and thus are closer to each other,
than those who do not share the same behavior, and thus are farther. In order to find
the best number of clusters, we calculated the silhouette profile [Rou87] and found that it
maximizes at 3 clusters.
The average behavior of the three clusters are shown in the top panel of Fig. 8.3. Colors
correspond to the clustering results. The difference between the three classes is due to the
combination of morning and afternoon peaks, respectively marked with a square and a circle
red symbols. We concretely found the following behaviors:
1. The third class (Fig. 8.3 A) presents two large peaks of similar sizes: one in the
morning (red x symbol) and another one in the afternoon (red circle).
2. The second class (Fig. 8.3 B) presents a medium-sized peak in the morning (red
square), followed by a very large peak at the afternoon (red x symbol).
134
3. The first class (Fig. 8.3 C) presents an almost imperceptible small peak at the morning
(red square) and a very large peak at the afternoon (red x symbol).
In order to visualize these clusters, we performed a multidimensional reduction based
on multidimensional scaling (MDS) [BG05]. The results are shown in the bottom panel of
Fig. 8.3. The MDS algorithm projects the points from the multidimensional space, into a
bidimensional one, by maintaining the distance between the elements. The new dimensions
do not necessarily have a physical meaning. However, we interpret the new dimensions as
modality in the daily pattern (x-axis) and symmetry (y-axis). The cluster in the left (green)
is highly symmetrical and presents a single peak, while the cluster in the right is symmetrical
and presents two peaks (yellow). The third cluster (blue) is not symmetrical and present a
larger afternoon peak than a morning peak.
It is remarkable that these clusters share cultural and regional affinity. If we notice
the series in the insets of Fig. 8.1 and in Fig. 8.2, we can perceive that the clustering
results (shown by the colors) are related to the geography and culture. For instance, most
of European and African cities are in the yellow cluster, while North America and East Asia
cities are in the green cluster, and the blue cluster mainly corresponds to South American
cities.
8.4 Summary
In summary, we have seen that the Twitter activity from urban areas have a pulsing behavior,
due to the cycles of work, recreation and sleep. We found that there are three classes of
behavior, based on the combination of morning and afternoon peaks of activity.
135
Figure 8.2: Temporal behavior of 52 cities across all continents. Series represent the representative
week of Twitter activity for each city (ac,i(t)). Color indicates the result of the clustering classifier.
136
Figure 8.3: Clustering of cities according to their temporal behavior. Colors indicate the results
of k-means clustering algorithm. Axes correspond to collapsed dimensions using multidimensional-
scaling algorithms. On the top panel we show the average behavior of each class (from A to C).
We have respectively marked the morning and afternoon peaks of activity with a red x symbol and
a circle.
137
138
Chapter 9
INFERRING HUMAN BEHAVIOR
FROM MOBILE PHONE DATA
The analysis of human data exhaust to improve social well-being is a very timely subject that
has attracted the attention of several researchers, as well as governmental and international
organizations over the last years. In countries with limited economical resources, these
sources of information represent opportunities to gain intelligence about their social systems
without the need of deploying expensive fieldwork. For instance, mobile phone data or
Call Detail Records (CDR) resulted to be an accurate source of data to estimate human
migrations after the cholera outbreaks in Haiti in 2010 [BLT+11]. In Kenya, a similar
approach may remarkably reduce the spread of contagious diseases like Malaria [WET+12]
by identifying sources and sinks of human displacements. Furthermore, recent studies using
CDR data have shown the ability to measure the impact of earthquakes on communication
patterns [BLT+11, MFMFM13] and to build predictive models of potential areas of disruption
following an earthquake [KEH10]. These studies are very important, since their results may
benefit a large amount of human population, by improving and enhancing the efficacy and
efficiency of governmental processes of strategic planning.
In this chapter, we infer human behavioral patterns from CDR data. We first study the
communication patterns in a developing country, by looking into how regional areas interact
with each other [MCB+ss, MCB+13]. Then, we explore the potential of CDR analysis, in
order to measure the impact of natural disasters on people’s behavior. For this purpose, we
develop a framework to combine CDR data with other data sources, in order to characterize
communication patterns and to detect abnormal variations in the usual behavior [PMT+14].
139
9.1 Characterizing Communication and Mobility Pat-
terns in a Developing Country
In this section, we analyze mobile phone data to understand the structure of regional and
ethnic interactions in Ivory Coast [MCB+ss, MCB+13]. We construct and analyze complex
social networks at several layers of interactions, such as calling activity and human mobility.
We show the role of underlying forces, like culture or economy, that influence and determine
the Ivorian regional and national communication patterns.
9.1.1 Context
In the recent decades, African countries have gone through several armed conflicts among
different ethnic and religious groups. The borders arbitrarily traced by Europeans for ad-
ministrative convenience of the former colonial order split and joined ethnic groups into new
countries, forcing them to coexist within previously nonexistent frontiers. Asymmetries in
economical and geographical benefits between different ethnic groups have led some countries
to different levels of social polarization, which have eventually resulted in civil wars. Recent
studies have shown that violence emerges between ethnic groups when their territories are
not well defined [ML07], or when a group is large enough in order to prevail among others,
but not as strong as to maintain order. Ivory Coast is not an exception of this context.
In less than two decades the Ivorians have engaged in two internal armed conflicts, due
to asymmetries between their inhabitants. Therefore, the characterization and understand-
ing of their ethnic relationships is crucial to consolidate peace and to strengthen the social
cohesion needed for any further economical development.
Ivory Coast presents a complex society compound by more than 60 different ethnic groups.
Although French being the official and broadly spoken language across the country, each
ethnic group has its own native language. Such many and diverse languages are classified
into four large linguistic families: Kwa, Kru, Mande and Gur [Lew09]. The territories of
these four linguistic families are well defined in the four coordinates of the country, as shown
in Fig. 9.1.
In summary, the Kwa group is located in the southeast side of the country. This is
the most economically developed region where the capital city and other major cities are
located, as well as the main Ivorian airport and seaport. The Kru group is located in the
southwest side, also in the Atlantic coast. The second seaport in Ivory Coast is located at
this region, which brings economical benefits to these people. The Mande group is found in
140
Figure 9.1: Ethno-linguistic map of Ivory Coast. Figure adapted from [Lew09]
141
Network Nodes Edges Density Clustering
Calls Network 1,215 1,284,311 0.87 0.95
Trajectories Network 1,215 187,102 0.13 0.58
Table 9.1: Properties of the Calls and Human Trajectories Networks.
the northeast side of the country, and the northwest region is occupied by the Gur family.
The northern regions occupied by the Mande and Gur groups are the least populated regions
of the country and less economically developed areas.
9.1.2 Characterizing Populated Areas
In order to characterize populated areas in Ivory Coast we studied the structure of the human
trajectories network at the meso-scale level. This network displays the people’s mobility
patterns within a given territory. It is built out of the aggregation of individual trajectories.
Each trajectory is defined as the sequential set of antennas that served a particular user
in time. Antennas represent nodes and an edge is created between two antennas, i and j,
if a user makes two consecutive calls, first from antenna i and later from antenna j. The
edges are directed, from i to j, and weighted according to the number of times that all users
performed the same trajectory. The resulting network has 1,215 nodes and 187,102 edges.
It is a sparse network with high clustering coefficient (see Table 9.1). A visualization of
the dynamical growth of this graph during an arbitrary day is presented in the Video B.4
described in the Appendix B.
By applying the community detection algorithm based on modularity optimization [BGLL08],
we found that the trajectories network could be classified in 100 network communities, which
are shown in Fig. 9.2 together with the map of Ivory Coast. Communities comprehend a
limited territorial area, not necessarily contained within the same regional borders, and are
related to urban and rural settlements. It can be noticed that there is a larger density of
antennas and communities in the south side of the country, while in the north side scattered
antennas conform a few communities. Such difference in the density of antennas and com-
munities is consistent with demographical information that reports the south side of Ivory
Coast as more densely populated.
The density of edges also display the same structure. A snapshot of the trajectories
network is presented in Fig. 9.3. The nodes are located at the antennas’ geographical
coordinates and edges are colored in blue. The width of the edge is proportional to the
142
Figure 9.2: Mapping the community structure of the trajectories network of Ivory Coast. An-
tennas represent nodes and are plotted in different colors and shapes, according to the community
they belong gotten from the community detection algorithm.
Figure 9.3: Mapping the structure of the trajectories network on the Ivory Coast geographical
map. The blue lines represent the edges of the network and their width is proportional to the edge
weight. Superimposed the main roads of Ivory Coast have been plotted as red lines. The location
of the country’s main cities are marked with black circles.
143
Figure 9.4: Mapping the closeness-centrality property of the trajectories network in Ivory Coast.
The edges have been colored according to the closeness centrality mean value of the two connected
nodes. The red regions indicate higher closeness-centrality, the yellow and pale blue regions indicate
medium centrality, and the dark blue regions indicate lower closeness-centrality.
edge’s weight, which means that the most intense edges represent the trajectories more
frequently used. The main cities (black circles) and southern regions concentrate a larger
amount of edges than the north side, indicating a remarkable difference in the amount
of human displacements between the two regions. Apart from demographic density, this
patterns also result from the underlying infrastructure and economical activity. In Fig. 9.3
we have superimposed in red color the main roads of the country. Most trajectories keep
a remarkable correspondence to available roads. Some of them seem to be more frequently
used, like the ones linking the north with the south of the country; while others are less
frequently used, like the transverse road up in the north. The fact that some infrastructures
are more frequently used than others can be a consequence that the region with more activity
showed in Fig. 9.3 corresponds to the zone of cocoa plantations. Ivory Coast is the largest
cocoa producer in the world with 36% of the global share [ICO12].
It has been stated that the economical development of large regions can be characterized
and understood by means of cellphone activity patterns [EMC10]. Accordingly, in this
study we have analyzed the closeness-centrality property of the antennas in the trajectories
network. This network property is inversely proportional to the average distance from a
node to the rest of the network in terms of connections. It provides information about the
144
Figure 9.5: Mapping the linguistic identity of the trajectories network of Ivory Coast. The edges
have been colored according to the linguistic group to which the most connected antenna at each
community belongs to. There are four major linguistic families represented in yellow (northwest),
purple (northeast), green (southwest) and blue (southeast). Black circles indicate the location of
the major cities.
central or peripheral behavior of nodes or regions according to all human displacements.
In Fig. 9.4 we present the trajectories network coloring the edges according to the mean
value of antennas’ closeness-centrality. Red regions are highly central, yellow and pale blue
regions are intermediate, and dark blue regions are peripheral. It can be noticed, that the
most central area (red) corresponds to the main city and the regions it adjoins, while the
most peripheral regions are located in the north and west sides (blue). This is in agreement
to international reports [ECd08] that identify the north and the west side of the country as
the less developed areas.
9.1.3 Ethnic Interactions
In order to understand the ethnic composition of this graph, we have taken into account
the ethnic and linguistic identity of each network community. For this purpose, we mapped
each community to its geographically closest ethnic group, according to the location of the
communities’ most connected antenna and the ethno-linguistic map show in Fig. 9.1. In Fig.
9.5 we present the trajectories network by coloring edges according to the linguistic family.
145
It can be seen how the most densely connected areas, like the capital city or the cities in the
center of the country (black circles), concentrate links from different linguistic areas, while
most of regions mainly present trajectories within their own linguistic family.
After mapping the ethnic groups, we have constructed a second network taking into ac-
count a new layer of interaction, such as the antenna-to-antenna calling information (see
section 4.3.1). In this network, the nodes represent the 100 communities found in the tra-
jectories network (see section 9.1.2) whose ethnic identity is already known. The edges
correspond to the number of calls made from one community to the other. The edge direc-
tion goes from the emitter community to the receiver community and the weight is equal to
the number of occurrences found in the datasets.
In order to get a clearer view of the way that ethnic groups communicate with each other,
we present in Fig. 9.6 A the weighted adjacency matrix of the ethnic groups calling network
normalized by row. This normalization provides relative information about the destination
and origin of outgoing and incoming calls by group. The diagonal entries of the matrix are
higher than the other elements, indicating that most of outgoing calls remain in the same
community. In fact, the preference of people to communicate with similar ones increases with
the scale of aggregation. When we aggregate the communities by ethnic group and linguistic
family (Fig. 9.6 B and C), the assortative coefficient [New03a] of each matrix increases from
r ∼ 0.5 to r ∼ 0.8 (Fig. 9.6 D), being r = 1 the case of absolute segregation. Such increase
indicates that there is a higher segregation between ethnic groups when we consider their
linguistic family.
Moreover, not all families behave the same way. The southern families (number 1 and 2
in Fig. 9.6 C) present a larger proportion of calls directed to their own linguistic family, in
comparison to the northern families (number 3 and 4 in Fig. 9.6 C), whose activity directed
to other linguistic families is relatively larger. In Fig. 9.7, we present the intra-family
flux (calls directed to the same linguistic family) and inter-family flux (calls directed to a
different linguistic family) of calls. In the figure the symbols represent communities from
the trajectories network and the color corresponds to the linguistic family they belong to.
The further the community is located below the dashed line of slope 1, the higher the family
internal traffic in comparison to the external traffic. Most of the southern ethnic groups
(blue and green dots) are farther from the diagonal line than the northern ones (yellow and
red dots). This means that the internal traffic in southern ethnic groups is much higher
than their external one, while on northern families the external traffic is comparable with
the internal one.
The external calling traffic from the northern ethnic groups is directed selectively towards
146
Figure 9.6: Normalized adjacency matrices of the calls network corresponding to the community
structure from the trajectories network (A), ethnic group aggregation (B) and linguistic family
aggregation (C). Assortativity coefficient of selectiveness to call on local scale (community), subre-
gional scale (ethnic group) and regional scale (linguistic family) (D).
147
Figure 9.7: Scatter plot of intra linguistic family flux (calls directed to an antenna in the same
linguistic family as the emitter antenna) versus inter linguistic family flux (calls directed to an
antenna in a different linguistic family than the emitter antenna). Symbols represent communities
from the trajectories network and the color indicates the linguistic family to which the community
belongs. The dashed line has slope 1.
148
their adjoin southern families. In Fig. 9.6 C, we see that the families 1 and 4 are more densely
connected among themselves than with the rest of families. The same happens with families
2 and 3, which are also more connected among themselves than with the rest of families.
Such observation is in good agreement with the mobility patterns shown in Fig. 9.3, where
the vertical roads seem to have a higher significance than the horizontal ones; as well as with
the patterns shown in Fig. 9.5, where we showed that the mobility of the northern families
to the south are stronger with the adjoin regions.
9.1.4 Effects of Selectiveness in the Calling Behavior
To further understand the selectiveness in the communication patterns between the east and
west side of the country, we built a third network taking into account another layer of social
interactions. Specifically we built a network from the calling behavior at the tower level,
extracting only information from the first dataset described in section 4.3.1. The nodes in
this network also represent single antennas, and an edge is created from the antenna i to the
antenna j, when a user that is being served by the antenna i makes a call to another user
who is served by the antenna j. The resulting is a directed and weighted network, where
the weight of the edges represents the total number of calls made from the antenna i to the
antenna j along the whole observation period. It is almost a fully connected network with
extremely high clustering coefficient (see Table 9.1).
The calls network is compound by over 19 communities of antennas according to the
modularity optimization algorithm [BGLL08]. The distribution of these communities along
the geography of Ivory Coast is shown in Fig. 9.8. The communities show a relationship with
administrative areas marked with gray lines, although at some cases these human borders
are not in correspondence to the political ones. We present an animation with the dynamics
of this network in the Video B.5 described in the Appendix B, together with a visualization
of the influence that each of the 19 communities have among each other.
To capture the influence that each community has on the rest of antennas from the
network, we analyzed the density of calls directed to the given communities from the rest
of antennas. To quantify such preference, we have measured the density of calls between
communities and classified them using a k-means clustering algorithm [Mac67]. The results
are presented in Fig. 9.9, where we have plotted the antennas with different colors, according
to the classifier results. We found that the country is divided between the east side and west
side of the map, as was previously intuited in the Fig. 9.6 C.
149
Figure 9.8: Mapping the community structure of the calls network of Ivory Coast. Antennas
represent nodes and are plotted in different colors and shapes, according to the community they
belong gotten from the community detection algorithm.
Figure 9.9: Mapping the classification results of antennas according to the way the calls net-
work communities are related. A k-means clustering classifier has been applied to the community
structure of the calls network.
150
9.1.5 Summary
In summary, we have characterized the interactions and resulting structure of the diverse
geographical and social areas of Ivory Coast. We found that on a local and subregional scale,
the ethno-linguistic factor determines the interaction patterns, while on a wider scale, the
available infrastructure and economic facts play a major influence in the social dynamics.
As a result the Ivorian communication map is organized in two interacting regions located
at the east and west side of the country. On each side the northern ethnic groups seem to
be influenced by the southern ethnic groups. This study shows how CDR data can be used
to understand the social composition of societies and the way that cultural exchange takes
place. It also reveals that the peripheral and poorer communities seem to be more influenced
by the wealthier ones than otherwise. Given the recent history of violence in Ivory Coast,
these studies could allow to identify whether conditions are set for social unrest.
9.2 Flooding through the Lens of Mobile Phone Activ-
ity
In this section, we explore the potential of analyzing CDR data for characterizing the re-
action of populations to natural disasters, using the Tabasco, Mexico floods in 2009 as a
case study. For this matter, we develop a multimodal data integration framework that facil-
itates the combining of CDR data with other data sources- remote sensing, rainfall activity,
census and civil protection information, in order to quantitatively characterize changes in
communication patterns during the floods [PMT+14]. The ultimate goal is to contribute to
the development of real-time decision-support tools based on CDR data, in order for gov-
ernments, international organizations and humanitarian actors to enhance their responses.
Natural disasters such as floods or earthquakes affect hundreds of millions of people
worldwide every year1. Effectiveness of humanitarian response is limited, in part, by the lack
of timely and accurate information about the patterns of movement and communication of
the affected population. Specifically, there is a need for dynamic in-situ information across
the event timeline: a baseline for understanding the previous and usual behavior, real-time
measurements of the behavior during the disaster, and the capacity to track return to normal
behavioral patterns during the recovery phase.
1EM-DAT database: http://emdat.be/disaster-trends
151
Figure 9.10: Left: Visualization of the precipitation data obtained from the NASA TRMM at
November, 2nd, 2009. The red square encloses the observed region. Right: Accumulated rainfalls
during the first two weeks of November, 2009 (jet colormap) over the Tabasco area. The floods
segmentation is shown by the white shade. The area correspond to the red square in the left panel.
9.2.1 Context
The state of Tabasco is located to the south of the Gulf of Mexico, covering 24, 738km2
(1,3% of national total area). Due to its location and topographical features, Tabasco is
subject to frequent flooding events, such as those that occurred in 2007, 2008 and 2009. On
28th October 2009, a cold front (Nr. 9) entered northwest Mexico and reached Tabasco on
the 31th, where it remained for four days. It rained intensely until November the 3th over
the west of Tabasco, within the Tonala basin. The National Meteorological Service (SMN)
recorded 800mm of accumulated rain in three days, 4-fold the regular accumulated rain level
for November. In Fig. 9.10 we present a visualization of the precipitation data obtained
from the NASA TRMM at November, 2th, 2009, together with the accumulated rainfalls
during the first two weeks of November, 2009 in the region of Tabasco. The rainfall levels
in the right panel have been colored from the highest (red) to the lowest values (blue). The
floods segmentation generated from the Landsat-7 images is shown in white shadow.
As the Tonala basin lacks hydraulic infrastructure for controlling river floods, the rain
water flowed freely to the coastal plains, causing flooding. The greatest damage occurred in
the Huimanguillo and Cardenas municipalities. On November the 3rd, after the heavy rain,
the state of emergency was declared in Huimanguillo and Cardenas. Response activities
coordinated by Civil Protection and the system for Integral Development of Families (DIF),
with contributions from other state and federal entities, such as the Federal Preventive
152
Figure 9.11: Left: map of 2010 census (green bars) vs CDRs based population estimation (purple
bars) in several cities of Tabasco (red=affected cities, blue=other cities) and surroundings. Right:
The plot shows linear correlation between the CDR census and the real census (r-square 0.97).
Police and the National Water Commission (CONAGUA). On November the 11th, a state
of emergency was declared in Comalcalco, Cunduacan, and Paraıso municipalities.
In January 2010, the National Center for Disaster Prevention (CENAPRED) carried
out a mission to assess the damage caused by the floods, together with the Planificacion
State Secretariat and Civil Protection. They interviewed over 16 state and federal agents
in charge of coordinating recovery actions. CENAPRED collected all the information and
compiled a report on the impact of the floods. According to the report, in economic terms,
the total losses in the state of Tabasco reached 190 million USD, 50% of which were due to
damage to road infrastructure; 16% were related to productive activities (agriculture and
ranching); and 7% of losses corresponded to social damage (dwelling, health, education).
The floods also had a significant emotional and psychological impact on peoples lives. The
CENAPRED report states that the total human, social and economic losses caused by the
2007, 2008 and 2009 stationary floods highlight the vulnerability of Tabasco to such natural
events. Furthermore, this recurring situation hinders the state from achieving total recovery
after each disaster. Hence it is recommended that resources be invested in designing and
implementing mitigation plans and prevention actions rather than in covering post-event
costs.
153
9.2.2 Assessing the Representativeness of CDR data
We considered a subset of the CDRs provided by the spanish company Telefonica2 comprising
only those mobile users (social baseline) who made calls from Tabasco during the month
prior to the onset of the reported floods on November 1st, 2009 (baseline period). In order
to evaluate how representative these data is of the real population of Tabasco, we have
compared the population distribution derived from the CDR data with the 2010 census of
Tabasco, used as the ground truth.
The social baseline has been characterized by assigning the home antenna tower (HAT)
for each user, meaning the antenna tower most used at night during the baseline (BL) period
[BCH+13]. Number of users per city (or administrative boundary) was inferred by cross-
referencing the users HAT with the GADM database. We then compared the 2010 census
information with the CDR population estimation for the main cities of the regions affected by
the 2009 floods: Cardenas, Huimanguillo, Paraiso, Comalco, Cunduacan and other nearby
cities (see Fig. 9.11). Results showed a linear relation between both variables with a relative
homogeneity of the telecom penetration in the affected region of around the 20%. Hence,
this analysis provides preliminary results that support the assumption of a homogeneous
representativeness of communication activity and mobility patterns extracted from CDRs in
the affected cities.
9.2.3 Population Response to Floods
For the analysis, the CDR data of the baseline has been aggregated by day and by antenna
to understand how the floods modulated the normal communication patterns observed at
the antenna level. In particular, we measured the number of users placing or receiving calls
in each antenna and for each day. We refer to this raw measurement as the antenna com-
munication activity x(t) (see Fig. 9.12). To detect abnormalities in this activity, we propose
the antenna variation metric that relies on the comparison x(t) against their characteristic
variation obtained during the baseline period. Mathematically, the antenna variation metric,
xnorm(t), is defined as the z-score from x(t) referred to the normal distribution characterizing
the baseline pattern as follows:
xnorm(t) =|x(t)− µBL|
σBL(9.1)
where the average and standard deviation (µBL, σBL) statistically characterizes the activ-
ity during the BL period (the month before the flooding onset). A graphical scheme of this
2www.telefonica.com/
154
Figure 9.12: Time evolution of the number of unique users per cell tower x(t). The gray stripes
indicate the Flood and Christmas periods where stronger variations are observed. The labels at
the top-right of each chart indicate the municipality where the tower is located. Towers have been
ordered and colored according to the maximum degree of variation during floods in decreasing
order.
155
Figure 9.13: Scheme of the Antenna Variation metric for cell towers. The black curve represents
the raw signal x(t). The gray stripe indicates the Flood period. The red line indicates the average
value (µBL) of users served during the Baseline period. The pink stripe indicates the standard
deviation (σBL) from the average value during the Baseline period. The blue line indicates the
deviation from the average value at a given day. Our measure of antenna variation results from the
ratio of the blue line divided by the green line.
156
Figure 9.14: Time evolution of the Antenna Variation metric (xnorm) for the considered towers.
The gray stripes indicate the Flood and Xmas periods. Color is proportional to the degree of
variation during the flooding period. It can be noticed that antennas have a spike of activity
during the floods (left shadowed region), as well as during Christmas and New Years Eve.
157
Figure 9.15: Impact Map of Tabasco for the 2009 floods. Circles represent antennas and their size
is proportional to the variation metric during the floods. The dark blue segmentation represents
the flooded region. The color of municipalities is proportional to the number of affected people.
The map shows the most critical day featuring the highest values of the antenna variation metric.
158
measure is presented in Fig. 9.13. A static z-score has been previously used to characterize
calling behaviors in large scale time sensitive emergency events like bombings, earthquakes
or brief storms [BWB11]. Here, we have computed xnorm(t) from the beginning of the BL
period until the end of January (Aprox. 2 months after rainfalls ended), generating temporal
series of this z-score for the antennas in the affected areas.
In Fig. 9.14 we present the temporal evolution of the antenna variation metric xnorm(t)
-derived from the CDRs- at all towers. Series have been colored according to their maximum
variation during the Floods (gray shaded region at the left). It can be noticed that some
antennas display a variation extremely high during the floods, up to 25 times higher than
its usual variations. These antennas are located in the most affected areas. The spatial dis-
tribution of the maximum value of the antenna variation metric is shown in an impact map
(see Fig. 9.15) that combines the metric with other contextual indicators: the municipalities
have been colored according to the official number of affected population and the segmen-
tation of the flooded area. The impact map is consistent with our ground truth evidence
(flood segmentation and civil protection records), since the antenna activity spikes in the
most affected municipalities: Cardenas and Huimanguillo. Furthermore, we also present the
daily variations of the antennas along the observation period in the Video B.5 described in
the Appendix B.
During the floods, the distribution of the maximum in the antenna variation metric is
wider than the BL period distribution, featuring more antenna with higher variation metric
(see Fig. 9.16). The real-time nature of mobile phone signals allows us to compare so-
cial patterns against their modulating factors. Here, we compare the proposed metric with
rainfall levels. These precipitation levels are obtained from the NASA TRMM projects day-
resolution estimations of the rainfalls. The six hottest antenna that also feature different
metric profile have been taken to observe the rainfall levels at the antenna level (see Fig.
9.17 Top). As shown, the typical delay between the maximum level of precipitations and the
peak in the variations of the hot antenna indicator is 4 days. One possible explanation is
that a population might not react in a way that alters the communication activity globally
even under extreme climatological conditions. Instead, the response captured in the commu-
nication activity could have occurred due to the initial flooding effects, after the rivers and
water reserves overflowed around November 5th and 6th as was reported in different news.
The civil protection warning was issued on the day of maximum precipitations (Novem-
ber 3rd). It would be expected that this warning would result in a spike in communications
activity, but this reaction can only be observed in two antennas located along Federal Road
180D that eventually suffered an outage (see Fig. 9.17 Bottom). These sudden variations
159
Figure 9.16: Distribution of the maximum of the antenna variation metric for the BL period
(gray) and floods (red). The curves show the percentage of antennas (y-axis) whose maximum
variation metric value (xnorm) is higher than a given value (x-axis).
160
and the following outage may indicate the point of the highest rain impact, likely caus-
ing a severe traffic jam on 180D. The increase of the antenna occupancy time due to the
jam would eventually generate the shown communication activity peaks (although further
analysis would be required).
On the other hand, the maximum of the antenna variation in the antennas with higher
population happens on November 6th when the rain was already vanishing. Several sources
also raised the estimates of the affected population from 50,000 to 100,000 people that
day. Thus, the hypothesis would be that for gradual-onset disasters (due to a cumulative
effect of some potential factor), the proposed metric might provide an estimation of the
populations awareness and subsequent reaction rather than a means to detect the onset of
the event. The delayed spike in antenna variation in this case may indicate that while the
civil protection warning did not produce the sufficient level of awareness in the population,
the initial consequences of the flooding did.
9.2.4 Summary
In summary, we have proposed a methodology based on integrated analysis of CDRs with
several data sources, including remote sensing imagery and rainfall information. We tested
the representativeness of the CDR data observing a homogeneous penetration of mobile
phones in the affected cities. We found abnormal communication activity that could be
used to measure the impact of the disaster. The populations reaction -in terms of increased
communication- took place when the emergency was declared, rather than during the previ-
ous alert stage, as expected. This could be an indicator of the skepticism or lack of awareness
of the population regarding the heightened risk of floods. If this is the case, a systematic
study of the reasons for such behavior is recommended, since lack of awareness of a hazard
implies an increase in vulnerability to its effects.
161
Figure 9.17: Top: Antenna variation metric (red) vs the precipitation level (blue) for the six
hottest antennas (A to F). The slashed line shows the emergency warning date as notified in the
news. Bottom: Map featuring the position and date (e.g. 6N is 6th November) where the maximum
of the antenna variation metric was observed.
162
Chapter 10
Conclusions
In this thesis, we have shown that several societal processes can be understood by analyzing
the data derived from people’s interactions with electronic media together with the mathe-
matical and computational tools from complexity science. We have proposed methodologies
to treat large volumes of unstructured data, resulting from human activity on social media
and through mobile phone. We have been able to unveil people’s collective behavior and to
retrieve structural and dynamical information about the underlying social systems from the
raw data.
Next, we present the conclusions obtained from our studies:
1. We developed methods to characterize and understand the social systems’ structure,
functioning and time evolution. To this end, we abstracted the systems as complex
networks and analyzed the evolution of their properties. We have applied this analysis
to several Twitter conversations during different events, finding similar patterns across
diverse contexts.
(a) We have shown, that the user activity distributions on several Twitter conversa-
tions typically scale as fat-tailed distributions, truncated by the individual con-
strains ad physical limitations. This means that the conversations are usually fed
by a small group of very active persons, while the large majority of users hardly
participates.
(b) During events, we have identified that the temporal behavior of the collective
activity is explosive and bursty. Most of the related information is posted during
the most critical hours of the event, when the topic captures the interest of the
majority of participants. We have shown that bursts present very similar shapes,
163
independently of the number of users and messages, which may span across several
orders of magnitude.
(c) We have shown that the user interactions on Twitter can be well defined by two
networks associated to the mechanisms provided by this online service for users
to receive and forward information. Both networks are directed and the sense of
the edges indicates the flow of attention and information.
i. One network emerges from the followers mechanism. In this network nodes
are linked by who receives whose messages, and its structure displays the
social substratum where information may flow during conversations.
ii. The other network emerges from the retweet mechanism. In this graph, users
are linked according to who forwarded (or retweeted) whose content and repre-
sents the information diffusion graph where messages actually traveled during
conversations.
(d) We have found that both followers and retweet networks present complex prop-
erties. The degree and strength distributions follow power laws at most of cases,
where the distribution resulting from the aggregation of collective behaviors present
a broader tail than the distributions emergent from individual actions. Besides,
the average shortest path between nodes result to be very small, since the few
hubs that connect most of the networks gather an extremely significant amount
of connections.
(e) We have shown that the directed assortativity of these networks varies according
to the direction considered. In general, the out-in relationship is disassortative,
meaning that non-popular accounts usually target their edges to popular accounts,
like selecting them as sources of information or to propagate their messages across
the network. Meanwhile, the out-out relationship is positively assortative, mean-
ing that the active users are linked among each other.
(f) We showed that the retweet mechanism can also be understood as information
cascades taking place on the followers network. We found that the size distribution
of cascades decays as a power law, indicating that while most of cascades hardly
include more than a couple participants, some few cascades are much more larger.
We have determined that the probability of a cascade to grow exponentially decays
as it moves farther from the original message source, in agreement to previous
works.
(g) In the mesoscale, we found that people are organized around influential accounts
164
from different collectives, like journalists, politicians or traditional media. The
followers network presents larger and denser communities, while the retweet net-
work present smaller communities with fewer edges. At the communities from the
followers network, the most central users are very popular and usually influence
the emergence of smaller retransmission communities due to the propagation of
their content.
(h) We have shown that people are more selective when it comes to take an active
part in the conversation, like retransmitting a message, rather than just passively
participating, like receiving and reading information from other sources.
(i) Moreover, our results indicate that although the online social media seem to be a
purely social phenomena, traditional media agents still enjoy a lot of power and
influence over people, who they use to boost and enhance their messages.
2. The characterization achieved of the social systems allowed us to understand the way
users interact and influence each other during events and conversations. Based on the
networks’ structure, we have classified users, as system’s elements, according to their
relationship with the environment and their role in the collective functioning.
(a) We have shown that there are three types of user behavior that determine the
dynamics of the information flow: Information Producers, Active Consumers and
Passive Consumers.
i. Information producers represent a very small group of highly influential users
who dominate the collective attention and catalyze the information diffusion
process. These users cause a lot of activity inside the network, posting a little
amount of messages.
ii. Active consumers usually retransmit a large amount of messages, gaining
influence in proportion to their activity employed. These users act like social
bridges delivering messages from other people to their own sub-networks.
iii. Finally, passive consumers are those who hardly participate, retransmit mes-
sages nor get retransmitted at all. These users represent the large majority
of the population while their activity represents less than half of the stream
of messages.
3. We have introduced a new measure of influence in the network called user efficiency,
defined as the ratio between the retransmissions gained by message posted. We have
also proposed a computational model to explain the distributions of user efficiency and
165
to explore the effects of the underlying network’s topological properties and the way
users post messages. We show that users can compensate their topological deficits
by means of modifying their behavior in order to be influential in the conversations.
However, this process is very costly for the user.
(a) We found that the user efficiency distribution follows a lognormal distribution with
a fatter tail than expected, due to the effects of the extremely connected hubs.
In average most of the users who get retweeted, gain as many retransmissions as
messages posted. However, a minority of them, occupying a privileged position
in the followers network, accomplish a very high level of retransmission with little
effort.
(b) We showed that the user efficiency distribution is universal across several Twitter
conversations. We demonstrated that the same distribution emerges from several
conversations of diverse nature and cultural context, whose sizes in users and
participants varied across several orders of magnitude.
(c) The user efficiency distributions have been explained by modeling the underlying
rules of the message spreading process by means of a computational model, based
on independent cascades taking place on the followers networks. The cascades are
biased in order to decay their probability of growth as the message travels farther
from the original source.
(d) The developed computational model revealed that the emergence of a small frac-
tion of highly efficient users results from the heterogeneity of the underlying net-
work, rather than the differences in the individual user behavior. Therefore, the
changes in the activity behavior are not significant if the underlying network
presents a scale-free structure.
(e) When considering homogeneous networks, we have shown that the retransmissions
gained by user are mainly proportional to their activity, meaning that there is
not an influential set of highly efficient users in this kind of graphs. In fact, an
homogeneously organized society would need a much larger population to find
the same level of efficiency to diffuse information that we get by complex and
heterogeneous organizing.
(f) Our results show that regular users can compensate their topological deficits by
means of change in their behavior. However, since the activity must be increased
in a very costly and even unaffordable way, such enhancement would be achieved
far less efficiently than the users with high connectivity.
166
(g) We conclude that although individuals may have remarkable psychological and
contextual differences, the dynamical patterns are due to simple and universal
interaction mechanisms.
4. We have proposed a methodology to infer the degree of polarization in social inter-
actions. The methodology consists of a polarization index and a model to estimate
opinions in social networks. We have illustrated how to apply this methodology by de-
tecting and measuring the polarization on a Twitter conversation related to the recent
death of the former Venezuelan president Hugo Chavez.
(a) We have introduced a new way to measure and quantify the degree of polarization
of a social group based on the concepts of physics and inspired by the electric
dipole moment. We have shown that the polarization of two equally populated
groups depends on how distant are their views, just like the electric dipole moment
increases with the distance between the charges
(b) We have shown that the opinions of a large number of participants on Twitter
conversations can be inferred with a social contagion model, in which a minority of
influential individuals -called elite- propagate their opinions through the emergent
retweet networks.
(c) Our methodology can detect different degrees of polarization, depending on the
structure of the network. If the network is polarized around the elite, then we are
able to detect a two islands structure. Instead, if the network is not polarized,
then we appreciate a single island structure.
(d) We applied this methodology to a Twitter conversation regarding the death an-
nouncement of the former Venezuelan president Hugo Chavez. We found that
the polarization degree varied according to developing external events. Based on
these results, we have identified the following periods:
i. Before the main announcement, we found the networks to be polarized around
the two political poles. However, the polarization index did not presented
maximum values since one pole was larger than the other.
ii. During the main announcement, we found the conversation to have no polar-
ization. We found single island structures with a remarkable participation of
international users.
iii. After the main announcement, the polarization emerged in the conversation
again and the networks showed two island structures. At this stage, both
167
poles reached similar sizes and the polarization index presented maximum
values.
(e) The Venezuelan elite were not capable of polarizing the network when the con-
versation stopped being local of Venezuela and turned to be international. The
more international users we detected, the less the polarization degree we found.
(f) However, by applying the model allowing the flow of information only in the
direction of who-influences-who, we found a social baseline that presented a higher
degree of polarization across the whole conversation.
(g) We contrasted our results against offline data, such as municipality governments
or socioeconomic factors, finding a good correlation between the online and offline
polarization.
(h) We have shown that a minority of elite users were able to influence the whole
online social network, resulting in a highly politically polarized conversation. This
means that most of users are exposed to opinions to which are favorable and cross
ideological interactions hardly occur.
5. We have also analyzed the temporal behavior of Twitter aggregated activity in urban
areas across the world. We characterized the kinetics of Twitter activity from over 50
cities by building temporal series of average behavior. We have shown that cities can
be classified by three classes of behavior due to combinations of morning and afternoon
activity.
(a) We found that cities present a collective cyclic behavior, due to daily routines and
collective activities. This behavior consists in periodic minima of activity during
the early morning and peaks of activity during the daytime.
(b) We have identified three classes of dynamical behavior, based on morning and
afternoon activity.
i. One class presents two peaks of similar size: one before noon and another
before night. We showed that most of these cities are located in Europe,
Middle East and Africa.
ii. Another class presents two peaks of different size: a smaller before noon and a
larger before night. We showed that most of these cities are located in South
America.
iii. The last class presents a single peak before night. Most of these cities resulted
to be from North America and East Asia.
168
6. Moreover, we have analyzed mobile phone activity from the country Ivory Coast in
order to infer the human behavior from calling and mobility patterns. In this study,
we have characterized the way geographical regions interact with each other. We
have found that the communication patterns have a correlation with the transport
infrastructure, economical development and cultural identity.
(a) We have shown that communication patterns in a developing country can be
characterized by the construction of networks of people calling each other and
moving through antennas. The networks are directed and weighted according to
the number of recurrences.
(b) We showed that the calls network behaves like a fully connected network with an
extremely high clustering coefficient, while the mobility network is sparser and
presents a lower clustering coefficient.
(c) At the mesoscale, we have shown that the calls network presents fewer and larger
communities, while the mobility network presents more but smaller communi-
ties. We found that the communities at both networks are related to regions and
populated areas, like cities or villages.
(d) We showed that the mobility network is a reflection of the transport infrastructure.
Besides, we found that the economical development is related to the closeness
centrality property of this network.
(e) We found that the communities from the calling network are clustered in two
regions located at both sides of the country. We have evidence to believe that such
division is due to cultural factors, like the spoken language, as well as economical
factors.
7. Finally, we have studied the effects of natural disasters on the collective people’s be-
havior, like the 2009 floods in Tabasco, Mexico. During this study we proposed a
methodology to integrate mobile phone data with other data sources, in order to en-
hance the information managed by local governments and international agencies during
emergencies. The results show that mobile phone activity could be a complementary
source of information in order to estimate the impact of natural disasters almost at
real time. Our conclusions from this study are the following:
(a) We have shown that mobile phone data is representative to estimate measurements
over the full population, since we observed a good correlation between the number
of users and inhabitants per region.
169
(b) We have shown that the mobile phone activity presents a bursty and hetero-
geneous behavior during natural disasters at the antennas close to the affected
areas.
(c) We have shown that these abnormal variations can be detected by normalizing
the behavior at each antenna during the emergency with its usual behavior.
(d) Our findings showed that relevant information results from the antenna-level ag-
gregation of cell phone traffic and not from the individual records. Therefore, we
have shown that user privacy is not compromised.
(e) We conclude that that popular reactions to catastrophes could be incorporated
into an evolving emergency management strategy and policies evaluation.
170
Appendix A
User Behavior
In this appendix, we show the characterization of the user behavior in different datasets. We
present the results from applying the same experiments performed in section 5.7 to two other
Twitter conversations described in section 4.2.2. More specifically, we present the results
from the 20N dataset in Fig. A.1 and the results from the ETA dataset in Fig. A.2.
It can be noticed that the patterns obtained from these datasets are very similar to the
results obtained in section 5.7 from the #SOSINternetVE dataset. In Fig. A.1 A and A.2
A, we show that the most retransmitted users are also the most followed ones (red dots),
independently of their activity. In Fig. A.1 B and A.2 B, we show that the most active users
(red dots) do not have the largest amount of followers. However, these active users may gain
as many retweets as the popular users. In Fig. A.1 C and A.2 C, we show that the most
active users (red dots) are reciprocal and mainly located at Kin/Kout ∼ 1. Meanwhile, we
see that popular users are asymmetrical and present Kin > Kout. Finally, in Fig. A.1 D and
A.2 D, we show that the most active users (red dots) are those who retweet the most and
do not have the largest amount of followers. Also, we see that the most followed ones hardly
retweet other users.
In summary, we have shown that our characterization of the user behavior is not con-
strained to a single dataset, but rather seems to be a general property of Twitter conversa-
tions. Again, we found that there are three kind of users. One group is compound by highly
followed users, that post a few messages and obtain a high quantity of retweets. Another
group is compound by less folowed users, who are very active, make a lot of retweets and
obtain as many retweets as their activity. Finally, there is a third set of less followed users
that hardly participate and consequently hardly gain retweets in the conversation.
171
Figure A.1: Analysis of the user behavior. (A) Scatter plot of retransmissions obtained by user
versus its activity and colored by its number of followers. (B) Scatter plot of retransmissions
obtained by user versus its number of followers and colored by its activity. (C) Scatter plot of
retransmissions obtained by user versus the ratio between the number of followers and followees,
and colored by its activity. (D) Scatter plot of retransmissions made by user versus its number of
followers and colored by its activity. Dots represent users. Data correspond to the 20N dataset.
172
Figure A.2: Analysis of the user behavior. (A) Scatter plot of retransmissions obtained by user
versus its activity and colored by its number of followers. (B) Scatter plot of retransmissions
obtained by user versus its number of followers and colored by its activity. (C) Scatter plot of
retransmissions obtained by user versus the ratio between the number of followers and followees,
and colored by its activity. (D) Scatter plot of retransmissions made by user versus its number of
followers and colored by its activity. Dots represent users. Data correspond to the ETA dataset.
173
174
Appendix B
Videos
In this appendix, we present the videos that we have made to illustrate some of our results.
For each video, we present a figure composed by three arbitrarily chosen snapshots. We also
provide a description of the video in the figures’ caption.
Specifically, we present the following videos:
1. Evolution of the opinion estimation model in Fig. B.1. In this video we show the
evolution of the opinion estimation model in a sample network. At the beginning of
the video, all nodes except from the elite are colored in white. Then, as the video goes
on, nodes iteratively adopt new opinions and change their colors iteratively.
2. Worldwide Twitter reaction to the announcement of Hugo Chavez decease in Fig.
B.2. In this video we show the evolution of geolocated messages per minute during a
24h period, including the decease announcement of the former Venezuelan president.
At the beginning of the video we see some scattered messages, mainly concentrated
in Venezuela. Then, once the news is released, we notice an explosion of activity
worldwide.
3. Worldwide Twitter activity in Fig. B.3. In this video we show the dynamics of Twitter
activity worldwide, during one arbitrary week. In the video we can notice a global wave
of activity going from east to west on daily basis. Se show that people periodically
goes to sleep and becomes active during the day.
4. Human trajectories network evolution in Ivory Coast in Fig. B.4. In this video we
show the evolution of the human mobility network in Ivory Coast during an arbitrary
day. We also show the location of the network communities in the map.
175
5. Calls network evolution in Ivory Coast in Fig. B.5. In this video we show the evolution
of the mobile phones’ calls network in Ivory Coast during an arbitrary day. We also
show the influence that the network communities play on each other.
6. Time-lapse of the Tabasco impact map in Fig. B.6. In this video we show the temporal
evolution of the antenna variation metric before, during an after the 2009 Floods
occurred in Tabasco, Mexico.
We have built these videos by means of Python scripts almost exclusively. In general,
videos are compound by a set of frames. In our videos, each frame is build as an independent
plot and saved as an independent figure. Then, we compiled all figures into a single video
using the ffmpeg1 program. The only requirement is that figure files must be numbered in
the order that will appear in the final video.
1https://www.ffmpeg.org/
176
Figure B.1: Evolution of the opinion estimation model. Nodes are colored according to their
opinion Xi. In principle, all nodes’ opinions are zero; thus, they are colored in white. However,
nodes with an opinion below zero are red and above zero are blue. The elite is hidden in the
network and will spread their opinions iteratively. We see how the network is increasingly colored
at each time step. Because the network is polarized around the elite, the red and blue colors are
not mixed.
177
Figure B.2: Worldwide Twitter reaction to the announcement of Hugo Chavez decease. Yellow
circles represent a geolocated tweet. The video spans for a 24h period. We show a counter indicating
the remaining time before the announcement and the time after it. It can be noticed that at the
moment of the announcement the whole world reacted massively to the news by posting related
messages.
178
Figure B.3: Worldwide Twitter activity. In this video we present the worldwide Twitter activity
during an arbitrary week. We plot all geolocated tweets as white dots in the map. It can be noticed
that there is a wave of activity from the east to the west side of the globe as days evolve. Also, it
is noticeable that the activity decreases to its minimum levels during early mornings.
179
Figure B.4: Human trajectories network evolution in Ivory Coast. In this video, we present the
dynamical growth of the human trajectories network during an arbitrary day. Dots represent users
moving across the country from antenna to antenna. The edge color is related to the network
community where the target node belongs to. It can be noticed that the network grows in a sparse
way, mostly connecting nodes that are geographically close to each other. Other regions like the
capital city (right bottom) concentrate most of the long distance edges.
Figure B.5: Calls network evolution in Ivory Coast. In this video, we present the dynamical
growth of the calls network during a period of 12 hours at an arbitrary day. Dots represent calls,
traveling from one antenna to the other at each hour. The edge color is related to the network
community where the target node belongs to. It can be noticed that there is an explosion of calls
after 6am, showing the dense structure of the network.
180
Figure B.6: Time-lapse of the Tabasco impact map. The video displays the absolute value of the
antenna variation metric from Oct, 2009 to Jan, 2010 as in the temporal series. Each antenna is
represented by a circle with color and size proportional to the daily metric value. The segmented
flooded area has been colored in light blue. It can be noticed that the antennas near the flooding area
dramatically increased their variation during the floods. This effect is noticeable during Christmas
and New Years Eve, where all antennas present extremely large variation.
181
182
Bibliography
[ACFO13] D. Acemoglu, G. Como, F. Fagnani, and A. Ozdaglar, Opinion fluctuations
and disagreement in social networks, Mathematics of Operations Research 38
(2013), no. 1, 1–27.
[AG05] L. A. Adamic and N. Glance, The political blogosphere and the 2004 U.S.
election: Divided they blog, Proceedings of LinkKDD, 2005.
[AHSW11] S. Asur, B. A. Huberman, G. Szabo, and C. Wang, Trends in social media:
Persistence and decay, CoRR abs/1102.1402 (2011).
[AJB99] R. Albert, H. Jeong, and A-L Barabasi, Internet: Diameter of the world-wide
web, Nature 401 (1999), no. 6749, 130–131.
[AMV+14] J. Adebayo, T. Musso, K. Virdee, C. Friedman, and Y. Bar-Yam, An explo-
ration of social identity: The structure of the bbc news-sharing community on
twitter, Complexity 19 (2014), no. 5, 55–63.
[AO11] D. Acemoglu and A. Ozdaglar, Opinion dynamics and learning in social net-
works, Dynamic Games and Applications 1 (2011), no. 1, 3–49.
[APR99] J. Abello, P. M. Pardalos, and M. G. C. Resende, On very large maximum
clique problems, AMS-DIMACS Series in Discrete Mathematics and Theoret-
ical Computer Science 50 (1999), 119–130.
[AS11] R. Alonso-Sanz, Discrete systems with memory, vol. 75, World Scientific,
2011.
[ASBS00] L. A. Amaral, A. Scala, M. Barthelemy, and H. E. Stanley, Classes of small-
world networks., Proc Natl Acad Sci 97 (2000), no. 21, 11149–11152.
[AW12] S. Aral and D. Walker, Identifying influential and susceptible members of
social networks, Science 337 (2012), no. 6092, 337–341.
183
[BA99] A-L Barabasi and R Albert, Emergence of Scaling in Random Networks, Sci-
ence 286 (1999), no. 5439, 509–512.
[Bar05] A-L Barabasi, The origin of bursts and heavy tails in human dynamics, Nature
435 (2005), 207.
[Bar12] A-L Barabasi, Network Science Project, http://barabasilab.neu.edu/ net-
worksciencebook, 2012.
[BB07] D. Baldassarri and P. Bearman, Dynamics of Political Polarization, American
Sociological Review 72 (2007), no. 5, 784–811.
[BBPSV04] A. Barrat, M. Barthelemy, R. Pastor-Satorras, and A. Vespignani, The archi-
tecture of complex weighted networks, Proceedings of the National Academy
of Sciences of the United States of America 101 (2004), no. 11, 3747–3752.
[BCH+13] R. Becker, R. Caceres, K. Hanson, S. Isaacman, J-M Loh, M. Martonosi,
J. Rowland, S. Urbanek, A. Varshavsky, and C. Volinsky, Human mobility
characterization from cellular network data, Commun. ACM 56 (2013), no. 1,
74–82.
[BEC+12] V. D. Blondel, M. Esch, C. Chan, F. Clerot, P. Deville, E. Huens, F. Morlot,
Z. Smoreda, and C. Ziemlicki, Data for development: the d4d challenge on
mobile phone data, CoRR abs/1210.0137 (2012).
[Ber97] A. Berry, The income distribution threat in latin america, Latin American
Research Review 32 (1997), no. 2, pp. 3–40 (English).
[Bet13] L. M. A. Bettencourt, The origins of scaling in cities, Science 340 (2013),
no. 6139, 1438–1441.
[BG05] I. Borg and P.J.F. Groenen, Modern Multidimensional Scaling: Theory and
Applications, Springer, 2005.
[BG08] D. Baldassarri and A. Gelman, Partisans without Constraint: Political Po-
larization and Trends in American Public Opinion, American Journal of So-
ciology 114 (2008), no. 2, 408–446.
[BGL10] D. Boyd, S. Golder, and G. Lotan, Tweet, tweet, retweet: Conversational
aspects of retweeting on twitter., HICSS, IEEE Computer Society, 2010, pp. 1–
10.
184
[BGLL08] V. D. Blondel, J. L. Guillaume, R. Lambiotte, and E. Lefebvre, Fast unfolding
of communities in large networks, J. Stat. Mech (2008), P10008.
[BH48] R. Bellman and T. E Harris, On the theory of age-dependent stochastic branch-
ing processes, Proc. Nat. Acad. Sci. USA 34 (1948), no. 12, 601.
[BHMW11] E. Bakshy, J. M. Hofman, W. A. Mason, and D. J. Watts, Everyone’s an
influencer: quantifying influence on twitter, Proceedings of the fourth ACM
international conference on Web search and data mining (New York, NY,
USA), WSDM ’11, ACM, 2011, pp. 65–74.
[BKM+00] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata,
A. Tomkins, and J. Wiener, Graph structure in the web, Comput. Netw. 33
(2000), 309–320.
[BKO11] D. Bindel, J. Kleinberg, and S. Oren, How bad is forming your own opin-
ion?, Foundations of Computer Science (FOCS), 2011 IEEE 52nd Annual
Symposium (2011), 57–66.
[BLM+06] S. Boccaletti, V. Latora, Y. Moreno, M. Chavez, and D-U. Hwang, Complex
networks : Structure and dynamics, Phys. Rep. 424 (2006), no. 4-5, 175–308.
[BLT+11] L. Bengtsson, X. Lu, A. Thorson, R. Garfield, and J. von Schreeb, Improved
response to disasters and outbreaks by tracking population movements with
mobile phone network data: A post-earthquake geospatial study in haiti, PLoS
Med 8 (2011), no. 8, e1001083.
[BMBL14] J. Borondo, A. J. Morales, R. M. Benito, and J. C. Losada, Mapping the on-
line communication patterns of political conversations”, Physica A: Statistical
Mechanics and its Applications 414 (2014), 403–413.
[BMLB12] J. Borondo, A. J. Morales, J. C. Losada, and R. M. Benito, Characterizing
and modeling an electoral campaign in the context of Twitter: 2011 Spanish
Presidential election as a case study., Chaos 22 (2012), no. 2, 023138.
[BMZ11] J. Bollen, H. Mao, and X-J Zeng, Twitter mood predicts the stock market., J.
Comput. Science 2 (2011), no. 1, 1–8.
[BS09] E. Bullmore and O. Sporns, Complex brain networks: graph theoretical anal-
ysis of structural and functional systems, Nature Reviews Neuroscience 10
(2009), no. 3, 186–198.
185
[BTW87] P. Bak, C. Tang, and K. Wiesenfeld, Self-organized criticality. an explanation
of 1/f noise, Physical Review Letters 59 (1987), 381–384.
[BWB11] J. P. Bagrow, D. Wang, and A-L Barabasi, Collective Response of Human
Populations to Large-Scale Emergencies, PLOS ONE 6 (2011), no. 3, e17680.
[BY97] Y. Bar-Yam, Dynamics of complex systems, vol. 213, Addison-Wesley Read-
ing, MA, 1997.
[BYB13] Y. Bar-Yam and M. Bialik, Beyond big data: Identifying important informa-
tion for real world challenges, arXiv in press (2013).
[Cas96] M. Castells, Rise of the network society, 1st ed., Blackwell Publishers, Inc.,
Cambridge, MA, USA, 1996.
[CBBV06] V. Colizza, A. Barrat, M. Barthlemy, and A. Vespignani, The role of the
airline transportation network in the prediction and predictability of global
epidemics, Proceedings of the National Academy of Sciences of the United
States of America 103 (2006), no. 7, 2015–2020.
[CCG+02] Q. Chen, H. Chang, R. Govindan, S. Jamin, S. Shenker, and W. Willinger,
The origin of power-laws in internet topologies revisited., INFOCOM, 2002.
[CE09] A. Cheng and M. Evans, An in-depth look inside the twitter world.,
http://www.sysomos.com/insidetwitter, 2009.
[CF07] N. A. Christakis and J. H. Fowler, The spread of obesity in a large social
network over 32 years, New England journal of medicine 357 (2007), no. 4,
370–379.
[CF08] N. A. Christakis and J. H. Fowler, The collective dynamics of smoking in a
large social network, New England journal of medicine 358 (2008), no. 21,
2249–2258.
[CFHB+05] R. Criado, J. Flores, B. Hernandez-Bermejo, J. Pello, and M. Romance, Ef-
fective measurement of network vulnerability under random and intentional
attacks, Journal of Mathematical Modelling and Algorithms 4 (2005), no. 3,
307–316.
[CFMF13] M. Conover, E. Ferrara, F. Menczer, and A. Flammini, The digital evolution
of occupy wall street, PLOS ONE 8 (2013), no. 5, e64679.
186
[CGFM12] M. D. Conover, B. Goncalves, A. Flammini, and F. Menczer, Partisan asym-
metries in online political activity, EPJ Data Science 1 (2012), no. 1, 1–19
(English).
[CGW+08] J. Candia, M. C. Gonzalez, P. Wang, T. Schoenharl, G. Madey, and A.-L.
Barabasi, Uncovering individual and collective human dynamics from mobile
phone records, Journal of Physics A 41 (2008), no. 22, 224015.
[CH03] R. Cohen and S. Havlin, Scale-Free Networks Are Ultrasmall, Phys. Rev. Lett.
90 (2003), 058701.
[CHBG10] M. Cha, H. Haddadi, F. Benevenuto, and K.P. Gummadi, Measuring user
influence in Twitter: The million follower fallacy, 4th International AAAI
Conference on Weblogs and Social Media (ICWSM), 2010.
[Cho13] K. Chodorow, Mongodb: the definitive guide, ” O’Reilly Media, Inc.”, 2013.
[Com11] Inc. ComScore, Social networking on-the-go: U.s. mobile social media audi-
ence grows 37 percent in the past year, Tech. report, 2011.
[Cou12] Digital Policy Council, World leader rankings on twitter, Research Note, 2012.
[CPRVP09] R. Criado, J. Pello, M. Romance, and M Vela-Perez, A node-based multiscale
vulnerability of complex networks, International Journal of Bifurcation and
Chaos 19 (2009), no. 02, 703–710.
[CRF+11] M. D. Conover, J. Ratkiewicz, M. Francisco, B. Goncalves, A. Flammini, and
F. Menczer, Political polarization on twitter, 2011.
[Cro06] D. Crockford, The application/json media type for javascript object notation
(json), RFC 4627, IETF, 7 2006.
[Cum12] G. Cumming, Understanding the new statistics : effect sizes, confidence in-
tervals, and meta-analysis, Multivariate applications series, Routledge Aca-
demic, London, 2012.
[CV12] Lidia Ceriani and Paolo Verme, The origins of the gini index: extracts from
variabilita e mutabilita (1912) by corrado gini, The Journal of Economic In-
equality 10 (2012), no. 3, 421–443.
187
[DB13] M. Duggan and J. Brenner, The demographics of social media users, 2012,
vol. 14, Pew Research Center’s Internet & American Life Project, 2013.
[DBM13] C. Doerr, N. Blenn, and P. Mieghem, Lognormal infection times of online
information spread, CoRR abs/1305.5235 (2013).
[DD87] C. J Date and H. Darwen, A guide to the sql standard, vol. 3, Addison-Wesley
New York, 1987.
[DeG74] M. H. DeGroot, Reaching a consensus, Journal of the American Statistical
Association 69 (1974), no. 345, 118–121.
[DG08] J. Dean and S. Ghemawat, Mapreduce: simplified data processing on large
clusters, Communications of the ACM 51 (2008), no. 1, 107–113.
[DGL13] P. Dandekara, A. Goelb, and D.T. Leec, Biased assimilation, homophily, and
the dynamics of polarization, Proc. Nat. Acad. Sci. (2013).
[Dia90] L. J. Diamond, Three Paradoxes of Democracy, Journal of Democracy 1
(1990), no. 3, 48–60.
[Dia97] J. M. Diamond, Guns, germs, and steel: The fates of human societies, W.W.
Norton, New York, 1997.
[DO14] F. D’Orazio and J. Owens, White paper: How stuff spreads 2: How videos go
viral part 1., Tech. report, 2014.
[Dow57] A. Downs, An economic theory of political action in a democracy, The Journal
of Political Economy (1957), 135–150.
[Dun92] R. I. M. Dunbar, Neocortex size as a constraint on group size in primates,
Journal of Human Evolution 22 (1992), no. 6, 469–493.
[DW07] A. K. Dixit and J. W. Weibull, Political polarization, Proceedings of the
National Academy of Sciences 104 (2007), no. 18, 7351–7356.
[DYB03] G. F. Davis, M. Yoo, and W. E. Baker, The small world of the American
Corporate Elite, 1982-2001, Strategic Organization 1 (2003), 301–326.
[ECd08] Communaute Europeenne and Republique Cote d’Ivoire, Document de strate-
gie pays et programe indicatif national pour la periode 2008-2013, Tech. re-
port, UE, 2008.
188
[EEBL11] P. Expert, T. S. Evans, V. D. Blondel, and R. Lambiotte, Uncovering space-
independent communities in spatial networks, Proceedings of the National
Academy of Sciences 108 (2011), no. 19, 7663–7668.
[EH02] S. Ellner and D. Hellinger, Venezuelan politics in the Chavez era: Class,
polarization and conflict, Lynne Rienner Publishers, 2002.
[EMC10] N. Eagle, M. Macy, and R. Claxton, Network diversity and economic devel-
opment, Science 328 (2010), no. 5981, 1029–1031.
[EP06] N. Eagle and A. Pentland, Reality mining: sensing complex social systems,
Personal and ubiquitous computing 10 (2006), no. 4, 255–268.
[ER60] P. Erdos and A. Renyi, On the evolution of random graphs, Publication of the
Mathematical Institute of the Hungarian Academy of Sciences, 1960, pp. 17–
61.
[FC08] J. H. Fowler and N. A. Christakis, Dynamic spread of happiness in a large
social network: longitudinal analysis over 20 years in the framingham heart
study, Bmj 337 (2008).
[FFGP10] J. G. Foster, D. V. Foster, P. Grassberger, and M. Paczuski, Edge direc-
tion and the structure of networks, Proceedings of the National Academy of
Sciences 107 (2010), no. 24, 10815–10820.
[FGH12] M. Fernandez, J. Galeano, and C. A. Hidalgo, Bipartite networks provide new
insights on international trade markets, Networks and Heterogeneous Media
7 (2012), no. 3, 399–413.
[FJ90] N. E. Friedkin and E. C. Johnsen, Social influence and opinions, Journal of
Mathematical Sociology 15 (1990), no. 3-4, 193–206.
[For10] S. Fortunato, Community detection in graphs, Physics Reports 486 (2010),
no. 3-5, 75 – 174.
[Fre11] A. Freitez, La emigracion desde Venezuela durante la ultima decada, Temas
de Coyuntura (2011), no. 63, 11–38.
[GA12] D. Gayo-Avello, ”i wanted to predict elections with twitter and all i got was
this lousy paper” – a balanced survey on election prediction using twitter data,
CoRR abs/1204.6441 (2012).
189
[GAC+10] W. Galuba, K. Aberer, D. Chakraborty, Z. Despotovic, and W. Kellerer,
Outtweeting the twitterers - predicting information cascades in microblogs,
Proceedings of the 3rd conference on Online social networks (Berkeley, CA,
USA), WOSN’10, USENIX Association, 2010, pp. 3–3.
[Gal73] E.H. Galeano, Open veins of latin america: Five centuries of the pillage of a
continent, Modern reader paperback. 308, Monthly Review Press, 1973.
[GG03] M. P. Garcia-Guadilla, Politizacion y polarizacion de la sociedad civil vene-
zolana: Las dos caras frente a la democracia, Espacio Abierto 12 (2003),
no. 001, 31–62.
[GHB08] M. C. Gonzalez, C. A. Hidalgo, and A-L. Barabasi, Understanding individual
human mobility patterns, Nature 453 (2008), no. 7196, 779–782.
[GHKV07] M. C. Gonzalez, H. J. Herrmann, J Kertesz, and T Vicsek, Community struc-
ture and ethnic preferences in school friendship networks, Physica A: Statis-
tical mechanics and its applications 379 (2007), no. 1, 307–316.
[GI95] J. W. Grossman and P. D. F. Ion, On a portion of the well known collaboration
graph, Congressus Numerantium 108 (1995), 129–131.
[GIT09] B. D. Gomperts, M. K. IJsbrand, and P.E.R. Tatham, Copyright, Signal
Transduction (Second Edition), Academic Press, San Diego, second edition
ed., 2009, pp. iv –.
[GJ10a] B. Golub and M. O. Jackson, Naive learning in social networks and the wis-
dom of crowds, American Economic Journal: Microeconomics (2010), 112–
149.
[GJ10b] B. Golub and M. O. Jackson, Using selection bias to explain the observed
structure of internet diffusions, Proc. Nat. Acad. Sci. USA 107 (2010), no. 24,
10833–10836.
[GKK11] V. Gomez, H.J Kappen, and A. Kaltenbrunner, Modeling the structure and
evolution of discussion cascades, Proceedings of the 22nd ACM conference on
Hypertext and hypermedia, ACM, 2011, pp. 181–190.
[GLM01] J. Goldenberg, B. Libai, and E. Muller, Talk of the network: A complex
systems look at the underlying process of word-of-mouth, Marketing Letters
(2001).
190
[GMSS12] D. Garcia, F. Mendez, U. Serdult, and F. Schweitzer, Political polarization
and popularity in online participatory media: An integrated approach, Pro-
ceedings of the First Edition Workshop on Politics, Elections and Data (New
York, NY, USA), PLEAD ’12, ACM, 2012, pp. 3–10.
[GN02] M. Girvan and M. E. J. Newman, Community structure in social and biological
networks, PNAS 99 (2002), no. 12, 7821–7826.
[GPG12] L. J Gilarranz, J. M Pastor, and J. Galeano, The architecture of weighted
mutualistic networks, Oikos 121 (2012), no. 7, 1154–1162.
[GPV11] B. Goncalves, N. Perra, and A. Vespignani, Modeling users’ activity on twitter
networks: Validation of dunbar’s number, PLoS ONE 6 (2011), no. 8.
[Gra73] M. Granovetter, The Strength of Weak Ties, The American Journal of Soci-
ology 78 (1973), no. 6, 1360–1380.
[Gra78] M. Granovetter, Threshold models of collective behavior, The American Jour-
nal of Sociology 83 (1978), no. 6, 1420–1443.
[GRM+12] P. A. Grabowicz, J. J. Ramasco, E. Moro, J.M. Pujol, and C.M. Eguiluz,
Social features of online networks: The strength of intermediary ties in online
social media, PLoS ONE 7 (2012), no. 1, e29358.
[Hin13] Hinterlaces, Monitor pais, 12 2013.
[HKBH07] CA Hidalgo, B. Klinger, A.L. Barabasi, and R. Hausmann, The product space
conditions the development of nations, Science 317 (2007), no. 5837, 482.
[HL75] R. A. Holley and T. M. Liggett, Ergodic theorems for weakly interacting infi-
nite systems and the voter model., The annals of probability (1975), 643–663.
[HRW09] B. A. Huberman, D. M. Romero, and F. Wu, Social networks that matter:
Twitter under the microscope, First Monday 14 (2009), no. 1.
[HSB+13] B. Hawelka, I. Sitko, E. Beinat, S. Sobolevsky, P. Kazakopoulos, and
C. Ratti, Geo-located twitter as the proxy for global mobility patterns., CoRR
abs/1311.0680 (2013).
[Huc01] R. Huckfeldt, The social communication of political expertise, American Jour-
nal of Political Science (2001), 425–438.
191
[HW09] H-B Hu and X-F Wang, Disassortative mixing in online social networks, EPL
(Europhysics Letters) 86 (2009), no. 1, 18003.
[HZGMBY13] A. Herdagdelen, W. Zuo, A.S. Gard-Murray, and Y. Bar-Yam, An exploration
of social identity: The geography and politics of news-sharing communities in
twitter, Complexity 19 (2013), 10–20.
[ICO12] ICCO International Cocoa Organization, Annual report 2011/2012, Tech. re-
port, ICCO, 2012.
[IE11a] J. L. Iribarren and Moro. E., Affinity paths and information diffusion in social
networks, Social Networks 33 (2011), no. 2, 134 – 142.
[IE11b] J. L. Iribarren and Moro. E., Branching dynamics of viral information spread-
ing, Phys. Rev. E 84 (2011), 046116.
[IFfAD09] IFAD International Fund for Agriculture Development, Enabling poor rural
people to overcome poverty in the Bolivarian Republic of Venezuela”, 2009.
[IJBZ08] B. Schmittmann I. J. Benczik, S. Z. Benczik and R. K. P. Zia, Lack of con-
sensus in social systems, Europhys. Lett 82 (2008), 48006.
[Jac10] M. O. Jackson, Social and economic networks, Princeton University Press
(2010).
[JCZB06] P. F. Jonsson, T. Cavanna, D. Zicha, and P. A. Bates, Cluster analysis of
networks generated through homology: automatic identification of important
protein communities involved in cancer metastasis., BMC Bioinformatics 7
(2006), 2.
[JKKK12] H.H Jo, M. Karsai, J. Kertesz, and K. Kaski, Circadian pattern and burstiness
in mobile phone communication, New Journal of Physics 14 (2012), no. 1,
013055+.
[JMBO01] H. Jeong, S.P. Mason, A.-L. Barabasi, and Z.N. Oltvai, Lethality and central-
ity in protein networks, Nature 411 (2001).
[JSFT09] A. Java, X. Song, T. Finin, and B. Tseng, Why we twitter: An analysis of a
microblogging community, Advances in Web Mining and Web Usage Analysis,
Springer, 2009, pp. 118–138.
192
[Kaw13] T. Kawamoto, A stochastic model of the tweet diffusion on the Twitter net-
work, Physica A: Statistical Mechanics and its Applications (2013).
[KEH10] A. Kapoor, N. Eagle, and E. Horvitz, People, quakes, and communications:
Inferences from call dynamics about a seismic event and its influences on
a population., AAAI Spring Symposium: Artificial Intelligence for Develop-
ment, AAAI, 2010.
[Kel58] H. C. Kelman, Compliance, identification, and internalization: Three pro-
cesses of attitude change, Journal of conflict resolution (1958), 51–60.
[KKK02] L. Kullmann, J. Kertesz, and K. Kaski, Time-dependent cross-correlations
between different stock returns: A directed network of influence, Phys. Rev.
E 66 (2002), 026125.
[KKT03] D. Kempe, J. Kleinberg, and E. Tardos, Maximizing the spread of influ-
ence through a social network, KDD ’03: Proceedings of the ninth ACM
SIGKDD international conference on Knowledge discovery and data mining,
ACM Press, 2003, pp. 137–146.
[KLPM10] H. Kwak, C. Lee, H. Park, and S. Moon, What is twitter, a social network or
a news media?, WWW ’10: Proceedings of the 19th international conference
on World wide web (New York, NY, USA), ACM, 2010, pp. 591–600.
[KM27] W. O. Kermack and Ag McKendrick, A Contribution to the Mathematical
Theory of Epidemics, Proceedings of the Royal Society of London. Series A,
Containing Papers of a Mathematical and Physical Character 115 (1927),
no. 772, 700–721.
[KOS11] A. S. King, F. J. Orlando, and D. B. Sparks, Ideological Extremity and Pri-
mary Success: A Social Network Approach, 2011 MPSA Conference (2011).
[Kra00] U. Krause, A discrete nonlinear and non-autonomous model of consensus
formation, Communications in difference equations (2000), 227–236.
[Kra09] D. Krackhardt, A plunge into networks, Science 326 (2009), 47–48.
[KSA+10] M. Kolar, L. Song, A. Ahmed, E. P. Xing, et al., Estimating time-varying
networks, The Annals of Applied Statistics 4 (2010), no. 1, 94–123.
193
[KSESM12] K. Klemm, M.A. Serrano, V.M. Eguiluz, and M. San-Miguel, A measure of
individual role in collective dynamics, Scientific Reports 2 (2012), no. 292.
[LBP13] S.Y Liu, A. Baronchelli, and N. Perra, Contagion dynamics in time-varying
metapopulation networks, Physical Review E 87 (2013), no. 3, 032805.
[LeB96] LeBon, G., The Crowd: A Study of the Popular Mind, New York Macmillan
Co., 1896.
[Lew09] M. P. Lewis, Ethnologue: Languages of the world, 16 ed., SIL International,
2009.
[LGRC12] J. Lehmann, B. Goncalves, J. J. Ramasco, and C. Cattuto, Dynamical classes
of collective attention in twitter, Proceedings of the 21st international confer-
ence on World Wide Web (New York, NY, USA), WWW ’12, ACM, 2012,
pp. 251–260.
[LNK07] D. Liben-Nowell and J. Kleinberg, The link-prediction problem for social net-
works, Journal of the American society for information science and technology
58 (2007), no. 7, 1019–1031.
[LNL94] B. Latane, A. Nowak, and J. H Liu, Measuring emergent social phenomena:
Dynamism, polarization, and clustering as order parameters of social systems,
Behavioral science 39 (1994), no. 1, 1–24.
[LPA+09] D. Lazer, A. Pentland, L. Adamic, S. Aral, A-L Barabasi, D. Brewer,
N. Christakis, N. Contractor, J. Fowler, M. Gutmann, T. Jebara, G. King,
M. Macy, D. Roy, and M. Alstyne, Social science: Computational social sci-
ence, Science 323 (2009), no. 5915, 721–723.
[LSAA11] A. Livne, M. P. Simmons, E. Adar, and L. A. Adamic, The party is over
here: Structure and content in the 2010 election., ICWSM (Lada A. Adamic,
Ricardo A. Baeza-Yates, and Scott Counts, eds.), The AAAI Press, 2011.
[Lup10] N. Lupu, Who votes for chavismo?: Class voting in Hugo Chavez’s Venezuela,
Latin American Research Review 45 (2010), no. 1, 7–32.
[Lus03] D. Lusseau, The emergent properties of a dolphin social network, Proceedings
of the Royal Society of London. Series B: Biological Sciences 270 (2003),
no. Suppl 2, S186–S188.
194
[Mac67] J. B. MacQueen, Some methods for classification and analysis of multivariate
observations, Proc. of the fifth Berkeley Symposium on Mathematical Statis-
tics and Probability (L. M. Le Cam and J. Neyman, eds.), vol. 1, University
of California Press, 1967, pp. 281–297.
[MBLB14] A. J. Morales, J. Borondo, J. C. Losada, and R. M. Benito, Efficiency of hu-
man activity on information spreading on Twitter, Social Networks 39 (2014),
1–11.
[MBLBss] A. J. Morales, J. Borondo, J.C. Losada, and R.M. Benito, Measuring Politi-
cal Polarization: Twitter shows the two sides of Venezuela, Chaos (2014, In
press).
[MCB+13] A. J. Morales, W. Creixell, J. Borondo, J.C. Losada, and R.M. Benito, Under-
standing Ethnical Interactions in Ivory Coast, 3rd International Conference
on the Analysis of Mobile Phone Datasets, 2013.
[MCB+ss] A. J. Morales, W. Creixell, J. Borondo, J.C. Losada, and R.M. Benito, Char-
acterizing Ethnic Interactions from Human Communication Patterns in Ivory
Coast, Networks and Heterogeneous Media (2014, In press).
[MFMFM13] B. Moumni, V. Frias-Martinez, and E. Frias-Martinez, Characterizing social
response to urban earthquakes using cell-phone network data: the 2012 oax-
aca earthquake, Proceedings of the 2013 ACM conference on Pervasive and
ubiquitous computing adjunct publication, ACM, 2013, pp. 1199–1208.
[MHVB13] Y.A. Montjoye, C. A Hidalgo, M. Verleysen, and V. D Blondel, Unique in the
crowd: The privacy bounds of human mobility, Scientific reports 3 (2013).
[Mil63] S. Milgram, Behavioral study of obedience, Journal of Abnormal and Social
Psychology 67 (1963), no. 4, 371–378.
[Mil11] G. Miller, Social Scientists Wade Into the Tweet Stream, Science 333 (2011),
no. 6051, 1814–1815.
[Mit04] M. Mitzenmacher, A brief history of generative models for power law and
lognormal distributions, Internet Mathematics 1 (2004), no. 2, 226–251.
[ML07] Y. Bar-Yam M. Lim, R. Metzler, Global pattern formation and ethnic/cultural
violence, Science 317 (2007).
195
[MLA+11] A. Mislove, S. Lehmann, Y-Y Ahn, J-P Onnela, and J. N. Rosenquist, Un-
derstanding the demographics of twitter users., ICWSM (Lada A. Adamic,
Ricardo A. Baeza-Yates, and Scott Counts, eds.), The AAAI Press, 2011.
[MLB12] A. J. Morales, J. C. Losada, and R.M. Benito, Users structure and behavior
on an online social network during a political protest, Physica A: Statistical
Mechanics and its Applications 391 (2012), no. 21, 5244 – 5253.
[MML10] G. Miritello, E. Moro, and R. Lara, The dynamical strength of social ties in
information spreading, CoRR abs/1011.5367 (2010).
[MMR07] A. Petersen M. Mobilia and S. Redner, On the role of zealotry in the voter
model., Journal of Statistical Mechanics: Theory and Experiment 8 (2007),
08029.
[Mob03] M. Mobilia, Does a single zealot affect an infinite group of voters?., Physical
review letters 91 (2003), no. 2, 028701.
[Mor51] J.L. Moreno, Sociometry, experimental method and the science of society.,
Beacon House, Inc., 1951.
[MPLC13] F. Morstatter, J Pfeffer, H Liu, and K M Carley, Is the sample good enough?
comparing data from twitters streaming api with twitters firehose, Proceedings
of The 7th International AAAI Conference on Weblogs and Social Media ,
The AAAI Press, 2013.
[MPR02] N. Mccarty, K. Poole, and H. Rosenthal, Political polarization and income
inequality.
[MR13] M. D. Makowsky and J. Rubin, An agent-based model of centralized institu-
tions, social network technology, and revolution, PLOS ONE 8 (2013), e80380.
[MS02] J. Montoya and R.S. Sole, Small world patterns in food webs, Journal of
Theoretical Biology 214 (2002), no. 3, 405 – 412.
[MSMA08] R. Dean Malmgren, Daniel B. Stouffer, Adilson E. Motter, and Luıs A. N.
Amaral, A Poissonian explanation for heavy tails in e-mail communication,
Proc. Nat. Acad. Sci. USA 105 (2008), no. 47, 18153–18158.
[MV12] E. Minaya and K. Vyas, When Chavez tweets, Venezuelans listen, Wall Street
Journal (April 25, 2012).
196
[NDXT11] N. P Nguyen, T. N Dinh, Y. Xuan, and M.T Thai, Adaptive algorithms for
detecting community structure in dynamic social networks, INFOCOM, 2011
Proceedings IEEE, IEEE, 2011, pp. 2282–2290.
[New02a] M. E. J. Newman, Assortative mixing in networks, Phys. Rev. Lett. 89 (2002),
no. 20, 208701.
[New02b] M. E. J. Newman, Spread of epidemic disease on networks, Phys. Rev. E 66
(2002), 016128.
[New03a] M. E. J. Newman, Mixing patterns in networks, Physical Review E 67 (2003),
no. 2, 026126.
[New03b] M. E. J. Newman, The structure and function of complex networks, SIAM
review 45 (2003), no. 2, 167–256.
[New05] M. E. J. Newman, Power laws, Pareto distributions and Zipf ’s law, Contem-
porary Physics 46 (2005), no. 5, 323–351.
[New06] M. E. J. Newman, Modularity and community structure in networks, Proc.
Natl. Acad. Sci. USA 103 (2006), 8577.
[NFB02] M. E. J. Newman, S. Forrest, and J. Balthrop, Email networks and the spread
of computer viruses, Phys. Rev. E 66 (2002), 035101.
[NMR05] M. J. Neely, E. Modiano, and C. E. Rohrs, Dynamic power allocation and
routing for time-varying wireless networks, Selected Areas in Communica-
tions, IEEE Journal on 23 (2005), no. 1, 89–103.
[NP03] M. E. J. Newman and J. Park, Why social networks are different from other
types of networks, Phys. Rev. E 68 (2003), 036122.
[NT12] J. Nigel and F. Toro, Facebook gives a platform to the challenger of Chavez,
2012.
[NWS02] M. E. J. Newman, D. J. Watts, and S. Strogatz, Random graph models of
social networks, Proc. Natl. Acad. Sci. USA 99 (2002), no. 1, 2566–2572.
[OSH+07] J.P. Onnela, J. Saramaki, J. Hyvonen, G. Szabo, D. Lazer, K. Kaski,
J. Kertesz, and A. L. Barabasi, Structure and tie strengths in mobile commu-
nication networks, Proc. Natl. Acad. Sci. USA 104 (2007), no. 18, 7332–7336.
197
[Pen08] A. Pentland, Honest signals: How they shape our world, The MIT Press,
2008.
[Pen14] A. Pentland, Social physics: How good ideas spread-the lessons from a new
science, Penguin Group (USA) Incorporated, 2014.
[PGPSV12] N Perra, B. Goncalves, R. Pastor-Satorras, and A. Vespignani, Activity driven
modeling of time varying networks, Scientific reports 2 (2012).
[PMT+14] D. Pastor, A. J. Morales, Y. Torres, J. Bauer, A. Wadhwa, C. Castro-Correa,
A. Caldern-Mariscal, L. Romanoff, J. Lee, A. Rutherford, V. Frias-Martinez,
N. Oliver, E. Frias-Martinez, and M. Luengo-Oroz, Flooding through the lens
of mobile phone activity, IEEE Global Humanitarian Technology Conference
(GHTC), 2014.
[PSR12] A. Pielow, R. Sioshansi, and M. C. Roberts, Modeling short-run electricity
demand with long-term growth rates and consumer price elasticity in com-
mercial and industrial sectors, Energy 46 (2012), no. 1, 533 – 540.
[PSV01] R. Pastor-Satorras and A. Vespignani, Epidemic dynamics and endemic states
in complex networks, Phys. Rev. E 63 (2001), 066117.
[PSV02] R. Pastor-Satorras and A. Vespignani, Epidemic dynamics in finite size scale-
free networks, Phys. Rev. E 65 (2002), 035108.
[RB10] M. Rosvall and C. T. Bergstrom, Multilevel compression of random walks on
networks reveals hierarchical organization in large integrated systems, CoRR
abs/1010.0431 (2010).
[Red98] S. Redner, How popular is your paper? an empirical study of the citation
distribution, European Physical Journal B 4 (1998), no. 2, 131–134.
[RFF+10] J. Ratkiewicz, S. Fortunato, A. Flammini, F. Menczer, and A. Vespignani,
Characterizing and modeling the dynamics of online popularity, Physical re-
view letters 105 (2010), no. 15, 158701.
[RGAH11] D. M. Romero, W. Galuba, S. Asur, and B. A. Huberman, Influence and
passivity in social media, Proceedings of the ECML/PKDD 2011, 2011.
198
[RLH11] L. E. C. Rocha, F. Liljeros, and P. Holme, Simulated epidemics in an empirical
spatiotemporal network of 50,185 sexual contacts., PLoS Comp Biol 7 (2011),
e1001109.
[RMFC10] J. N. Rosenquist, J. Murabito, J. H. Fowler, and N. A. Christakis, The spread
of alcohol consumption behavior in a large social network, Annals of Internal
Medicine 152 (2010), no. 7, 426–433.
[RMM+10] K. K. Rachuri, M. Musolesi, C. Mascolo, P. J. Rentfrow, C. Longworth,
and A. Aucinas, EmotionSense: a mobile phones based adaptive platform
for experimental social psychology research, Proceedings of the 12th ACM
international conference on Ubiquitous computing (New York, NY, USA),
Ubicomp ’10, ACM, 2010, pp. 281–290.
[Rou87] P. J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and valida-
tion of cluster analysis, Journal of computational and applied mathematics
20 (1987), 53–65.
[RTU11] D. M. Romero, C. Tan, and J. Ugander, Social-Topical Affiliations: The
Interplay between Structure and Popularity, arXiv:1112.1115 (2011).
[San07] A. Santiago, Modelos Generalizados de Enlace Preferencial en Redes Com-
plejas Heterogneas, Ph.D. thesis, Universidad Politecnica de Madrid, 2007.
[SAR08] V. Sood, T. Antal, and S. Redner, Voter models on heterogeneous networks,
Physical Review E 77 (2008), no. 4, 041121.
[SB08] A. Santiago and R. M Benito, An extended formalism for preferential attach-
ment in heterogeneous complex networks, Europhysics Letters (2008).
[Sch71] T. C. Schelling, Dynamic models of segregation, J. Math. Sociol. 1 (1971),
no. 2, 143–186.
[SCL00] M. L. Sachtjen, B. A. Carreras, and V. E. Lynch, Disturbances in a power
transmission system, Phys. Rev. E 61 (2000), 4877–4882.
[SEM05] K. Suchecki, V. M. Eguiluz, and M. San Miguel, Voter model dynamics in
complex networks: Role of dimensionality, disorder, and degree distribution.,
Physical Review E 72 (2005), no. 3, 036132.
199
[Sem12] Semiocast, Twitter reaches half a billion accounts more than 140 millions in
the u.s., WWW page, 2012.
[Sha01] C. E. Shannon, A mathematical theory of communication, ACM SIGMOBILE
Mobile Computing and Communications Review 5 (2001), no. 1, 3–55.
[Shi95] R. J. Shiller, Conversation, information, and herd behavior, The American
Economic Review (1995), 181–185.
[Sim62] H. A. Simon, The architecture of complexity, Proceedings of the American
Philosophical Society 106 (1962), no. 6, 467–482.
[SJN+07] C. J. Stam, B. F. Jones, G. Nolte, M. Breakspear, and P. Scheltens, Small-
world networks and functional connectivity in alzheimer’s disease, Cerebral
Cortex 17 (2007), no. 1, 92–99.
[SNK08] K. Saito, R. Nakano, and M. Kimura, Prediction of information diffu-
sion probabilities for independent cascade model., KES (3) (Ignac Lovrek,
Robert J. Howlett, and Lakhmi C. Jain, eds.), Lecture Notes in Computer
Science, vol. 5179, Springer, 2008, pp. 67–75.
[SOM10] T. Sakaki, M. Okazaki, and Y. Matsuo, Earthquake shakes twitter users: real-
time event detection by social sensors, Proceedings of the 19th international
conference on World wide web (New York, NY, USA), WWW ’10, ACM,
2010, pp. 851–860.
[SS12] P. Sobkowicz and A. Sobkowicz, Two-year study of emotion and communica-
tion patterns in a highly polarized political discussion forum, Social Science
Computer Review 30 (2012), no. 4, 448–469.
[TL13] G. Tang and F. L. F. Lee, Facebook use and political participation: The impact
of exposure to shared political information, connections with public political
actors, and network structural heterogeneity, Social Science Computer Review
31 (2013), no. 6, 763–773.
[TUGB12] J. L Toole, M. Ulm, M. C Gonzalez, and D. Bauer, Inferring land use from
mobile phone activity, Proceedings of the ACM SIGKDD international work-
shop on urban computing, ACM, 2012, pp. 1–8.
200
[UBMK12] J. Ugander, L. Backstrom, C. Marlow, and J. Kleinberg, Structural diversity
in social contagion, Proceedings of the National Academy of Sciences 109
(2012), no. 16, 5962–5966.
[UH03] UN-HABITAT, The challenge of slums - global report on human settlements
2003, Tech. report, UN, 2003.
[VDAVH04] W. Van-Der-Aalst and K. M. Van-Hee, Workflow management: models, meth-
ods, and systems, MIT press, 2004.
[VH86] E. Von Hippel, Lead users: a source of novel product concepts, Management
science 32 (1986), no. 7, 791–805.
[Wat02] D. J. Watts, A simple model of global cascades on random networks, Proceed-
ings of the National Academy of Sciences 99 (2002), no. 9, 5766–5771.
[Wat04] D. J. Watts, The ”new” science of networks, Annual Review of Sociology 30
(2004), 243–270.
[WET+12] A. Wesolowski, N. Eagle, A. J. Tatem, D. L. Smith, A. M. Noor, E. W. Snow,
and C. O. Buckee, Quantifying the Impact of Human Mobility on Malaria,
Science 338 (2012), no. 6104, 267–270.
[WF01] A. Wagner and D. A. Fell, The small world inside large metabolic networks.,
Proc R Soc Lond B Biol Sci 268 (2001), no. 1478, 1803–1810.
[WG75] H. W. Watson and F. Galton, On the probability of the extinction of families.,
The Journal of the Anthropological Institute of Great Britain and Ireland 4
(1875), 138–144.
[WH04] D. M. Wilkinson and B. A. Huberman, A method for finding communities of
related genes, Proc. Nat. Acad. of Sci. USA 10 (2004), no. 1073.
[WHAT04] F. Wu, B. A. Huberman, L. A. Adamic, and J. R. Tyler, Information flow
in social groups, Physica A: Statistical Mechanics and its Applications 337
(2004), no. 1-2, 327–335.
[Whi09] T. White, Hadoop: the definitive guide, ” O’Reilly Media, Inc.”, 2009.
[WRB06] S. Wuchty, E. Ravasz, and A-L Barabasi, The architecture of biological net-
works, Complex systems science in biomedicine, Springer, 2006, pp. 165–181.
201
[WS98] D. J. Watts and S. H. Strogatz, Collective dynamics of ’small-world’networks.,
Nature 393 (1998), no. 6684, 409–10.
[WWT+11] D. Wang, Z. Wen, H. Tong, C-Y Lin, C Song, and A-L Barabasi, Informa-
tion spreading in context, Proceedings of the 20th international conference on
World wide web (New York, NY, USA), WWW ’11, ACM, 2011, pp. 735–744.
[XC05] J. Xu and H. Chen, Criminal network analysis and visualization, Communi-
cations of the ACM 48 (2005), no. 6, 100–107.
[XLZ+12] F Xiong, Y Liu, Z-J Zhang, J Zhu, and Y Zhang, An information diffusion
model based on retweeting mechanism for online social media, Physics Letters
A 376 (2012), no. 3031, 2103 – 2108.
[YL11] J. Yang and J. Leskovec, Patterns of temporal variation in online media,
Proceedings of the Fourth ACM International Conference on Web Search and
Data Mining (New York, NY, USA), WSDM ’11, ACM, 2011, pp. 177–186.
[ZCH+12] Z. D. Zhao, S. M. Cai, J. Huang, Y. Fu, and T. Zhou, Scaling behavior of
online human activity, EPL (Europhysics Letters) 100 (2012), no. 4, 48004.
[ZFT+08] Y. Zhang, A. J. Friend, A. L. Traud, M. A. Porter, J. H. Fowler, and P. J.
Mucha, Community structure in congressional cosponsorship networks, Phys-
ica A: Statistical Mechanics and its Applications 387 (2008), no. 7, 1705–
1712.
202