ANALISIS Y MODELIZACI ON DE LA DIN AMICA EMERGENTE...

UNIVERSIDAD POLITECNICA DE MADRID

ESCUELA TECNICA SUPERIOR DE INGENIEROS AGRONOMOS

ANALISIS Y MODELIZACION DE LA DINAMICAEMERGENTE DURANTE EL PROCESO DE

DIFUSION DE INFORMACION EN LAS REDESSOCIALES DE INTERNET

ALFREDO JOSE MORALES GUZMAN

Ingeniero en Telecomunicacion

Master en Fısica de Sistemas Complejos

TESIS DOCTORAL

2014

GRUPO DE SISTEMAS COMPLEJOS

ESCUELA TECNICA SUPERIOR DE INGENIEROS AGRONOMOS

ANALYZING AND MODELING THE EMERGENTDYNAMICS DURING THE INFORMATION

DIFFUSION PROCESS ON INTERNET SOCIALNETWORKS

ALFREDO JOSE MORALES GUZMAN

Telecommunications Engineer

MSc in Physics of Complex Systems

Advisor:

ROSA MARIA BENITO ZAFRILLA

PhD in Chemistry Sciences

2014

A mi madre Kalena, por ser mi ejemplo

AGRADECIMIENTOS

En primer lugar, quiero agradecer a la Dra. Rosa Marıa Benito Zafrilla por su incansable

labor como directora de esta tesis. Durante estos anos, con mucha paciencia y teson, me

ha ensenado con gran firmeza la labor de la investigacion cientıfica y los estandares de la

excelencia. Especialmente, le estare infinitamente agradecido por haberme dado esa primera

oportunidad, que sin ser conciente, cambio el rumbo de mi vida para siempre.

Por otra parte, quiero agradecerle a mis profesores, colaboradores y companeros del

Grupo de Sistemas Complejos de la Universidad Politecnica de Madrid. Sin sus ensenanzas,

aportes, consejos y apoyo, el trabajo realizado durante estos anos no hubiera sido el mismo.

Con especial carino me gustarıa recordar a los profesores: Juan Carlos Losada, Werner Creix-

ell (visitante), Javier Galeano, Ramon Alonso, Miguel A. Porras y Ana Tarquis. Ası como

a mis companeros del laboratorio: Javier Borondo, Fabio Revuelta, Izaskun Oregui, Pedro

Benıtez, Henar Hernandez, Johan Martınez y Maxi Fernandez. Ademas, debo agradecer

a la Universidad Politecnica de Madrid por otorgarme la beca UPM-BSCH, sin la cual, la

culminacion de este trabajo hubiera sido imposible.

Ası mismo me gustarıa agradecer a los miembros del New England Complex Systems

Institute, donde tuve el gusto de realizar una estancia de movilidad. En especial, me gustarıa

agradecer al prof. Yaneer Bar-Yam por haberme dado la oportunidad de colaborar con el

instituto, ası como al prof. Hiroki Sayama por sus aportes en la labor investigativa. Por

otra parte, me gustarıa recordar a mis companeros de trabajo: Debra Gorfine, Francisco

Prieto, Joe Norman, Maya Bialik, Vaibhav Vavilala, Molly Wexler-Romig, Vincent Wong,

Lili y Katriel Friedman.

Tambien quiero agradecer a mis colaboradores de Global Pulse de las Naciones Unidas,

Telefonica Digital y Centro de Innovacion en Tecnologıa para el Desarrollo Humano de la

Universidad Politecnica de Madrid, por haberme dado la oportunidad de trabajar y apren-

der de ellos en un proyecto conjunto. En especial, quiero agradecer y recordar a Miguel

A. Luengo-Oroz, David Pastor, Yolanda Torres, Vanessa Frıas-Martınez y Enrique Frıas-

Martınez.

iii

Ademas, quiero agradecer a todas las personas, amigos y familiares que me acompanaron

durante este largo viaje. En primer lugar, quiero recordar a mi padre, suegros, hermanos,

cunados, abuela, tıas, sobrinos y primos, que con su carino incondicional me dieron las fuerzas

necesarias para emprender este camino. Por otra parte, quiero agradecer a mis amigos de

vida Zhandra, Edu, Patricia, Sergio, Laura, Andrei, Iuri, Cesar y Carolina, que con su apoyo

y companıa me hicieron el viaje mas placentero.

Finalmente, quiero agradecer de forma absoluta a mi esposa, Vanessa Pechiaia, coautora

honorıfica de esta tesis. Su apoyo y amor inagotable fueron la base fundamental para la

realizacion de este trabajo. A ella, mi mas profunda gratitud por haber hecho de esta, otra

de las mejores etapas de mi vida. Por ultimo, he de decir con mucho honor, que este trabajo

esta dedicado a mi madre, el pilar fundamental de mi vida. Fue ella la primera persona en

animarme a tomar este camino y en darme su absoluta confianza para recorrerlo con exito.

Sin palabras capaces de expresarle mi profunda admiracion, le agradecere eternamente por

ser mi ejemplo a seguir y constante motivo de inspiracion.

Desde el fondo de mi corazon, gracias a todos.

iv

RESUMEN

Durante la actividad diaria, la sociedad actual interactua constantemente por medio de

dispositivos electronicos y servicios de telecomunicaciones, tales como el telefono, correo

electronico, transacciones bancarias o redes sociales de Internet. Sin saberlo, masivamente

dejamos rastros de nuestra actividad en las bases de datos de empresas proveedoras de

servicios. Estas nuevas fuentes de datos tienen las dimensiones necesarias para que se puedan

observar patrones de comportamiento humano a grandes escalas. Como resultado, ha surgido

una reciente explosion sin precedentes de estudios de sistemas sociales, dirigidos por el analisis

de datos y procesos computacionales.

En esta tesis desarrollamos metodos computacionales y matematicos para analizar sis-

temas sociales por medio del estudio combinado de datos derivados de la actividad humana

y la teorıa de redes complejas. Nuestro objetivo es caracterizar y entender los sistemas emer-

gentes de interacciones sociales en los nuevos espacios tecnologicos, tales como la red social

Twitter y la telefonıa movil. Analizamos los sistemas por medio de la construccion de redes

complejas y series temporales, estudiando su estructura, funcionamiento y evolucion en el

tiempo. Tambien, investigamos la naturaleza de los patrones observados por medio de los

mecanismos que rigen las interacciones entre individuos, ası como medimos el impacto de

eventos crıticos en el comportamiento del sistema. Para ello, hemos propuesto modelos que

explican las estructuras globales y la dinamica emergente con que fluye la informacion en el

sistema.

Para los estudios de la red social Twitter, hemos basado nuestros analisis en conversa-

ciones puntuales, tales como protestas polıticas, grandes acontecimientos o procesos elec-

torales. A partir de los mensajes de las conversaciones, identificamos a los usuarios que

participan y construimos redes de interacciones entre los mismos. Especıficamente, constru-

imos una red para representar quien recibe los mensajes de quien y otra red para representar

quien propaga los mensajes de quien. En general, hemos encontrado que estas estructuras

tienen propiedades complejas, tales como crecimiento explosivo y distribuciones de grado

libres de escala. En base a la topologıa de estas redes, hemos indentificado tres tipos de

v

usuarios que determinan el flujo de informacion segun su actividad e influencia.

Para medir la influencia de los usuarios en las conversaciones, hemos introducido una

nueva medida llamada eficiencia de usuario. La eficiencia se define como el numero de

retransmisiones obtenidas por mensaje enviado, y mide los efectos que tienen los esfuer-

zos individuales sobre la reaccion colectiva. Hemos observado que la distribucion de esta

propiedad es ubicua en varias conversaciones de Twitter, sin importar sus dimensiones ni

contextos. Con lo cual, sugerimos que existe universalidad en la relacion entre esfuerzos

individuales y reacciones colectivas en Twitter. Para explicar los factores que determinan

la emergencia de la distribucion de eficiencia, hemos desarrollado un modelo computacional

que simula la propagacion de mensajes en la red social de Twitter, basado en el mecanismo

de cascadas independientes. Este modelo nos permite medir el efecto que tienen sobre la

distribucion de eficiencia, tanto la topologıa de la red social subyacente, como la forma en

que los usuarios envıan mensajes. Los resultados indican que la emergencia de un grupo

selecto de usuarios altamente eficientes depende de la heterogeneidad de la red subyacente

y no del comportamiento individual.

Por otro lado, hemos desarrollado tecnicas para inferir el grado de polarizacion polıtica

en redes sociales. Proponemos una metodologıa para estimar opiniones en redes sociales y

medir el grado de polarizacion en las opiniones obtenidas. Hemos disenado un modelo donde

estudiamos el efecto que tiene la opinion de un pequeno grupo de usuarios influyentes, lla-

mado elite, sobre las opiniones de la mayorıa de usuarios. El modelo da como resultado una

distribucion de opiniones sobre la cual medimos el grado de polarizacion. Aplicamos nues-

tra metodologıa para medir la polarizacion en redes de difusion de mensajes, durante una

conversacion en Twitter de una sociedad polıticamente polarizada. Los resultados obtenidos

presentan una alta correspondencia con los datos offline. Con este estudio, hemos demostrado

que la metodologıa propuesta es capaz de determinar diferentes grados de polarizacion de-

pendiendo de la estructura de la red.

Finalmente, hemos estudiado el comportamiento humano a partir de datos de telefonıa

movil. Por una parte, hemos caracterizado el impacto que tienen desastres naturales, como

innundaciones, sobre el comportamiento colectivo. Encontramos que los patrones de comu-

nicacion se alteran de forma abrupta en las areas afectadas por la catastofre. Con lo cual,

demostramos que se podrıa medir el impacto en la region casi en tiempo real y sin necesidad

de desplegar esfuerzos en el terreno. Por otra parte, hemos estudiado los patrones de ac-

tividad y movilidad humana para caracterizar las interacciones entre regiones de un paıs en

desarrollo. Encontramos que las redes de llamadas y trayectorias humanas tienen estructuras

de comunidades asociadas a regiones y centros urbanos.

vi

En resumen, hemos mostrado que es posible entender procesos sociales complejos por

medio del analisis de datos de actividad humana y la teorıa de redes complejas. A lo largo de

la tesis, hemos comprobado que fenomenos sociales como la influencia, polarizacion polıtica

o reaccion a eventos crıticos quedan reflejados en los patrones estructurales y dinamicos

que presentan la redes construidas a partir de datos de conversaciones en redes sociales de

Internet o telefonıa movil.

vii

ABSTRACT

During daily routines, we are constantly interacting with electronic devices and telecom-

munication services. Unconsciously, we are massively leaving traces of our activity in the

service providers’ databases. These new data sources have the dimensions required to enable

the observation of human behavioral patterns at large scales. As a result, there has been an

unprecedented explosion of data-driven social research.

In this thesis, we develop computational and mathematical methods to analyze social

systems by means of the combined study of human activity data and the theory of complex

networks. Our goal is to characterize and understand the emergent systems from human

interactions on the new technological spaces, such as the online social network Twitter and

mobile phones. We analyze systems by means of the construction of complex networks

and temporal series, studying their structure, functioning and temporal evolution. We also

investigate on the nature of the observed patterns, by means of the mechanisms that rule the

interactions among individuals, as well as on the impact of critical events on the system’s

behavior. For this purpose, we have proposed models that explain the global structures and

the emergent dynamics of information flow in the system.

In the studies of the online social network Twitter, we have based our analysis on specific

conversations, such as political protests, important announcements and electoral processes.

From the messages related to the conversations, we identify the participant users and build

networks of interactions with them. We specifically build one network to represent who-

receives-whose-messages and another to represent who-propagates-whose-messages. In gen-

eral, we have found that these structures have complex properties, such as explosive growth

and scale-free degree distributions. Based on the topological properties of these networks,

we have identified three types of user behavior that determine the information flow dynamics

due to their influence.

In order to measure the users’ influence on the conversations, we have introduced a new

measure called user efficiency. It is defined as the number of retransmissions obtained by

message posted, and it measures the effects of the individual activity on the collective reac-

ix

tions. We have observed that the probability distribution of this property is ubiquitous across

several Twitter conversation, regardlessly of their dimension or social context. Therefore, we

suggest that there is a universal behavior in the relationship between individual efforts and

collective reactions on Twitter. In order to explain the different factors that determine the

user efficiency distribution, we have developed a computational model to simulate the diffu-

sion of messages on Twitter, based on the mechanism of independent cascades. This model,

allows us to measure the impact on the emergent efficiency distribution of the underlying

network topology, as well as the way that users post messages. The results indicate that the

emergence of an exclusive group of highly efficient users depends upon the heterogeneity of

the underlying network instead of the individual behavior.

Moreover, we have also developed techniques to infer the degree of polarization in social

networks. We propose a methodology to estimate opinions in social networks and to measure

the degree of polarization in the obtained opinions. We have designed a model to study the

effects of the opinions of a small group of influential users, called elite, on the opinions of the

majority of users. The model results in an opinions distribution to which we measure the

degree of polarization. We apply our methodology to measure the polarization on graphs

from the messages diffusion process, during a conversation on Twitter from a polarized

society. The results are in very good agreement with offline and contextual data. With

this study, we have shown that our methodology is capable of detecting several degrees of

polarization depending on the structure of the networks.

Finally, we have also inferred the human behavior from mobile phones’ data. On the one

hand, we have characterized the impact of natural disasters, like flooding, on the collective

behavior. We found that the communication patterns are abruptly altered in the areas

affected by the catastrophe. Therefore, we demonstrate that we could measure the impact

of the disaster on the region, almost in real-time and without needing to deploy further

efforts. On the other hand, we have studied human activity and mobility patterns in order

to characterize regional interactions on a developing country. We found that the calls and

trajectories networks present community structure associated to regional and urban areas.

In summary, we have shown that it is possible to understand complex social processes

by means of analyzing human activity data and the theory of complex networks. Along the

thesis, we have demonstrated that social phenomena, like influence, polarization and reaction

to critical events, are reflected in the structural and dynamical patterns of the networks

constructed from data regarding conversations on online social networks and mobile phones.

x

Contents

1 INTRODUCTION 1

1.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 COMPLEX NETWORKS 7

2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Topological Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.1 Degree Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.2 Geodesic Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Types of Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.1 Regular Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.2 Random Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.3 Small World Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.4 Scale-free Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4 Community Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4.1 Detection Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.5 Assortativity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.6 Networks Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.6.1 Erdos-Renyi Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.6.2 Watts and Strogatz Model . . . . . . . . . . . . . . . . . . . . . . . . 18

2.6.3 Barabasi-Albert Models . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.7 Dynamics on Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.7.1 Disease Contagion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.7.2 Social Contagion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.7.3 Cascades on Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.8 Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

xi

2.9 Time Varying Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3 COMPUTATIONAL SOCIAL SCIENCE 29

3.1 Human Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2 Socio-Technological Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3 Information Spreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.4 Influence and Popularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.5 Polarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4 DIGITAL TRACES AND COMPUTATIONAL METHODS 41

4.1 From Data to Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.1.1 Data Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.1.2 Finding Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.1.3 Statistical Significance . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2 Twitter Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.2.1 Data Gathering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2.3 Representativity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.3 Mobile Phones Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.4 Additional Sources of Information . . . . . . . . . . . . . . . . . . . . . . . . 55

5 HUMAN BEHAVIOR DURING POLITICAL MOBILIZATION 57

5.1 Temporal Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.2 Individual Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.3 Followers Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.4 Retweets Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.5 Degree Assortativity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.6 Retweet Cascades . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.7 Analysis of User Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.8 Mesoscale Communities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

6 EFFICIENCY OF HUMAN ACTIVITY AS A MEASURE OF INFLU-

ENCE 79

6.1 User Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

xii

6.2 Universality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.3 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

6.5 Analytical Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

7 MEASURING POLITICAL POLARIZATION 97

7.1 A Model to Estimate Opinions in a Social Network . . . . . . . . . . . . . . 98

7.2 A Measure of Polarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

7.3 Study of Polarization on Retweet Networks . . . . . . . . . . . . . . . . . . . 101

7.3.1 Retweets Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

7.3.2 Elite nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

7.3.3 Estimating Opinions . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

7.3.4 Contagion by Influence . . . . . . . . . . . . . . . . . . . . . . . . . . 120

7.3.5 Offline Polarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

7.3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

8 URBAN COLLECTIVE PATTERNS 131

8.1 World Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

8.2 Urban Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

8.3 Dynamical Classes of Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . 134

8.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

9 INFERRING HUMAN BEHAVIOR FROM MOBILE PHONE DATA 139

9.1 Characterizing Communication and Mobility Patterns in a Developing Country140

9.1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

9.1.2 Characterizing Populated Areas . . . . . . . . . . . . . . . . . . . . . 142

9.1.3 Ethnic Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

9.1.4 Effects of Selectiveness in the Calling Behavior . . . . . . . . . . . . . 149

9.1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

9.2 Flooding through the Lens of Mobile Phone Activity . . . . . . . . . . . . . 151

9.2.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

9.2.2 Assessing the Representativeness of CDR data . . . . . . . . . . . . . 154

9.2.3 Population Response to Floods . . . . . . . . . . . . . . . . . . . . . 154

9.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

xiii

10 Conclusions 163

A User Behavior 171

B Videos 175

xiv

List of Figures

2.1 Homogeneous vs. power-law distributions. (a) A Homogeneous function and

a power-law function with γ = 2.1. Both distributions have 〈k〉 = 10. The

curves in (a) are shown on a linear plot and in (b) on a log-log plot. (c) A

random network with 〈k〉 = 3 and N = 50. (d) A scale-free network with

〈k〉 = 3. Figure adapted from [Bar12] . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Complementary Cumulative degree distributions for six different networks.

(a) Collaboration network of mathematicians [GI95]; (b) Citations between

1981 and 1997 to papers cataloged by the Institute for Scientific Information

[Red98]; (c) A 300 million vertex subset of the World Wide Web, circa 1999

[BKM+00]; (d) The Internet at the level of autonomous systems, April 1999

[CCG+02]; (e) The power grid of the western United States [WS98]; (f) The

interaction network of proteins in the metabolism of the yeast S. Cerevisiae

[JMBO01]. (c), (d) and (f), appear to have power-law degree distributions

and (b) has a power-law tail but deviates its behavior for small degree. (e)

has an exponential degree distribution and (a) appears to possibly have two

separate power-law regimes with different exponents. Figure adapted from

[New03b] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 A simple graph with three communities, enclosed by the dashed circles. Figure

taken from [For10] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4 (a) Schematic of the Watts-Strogatz model. (b) Normalized average shortest

path length L and clustering coefficient C as a function of the random rewiring

parameter p for the Watts-Strogatz model with N=1000, and k=10. Figure

taken from [WS98]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

xv

2.5 (A) Degree distribution of networks generated by the Barabasi-Albert model

in linearly-binned (red symbols) and log-binned version (green symbols). The

number of edges per new node m = 3. Size of (A) N = 100, 000, (B) N = 100,

(C) N = 10, 000 and (D) N = 1, 000, 000. The straight line has slope γ = 3,

corresponding to the resulting networks degree distribution. Figure adapted

from [Bar12]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.6 Comparison of disease spreading on homogeneous random graph and scale-free

networks. The fraction of infected nodes displays a distinct phase transition

(or epidemic threshold) in the case of an homogenous random graph, but not

for the scale-free network. Figure taken from [Wat04] . . . . . . . . . . . . . 23

2.7 Schematic representation of cascade on a network. The red and yellow nodes

belong to the cascade. The white nodes belong to the network but are not

part of the cascade. The cascade layers have been marked in gray. . . . . . . 25

2.8 Schematic representation of the activity driven network creation model. Red

nodes show the active nodes at each time T . The bottom plot represents the

final aggregated structure of the network. This figure has been adapted from

[PGPSV12]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.1 Bursts of individual activity on an e-commerce site. In the left panel we repre-

sent the temporal behavior of four individuals, showing that bursts of activity

(color stripes) coexist with large moments of inactivity (white periods). The

x-axis represents time and the colored lines represent individual actions. In

the right panel we show the distribution of inter-action waiting times for each

of the four users. Figure adapted from [ZCH+12]. . . . . . . . . . . . . . . . 31

3.2 Collective response to a critical event. In the top panel we show the emer-

gent networks between affected users during an event at three times. In the

bottom panel we show the calls pattern between the same users a week be-

fore the event, indicating that the cascades observed during the event are

extraordinary. Figure adapted from [BWB11] . . . . . . . . . . . . . . . . . 31

3.3 Emergent networks from the propagation of four videos on Twitter. In panels

(A) and (B) the local influential leaders performed a remarkable role in the

diffusion process. Whereas in panels (C) and (D) the influence of hubs was

much more stronger. Figure adapted from [DO14]. . . . . . . . . . . . . . . 34

xvi

4.1 Temporal evolution of Twitter activity (messages/hour) corresponding to datasets:

(A) 20N, (B) Egypt, (C) Obama and (D) Chavez, described in Table 4.1. At

all panels, we are displaying the impact of events on Twitter activity. The

four of them present a burst of activity when the event takes place, which

gradually decreases down to previous levels. Panels (A), (B) and (C) have

similar patterns despite spanning three orders of magnitude on the y-axis.

The envelope curve in panel (D) presents the same pattern across a different

time scale. The gradual decrease of activity spans for several days. The inset

curve corresponds to the activity during the shadowed area in green in a linear

scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.1 Top: Time evolution of the message rate (messages/minute) of the Venezue-

lan protest #SOSInternetVE. Arrows indicate some of the times when the

protest convoker participated. Bottom: Time evolution of the accumulated

percentage of messages (dashed line) and participant users (solid line). . . . 58

5.2 Complementary cumulative distribution of the user activity during the Venezue-

lan protest #SOSInternetVE. Solid line is the fit to an exponentially truncated

power law, P (x > x∗) ∝ x−βe−x/c, where β = 0.880±0.001 and c = 65, 0±0.6

at the last day. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.3 In (top) and out (bottom) degree complementary cumulative distributions of

the followers network from the Venezuelan protest #SOSInternetVE. . . . . 60

5.4 Scatter plot of in and out degree of the followers network from the Venezuelan

protest #SOSInternetVE. Dots represent users. . . . . . . . . . . . . . . . . 61

5.5 In (top) and out (bottom) strength complementary cumulative distributions

of the retransmission network of the Venezuelan protest #SOSInternetVE.

Solid line is the fit to an exponentially truncated power law P (Sout > S∗out) ∝S−βoute

−Sout/c, where β = 0.890± 0.002 and c = 61.0± 1.2. . . . . . . . . . . . 63

5.6 Edge’s weight complementary cumulative distribution of the retransmission

network from the Venezuelan protest #SOSInternetVE. . . . . . . . . . . . . 64

xvii

5.7 Visualization of the retweet network emergent from the message propagation

on the followers network. (A) Subgraph of the retweet network (green) super-

imposed to the corresponding followers network (black), from the #SOSInter-

netVE dataset. In the figure a subset of 1000 random nodes (yellow and red)

are presented. The node size is proportional to the respective in degree on

the followers network. (B, C and D) Example of the formation of the retweet

network from independent retweet cascades on an artificial followers network.

(B) shows when two users (red nodes) post independent messages which are

received by their followers (gray). (C) shows when some users retweeted the

message (yellow) and this message arrives to their followers (gray). (D) shows

the final shape of the cascades on the network, compound only by the acti-

vated nodes (red and yellow) connected by the green links. The white nodes

and gray links represent the rest of the substratum (followers network) who

did not activate. (E) shows the schema of a single cascade. The black circles

determine the cascade layers. . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.8 Retweets cascades statistical properties. (A) Complementary cumulative den-

sity function of the number of users per cascade, (B) Cascade depth distribu-

tion P (d) and (C) Retransmission rate by layer λl in terms of retweets over

followers. The data correspond to the #SOSInternetVE dataset. . . . . . . . 68

5.9 Analysis of the user behavior. (A) Scatter plot of retransmissions obtained

by user versus its activity and colored by its number of followers. (B) Scatter

plot of retransmissions obtained by user versus its number of followers and

colored by its activity. (C) Scatter plot of retransmissions obtained by user

versus the ratio between the number of followers and followees, and colored

by its activity. (D) Scatter plot of retransmissions made by user versus its

number of followers and colored by its activity. Dots represent users. Data

correspond to the #SOSInternetVE dataset. . . . . . . . . . . . . . . . . . . 70

5.10 Community structure for the follower graph. Circles represent communities of

users and their size is proportional to the amount of users that belong to the

community. Edges represent the inter-community links, either followers (Left)

or retransmissions (Right), and their width is proportional to the amount of

edges, normalized by the size of the outgoing community. The data correspond

to the #SOSInternetVE dataset. . . . . . . . . . . . . . . . . . . . . . . . . 72

xviii

5.11 Community structure for the retransmission graph. Nodes represent com-

munities and edges represent the inter-community links. The nodes’ size are

proportional to the number of people that compound the community and

the edges’ width are proportional to the number of inter-community links

normalized by the size of the community. The data correspond to the #SOS-

InternetVE dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

6.1 Scatter plot of the user in degree vs out degree in the followers network, colored

by the respective user efficiency. Dots represent users. Data correspond to

the #SOSInternetVE dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.2 User efficiency probability density function (A) and complementary cumula-

tive density function (B). The red dots correspond to the empirical results,

the black solid line represents the lognormal fit and the black dashed line

represents a power law fit. Quantile-Quantile plot (C) of the user efficiency

distribution, filtered by the in degree in the followers network KFin. The dis-

tributions correspond to the #SOSInternetVE dataset. . . . . . . . . . . . . 81

6.3 Complementary cumulative density function of the user activity, from sev-

eral Twitter conversations, increasingly ordered according to the number of

messages (A-F): (A) Andreafabra, (B) Gringich, (C) Leones, (D) 20N, (E)

Obama, and (F) Egypt. The black dashed line represents a power law fit and

the red dots correspond to the measured distributions. . . . . . . . . . . . . 84

6.4 Complementary cumulative density function of the retweets obtained by user,

from several Twitter conversations, increasingly ordered according to the num-

ber of messages (A-F): (A) Andreafabra, (B) Gringich, (C) Leones, (D) 20N,

(E) Obama, and (F) Egypt. The black dashed line represents a power law fit

and the red dots correspond to the measured distributions. . . . . . . . . . . 85

6.5 Probability density function of the user efficiency on several Twitter conver-

sations, ordered increasingly according to the number of messages (A-F): (A)

Andreafabra, (B) Gringich, (C) Leones, (D) 20N, (E) Obama, and (F) Egypt.

The properties of these conversations may be found in Table 6.1. The black

solid line represents the lognormal fit, the black dashed line represents a power

law fit and the red dots correspond to the measured distributions. . . . . . . 86

6.6 Model results to the user efficiency distribution (left column) and retweets

gained by user distribution (right column), with the empirical results. The

model has been applied to the followers network from the #SOSInternetVE

dataset (top panel) and the #20N dataset (bottom panel). . . . . . . . . . . 87

xix

6.7 Effects of the underlying network topology on the model results in terms of the

user efficiency distribution (left column) and retweets gained by user distribu-

tion (right column). The model has been applied to the followers network (blue

crosses) and their randomized versions (red x symbols). Two datasets have

been considered: #SOSInternetVE (top panel) and #20N (bottom panel).

In all cases, an heterogeneous initial activity distribution P (A0) ∝ A−1.40 has

been considered. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

6.8 Effects of the individual user behavior on the model results in terms of the user

efficiency distribution (left column) and retweets gained by user distribution

(right column). The model has been applied to the followers network (blue

crosses) and their randomized versions (red x symbols). Two datasets have

been considered: #SOSInternetVE (top panel) and #20N (bottom panel). In

all cases, an homogeneous activity distribution P (A0) = 1/6 where A0 ∈ [1, 6]

has been considered. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6.9 Results from the analytical model of user efficiency, considering cascades up

to three layers of depth in the followers network from the #SOSInternetVE

dataset. Resulting η average (A) and standard deviation (B) from evaluating

the model with 0.2 < P (d > 0) < 1.0 (x-axis) and 0.05 < r0 < 0.3 (color). The

dashed lines indicate the empirical values. (C) Resulting η distribution from

applying the analytical model to the followers network with the empirical

activity distribution P (A0) by setting P (d > 0) = 0.775 and r0 = 0.15.

The white dots represent the empirical distribution of user efficiency and the

triangles represent the distribution obtained from the analytical model. . . . 93

7.1 Schema explaining the proposed polarization index µ. (A) Density distribu-

tion of opinions. gc stands for the gravity center of each pole, A stands for

the population associated to each ideology, and d stands for the pole distance.

(B) Visualization of the polarization index, µ, for three different situations. . 99

7.2 Schema of the influence spreading process in the opinion estimation model.

(A) Displays the seed nodes in the network, colored according to their re-

spective ideology. (B) Displays the network at t = 0, before seeds start to

propagate their influence. (C) Shows the state of the network at t = 1. (D)

shows the state of the network at t = n/2. (E) Displays the final state of the

network at t = n. (F) and (G) Visualizations of two examples of the result

of the opinion estimation model to the Venezuelan dataset for non polarized

(F) and polarized (G) days. See the video B.1 described in the Appendix B . 101

xx

7.3 Visualization of the retweet network at day D − 29. The Giant Component

has been colored in blue and red, while the rest of components have been

colored in gray. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

7.4 (Left) Distributions of the components size of the retweet networks from the

Twitter conversation about the Venezuelan President Hugo Chavez for three

days: D − 29, D and D + 20, where D represents the day of the main occur-

rence. (Right) Time evolution of the Giant Component (GC) of the retweets

networks: (A) Ratio between the number of nodes that conform the GC and

the number of nodes in the respective networks. (B) Time evolution of the

whole network and GC size in terms of nodes. (C) Relative number of mes-

sages inside Venezuela from the geolocalized users in the GC. The orange

stripe represents the day D and the state funeral period. . . . . . . . . . . . 104

7.5 Visualization of geolocated messages from the Chavez conversation on three

days from different periods: before the announcement (top), during the an-

nouncement (middle), after the announcement (bottom). The dots represent

geolocalized messages. The label indicates the day of observation, being D

the day of the announcement. . . . . . . . . . . . . . . . . . . . . . . . . . . 106

7.6 Evolution of the topological properties of the retweet networks emergent at

each day of the observation period, in terms of: (A) Out strength comple-

mentary cumulative distribution, (B) In strength complementary cumulative

distribution, (C) Gini index evolution of the strength distributions. (D) Di-

rected degree assortativity evolution. The orange stripe represents the day of

the main occurrence. In A and B, the blue curves correspond to the first days

and the red curves correspond to the last days. . . . . . . . . . . . . . . . . . 107

7.7 Conditioned probability density function of the accumulated in-strength (Sin)

given the participation rate (ρ), from the Twitter conversation about the

Venezuelan President Hugo Chavez. The color correspond to the density of

users. The red line indicates the average accumulated in-strength value Sin

for a given participation rate ρ. . . . . . . . . . . . . . . . . . . . . . . . . . 108

7.8 Adjacency matrices (top) and corresponding visualization (bottom) of the

considered elite networks. (A) Corresponds to the seed with Sin ≥ 10000

and ρ ≥ 0. (B) Corresponds to the seed with Sin ≥ 1000 and ρ ≥ 0.89.

(C) Corresponds to the seed with Sin ≥ 10 and ρ ≥ 0.82. Nodes have been

ascendantly ordered according to their opinions Xs. The color indicates the

average value of the node’s opinions Xij at both sides of the edge i− j. . . . 112

xxi

7.9 Visualization of two cases of possible retweet networks and expected outcomes.

The top row represents a polarized case and the bottom row represents a

nonpolarized case. Panels A and E show the position of the elite nodes,

colored in each network. Panels B and F shows the respective networks,

coloring the nodes with their estimated opinion. Panels C and G show the

opinion adjacency matrices AXij. The colored dots in the matrices represent

interactions: blue and red dots indicate interactions within the same group;

pale blue and yellow dots indicate interactions across groups. Nodes have

been ascendently ordered according to their estimated opinion Xi. Panels D

and H represent the resulting opinion distributions. . . . . . . . . . . . . . . 113

7.10 Time evolution of estimated opinions (Xi) probability density functions (p(X))

for the Venezuelan conversation. These distributions respectively result from

applying the model to the retweet networks using the elites No. 1 (top panel),

No. 2 (middle panel) and No. 3 (bottom panel) described in section 7.3.2. La-

bels indicate the day of observation, D standing for the day of the President’s

death. Colors indicate the number of participants. . . . . . . . . . . . . . . . 115

7.11 Time evolution of the polarization index µ (C), and the variables associated

with it: pole distance d (B) and the difference in population sizes (A) for

the Venezuelan conversation in the undirected version of the networks. The

magenta line represents the average of the results from applying the model

with the three elite users from section 7.3.2. The gray shadow shows the

standard deviation. The orange stripe indicates the day of main event. . . . 116

7.12 Time evolution of the statistical properties of the Xi distribution in terms of

(A) Average, (B) Standard deviation and (C) Kurtosis. The orange stripe

represents the day of the main occurrence (D) and the state funeral period.

The magenta line represents the average of the results from applying the model

with the three elite users from section 7.3.2. The gray shadow represents the

standard deviation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

7.13 Time evolution of the opinion adjacency matrices AXijfrom the Twitter con-

versation about the Venezuelan President Hugo Chavez. Nodes have been

plotted in ascendant order according to their estimated opinion Xi. The label

indicates the day of observation (from D− 29 to D+ 26). The color indicates

the average value of the node’s opinions at both sides of the edge i− j. . . . 119

xxii

7.14 Effects of rewiring edges in the results of the opinion estimation model. Time

evolution of estimated opinion (Xi) cumulative probability density functions

(CDF) resulting from the opinion estimation model to the undirected networks

(solid) and corresponding rewired versions (dashed). The label indicates the

day of observation (from D−29 to D+26). Columns are ordered from Monday

to Sunday. The labels indicate the corresponding day of observation, from

D − 29 to D + 26, being D the day of the President’s death announcement.

The distributions for the rewired networks represent the average over 200

realizations. These curves correspond to the results from applying the model

with the elite No. 3 described in 7.3.2. . . . . . . . . . . . . . . . . . . . . . 121

7.15 Time evolution of the estimated opinions (Xi) probability density functions

(p(X)) for the Venezuelan conversation. Labels indicate the day of observa-

tion, D standing for the day of the President’s death. Colors indicate the

number of participants. These curves are the average of the results from

applying the model with the three elite users from section 7.3.2. . . . . . . . 122

7.16 Time evolution of the polarization index µ (C), and the variables associated

with it: the pole distance d (B) and the difference in population sizes (A) for

the Venezuelan conversation. The magenta line represents the average of the

results from applying the model with the three elite users from section 7.3.2.

The gray shadow shows the standard deviation. . . . . . . . . . . . . . . . . 123

7.17 Effects of edges’ direction in the results of the opinion estimation model. Time

evolution of estimated opinion (Xi) cumulative probability density functions

(CDF) resulting from the opinion estimation model on the directed network

(solid) and undirected network (dashed). The label indicates the day of ob-

servation (from D − 29 to D + 26). Columns are ordered from Monday to

Sunday. The color indicates the kurtosis values of the distributions. The la-

bels indicate the corresponding day of observation, from D − 29 to D + 26,

being D the day of the President’s death announcement. These curves are

the average of the results from applying the model with the three elite users

from section 7.3.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

7.18 Electoral polarization in Venezuela. Distribution of voting stations accord-

ing to the winner party and the location of station, according to the 2013

Venezuelan Presidential elections. . . . . . . . . . . . . . . . . . . . . . . . . 126

xxiii

7.19 Mass of tweets in the city of Caracas. Contour levels (from inside to outside

0.25, 0.20, 0.15, 0.10) represent the mass of tweets identified as in favor of

the government (red) and against it (blue). Areas bordered in green corre-

spond to the five municipalities that conform the city. White regions display

unpopulated areas, yellow regions represent populated areas and pink regions

correspond the informal and poorer neighborhoods (slums). The label color

indicates the ruling party at each municipality, according to the 2013 Venezue-

lan local elections: red represents the officialism party at Libertador and blue

indicates opposition parties at Chacao, Sucre, Baruta and El Hatillo. . . . . 129

8.1 World Twitter Activity. Geographical density of Twitter activity (number of

tweets) during one average day in logarithmic scale. Red and orange indicate

a high concentration of activity, while blue and green indicate a lower concen-

tration of tweets, and black indicates the absence of activity. Insets: Average

week of Twitter activity on several cities (ac,d(t)). . . . . . . . . . . . . . . . 132

8.2 Temporal behavior of 52 cities across all continents. Series represent the

representative week of Twitter activity for each city (ac,i(t)). Color indicates

the result of the clustering classifier. . . . . . . . . . . . . . . . . . . . . . . . 136

8.3 Clustering of cities according to their temporal behavior. Colors indicate

the results of k-means clustering algorithm. Axes correspond to collapsed

dimensions using multidimensional-scaling algorithms. On the top panel we

show the average behavior of each class (from A to C). We have respectively

marked the morning and afternoon peaks of activity with a red x symbol and

a circle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

9.1 Ethno-linguistic map of Ivory Coast. Figure adapted from [Lew09] . . . . . 141

9.2 Mapping the community structure of the trajectories network of Ivory Coast.

Antennas represent nodes and are plotted in different colors and shapes, ac-

cording to the community they belong gotten from the community detection

algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

9.3 Mapping the structure of the trajectories network on the Ivory Coast geo-

graphical map. The blue lines represent the edges of the network and their

width is proportional to the edge weight. Superimposed the main roads of

Ivory Coast have been plotted as red lines. The location of the country’s

main cities are marked with black circles. . . . . . . . . . . . . . . . . . . . . 143

xxiv

9.4 Mapping the closeness-centrality property of the trajectories network in Ivory

Coast. The edges have been colored according to the closeness centrality mean

value of the two connected nodes. The red regions indicate higher closeness-

centrality, the yellow and pale blue regions indicate medium centrality, and

the dark blue regions indicate lower closeness-centrality. . . . . . . . . . . . . 144

9.5 Mapping the linguistic identity of the trajectories network of Ivory Coast.

The edges have been colored according to the linguistic group to which the

most connected antenna at each community belongs to. There are four major

linguistic families represented in yellow (northwest), purple (northeast), green

(southwest) and blue (southeast). Black circles indicate the location of the

major cities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

9.6 Normalized adjacency matrices of the calls network corresponding to the com-

munity structure from the trajectories network (A), ethnic group aggregation

(B) and linguistic family aggregation (C). Assortativity coefficient of selec-

tiveness to call on local scale (community), subregional scale (ethnic group)

and regional scale (linguistic family) (D). . . . . . . . . . . . . . . . . . . . . 147

9.7 Scatter plot of intra linguistic family flux (calls directed to an antenna in the

same linguistic family as the emitter antenna) versus inter linguistic family flux

(calls directed to an antenna in a different linguistic family than the emitter

antenna). Symbols represent communities from the trajectories network and

the color indicates the linguistic family to which the community belongs. The

dashed line has slope 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

9.8 Mapping the community structure of the calls network of Ivory Coast. Anten-

nas represent nodes and are plotted in different colors and shapes, according

to the community they belong gotten from the community detection algorithm.150

9.9 Mapping the classification results of antennas according to the way the calls

network communities are related. A k-means clustering classifier has been

applied to the community structure of the calls network. . . . . . . . . . . . 150

9.10 Left: Visualization of the precipitation data obtained from the NASA TRMM

at November, 2nd, 2009. The red square encloses the observed region. Right:

Accumulated rainfalls during the first two weeks of November, 2009 (jet col-

ormap) over the Tabasco area. The floods segmentation is shown by the white

shade. The area correspond to the red square in the left panel. . . . . . . . . 152

xxv

9.11 Left: map of 2010 census (green bars) vs CDRs based population estima-

tion (purple bars) in several cities of Tabasco (red=affected cities, blue=other

cities) and surroundings. Right: The plot shows linear correlation between

the CDR census and the real census (r-square 0.97). . . . . . . . . . . . . . . 153

9.12 Time evolution of the number of unique users per cell tower x(t). The gray

stripes indicate the Flood and Christmas periods where stronger variations are

observed. The labels at the top-right of each chart indicate the municipality

where the tower is located. Towers have been ordered and colored according

to the maximum degree of variation during floods in decreasing order. . . . . 155

9.13 Scheme of the Antenna Variation metric for cell towers. The black curve

represents the raw signal x(t). The gray stripe indicates the Flood period.

The red line indicates the average value (µBL) of users served during the

Baseline period. The pink stripe indicates the standard deviation (σBL) from

the average value during the Baseline period. The blue line indicates the

deviation from the average value at a given day. Our measure of antenna

variation results from the ratio of the blue line divided by the green line. . . 156

9.14 Time evolution of the Antenna Variation metric (xnorm) for the considered

towers. The gray stripes indicate the Flood and Xmas periods. Color is

proportional to the degree of variation during the flooding period. It can be

noticed that antennas have a spike of activity during the floods (left shadowed

region), as well as during Christmas and New Years Eve. . . . . . . . . . . . 157

9.15 Impact Map of Tabasco for the 2009 floods. Circles represent antennas and

their size is proportional to the variation metric during the floods. The dark

blue segmentation represents the flooded region. The color of municipalities

is proportional to the number of affected people. The map shows the most

critical day featuring the highest values of the antenna variation metric. . . . 158

9.16 Distribution of the maximum of the antenna variation metric for the BL period

(gray) and floods (red). The curves show the percentage of antennas (y-axis)

whose maximum variation metric value (xnorm) is higher than a given value

(x-axis). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

9.17 Top: Antenna variation metric (red) vs the precipitation level (blue) for the

six hottest antennas (A to F). The slashed line shows the emergency warning

date as notified in the news. Bottom: Map featuring the position and date

(e.g. 6N is 6th November) where the maximum of the antenna variation metric

was observed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

xxvi

A.1 Analysis of the user behavior. (A) Scatter plot of retransmissions obtained







correspond to the 20N dataset. . . . . . . . . . . . . . . . . . . . . . . . . . 172

A.2 Analysis of the user behavior. (A) Scatter plot of retransmissions obtained







correspond to the ETA dataset. . . . . . . . . . . . . . . . . . . . . . . . . . 173

B.1 Evolution of the opinion estimation model. Nodes are colored according to

their opinion Xi. In principle, all nodes’ opinions are zero; thus, they are

colored in white. However, nodes with an opinion below zero are red and

above zero are blue. The elite is hidden in the network and will spread their

opinions iteratively. We see how the network is increasingly colored at each

time step. Because the network is polarized around the elite, the red and blue

colors are not mixed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

B.2 Worldwide Twitter reaction to the announcement of Hugo Chavez decease.

Yellow circles represent a geolocated tweet. The video spans for a 24h period.

We show a counter indicating the remaining time before the announcement

and the time after it. It can be noticed that at the moment of the announce-

ment the whole world reacted massively to the news by posting related mes-

sages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

B.3 Worldwide Twitter activity. In this video we present the worldwide Twitter

activity during an arbitrary week. We plot all geolocated tweets as white dots

in the map. It can be noticed that there is a wave of activity from the east

to the west side of the globe as days evolve. Also, it is noticeable that the

activity decreases to its minimum levels during early mornings. . . . . . . . 179

xxvii

B.4 Human trajectories network evolution in Ivory Coast. In this video, we present

the dynamical growth of the human trajectories network during an arbitrary

day. Dots represent users moving across the country from antenna to antenna.

The edge color is related to the network community where the target node

belongs to. It can be noticed that the network grows in a sparse way, mostly

connecting nodes that are geographically close to each other. Other regions

like the capital city (right bottom) concentrate most of the long distance edges. 180

B.5 Calls network evolution in Ivory Coast. In this video, we present the dynam-

ical growth of the calls network during a period of 12 hours at an arbitrary

day. Dots represent calls, traveling from one antenna to the other at each

hour. The edge color is related to the network community where the target

node belongs to. It can be noticed that there is an explosion of calls after

6am, showing the dense structure of the network. . . . . . . . . . . . . . . . 180

B.6 Time-lapse of the Tabasco impact map. The video displays the absolute value

of the antenna variation metric from Oct, 2009 to Jan, 2010 as in the temporal

series. Each antenna is represented by a circle with color and size proportional

to the daily metric value. The segmented flooded area has been colored in light

blue. It can be noticed that the antennas near the flooding area dramatically

increased their variation during the floods. This effect is noticeable during

Christmas and New Years Eve, where all antennas present extremely large

variation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

xxviii

List of Tables

4.1 Description of the studied datasets. . . . . . . . . . . . . . . . . . . . . . . . 51

5.1 Followers and retweet network properties from the Venezuelan protest #SOS-

InternetVE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.2 Pearson correlation (r) by user of the number of followers (F), retweets (R)

and activity (A). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.3 Main collectives around which each follower community is formed from the

Venezuelan protest #SOSInternetVE. . . . . . . . . . . . . . . . . . . . . . . 74

5.4 Most retransmitted account at each retransmission community from the Venezue-

lan protest #SOSInternetVE. . . . . . . . . . . . . . . . . . . . . . . . . . . 76

6.1 Properties of the studied datasets and their resulting user efficiency distribu-

tion properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

7.1 Elite networks topological properties. Sin and ρ columns represent minimum

values. Off. C-node indicates the number of network communities related

to the officialism, and Opp. C-nodes indicates the number of communities

related to the opposition. The numbers in the parentheses indicate the number

of nodes in each pole. Q stands for modularity. r stands for the Pearson

coefficient of mixing patterns by ideology. . . . . . . . . . . . . . . . . . . . . 111

9.1 Properties of the Calls and Human Trajectories Networks. . . . . . . . . . . 142

xxix

Chapter 1

INTRODUCTION

Nowadays, we are constantly interacting with electronic devices on daily basis, such as

mobile phones, e-mail or online social networks. The increasing integration of technological

solutions into people’s life, is certainly affecting the way people relate to each other and

consequently the properties of the social system. Historically, the exchange of information

among social groups has influenced and determined the course of events across societies

[Dia97]. In fact, the development of societies is associated with the number and diversity of

exchanging connections. The recent explosion of information technologies has enabled the

emergence of a global society without precedents. One society in which distances no longer

exist and where previously isolated events may trigger worldwide reactions in a few instants.

Researchers say that the world is heading towards a networked society, where Internet-

based solutions are emerging as alternatives to traditional centralist institutions [Cas96].

For instance, social media allows people to broadcast information extremely affordable in a

global scale, in detriment to mass media companies, which no longer control the monopoly of

information. Also, the large number of collaborative online working sites and the increasing

activity of international freelancing, are signs that corporations are no longer needed to con-

duct large businesses. In fact, new virtual currencies are already working as an alternative to

traditional financial systems, waving intermediaries and international monetary institutions.

As a consequence, current business and political models must chose between either adapting

to the new times or becoming extinct.

The current challenge is to characterize the social systems that emerge from these new

technological spaces and to understand their rules of behavior [LPA+09]. For this purpose, we

must enhance our ability to measure these systems in their actual dimensions. Fortunately,

the mentioned explosion of information technologies is providing the data required for these

analyses. When consuming these services, we are unconsciously leaving traces of our activity

1

as a by-product in the providers’ databases. Individually, these records contain detailed

information about the user activity and may serve for billing processes. It is natural to

think that users will have a unique profile determined by their own habits and customs.

However, these databases are so large that they have the dimensions required to enable

the observation of large scale human behavioral patterns. In fact, they are unveiling the

characteristics of societies as a whole physical system, rather than a collection of isolated

individuals [Pen14]. Besides, these datasets have the advantage of being real measurements

of people’s actual behavior, instead of the result of some sparse observations and honesty-

based questionnaires.

The human society is a complex system. Many social phenomena strongly depend upon

the way people behave and how collective actions are combined together in the society

[Bet13]. Like in other complex systems, there are global properties that emerge from the

relationships between the individuals, rather than the properties of the individuals themselves

[BY97]. The elements in complex systems do not behave independently from each other

but neither behave fully coherently. Instead, individuals create interdependencies in their

actions, given in the form of collective behaviors. During this process, the individuals loose

independence in their behavior, in favor for the system to gain properties and capabilities

at larger scales. As a result, the emergence of a collective behavior increases the system’s

complexity.

In the case of nowadays societies, we can find traces of the collective behavior that enables

larger scale patterns in the data derived from human activity. That information is embedded

and unstructured in the raw data. Therefore, in order to retrieve this knowledge, we must

treat the data properly [BYB13]. On the one hand, we can not explain the system through

the individual states, since it would require as many descriptors as to make the system too

complicated to understand. But, on the other hand, we can not reduce the overall behavior

into mere statistics either. By doing this, we will loose the heterogeneity and diversity typical

of social systems. Therefore, we need frameworks to observe the system at all its complexity.

The theory of complex networks is an adequate tool to treat and analyze this kind of sys-

tems [New03b, Wat04]. Networks are mathematical structures compound by a set of nodes,

linked to each other by a set of edges that represent relationships or interactions between the

systems’ elements. By analyzing systems in the form of networks, we can understand their

structure, their dynamical evolution and the responsible mechanisms for patterns formation.

In general, networks are a common ground for analyzing complex systems in a variety of sci-

entific disciplines. In part because networks reveal the systems’ characteristics across several

scales. They can describe the systems’ global properties and their functioning as a whole.

2

At the same time, they can also describe local interactions, the role of individuals in their

environment and the connection patterns, which include structures at intermediate scales.

Recently, there has been an explosion of research for ways to retrieve societal knowledge

from data. Most of these studies take advantage of the size, diversity and real-time nature

of the data in order to revise old sociological questions and to ask new ones. Such way

of studying social systems is unprecedented and it is revealing the true nature of societal

phenomena. For instance, patterns in the diversity of connections can explain the economical

development of cities [EMC10], as well as the emotional state of individuals [RMM+10]. Also,

patterns of popularity can explain the economical value of stocks [BMZ11] or earthquake

epicenters [SOM10]. Finally, patterns of mobility can predict the propagation of infectious

diseases [WET+12] or evaluate urban land use [TUGB12].

Further from remarkably increasing our societal knowledge, these and many other scien-

tific advances suggest that the analysis of data can be incorporated as valuable information

for decision making and policy evaluation processes, in both private and public sectors. First

because the analysis of data has the potential to show an unprecedented view of the impact

of policies on the population, so that they can be revised and modified if needed. Second

because they can also provide the knowledge to rethink the way our social and engineered

systems are functioning together, in order to design new rules of complex interactions for

building better societies in the future.

1.1 Goals

In this thesis we develop computational and mathematical methods to analyze social systems

from the combined study of data derived from human activity with the theory of complex

networks. Our main goal is to characterize and model the human behavior during people’s

interactions on the new technological spaces, such as online social networks and mobile

phones. We intend to understand the social systems that emerge from such interactions, by

means of their structure, functioning and temporal evolution. To this end, we will analyze the

systems as complex networks and propose metrics based on physics magnitudes to measure

their characteristics. Furthermore, we will model the dynamics of information flow in the

system and measure the impact of external and critical events on the system’s behavior.

In order to achieve these goals, we have defined the following targets:

1. To develop methods to characterize and understand the social system’s structure, func-

tioning and time evolution. For this purpose the following particular goals must be

achieved:

3

(a) To represent the systems as complex networks, temporal variables and geograph-

ical information systems. Then, to characterize the system by analyzing the

properties of these abstractions.

(b) To characterize and classify users, as system’s elements, according to their rela-

tionship with the environment and their role in the collective functioning.

(c) To understand how users influence each other and the way information flows

among people. For this matter, we will characterize users by their influence to

spread information.

(d) To detect and measure the degree of polarization on social networks. For this

purpose, we will develop a methodology to infer opinions in social networks and

to measure the polarization in the resulting opinion distributions.

(e) To characterize the impact of critical events on the collective behavior. This

means to analyze the way critical events influence the communication patterns.

For this purpose, we will study the development of events like political protests,

news events and natural disasters.

(f) To characterize the geographical distribution of human activity, developing meth-

ods to measure interactions among geographically located social systems, like

urban areas or regions.

2. To develop dynamical models to explain global properties in the system. This means,

to explain the nature of the observed patterns through the dynamical mechanisms that

rule the interactions among individuals. For this matter, we propose to achieve the

following specific goals:

(a) To model the propagation of information across social networks as independent

cascades. Then, to explain the effects of the underlying networks’ topology and

user behavior on the information flow dynamics.

(b) To model the flow of opinions in a social network. Then, to explore the effects of

an influential minority’s opinion on the majority of users.

3. In order to achieve the previous goals, we must first develop a computer platform for

the collection, storage, querying and treatment of the data. For this matter, we propose

to reach the following goals:

4

(a) To develop applications to collect data from online social networks’ servers. These

applications must be able to authenticate with servers and to manage queries

automatically.

(b) To develop methodologies to store and query the data. We will implement tra-

ditional solutions like MySQL and develop applications based on Map-Reduce

algorithms.

(c) To develop software applications for the mathematical and statistical treatment

of the data, computational modeling and simulations, as well as visualization

techniques.

1.2 Organization

The thesis is organized in 10 chapters. After this introduction, in chapter 2 we review the

most important concepts of complex networks theory used in this thesis. We present the

main properties of complex networks, as well as the models proposed for network generation

and dynamics on networks. In chapter 3 we describe relevant previous work in the compu-

tational social science related with this thesis. We analyze the dynamics of human activity,

the properties of socio-technological networks and the information spreading processes. In

chapter 4 we introduce the concept of digital traces and discuss the computational meth-

ods to treat the data in order to retrieve patterns and knowledge. We also go through the

datasets we have analyzed, their properties and the techniques followed to build them.

In chapter 5, we present the analysis of the user behavior during political mobilization.

We analyze a Venezuelan protest that took place exclusively by Twitter at December, 2010.

We characterize the user behavior by describing the networks of collective attention and

information flow. We study the social networks’ topological properties, finding communities

structure and highly connected hubs. We classify users according to their role during the

information flow dynamics and identified three different kinds of user behavior. We show

that traditional media still hold too much influence on social media.

In chapter 6 we define a new measure of influence in the network called user efficiency.

It characterizes the relationship between the activity employed by users and the emergent

collective response to such activity. On this basis, we propose a model to understand the

emergence of the user efficiency distribution, based on independent cascades taking place on

networks. We show that this measure and the underlying mechanism are universal across

Twitter conversations from different and diverse social contexts.

In chapter 7 we introduce a methodology to quantify the degree of polarization in social

5

media conversations. We propose an opinion estimation model in which a minority of in-

fluential individuals propagate their opinions through social networks. The model results in

an opinion probability density function that represents the state of the population. Next,

we propose an index to quantify to which extent this resulting distribution is polarized. We

apply this model to study the time evolution of the user behavior on Twitter during an

important event for politics in Venezuela, such as the death announcement of the President

in office in 2013.

In chapter 8 we characterize the dynamics of human activity in urban environments. We

analyze the kinetics of Twitter activity across several cities worldwide. We characterize the

cyclic behavior of daily routines with the construction of temporal series. We show that

cities are classified according to three kinds of dynamical behavior, based on morning and

afternoon activity.

In chapter 9 we study the human behavior from mobile phones data. First, we analyze

the way social factors affect regional communication patterns. We propose a methodology

to infer regional relationships by means of human mobility and calling activity patterns. We

show that the regional communication patterns in Ivory Coast are correlated with ethnic

and economical factors. Then, we study the reaction of a population to a natural disaster.

We investigate the viability of using mobile phone data, combined with other sources of

information, in order to characterize the floods occurred in Tabasco, Mexico in 2009. We

propose methods to evaluate the population behavior during the tragedy. We show that the

analysis of data could help for the evaluation of policies and resource allocation strategies.

Finally, in chapter 10 we briefly summarize our results and present our conclusions. Also,

we present two appendices with supplementary information and additional visualizations.

First, in the appendix A we generalize some results from chapter 5. Second, in the appendix

B we present the videos that we have made to illustrate some of our results.

6

Chapter 2

COMPLEX NETWORKS

Many natural systems can be modeled in the form of complex networks [New03b, Wat04,

BLM+06]. In this abstraction, the system’s elements are represented as nodes, which are

linked to each other due to the existence of relationships. In general, networks are a common

ground to visualize and explain systems across different scales. For instance, it is possible

to measure hierarchies in the systems’ structure, correlations in the connections or the re-

lationship between the local elements’ behavior and the global system’s properties. Recent

research has shown that many of these patterns are universal across previously unrelated dis-

ciplines, such as biology [WRB06], sociology [NP03], economy [HKBH07, FGH12] or ecology

[GPG12]. These systems typically present complex properties like small-world structures

and scale-free degree distributions. Such complexity has remarkable effects in the system’s

functioning as a whole, such as the robustness to random failures and vulnerability to se-

lected attacks [CFHB+05, CPRVP09]. Therefore, the complexity of these systems must be

fully understood in order to properly characterize them and predict their behavior.

Over the last decade, there has been an explosion in modeling systems as complex net-

works. In some cases, the network structure is more evident given the existence of physical or

explicit connections, like the Internet [AJB99], flights connecting airports [BBPSV04], neu-

rons [BS09] or social interactions [MLB12, MBLB14, BMLB12, BMBL14]. However, other

kind of phenomena can also be modeled in the form of networks, such as linking elements

according to correlated behaviors [KKK02] or common functions in the system [BS09]. At

all cases, complex networks are a powerful tool to understand the structure of these systems

as well as the evolution of their dynamical processes.

In this chapter we will review the main concepts of network science. We will give some

basic definitions and discuss the main topological properties of complex networks. We will

study network generation models, that explain the emergence of real networks properties,

7

and dynamical processes on networks, like information spreading or opinion formation.

2.1 Definitions

In this section a set of basic concepts of network science will be defined. These concepts are

constantly referred throughout the thesis and must be clarified in order to understand the

following sections and chapters.

Nodes: Nodes or vertex are the simplest representation for elements in a system.

Edges: Links or edges represent relationships between nodes.

Directed Edges: Edges may be directed when the sense of the relationship is relevant

in the system representation, or undirected otherwise. For instance, in a scientific

collaboration network, the sense of the edges has no effect, since the interaction took

place in the same way for both nodes. However, interactions like phone calls, must be

analyzed taking into account the sense of the edge, since it is not the same making

than receiving a phone call.

Weighted Edges: An edge can also be weighted with any value according to a given

property in the system representation [BBPSV04]. For instance, in commercial net-

works, one could weight the edge between the seller and buyer according to the sum of

all the transactions made.

Network: The network or graph, G, is a mathematical structure that consists in a set

of nodes, V , and a set of edges, E, relating all pairs of nodes i and j. A directed graph

takes into account the sense of the edges. In weighted graphs, edges are weighted with

a value different than one.

Adjacency Matrix: The Adjacency Matrix, Aij, is the simplest representation of a

graph. Its elements may take a value of 1 if there exists a connection between i and j,

or 0 if there is not a connection. In the case of undirected networks Aij = Aji and the

adjacency matrix is symmetric. In turn, in the case of directed networks Aij might not

be equal to Aji, and the adjacency matrix might be asymmetric.

Multiplex: The multiplex is the representation of a system where the same set of

nodes may be linked by different types of relationships, which are modeled as networks

in different layers. For instance, a people can have many types of relationships, like

networks of family, friends or coworkers.

8

Degree: A node’s degree is the sum of all the edges that connect it to the rest of the

network. In terms of the adjacency matrix Aij, the degree ki of the node i is defined

as ki =∑

j Aij. Its value indicates the connectivity of the node in the network.

Directed Degree: Given the fact that Aij is asymmetric in the case of directed

networks, two types of connection degree are considered, the out degree kout,i =∑

j Aij

and in degree kin,i =∑

j Aji. In these networks, the total degree would be ki =

kout,i + kin,i.

Strength: The node’s strength Si is similar to the node’s degree but taking also into

account the weights of the edges. Its value indicates the strength to which the node

is connected to the rest of the network, beyond the absolute number of edges. In the

case of directed and weighted networks, it is common to consider the strength in both

senses of the edges: Sin,i and Sout,i.

Path: A path or trajectory is the set of connected nodes that separate a pair of nodes

in the network.

Shortest Path: Minimum set of connected nodes that separate a pair of nodes in the

network.

Distance: Length of the shortest path between a pair of nodes (lij).

Diameter: The diameter of the network is understood as the longest shortest path in

the network.

Components: In a component all nodes are reachable with a given trajectory. The

Giant Component has a size in the same scale as the whole network.

Connected and Disconnected Graph: A graph is said to be connected when all

nodes belong to the same component, and disconnected when there exist more than

one component in the network. The distance between nodes from different components

is infinite.

2.2 Topological Properties

In this section we will review the main topological properties of complex networks. These

properties describe the structure of the network which strongly determines the functioning

of the system.

9

2.2.1 Degree Distribution

The probability of randomly choosing a node in a network with degree k is the first approach

to understand the structure of a network. The statistical properties of its density function

P (k) determine several of the system’s emergent properties, such as the system’s robustness

or vulnerability to failures and attacks. In the case of directed networks two distributions

are considered according to the sense of edge (outgoing or incoming). When the network is

weighted it is common to work with the strength distribution, or the in- and out-strength

distributions when the network is also directed.

2.2.2 Geodesic Distance

The shortest path lenght between two nodes is the minimum set of nodes that separate them

from each other in the network. The geodesic distance indicates the average value of the

shortest paths between any pair of nodes. It is a network measure that indicates in average

how distant nodes are from one another:

L =1

N(N − 1)

∑i,j∈V,i6=j

lij (2.1)

2.2.3 Clustering

The concept of clustering is related to the amount of a node’s neighbors that are connected

with each other forming triangles. It is a local measure quantified as the portion of existing

triangles in relation to all the possible ones, defined as [WS98]:

ci =

∑j,mAijAjmAmi

ki(ki − 1)(2.2)

It can also be a global measure, quantified as the average clustering coefficient of all

nodes in the network:

C =1

N

∑i∈V

ci (2.3)

2.3 Types of Networks

In this section we will review the most common types of networks. We will explain the

structural properties of these networks and relate their applications to real study cases. We

10

will emphasize on scale-free networks.

2.3.1 Regular Networks

Lattices or grids are networks whose connections form a regular tilling. They are not random

nor disordered. Instead all nodes repeat the same regular and coherent pattern, which could

form triangles, squares, hexagons, etc. Apart from the nodes at the borders, all nodes in these

networks present the same degree. The clustering coefficient is high, because neighbors are

regularly connected with other neighbors. The average shortest path length is also high, since

far regions are only reachable after hopping across several nodes. This kind of networks are

common in material science, when modeling bonds between atoms in crystalline materials.

They also serve as the substratum to dynamical models like cellular automata [AS11].

2.3.2 Random Networks

In random networks the connections between the nodes are random and independent from

each other. The degree distributions follow normal curves. This means that all nodes present

a number of connections bounded within the limits of the standard deviation. Therefore the

majority of nodes’ degree fluctuate close to an average value. In these networks, the average

shortest path length is usually low. The clustering coefficient is also low and even decreases

with network’s size. That happens because connections are independent from each other:

Therefore, the probability of two neighbors being connected is the same probability of two

independently chosen nodes in the network being connected. As a consequence, the larger

the graph, the least the probability of finding triangles.

2.3.3 Small World Networks

A network is said to be small world when the average shortest path, L, scales as the logarithm

of the network size, N , in the form:

L ∝ logN (2.4)

This means that L grows much more slower than the number of nodes in the network.

This property is related to the famous six-degree-of-separation experiment performed by Mil-

gram in the 1960s [Mil63]. The experiment showed for the first time that any two randomly

chosen strangers are only separated by 6 individuals in average. More recently, new tech-

nologies have revealed that the average shortest path length between social media users is

11

Figure 2.1: Homogeneous vs. power-law distributions. (a) A Homogeneous function and a power-

law function with γ = 2.1. Both distributions have 〈k〉 = 10. The curves in (a) are shown on a

linear plot and in (b) on a log-log plot. (c) A random network with 〈k〉 = 3 and N = 50. (d) A

scale-free network with 〈k〉 = 3. Figure adapted from [Bar12]

even bellow that value [MLB12]. Also, it is a pattern frequently found in many real systems

[ASBS00], such as biological [WF01], ecological [MS02], neuronal [SJN+07] or economical

networks [DYB03]. Moreover, small-world networks are also characterized for having a high

clustering coefficient.

2.3.4 Scale-free Networks

Scale-free networks are characterized for not having a characteristic scale for the nodes’

degree. They are often called heterogeneous networks because nodes’ degree does not homo-

geneously fluctuate close to an average value. Instead, most of nodes have a less-than-average

degree while a very few of them, called hubs, have a far-above-than-average degree. This

means that the majority of nodes are poorly connected while only a few of them are ex-

tremely connected and link most of the network. An example of these distributions is the

power law:

P (k) ∼ k−γ (2.5)

where typically 2 < γ < 3. At this range of γ, the second moment (standard deviation)

is not defined. That means that the fluctuations around the average value diverge with the

12

size of the network. This might seem like a contradiction to traditional statistical techniques,

where the more independent samples the more accurate the estimation of the moments, like

the average or the standard deviation. However, in scale-free networks there is no indepen-

dence among the nodes’ behavior. Instead, there are interdependencies and correlations in

their connections that lead to the emergence of extremely connected cases.

In the double logarithmic scale, scale-free distributions look like a straight line across

several orders of magnitude. This means that they do not have the characteristic cut-off of

scale-defined distributions. In Fig. 2.1 we present an example of two distributions plotted

in linear and logarithmic scale. It can be noticed, that the scale-defined distribution (green

curve) rapidly converges to zero in a sharp cut-off, no much farther than the average value.

Meanwhile, the scale free distribution (red curve) has a fat tail that spans across several

orders of magnitudes, indicating that there is not a defined scale to represent this population.

Several examples can be found in the literature of real scale-free networks. In Fig.

2.2, we present the degree distribution of six networks of different nature, like scientific

collaborations, the Internet web pages’ network and physical connections, as well as protein

interactions. The fact that such different kinds of phenomena present an universal structure

is remarkable. Such universality has radically changed the way scientists look at natural

systems and the emergence of their properties.

The implications of scale-free distributions on the systems’ functioning are also remark-

able. First, the extremely connected hubs shortcut distant regions of the network, giving

place to very short average path lengths. In fact, depending on the value of γ the geodesic

distance may behave like: L ∝ log logN [CH03]. This effect is called ultra small world. Sec-

ond, because the probability of finding a poorly connected node is very high, the network’s

functioning is robust to random failures. However, as a counterpart, these networks are very

vulnerable for selected attacks, since the failure of hubs compromises the structure of the

network as a whole.

2.4 Community Structure

Often referred as the meso-scale, the community structure indicates the existence of groups

of nodes within the network that share a larger amount of edges between them than with

the rest of nodes in the network [For10]. This means that networks with communities have a

hierarchical structure where nodes can be classified in groups, which are densely connected

internally and sparsely connected with other groups. In Fig. 2.3 we present an schematic

representation of a network with community structure.

13

Figure 2.2: Complementary Cumulative degree distributions for six different networks. (a) Collab-

oration network of mathematicians [GI95]; (b) Citations between 1981 and 1997 to papers cataloged

by the Institute for Scientific Information [Red98]; (c) A 300 million vertex subset of the World

Wide Web, circa 1999 [BKM+00]; (d) The Internet at the level of autonomous systems, April 1999

[CCG+02]; (e) The power grid of the western United States [WS98]; (f) The interaction network

of proteins in the metabolism of the yeast S. Cerevisiae [JMBO01]. (c), (d) and (f), appear to

have power-law degree distributions and (b) has a power-law tail but deviates its behavior for small

degree. (e) has an exponential degree distribution and (a) appears to possibly have two separate

power-law regimes with different exponents. Figure adapted from [New03b]

14

Figure 2.3: A simple graph with three communities, enclosed by the dashed circles. Figure taken

from [For10]

The analysis of the meso-scale is important for several reasons. First, it allows to un-

derstand the structure of the network at intermediate scales between the most local and the

most global space. In such hierarchy, large scale structures are often assembled by smaller

subparts previously assembled, such as cells and organisms [Sim62]. Second, the communi-

ties enhance the nodes’ characterization according to their role in the sub-structure. Some

of the nodes play a central role, keeping the module together, while others may play a bridge

role, connecting different modules.

It is intuitive to think about community structure in social networks since people have

a tendency to form groups within their coworkers, friends or families. However, non evi-

dent community structure has been detected in networks of different nature. For instance,

in ecological networks it has been reported that dolphins interact with each other within

communities due to racial issues [Lus03]. Also in functional networks, protein-to-protein

interactions are grouped in communities according to their function in the cell [JCZB06], as

well as genes are organized in community structures according to common ends [WH04].

2.4.1 Detection Algorithms

The detection of communities in graphs has been a hot topic of research during the last years.

The broad definition of communities has resulted in a wide amount of interpretations and

15

therefore many different ways to detect them. A very popular algorithm to find community

structure is the modularity optimization method [BGLL08]. First, Newman introduced

the concept of modularity to quantify the number of edges within and between groups, in

comparison to what would be expected for a random case [New06]. Then, Blondel introduced

an algorithm based on modularity optimization in order to find the best partition of the

network [BGLL08]. At the beginning of the algorithm, every node is considered as an

independent community. Then, the algorithm iteratively proposes network partitions until

finding the one that maximizes modularity. It is a powerful algorithm capable of finding

community structure of very large networks in a small amount of time.

Other methods are based on different ideas. For example, random walks algorithms are

based on the principle that a random surfer would be trapped within a community of nodes

given the density of shared edges [RB10]. In this algorithm, we first let a random walker

to surf the net for a period of time. Then, we identify the communities as the clusters of

nodes which were more frequently visited from one another. Moreover, the concept of edge

betweenness has been proposed to determine the edges that could be connecting communities

[GN02]. According to this method, if the edges with high betweenness are systematically

removed from the network, the communities will eventually disconnect from one another.

Finally, data mining techniques, such as clustering algorithms [Mac67], have been proposed to

determine communities. These algorithms understand nodes as vectors in a multidimensional

space where there are as many dimensions as nodes in the networks. Then, we calculate

distances between nodes as the more similar connections the closer nodes are. At last, we

identify clusters of nodes that are closer to each other than with the rest of the network.

2.5 Assortativity

The assortative mixing patterns quantify the tendency of nodes to be connected to other

nodes that are similar or dissimilar to them [New02a]. It is measured by the correlation

coefficient, r, which indicates if there is a tendency in the way that nodes are connected to

each other, or whether nodes are independently mixed among each other. It is a network

metric that measures who is connected to whom and to which extent.

The degree assortativity r indicates if the pair of connected nodes have a correlated

degree. It is defined as:

r =

∑i∈E (ji − 〈ji〉)(ki − 〈ki〉)√∑

i∈E (ji − 〈ji〉)2√∑

i∈E(ki − 〈ki〉)2(2.6)

where ji and ki are the degree of the nodes at both extremes of the edge i. In the case of

16

directed edges, ji and ki may respectively represent the in- or out-degree [FFGP10]. There-

fore, there are four kind of directed assortativities according to the possible combination:

rin−in, rout−in, rin−out and rout−out.

The graph is positively assortative if r > 0, meaning that the hubs and most connected

nodes tend to be connected among each other with a greater probability than with the less

connected nodes. Instead, a negative value of assortativity or dissasortativity (r < 0) means

that hubs are preferentially connected to the poorly connected nodes rather than with each

other. If there is no correlation in the nodes’ connections then r ∼ 0 and we could say that

the connections between hubs and non-hubs occur independently.

Analogously to the degree correlation, we can define other types of mixing patterns.

For example, networks may also present patterns of connectivity with any nodes’ discrete

characteristic, like language, sex or race. It can be quantified by a matrix eij that measures

the fraction of edges that connect nodes of type i to type j. Then the correlation coefficient,

r, is defined as:

r =Tr eij − ||e2ij||

1− ||e2ij||(2.7)

where ||x|| means the sum of all elements in the matrix x. This formula gives r = 0 when

there is no assortative mixing and r = 1 when there is perfect assortative mixing. In the

case of r = 0, the connections are independently randomly mixed.

2.6 Networks Models

In this section we review the most important network models. These models explain the

emergence of real networks’ properties, by means of defining a set of underlying rules of

behavior.

2.6.1 Erdos-Renyi Model

The Erdos-Renyi model (ER) is one of the first ones to study the properties and generation

of random graphs [ER60]. Originally proposed by Paul Erdos and Alfred Renyi in 1959, the

model consists in independently connecting a set of nodes with a previously defined amount

of edges and a probability of connection q. Depending on the value of q, the network transits

from an sparse and disconnected network (q → 0) to a fully connected one (q → 1). In-

between these extreme cases, there is a critical probability qc after which a giant component

emerges. During the process, nodes are independently connected with a number of edges

17

that homogeneously fluctuate around an average value bounded by the standard deviation.

The resulting are homogeneous networks with degree distributions that follow normal curves,

small average shortest path lengths and low clustering coefficient.

2.6.2 Watts and Strogatz Model

The first model to explain small world networks was proposed by Duncan Watts and Steven

Strogatz [WS98] in 1998. It is a random graph generation model that explains the collective

dynamics behind some topological patterns found in real networks, such as small average

shortest path lengths and high clustering coefficients together. The model does not consider

a network growth process. Instead it consists in rewiring a fraction of the existing edges in

order to drive the network to a complex border between order and disorder.

The process is illustrated in Fig. 2.4 A. It begins with a highly ordered network, such

as grid or lattice, where the rewiring probability p is null (p = 0). This kind of networks

typically have very high clustering coefficient and very long average shortest path lengths.

Then some edges are randomly rewired with a given probability p, introducing a certain

amount of disorder to the network. In the extreme case of p = 1 all edges have been rewired

and the network behaves like an ER graph, where both clustering and average shortest

path length are very low. Somewhere in-between, as shown in Fig. 2.4 B, there are some

values of p where the average shortest path length dramatically decreases (black squares)

without loosing the high clustering property (white squares). This happens because the new

shortcuts rapidly connect distant nodes, before rewiring as many edges as required in order

to break the clustering coefficient, which is the last to diminish in the process.

The limitations of this model to explain the behavior of real systems are given in the

resulting unrealistic degree distributions. These distributions do not explain the distributions

of real networks, which are scale-free. However, the model explains how some properties of

real systems, like the small world effect and high clustering, are the result of a neither

coherent nor random behavior in the nodes and their relationships.

2.6.3 Barabasi-Albert Models

The Barabasi-Albert model (BA) is a network growth model that explains the emergence

of fat-tailed degree distributions like the ones found in real networks [BA99]. It is based on

two mechanisms: population growth and preferential attachment. The first mechanism is

based on the observation that networks grow in time as new nodes are added to the system.

The second mechanism states that these new nodes will tend to connect with previously

18

Figure 2.4: (a) Schematic of the Watts-Strogatz model. (b) Normalized average shortest path

length L and clustering coefficient C as a function of the random rewiring parameter p for the

Watts-Strogatz model with N=1000, and k=10. Figure taken from [WS98].

19

Figure 2.5: (A) Degree distribution of networks generated by the Barabasi-Albert model in

linearly-binned (red symbols) and log-binned version (green symbols). The number of edges per

new node m = 3. Size of (A) N = 100, 000, (B) N = 100, (C) N = 10, 000 and (D) N = 1, 000, 000.

The straight line has slope γ = 3, corresponding to the resulting networks degree distribution.

Figure adapted from [Bar12].

well connected nodes rather than less connected ones. The combination of both mechanisms

gives place to the heterogeneity of the resulting degree distribution.

The model specifies that the probability of a new node j, to connect an edge to a node

i, already in the network, is proportional to the degree of i, ki, in the following way:

Pj→i =ki∑l kl

(2.8)

This means that the more connections a node has, the higher the probability to gain new

ones. That mechanism is usually called rich-get-richer or preferential attachment. Barabasi

et al. [BA99] showed that the emergent degree distributions converge to a power law of

exponent 3. Therefore the networks that emerge from this model are scale-free; which

implies that while most of nodes present a small amount of connections, only a very few

nodes concentrate the largest amount of them. In Fig. 2.5 we show the resulting degree

distribution of a numerical simulation of the BA model.

The power law emerges due to the correlation in the nodes’ behavior (collective behavior).

These nodes do not distribute their edges independently among the rest of nodes in the

network. Instead they prefer to connect with the well connected ones. Therefore, the new

nodes’ decisions of choosing whom to connect will depend upon the decisions previously taken

by those nodes who are already connected in the network. This creates a time dependence

20

phenomenon, where the aggregation of individual contributions lead to the emergence of

extremely large cases.

The BA networks present the small world effect. The extremely connected hubs link

large parts of the network. Moreover, the clustering coefficient is typically low in these net-

works. That is a limitation for modeling real systems, where high clustering coefficients are

typically found. However, BA networks are widely used to evaluate dynamical phenomena

in heterogeneous networks.

Over the last years, the preferential attachment model has been generalized in order

to explain other properties of complex networks. For instance, in some models nodes may

present a set of attributes and properties of their own that identify them and influence the

rules of connections. That observation is based on real social systems. People not only

decide to connect with those who are popular. We also look to connect with those that are

similar to us in certain ways. The heterogeneous preferential attachment model [San07, SB08]

proposes a formalism to bias the probability of connection with an affinity value based on

nodes’ attributes. This means that the rich-get-richer mechanism is biased with an affinity

function that increases or decreases the probability of connection between two nodes based

on their attributes.

The affinity function is the rule that nodes will apply when deciding whom to connect. It

can be either a global or local rule. In both cases, all nodes would have their own attribute.

However, depending on the rule, nodes will compare themselves with either unique or general

ways. In the global case, the affinity between nodes is defined due to a single global rule

that all nodes equally apply. Whereas, in the local case, nodes may also have an individual

function to determine their own affinities. This implementation increases the heterogeneity

of users, in terms of their characteristics and behavior. The heterogeneous preferential

attachment formalism has been used to model different kinds of real networks, such as

networks of politicians’ interactions on Twitter [BMLB12].

2.7 Dynamics on Networks

The analysis of dynamical processes among nodes is studied by applying models on networks

with a previously defined topology. In this section we will explore the most important

dynamical processes taking place on networks, such as contagion processes and cascading

effects.

21

2.7.1 Disease Contagion

The most popular model for disease contagion processes is called SIR model [KM27]. It

was first introduced by Kermack and MacKendrick in 1927. The SIR model is named after

the three possible states that nodes may adopt: Susceptible, Infected and Recovered. Its

dynamics consist in the temporal change of the nodes’ state. The susceptible may be infected

with a rate βi. The infected are recovered with a rate βr. The recovered may be immune

or susceptible with a rate βs. The quantities S(t), I(t) and R(t) will define if there is an

epidemic outbreak or if it can be controlled.

The critical value of the infection rate depends on the network topology [PSV01, PSV02].

Standard random networks need a critical infection rate higher than zero in order to cause

an outbreak (Fig. 2.6). Below that critical limit the disease would not diffuse largely enough

and may even become extinct. However, Pastor-Sotorras et al. [PSV01] showed that the

critical limit of infection rate decreases down to zero in scale-free networks. This means

that an initial infection of only a few nodes can compromise the network as a whole. That

phenomenon happens because the diffusion process occurs much more rapidly in scale-free

networks due to the effects of hubs and small average shortest path lengths.

Recently, this model has been applied to many kinds of real networks, like actual sexual

contacts networks [New02b, RLH11] and air transit networks [CBBV06]. These studies give

much more realistic results about the true dynamics of actual epidemics and provide better

basis for designing response strategies.

2.7.2 Social Contagion

Social contagion is the process of people making decisions influenced by the decisions taken by

other people. For instance, the decision of adopting a trend, acquiring a product or forming

an opinion are part of social contagion processes. As opposed to epidemic models, the spread

of ideas is not typically negative as the spread of diseases. Therefore, the strategies of social

contagion are commonly intended to reach as many people as possible, instead of retrieving

the information needed to prevent the outbreak.

The Threshold Model is one of the first models to understand the diffusion of ideas

among people [Gra78]. Proposed by Mark Granovetter in 1978, it is a model that defines a

collective interacting behavior among agents, based on the tipping point ideas from Shelling’s

segregation model [Sch71]. In the model, agents require a critical mass of neighbors who

already adopted the new state before deciding to do so. This means that nodes change their

state after exceeding a threshold value based on the absolute or relative number of neighbors

22

Figure 2.6: Comparison of disease spreading on homogeneous random graph and scale-free net-

works. The fraction of infected nodes displays a distinct phase transition (or epidemic threshold)

in the case of an homogenous random graph, but not for the scale-free network. Figure taken from

[Wat04]

that already adopted the new state. Individual thresholds may be unique for all nodes or

either vary according to a probability distribution. Watts applied the threshold model in

complex networks in 2002 [Wat02]. He found that hubs influence the spread of adoption in

two different ways. On one hand, hubs influence a very large amount of users when they

adopt a new state due to their high connectivity. However, as the threshold is a percentage

of connections, it is more difficult for hubs to reach the number of neighbors needed in order

to change their own state.

Opinion formation processes are also part of social contagion. A popular model of opinion

formation is the voter model [HL75]. In this model, nodes can only adopt binary states based

on the states of their neighbors and a set of interaction rules. These interaction rules are

based on connections, whether in grids or networks, and probabilities of interaction. The

mechanism is the following:

1. First, we select a node from the network at random with given probability distribution.

2. Then, the chosen node selects a neighbor of his own, with another probability distri-

bution.

3. Finally, the first node adopts the state of the selected neighbor, and another node is

chosen to interact.

23

These interactions are iteratively repeated until nodes reach consensus. The consensus is

said to be reached when no more changes occur in the nodes’ states. Recently, the voter

model has been applied to complex networks [SAR08, SEM05]. The results indicate that

the heterogeneity of the network and the existence of hubs facilitate the reach of consensus.

Other scholars have been interested in the effects of a set of nodes called “zelots” whose

opinions remain constant along time [Mob03, MMR07]. They have found that the presence

of zelots in the network prevents the reach of consensus in the population.

Other models of opinion formation do not consider binary states, but a continuous spec-

trum of possible opinions. For instance, the DeGroot model [DeG74] describes how a group

of individuals might reach a shared opinion by iteratively updating their opinion as the av-

erage of their current opinion with the opinions of their neighbors. In this model there is no

external data and nodes are only able to reach opinions based on observing the neighbors’

opinions. The result is an explicitly determined distribution of the opinions reached. The

shape of the distribution indicates the way consensus has been achieved, whether opinions

merge in a single view or there are multiples points of view among people. Recently, the

DeGroot model has been used to study the conditions under which consensus is achieved

[AO11, GJ10a, Jac10]. However, as consensus is rarely reached in real world [Kra09, IJBZ08],

variants of this model can held to a diversity of opinions [BKO11, ACFO13, Kra00, FJ90].

For example, by weighting edges and biasing the way nodes interpret their incoming infor-

mation, divergence and polarization of people criteria may be reproduced.

2.7.3 Cascades on Networks

Occasionally, during contagion processes, the nodes’ adoption of a new state triggers a se-

quence of reactions among its neighbors, and neighbors of neighbors, in the shape of cascades

(see Fig. 2.7). During cascades, individuals show a heard-like behavior, making decisions

solely based on the actions of others. Many real systems constantly show cascading behavior,

such as clashes in the stock market [Shi95], failures in the electrical system [SCL00], biolog-

ical procedures [GIT09] or viral marketing campaigns [GLM01]. The dynamics of cascades

are related to the avalanches of Bak’s self-organized criticality [BTW87]. In both systems,

the propagation of actions occurs due to long range correlations among the different nodes

or agents.

A kind of cascades on networks result from the threshold model. The adoption of a new

state by a given node can provoke a cascade, if any of its neighbors’ thresholds is exceeded

with such adoption. Watts showed that the size distribution of this kind of cascades follows

a power law when the network connectivity is limited [Wat02]. Besides, he showed that the

24

Figure 2.7: Schematic representation of cascade on a network. The red and yellow nodes belong

to the cascade. The white nodes belong to the network but are not part of the cascade. The cascade

layers have been marked in gray.

heterogeneity of behavior plays an ambiguous role in the cascades’ propagation. On the one

hand, the heterogeneity in the threshold distribution makes the network more vulnerable to

the occurrence of large scale cascades. However, on the other hand, the heterogeneity in the

degree distribution makes the network more robust to the propagation of them. This occurs

because, although hubs trigger many more cascades than the average nodes, they are less

likely to propagate the already existing ones.

The cascades that result from the threshold model depend on the overall state of the

social contagion process. However, there is another kind of cascades which occur indepen-

dently of the nodes’ history. These cascades are analyzed through the independent cascade

model [GLM01]. In this model, every nodes’ adoption may trigger an independent cascade,

regardlessly of whatever happened before the adoption. When a node is active, it has a single

chance to activate each of its neighbors with a given transmission probability. Saito [SNK08]

proposed a model to predict the optimal diffusion probabilities based on maximizing the

likelihood of possible episodes.

Kempe et al. [KKT03] proposed a general framework where the threshold model and

independent cascade model are included as particular cases. In their general cascade model,

the independent cascades behave as the threshold model, with the difference that the nodes’

threshold is reduced to one and there is a probability to propagate the cascades once the

threshold is exceeded. They also proposed an algorithm to find the initial group of individ-

25

uals that will produce the largest cascades in a social networks based on a combinatorial

optimization process.

2.8 Social Networks

Social groups can be analyzed in the form of networks. In this abstraction, the nodes

represent persons and the edges represent social relationships of any kind, such as friendship,

family or work. The analysis of social groups in the form of networks is a well known concept

in social sciences [Mor51]. However, until recently, large scale social structures were not

addressed in this way. Many examples can be found in the literature about social networks,

such as actors performing in common movies [NWS02, ASBS00], scientists collaborating in

papers [BBPSV04], criminals acting in gangs [XC05] or people interacting through Internet

[MLB12, MBLB14, BMLB12, BMBL14]. Most of these studies analyze the social structure

at different scales, the dynamical processes taking place on the networks and the effects on

peoples’ behavior on the social structure in which are embedded.

The social networks typically present complex properties, such as scale-free degree distri-

butions and small-world effect. The degree distributions behave like power laws, the average

shortest path length is small and the clustering coefficient is usually much greater than what

would be expected for a random process. Moreover, social networks tend to be positively

assortative [New02a]. Popular people tend to relate with other popular people, while unpop-

ular people tend to be friend with unpopular people. In cases like scientific collaboration,

movie actors or directive boards a positive assortativity has been found. In contrast to

networks like the Internet, airports or semantic which tend to be disassortative.

A very important property of social networks is the community structure. Social networks

are usually subdivided in groups of nodes more closely related to each other than with the

rest of the network. Communities can be related to working disciplines [RB10], racial issues

[GHKV07], language spoken [BGLL08] or mobility patterns [EEBL11]. Newman showed that

the community structure is responsible for other properties, such as the positive assortativity

together with the high clustering coefficient [NP03]. During the growth process of a network

divided in communities, the new edges tend to stay within the same group, thus the clustering

coefficient does not decrease as the network grows.

26

2.9 Time Varying Networks

In real systems, not all networks can be modeled as constant growing processes where edges

remain invariant after their creation. Some systems change their structure dynamically, as

elements appear and disappear, or relationships are built and destroyed. There is a class of

networks that takes into consideration this dynamical nature of real processes. This kind of

networks are called time varying, and their edges are created, removed and rewired according

to the nodes’ behavior. They have been recently used to explain the evolution of online inter-

actions among people [PGPSV12], genetic procedures [KSA+10], wireless routing strategies

[NMR05] and contagion processes [LBP13]. These studies have found remarkable differences

with their static counterparts, showing the importance of considering the dynamical nature

of real processes.

Many real systems are determined by the nodes’ dynamical behavior. Perra et al.

[PGPSV12] proposed an activity driven model where the network formation depends on

the nodes’ activity. The model takes into account the different dynamics of activation on

social networks, such as in social media or scientific collaboration. In Fig. 2.8 we illustrate

three different time steps of the evolution of the model. On each time step, active nodes (red)

create a new set of edges (white). The bottom visualization shows the aggregated network

at the end of the process. This model is capable of providing explanation of the emergence

of hubs in networks based on the heterogeneity of nodes’ activity distribution, rather than

the preferential attachment mechanism. This model is a clear example of how the complex

structural properties of the system at larger scales are the result of the complexity in the

individual behavior, and also demonstrates that structure and dynamics in complex systems

are intimately related to each other.

27

Figure 2.8: Schematic representation of the activity driven network creation model. Red nodes

show the active nodes at each time T . The bottom plot represents the final aggregated structure

of the network. This figure has been adapted from [PGPSV12].

28

Chapter 3

COMPUTATIONAL SOCIAL

SCIENCE

Over the last decades, fields like biology or physics have been revolutionized by increasing

amounts of data obtained from electronic measurements. Social sciences, however, were par-

ticularly slow in this competition. The technical challenges to survey at larger scales impeded

the conduction of data driven studies and experiments in social disciplines. Fortunately, that

reality has recently changed, as humans constantly interact with electronic devices nowadays.

These devices are acting like social cathodes, recording our activity and enabling unexplored

research and knowledge. The computational social science is the arising field of data driven

analysis to understand societal phenomena [LPA+09]. The theory of complex systems [BY97]

is a powerful framework to understand this kind of phenomena. In order to understand the

social system we can not ignore the complexity of our behavior and the relationships we

build. Complex systems’ tools, like complex networks [New03b, Wat04, BLM+06], explain

patterns of societies, like the structure of the system, the way it evolves and the underlying

mechanisms responsible for the pattern formation.

Until now, social sciences were studied with surveys that do not meet the requirements

to develop a data driven analysis. Traditional surveys usually represent snapshots of some

hundreds randomly sampled individuals that do not show the structural nor dynamical

patterns of the social system as a whole. The complexity of the social system is simplified and

the diversity of the population is not captured. The computational social science provides a

new perspective on social processes. It enables sociologists and experts to revise old concepts

and answer new questions. The new data sources have brought many benefits. First, the

fact that these datasets are automatically collected dramatically reduces the costs and efforts

to deploy the data gathering. That for instance is very helpful for less developed countries

29

which can gain information from databases already collected in an affordable way [BLT+11].

Second, this human activity represents a new dimension that enhances the characterization of

societies. For example, phone call behavior explains from individual properties, like emotions

[RMM+10], up to large scale patterns, such as the economical development [EMC10]. Third,

the temporal evolution allows to detect patterns and find trends. Nowadays, the price of

stocks in the market is predicted with people’s mood on Internet [BMZ11]. Finally, the

modeling techniques unveils the nature of the system. Computational and mathematical

models control and predict the behavior at different times and in different conditions.

There are two ways to conduct experiments in computational social sciences. One is to

analyze large datasets from service providers like emails, social media, mobile phones and

e-commerce. Then, to retrospectively look for patterns in the collective behavior of the

population, in order to discover the micro-macro connections of the social process. Another

one is to build living labs with people and conduct large scale experiments [VH86]. Living

labs represent techniques for research on user-centric behavior by sensing, designing and

validating complex solutions in real scenario.

In the present chapter we will discuss some of the most important advances in the field

of computational social science. We will review the main characteristics of online human

activity and their applications to understand society in section 3.1. We will introduce the

concept of socio-technological networks in section 3.2. Finally, from sections 3.3 to 3.5, we

will review the dynamical processes that take place in these new spaces, like social contagion,

influence propagation and social polarization.

3.1 Human Activity

In the context of computational social science, the human activity is related to the action

of consuming any of these communications services. As people use these technologies a

picture emerges with their social interactions. The emergent patterns are complex and

heterogeneous. For instance, there is not a characteristic rate of activity. Instead, there are

long periods of inactivity, followed by fewer shorter moments of highly intense activity called

bursts [Bar05, ZCH+12]. This behavior is captured by the time between two consecutive

actions which scales as power laws (see Fig. 3.1). Barabasi [Bar05] modeled this activity

effect as a queue process where tasks or actions are taken over consecutively. In the model,

most of tasks are rapidly executed one after another, but some fewer ones experience very

long waiting times before being executed.

In a larger scale, we are part of societies. Studies using electronic media data report that

30

Figure 3.1: Bursts of individual activity on an e-commerce site. In the left panel we represent

the temporal behavior of four individuals, showing that bursts of activity (color stripes) coexist

with large moments of inactivity (white periods). The x-axis represents time and the colored lines

represent individual actions. In the right panel we show the distribution of inter-action waiting

times for each of the four users. Figure adapted from [ZCH+12].

Figure 3.2: Collective response to a critical event. In the top panel we show the emergent

networks between affected users during an event at three times. In the bottom panel we show the

calls pattern between the same users a week before the event, indicating that the cascades observed

during the event are extraordinary. Figure adapted from [BWB11]

31

individual activities like commuting, working, and sleeping on a daily basis combine into

area-wide pulsing patterns. Measurements like the number of calls [CGW+08], electricity

consumption [PSR12] or emails sent [WWT+11] display regular cycles of high activity during

work hours and low activity during rest hours. These urban rhythms were reported as

responsible for the heterogeneity in the queuing times modeled by Barabasi [MSMA08].

However, the burstiness of the human activity results to be robust to any seasonal or daily

variations [JKKK12]. On the other hand, collective activity in large areas, also explain the

economical development of the region. Eagle et al. [EMC10] found that regions with diverse

communication patterns tend to be wealthier than regions with insular communications.

The regular patterns of collective behavior are disrupted during critical events like natural

disasters or episodes of social unrest [BWB11, MLB12]. During these events, the populations

encounter unfamiliar conditions and their reactions determine the outcome of the crisis. Elec-

tronic media allows to measure and understand the impact of the events in the social system.

Recent studies have characterized the disruption of the collective patterns by comparing the

behavior during the event to the usual one [BWB11]. They found abrupt variations in the

activity, closely related to the emergence of extraordinary information cascades (see Fig.

3.2). As the emergence occurs people tend to communicate it to others, triggering chain re-

actions. The geographical distribution of the activity also describes the event. For instance,

during earthquakes Twitter activity allows to locate the epicenter with an extraordinary

accuracy, by geographical measuring the volume of related messages [SOM10]. Episodes of

social unrest are another kind of collective disruption [LeB96]. Recently, many social move-

ments have been analyzed by means of human activity data. The propagation of action and

influence across networks during the episodes of social unrest, including how leaders attract

and influence followers, has been described [MLB12, MR13]. Efforts have also been made to

understand the evolution of some of these movements and to investigate possible reasons for

their eventual decay [CFMF13].

Another way to analyze human activity is by means of the mobility patterns. By looking

at geolocalized data it is possible to analyze the laws that govern people’s movement. Eagle

and Pentland [EP06] predicted the movements of students and employees in a university,

based on their individual characteristics such as studies or employment level. On a larger

scale, Gonzalez et al. [GHB08] studied thousands of anonymous mobile phone users in

order to unveil that humans follow simple and predictable mobility patterns. They found

that human trajectories present a strong temporal and spatial regularity, characterized by

a significant probability of returning to some fewer but highly visited locations. Moreover,

electronic media like Twitter has shown global migrations and the actual exchange of humans

32

between countries [HSB+13].

3.2 Socio-Technological Networks

Telecommunication solutions, like phones or Internet, require underlying technological net-

works in order to transfer the data and be able to provide services. When consuming these

services, people interact and communicate with other people, creating social networks that

emerge from the exploitation of the technological resources. In these socio-technological

networks, information is shared and ideas flow among people. The characterization of these

networks allows to understand large scale patterns of the society. Two things can be ad-

dressed with social-technological networks: the topological properties of the social structure

and the dynamical properties of the interaction processes. On the one hand, the social struc-

ture is defined as the network of social contacts, either by making calls or sending messages.

On the other hand, dynamical processes are due to user activations in the network. These

activations are usually not independent from each other and present large scale patterns like

the emergence of information cascades.

These networks present complex properties. Fat-tailed degree distributions have been

frequently found, indicating that the number of contacts by person is scale free. This has

been shown in email networks [NFB02], mobile phone networks [APR99, OSH+07] or Twitter

networks [MLB12, MBLB14, BMLB12, BMBL14, KLPM10]. The small world property is

also common in these networks. The average shortest path length have been calculated

in email networks and Twitter networks, being below six degrees of separation in systems

with thousands of millions participants. Moreover, these networks have a negative degree

correlation [HW09]. That happens because the new interaction mechanisms allow regular

people to interact with famous people.

Huberman et al. [HRW09] showed that people behave quite selectively when truly inter-

acting with their contacts. Each person has a subset of friends with which interacts much

more frequently than with the rest of acquaintances. In fact, Kleinberg et al. [LNK07]

showed that people even start new relationships with their friends’ friends. This behavior

constitutes a network that matters contained within the overall social structure. Such effect

is due to limitations in people’s attention, since there is a maximum number of relationships

that we can manage simultaneously. Originally measured by Dunbar [Dun92] and later con-

firmed with Twitter data [GPV11], it is difficult for people to manage more than a couple

of hundred relationships at the same time.

Another property of socio-technological networks is the community structure. These

33

Figure 3.3: Emergent networks from the propagation of four videos on Twitter. In panels (A)

and (B) the local influential leaders performed a remarkable role in the diffusion process. Whereas

in panels (C) and (D) the influence of hubs was much more stronger. Figure adapted from [DO14].

graphs typically present communities of users whose interactions occur more frequently

among themselves than with the rest of the network [JSFT09]. Due to the dynamical na-

ture of human interactions, these communities are not static but rather emerge, evolve and

disappear in time [NDXT11]. In general, these communities are formed due to similari-

ties in people’s characteristics and dramatically impact the way information spreads across

the network [GRM+12]. For instance, linguistic communities are usually found in socio-

technological networks, as people usually interacts with those who speak the same language

[BMBL14]. Also, people conform communities according to their interests. Recent studies

showed that news sharing networks on Twitter have communities of those who are inter-

ested in either on global, national or local scale issues [HZGMBY13, AMV+14]. Besides,

other characteristics, like political affinity, determine the emergence of communities during

political discussions and electoral campaigns [BMLB12, BMBL14].

3.3 Information Spreading

The exchange of information during social interactions leads to the contagion of ideas, opin-

ions and adoptions between people. The underlying laws that govern the contagion and flow

of ideas are similar to those that rule information spreading processes on social networks. In

the context of social media, recent studies have revealed that most of the information posted

in these networks is hardly propagated by the participants. Around 71% of the messages do

not travel any farther than the authors time-line [CE09]. However, some authors manage to

spread their content in a wide variety of proportions, due to a combination of several factors,

such as their popularity, posting frequency, and novelty or resonance of the posted content

34

[RGAH11]. In Fig. 3.3 we present the respective networks from propagating four videos on

Twitter. In these networks, nodes are users and links represent the diffusion of messages. It

can be noticed that the shape of networks A and B is radically different than C and D. In A

and B, local leaders played an influential role, making the diffusion really viral. Meanwhile,

in C and D, the diffusion happened mainly due to the popularity of hubs.

In 1973, Granovetter argued that while most of information is shared within sets of

strongly tied individuals (or communities), weak ties link communities and spread the infor-

mation across the whole network [Gra73]. More recently, Onnela et al. [OSH+07] showed

that millions of mobile phone users do behave like social bridges, allowing information to flow

across communities in the social network. They showed that social networks are robust to

removing the strong intracommunity ties, but are destroyed if intercommunity weak ties are

removed. Also, the heterogeneity of human temporal behavior slows the diffusion process.

The temporal bursts of activity are typically trapped within the communities, producing fast

local cascades but reducing the diffusion at larger scales [MML10]. Moreover, other factors

also play a fundamental role in the process of social contagion. For example, the diversity of

contacts who already adopted the trend is highly determining for users to imitate. Ugander

et al. [UBMK12] showed that people’s decision to join the social media depends not only

upon the absolute number of friends who already joined, but also on the diverse kinds of

social groups that these friends represent, such as family, coworkers, etc.

3.3.1 Models

Many researchers have modeled the information spreading processes as branching processes.

These were firstly introduced by Galton and Watson in the 19th century to model the

survival of family names [WG75]. A branching process is an stochastic model where a set of

individuals Zn, at the generation n, produces a random number of new individuals Zn+1, at

the generation n + 1. The individual multiplications occur independently from each other

or previous generations. Other branching models, like the Bellman-Harris process [BH48],

propose that individuals live for a random period of time independently of each other. Then,

once an individual life time is expired, a random number of new individuals is produced.

More recently, branching processes have been used to model several information spreading

processes on social media, inspiring network growth models characterized by the information

retrieved from data [IE11b, WWT+11, GJ10b]. Barabasi et al. [WWT+11] studied the flow

of emails in a corporation by means of a biased Galton-Watson model to reproduce the bushy

but shallow cascades. Golub [GJ10b] instead modeled deep and narrow email chains with a

similar process. Doerr et al. [DBM13] found that the adoption times of individuals in the

35

inter-arrival time of Twitter messages, or the propagation time of stories on a social media

site, can be explained through a convolution of log-normally distributed observation and

reaction times of the individual participants. Moro et al. [IE11b] studied the dynamics of

information flow from the individual activity patterns of the nodes’ branching dynamics. The

model includes a temporal variable that determine the time lapse between generations, based

on the observations of human activity patterns. They conclude that the heterogeneity in the

user behavior gives place to extreme events. Therefore the size of the emergent networks

from the diffusion process depends upon the activity of the spreaders.

Information cascades in social media have also been modeled as BA networks, biasing

the seed’s probability to gain more connections and linking all new nodes with a single

edge [GKK11]. This model reproduces very well the cascades’ shapes across different online

social networks, which are characterized by the model parameters. Wang et al. [AHSW11]

have proposed that the number of tweets in trending topics grows in a multiplicative way.

Kawamoto [Kaw13] proposed a dynamical model with stochastic branching growth to predict

cascades’ sizes taking into account the network of followers. The stochastic parameter in

the model follows a log-normal curve with a fatter tail. Zhang et al. [XLZ+12] proposed to

modify roles in the SIR model in order to include a message forward mechanism. In the new

state, contacted, nodes know about the message but have not forwarded it yet. Like in other

SIR models, the network heterogeneity diminishes the spreading probabilities.

3.4 Influence and Popularity

Social influence is the process by which individuals change their behavior as a result of social

interactions with other people. During these interactions, individuals exchange information

and adapt their opinions and beliefs based on the information received. This process may

happen conscientiously or not, through processes of persuasion and leadership [Kel58]. For

example, social interactions are known to be responsible for the transmission of social be-

haviors like eating [CF07], drug abuse [RMFC10, CF08] or emotions [FC08]. In fact, these

studies show that we are influenced not only by our acquaintances, but also from friends of

friends, and many other persons in the network. In the context of social media, the user

influence can be measured by the propagation of his contents across the network. However,

from a marketing perspective, the social influence is seen as the capacity of some users to

encourage the adoption of a trend or product among his contacts [AW12].

Bakshy et al. [BHMW11] defined influence in social media according to the size of

the information cascades that users’ produce. They found that word-of-mouth in social

36

media functions by means of many small cascades seeded by regular users, rather that the

cascades produced by popular users. On the other hand, Aral et al. [AW12] found that

social properties like age, sex and marital status play a fundamental role in the adoption of

products. They found that influential people are less likely to be influenced, and that these

users are usually clustered in communities. Therefore, they think that influential people with

influential friends are a good target to start campaigns. Most of these studies are structural,

explaining influence either by the nodes’ properties or the connection patterns. However,

influence is also an interplay between dynamics and structure. In the context of the SIR

model, Klemm et al. [KSESM12] defined a measure of influence called, spreading efficiency,

as the expected infected fraction of a network, when a node is initially infected and the rest

of the network is susceptible.

Popularity is related to having a large number of connections. Cha et al. [CHBG10]

found that popular users on Twitter may not be the most influential. They say that influence

also includes being retransmitted and mentioned in the conversations. Both influence and

popularity are the result of the way people pay attention to each other. Ratkiewicz et

al. [RFF+10] showed in 2010 that the evolution of the collective attention is complex and

characterized by a bursty behavior, since the popularity of contents has abrupt growths

due to external factors. They modeled this process as preferential increase mechanisms

combined with random popularity shifts. This means that the popular contents tend to be

more popular in the future, boosted by the occurrence of external events. During this process,

the resonance of the content is extremely important in order to attract people’s attention

[RGAH11]. However, all content has an effective period to catch the collective attention

before loosing that capacity. The novelty of information decays quite rapidly, stretching the

effective time to attract the collective attention [AHSW11].

3.5 Polarization

Polarization is a social phenomenon that frequently appears in many collectives and societies,

when individual beliefs diverge during dynamical opinion formation processes [BG08]. A

reason for such divergence is the tendency of people to discuss in a biased way, looking for and

accepting opinions that reinforce their own beliefs, as well as rejecting those positions that

contradict them [DGL13]. However, the heterogeneity of individuals background, beliefs and

needs also play a fundamental role in polarization emergence. The diverse people’s criteria

strongly condition the evaluation of issues like policies outcomes or the appreciation of the

state of a nation or any society [DW07].

37

Political polarization is understood as the alignment of individual ideologies with extreme

positions. A usual case called elite polarization [ZFT+08] occurs among leaders between

parties, or even within the same party, when mutually exclusive positions are assumed and

radically different solutions are proposed to common problems. Although it evidently seems

harmful for institutional cohesion and stability [Dia90], some scholars have pointed it out as

beneficial for democracies, arguing that the electorate is able to recognize their leaders’ ide-

ological positions more accurately [BB07]. At any case, elite polarization has been reported

to be much stronger than popular polarization, as regular people tend to assume more mod-

erate positions than politicians or opinion leaders [GMSS12, SS12]. Popular polarization

emerges when the ideological divergence occurs in societies as a whole. A reflection of it

occurs in electoral processes when the political options are reduced into two or a few parties

with extreme positions. Spatial segregation is another kind of polarization. This is related

with the Schelling model [Sch71], where agents move among slots in a grid, according to their

tolerance to coexist with the other. The results are social segregation patterns given in the

spatial configuration. During the process, an initially homogeneously distributed population

ends up grouped in clusters of others with similar characteristics.

More recently, political processes have been analyzed by means of data exhaust from

human communications through electronic devices [TL13]. In the context of the U.S. elec-

tions, Adamic et al. [AG05] found a divided blogsphere where the two main political parties

(liberal and conservative) mainly cited their own community blogs with a very few crossed

interactions. Such division was reflected in their own articles, as each band showed respec-

tive interest on different subjects. On Twitter conversations, Conover et al. [CRF+11] found

the retransmissions mechanism to be the most polarized one. They proposed a method to

determine the political valence of certain keywords in the content of messages, based on how

parties use them in a mutually exclusive way. Moreover, Livne et al. [LSAA11] used Twitter

data to determine which party exploited the tool more effectively. They found that partisans

show respective interest on specific aspects with few coincidences. In general, these studies

show that there is a remarkable lack of debate in social media, and that people are usually

exposed to similar opinions.

Other researchers have proposed ways to measure the degree of polarization in the pop-

ulation. From a dynamical point of view, Dandekara et al. [DGL13] measure whether

polarization is emerging in a population at each time step by looking at the evolution of dis-

agreement. Also, Weibull et al. [DW07] measure whether the distributions of opinions are

turning bimodal. Baldassarri et al. [BB07] measure the bi-modality of opinion distributions

by means of the kurtosis and variance. On a different perspective, Latane et al. [LNL94]

38

measure polarization based on the change of group sizes. They claim that polarization oc-

curs when a minority group of individuals grow in comparison to the majority group, thus

measuring polarization based on the growth rate. From a structural point of view, Conover

et al. [CRF+11] compare the polarization in Twitter networks by means of the modularity

measure. Modularity has also been used to measure polarization in congressional networks

[ZFT+08] and networks of modeled interactions [BB07]. From a different perspective, King

et al. [KOS11] defined Twitter users as a collection of binary decisions of either following or

not an influential set of politicians. They measured the distance between users, so that the

more coincidences in the follow decisions, the closer users are. They found two large clusters

of users and measured the polarization based on how close users are within a cluster, and

how far are clusters from each other.

39

Chapter 4

DIGITAL TRACES AND

COMPUTATIONAL METHODS

Every time we consume telecommunication services we leave behind digital records of our

activity in the service providers’ data bases. That exhaust of data is an increasing by-product

of nowadays human life style. The datasets are huge in volume. Their nature is diverse and

include several dimensions of human societies like friendships, debates, dating, payments,

etc. The analysis of data derived from human activity enables the possibility to measure

the social systems in their actual dimensions [Pen08]. However, most of the information

is unstructured, embedded in text messages, images or videos. Many of the traditional

querying methods are limited to retrieve information in this kind of data. Therefore, a

new set of methodologies and technologies have been recently proposed to query the data

[Whi09, Cho13], based on new principles of data treatment.

In this chapter we will discuss methods to treat data, obtain patterns and retrieve un-

structured information. First, we will show how to store and query data depending on its

structure. Second, we will revise some of the techniques for patterns recognition. Finally,

we will describe and characterize the datasets we have built and analyzed in this thesis, as

well as revise the main Twitter’s functionalities, limitations and how to gather data from its

servers.

4.1 From Data to Knowledge

The process to retrieve information from datasets require several interdisciplinary skills.

Fist, we need computational knowledge like being able to store, query and manage the digi-

tal stream of data. Also, we must be able to abstract the data and mathematically treat it,

41

in order to quantify and measure it. Finally, we must extract the information contained in

the form of patterns, by means of recognition techniques. These include statistical, computa-

tional, modeling and visualization methods. The ultimate goal is to understand the meaning

of patterns and the behavior of the system; in order to be able to propose dynamical models

to explain, reproduce or predict the observed behavior.

The information is unstructured within the data. The methodology applied in this the-

sis to retrieve it consists in abstracting the data and constructing intermediate structures,

which are easier to manage and where information is condensed. Such structures can be den-

sity functions, multi-dimensional samples or complex networks. Most of them, result from

the aggregation of the collective behavior, either by looking for overall kinetics or internal

structures due to relationships. Finally, by means of patterns recognition techniques, we

can reduce the complexity of the system and find the best descriptions for its structure and

dynamical change.

4.1.1 Data Management

The first step for data analytics is to build the infrastructure necessary to store and query the

data. In this section we will present the methodologies followed to manage these processes,

according to the structure of the data model.

All databases are technically possible due to large files in the server’s hard drives con-

taining enormous amounts of bytes. The data contained in these files are usually organized

by following a data model, where entries are defined as a set of values that correspond to

attributes or fields. The nature of the information and the way fields are organized define

the data structure and consequently the methodology followed to treat it.

There are two kinds of data: structured and unstructured. The structured data is given

in tables of rows and columns; while unstructured data do not follow a predefined model.

Take an online message for example, one could build a table with a field called text and

have all the posts in a single table. But traditional ways to query data will not answer

the questions like which is the influential topic or which is the least popular. That kind of

information is unstructured across the registers.

Relational Tables

Relational tables are the way structured data are usually stored [DD87]. Tables follow

a scheme data model compound by fields of any data type, such as numbers, text strings,

dates or boolean. These fields are table columns and the rows are entries that fill the columns

with a set of values. The number of columns is defined, but the number of rows is not. Each

42

row is identified by an unique key. Keys serve as indexes for the database querying motor,

like SQL, to optimize searches. Also, they can be used as columns in other tables, creating

relationships between entries of different tables. The ultimate goal of these methods is to

retrieve fields across tables joined by their related identifiers.

Non-Relational Methods

Big Data is not about querying tables with very large amounts of entries. It is about

finding patterns in the entries at different usually larger scales [BYB13], and retrieving the

unstructured information contained in the raw data. For this matter data must be processed

with other kind of methods. New architectures, like Hadoop [Whi09] or MongoDB [Cho13],

provide mechanisms to build flexible models for the diversity of data found nowadays. These

methodologies usually allow parallel processing, which enables a better scaling than SQL

tables. Queries can be performed on multi-server environments, where several stations query

in parallel across millions of entries. Also, the models are file-based. The data entries are

typically stored in text files in the form of key/values dictionaries, like the format JSON

[Cro06], where keys identify the columns from the data model and values represent the

register. This kind of organizing data allows to nest registers and dynamically modify the

data model.

Map and Reduce

The Map and Reduce is an alternative methodology to query data [DG08]. It is based in

two processes: to map and to reduce. The goal is to divide the problem in several smaller

sub-problems which could run in parallel and later be summarized.

The map procedure turns the original data into intermediate structures. These are usually

relations between keys and values, that contain a partial processed amount of data. The

reduce procedure gathers all the results of the Mappers and combine them into a single result.

For this purpose, the reduce procedure can either sum, accumulate or process the partial

results with any mathematical function. For instance, if we want to count all the visible stars

in the sky, we could have one single station counting region by region; or instead divide the

sky into a thousand pieces, and assign the same task to a thousand processes. Each of them

will be the mappers. They will count the stars on their sky portion alone, which should be

possible for them to count in a reasonable amount of time. Then another process will gather

and combine all the partial results into a single one. Furthermore, let us suppose we want

to classify the observed stars according to their type. In this case, each mapper can count

the number of each type of star in their own sky portion. Accordingly, the reduce algorithm

43

can sup up all the partial counts into a single final one.

This methodology allows to do parallel computing. This results very effective in time

and computational costs. The parallel computing could take place on a single station, by

means of multi-threading the code, or several working stations as the computing clouds. A

typical implementation of this algorithm is compound by the following three methods:

1. Manager: This is the main method where we open the data files, create the mappers,

execute them, wait for their partial results and call the reduce algorithm.

2. MapFunction: This is the mappers method, where we build the partial results and

deliver them to the manager process. The communication between processes can be

done by means of a queue object.

3. ReduceFunction: This method aggregates all the partial results into a single one.

In Python1, the new processes are created with the Process function from the multipro-

cessing package2. This function creates an process object which receives as arguments: the

target function to execute and the corresponding arguments. The new process will execute

the target function and terminate at the conclusion.

4.1.2 Finding Patterns

In order to find the patterns hidden in data a set of variables must be defined first. These

variables are going to characterize the information. The raw data will be aggregated into

these variables. Then, the information would be retrieved by means of statistical analysis

and pattern recognition. In this section, we will review the most important techniques for

pattern recognition that we have used in this thesis.

Probability Distributions

The first approach to analyze a variable is to characterize its distribution. Such characteri-

zation would allow to determine expected outcomes of the different experiments, determine

ranges of the sample space and find anomalies in the data. A probability distribution is a

function that indicates the likelihood of a random variable to take on given values, defined in

a probabilistic space. These values could be categorical or numerical, contained in discrete or

continuous random variables. The probability of a variable to take on a given range of values

1http://docs.python.or2http://docs.python.org/2.7/library/multiprocessing.html

44

is given by the integral of the density function at that range. The curve of a probability

distribution is nonnegative and the integral across the whole space is equal to one, just as

the probability of any of the possible events to happen.

After defining variables in data, the probability density functions are built by counting

the frequency of each outcome. An easy way is to build first the cumulative probability

density function. For this matter we first sort the data entries from lower to higher and plot

them against a vector from 0 to 1 with the same number of equally distanced points as the

original sample. Then we can take the derivative of this function and find the probability

density function of the variable.

Depending on the distribution, the statistical moments like average and standard de-

viations are good enough to characterize the population. Homogeneous distribution like

Gaussian curves or Uniform distributions meet this criterion. However, these statistical

moments may diverge in heterogeneous distributions like power laws. Therefore, when char-

acterizing the distribution of complex systems variables it is important to understand the

structure of the system, before making wrong assumptions.

Correlations

The correlation of two or more variables indicates whether there exists or not a relationship

of dependence between them. There are several ways to measure the strength and direction

of such dependence. The strength measures how strongly or weakly variables are correlated.

The direction indicates if the variables are correlated in the same direction or opposite

directions.

There are several correlation coefficients to quantify the degree of the variables’ rela-

tionship. The Pearson coefficient is one of the most popular and used in this thesis. This

coefficient is often denoted as r and measures the linear relationship between two variables.

It is obtained with the ratio of the covariance of the two variables, X and Y , and the product

of their standard deviations, σx and σy, in the following way:

r =E[(X − µx)(Y − µy)]

σxσy(4.1)

where E[] is the expected value function and µx is the average value of X. r ∼ 1 when

there is an increasing relationship between the variables, which means that the larger the

first, the larger the second. If in turn, the relationship of the two variables is inverse, and

the larger the first the smaller the second, then r ∼ −1. Finally if there is no correlation

between the variables and they seem to behave independently, then r ∼ 0.

45

It is also possible to compare series of random variables. In this case a different kind

of correlation is applied, such as the cross-correlation. Statistically it is defined as the

correlation of all points of both time series at different times, defined as the convolution of

both signals:

f ∗ g(t) =

∫ ∞−∞

f(τ)g(t− τ)dτ (4.2)

A shifted correlation would indicate the evolution of the joint behavior of two series,

highlighting the instants where the series have independent or coherent behavior.

Data Clustering

The clustering process aims to find similarity between samples of a set of data. It is the

process of grouping a set of observations in a way that those who belong to the same group

are more similar than those who belong to other groups. The wide definition of clusters

gave place to several interpretations and therefore several kinds of algorithms to solve the

problem. Most of them are based on concepts of distance between observations and defining

criteria to determine which element is closer to which.

An example used in this thesis is the k-means clustering algorithm [Mac67]. In this

algorithm clusters are represented by central vectors. These are not necessarily part of the

dataset and their quantity is fixed to a previously defined number k. The algorithm therefore

finds the k cluster centers such that the squared distances from the cluster are minimized.

Its main limitations are that the number k must be given in advance and that it is not

that accurate to determine cluster borders since it has been optimized to find centers and

not borders. However, methods like the silhouette [Rou87] allow to measure the quality of

partitions, based on how close are elements to their centers, versus how apart are elements

from other centers.

Multidimensional scaling (MDS) is a methodology to visualize the similarity of entries in

the dataset [BG05]. It understands the objects as a set of variables in a multi-dimensional

space. The model consists in reorganizing the elements in order to reduce the multidimen-

sionality, and preserve the original distances between the elements. Therefore, the multiple

dimensions are reduced to a few, and new coordinates are assigned to elements in the new

space. Two dimensions will allow to plot the entries as scattered dots and find clusters of

closer elements.

46

Networks

Another technique to find patterns in data is the theory of complex networks [BLM+06]. By

means of the construction and analysis of networks, the structure and dynamics of many

natural, social and technological systems have been revealed. This methodology unveils

patterns at different scales by aggregating the relationships occurring at the local scale. The

idea is to understand the complex system abstracted in the form of network, by means of the

graphs topological properties. This means to characterize the way the system is structured,

like unveiling mixing patterns, as well as modeling the dynamical processes that take place

on the network, such as information spreading.

4.1.3 Statistical Significance

A challenge when measuring patterns in related random variables is to discriminate sampling

errors and to confirm that the claimed effects are not the result of random processes. The

statistical significance measures how apart are observations from being just random and

indicates whether observations represent actual properties in the population [Cum12].

The z-score is a measure to test the significance of a given value with respect to the

expected from a normal distribution. It normalizes the variable x according to the average

µ and the standard deviation σ in the following way:

z =x− µσ

(4.3)

Te value of z indicates how common an observation is, in respect to the probabilistic

space. A low z-score means that the observation is common and that the probability for it

to happen is high. A high z-score means that the observation is not that frequent and that

its probability of occurrence is low. The probability of occurrence of a normalized variable is

called p-value. The p-value is inversely proportional to the amount of information contained

in an observation. The least the probable, the more information it contains [Sha01].

The same happens with patterns. Up to which point, the patterns observed are not

just a matter of chance and how much of information is contained in the measure. A

common technique to test the significance of the structure of networks or dynamical models,

is to suppose that the measured configuration is just a realization of an stochastic process.

In order to measure so, we can rewire the network edges and create several independent

configurations from the same space of possibilities. Then we can compare our measure with

the results from the reshuffled networks, and estimate its statistical significance.

47

4.2 Twitter Datasets

Twitter is an online social network with over 200 million users around the globe. Its main

feature consists in allowing people to post and exchange text messages limited by 140 char-

acters [JSFT09]. People use it from personal computers and more increasingly from mobile

devices. According to recent user tendencies research [Com11], most of people participate

in social media away from personal computers. Each message contains information about

its author, creation date, device source, text body and some times geolocation. By default

messages are public on Twitter, but users have the option to make them private and share

them with selected contacts.

There are several mechanism for users to interact on Twitter. The first of these is the

followers mechanism. It allows users to passively receive all the messages posted by those

who follow, as well as to deliver their own messages to their own followers. In this sense, it

establishes the Twitter followers network, where the users are connected among each other,

through links that determine the explicit ways where messages are delivered. The Twitter’s

global followers network is a directed graph where non reciprocal relations are admitted.

Previous studies have reported complex properties in this network [KLPM10], like degree

distribution with power law behavior, small mean distance between nodes and modular

structure.

An important mechanism on Twitter is the retweet, which is used to retransmit messages

from other sources. This mechanism allows individual messages to travel throughout the

network. The retweet is the most popular mechanism to propagate the received messages

throughout the network. By retweeting a message, users deliver specific information to their

own followers, at the same time that endorse ideas and gain visibility in the network [BGL10].

The study of the retweets cascades has served to characterize user profiles [GAC+10], measure

influence [CHBG10] and propose spreading models [XLZ+12].

All messages on Twitter may be identified using keywords called hashtag. This mechanism

organize conversations and individuals use it to exchange ideas on specific subjects. It is

responsible for generating the trending topics, and people use it to discuss and exchange ideas

without the necessity of having any explicit relationship. Recently, the statistical analysis

of the hashtags usage has let prediction on social relations [RTU11] and collective attention

[LGRC12].

48

4.2.1 Data Gathering

Twitter has several Application Program Interfaces (API) for people to programmatically

interact with the online service. These APIs are used to gather the data. There are three

main Twitter APIs:

1. The Search API3 queries messages from a temporal index of recent tweets, posted

within a lapse of a week old. Queries must contain a keyword to look for in the

message’s text. Its limitations are specified as the result of queries complexity and

frequency, instead of a percentage of the main stream.

2. The Stream API4 is the one that delivers real time data, providing about 1% of the

main stream. It may track keywords, users or geolocated messages.

3. Finally the REST API5 is the one used to do programmatically functions like posting

messages or following people by means of applications. It also allows to download

user-related information like profiles or followers lists.

4.2.2 Datasets

Using the Twitter Search API, we have built several datasets from public access messages.

Many of the datasets are related to events, like political protests, electoral campaigns or

historical announcements. We have queried the Twitter databases by looking for messages

that contain keywords (or hashtags) that identify the events. In this section we will describe

each of these datasets. Their properties can be found in Table 4.1.

The main analyses conducted on this thesis are related two datasets regarding Venezuelan

politics. Venezuela is the thirteen country in the world with the largest penetration on

Twitter [Sem12]. Close to 3 million Venezuelans participate on this online social network,

which is the equivalent to almost 10% of the country’s population6. The political usage of

Twitter in Venezuela is of great importance and has played a fundamental role in the recent

Venezuelan history [MV12, NT12]. The late President Hugo Chavez was considered to be the

second most influential world leader on Twitter [Cou12], preceded only by the US President

Barack Obama. The collective who opposes the late President, also finds on social media a

channel to freely speak to their supporters and protest against the Government [MLB12].

3https://dev.twitter.com/docs/using-search4https://dev.twitter.com/streaming/overview5https://dev.twitter.com/rest/public6http://www.ine.gov.ve/

49

The first dataset we considered is related to a Venezuelan political protest that took

place exclusively by digital means at December 16th, 2010. The event consisted in posting

messages identified with the hashtag #SOSInternetVE. We downloaded all the messages

that included this hashtag between December 14th-19th, 2010 (two days before and after

the protest). At total we found 421.602 messages, written by 77.706 users. It is remarkable

that 42% of messages where retweets and 60% were sent from smart mobile phones.

Then, we considered a conversation about the late Venezuelan President Hugo Chavez on

Twitter. The conversation includes the day of the announcement of the President’s death,

as well as the schedule for new elections. In total we downloaded over 16,383,490 messages

written by 3,173,090 users for a two month period, from February 4th, 2013 (29 days be-

fore the death announcement) to April 4th, 2013 (26 days after the death announcement).

Messages were posted in more than 159 countries (according to the 0.4% of geographically

located messages). Our analysis is based on those messages that represent retweets or re-

transmissions, which correspond to 49% of the downloaded messages, and more specifically

those that conform the giant components of the retweet networks, which come from 57% of

original set of users.

In order to generalize results, we have also considered other datasets related to conversa-

tions of diverse nature such as sports, news, political protests and electoral campaigns. One

these datasets is related to a political scandal that took place on the Spanish parliament on

2012 due to some unappropriated comments from a congresswoman that echoed loudly on

the social networks. This dataset was built by downloading 35,835 messages from 23,498

users, using the hashtag #Andreafabra, from July 12th, 2012, to July 23th, 2012. Another

dataset concerns a conversation about a Venezuelan baseball team. It was built by down-

loading 142,808 messages that contained the team’s name leones, posted by from 46,608 users

during a 3 weeks period from Dec. 22th, 2010, to Jan. 12th, 2011. We have also constructed

a dataset regarding the announcement of the Spanish separatist band, ETA, declaring the

end of the armed struggle. We downloaded 617,545 messages posted by 241,292 users during

a ten days period from Oct, 10th to 25th, 2011. We have also built another dataset concern-

ing the 2011 Arab Spring, by downloading 7,433,542 messages that contained the keyword

(and hashtag) Egypt, posted by 1,180,715 users during a 5 weeks period, from Jan. 12th,

2011, to Feb. 17th, 2011. During this period the former Egyptian president Mubarak was

overthrown by the social demonstrations. One dataset concerning the 2012 US presidential

elections was built by gathering all the messages that contained the word Gingrich during a

week period from Feb. 29th, 2012, to Mar. 3rd, 2012. This dataset is compound by 93,063

messages and 43,061 users. Another dataset regarding the same elections was built by col-

50

Identifier Messages Users Dates

Andreafabra 35, 835 23, 498 Jul. 12th to 23th, 2012

Gingrich 93, 063 43, 061 Feb. 29th, 2012, to Mar. 3rd, 2012

Leones 142, 808 46, 608 Dec. 22th, 2010 to Jan. 12th, 2011

20N 389, 988 123, 710 Oct. 29th, 2011 to Nov. 27th, 2011

SOSInternetVE 421, 602 77, 706 Dec. 14th to 19th, 2010

ETA 617,545 241,292 Oct, 10th to 25th, 2011

Obama 6, 818, 782 2, 265, 799 Oct. 3th, 2012 to Oct. 5th, 2012

Egypt 7, 433, 542 1, 180, 715 Jan. 12th, 2011 to Feb. 17th, 2011

Chavez 16, 383, 490 3, 173, 0905 Feb. 4th, 2013 to Apr. 5th, 2013

Geolocated 500, 000, 000 - Oct. 1st, 2013 to Jan. 31th, 2014

Table 4.1: Description of the studied datasets.

lecting messages mentioning Obama during the first televised debate from Oct. 3th, 2012,

to Oct. 5th, 2012. This dataset is compound by 6,818,782 messages and 2,265,799 users.

The last of these datasets is related to the 2011 Spanish electoral process. It has been built

with all the messages that contained the keyword (and hashtag) 20N, which was used by all

parties in reference to the election day on Nov. 20th, 2011. This dataset comprehends the

period from Oct. 29th, 2011, to Nov. 27th, 2011 and it is compound by 389,988 messages

adn 123,710 users. In [BMLB12], we characterized the user and politicians interactions dur-

ing these elections and found that the mass media accounts widely dominated the attention

received through the retweets mechanism, while politicians ruled the mentions scenario.

Most of these datasets are related to events that occurred offline, such as televised debates,

electoral processes or historical happenings. In Fig. 4.1 we present the temporal evolution

of the Twitter activity during three of these events: the Spanish election (in panel A), the

Obama’s debate (B) and the Egyptian revolts (C). It can be noticed that during these event

there is a burst of activity, characterized for having an abrupt growth followed by a smooth

decay. This pattern is remarkably ubiquitous regardless of the amount of people participating

or number of messages sent. As shown in panels A, B and C, the height of the activity peak

can span over several orders of magnitude, and yet the curves still present a similar shape.

Moreover, the scale independence is also temporal. The gradual decrease of activity after

the peak can last from a couple of hours, as shown in panel C, up to several days, as shown

by the enveloping curve in panel D.

Finally, we also studied geolocated messages from the Twitter Stream API. Unlike in

51

the previous datasets, these messages are not filtered by keywords but by having enabled

the geolocation option. In general, geo-located messages represent around 3% of the Stream

API messages. However, since these messages represent a minority of the overall stream, the

Stream API provides 90% of them [MPLC13]. In summary, we collected roughly 500 million

geolocated tweets between October 1, 2013 - January 31, 2014 from across all latitudes and

longitudes.

4.2.3 Representativity

In order to conduct research with Twitter data, we must consider the following facts. Due to

the technological nature of Twitter, in general, its users tend to be younger than the average

person and live in denser, more urban areas [DB13, MLA+11]. Also, not all countries use

Twitter the same. For instance, it is banned from countries like China or Iran. Therefore, a

random sample of Twitter users may not necessarily be representative of the whole society

[GA12]. However, Twitter datasets are so massive that they enable the observation of

tendencies and patterns in the behavior of millions of persons and their interactions [Mil11].

4.3 Mobile Phones Datasets

Mobile phones datasets are made out of Call Detail Records (CDR). These are produced by

any phone call or SMS in the communication provider data bases. A CDR usually contains

information about the origin and destination phone numbers, starting time of the call and

duration, and the antenna that is serving the subscriber. For telephone service providers,

CDR are critical for the production of the monthly bill. However, their information is wealth

to identify individuals and their usual behavior and location.

Over the last few years, due to the exponential increase in the penetration of mobile

phones, new opportunities for obtaining such indicators have emerged. In particular, the

use of mobile phones as sensors of human behavior has yielded important research findings

in large-scale social dynamics analysis in areas such as human mobility [GHB08], informa-

tion diffusion [OSH+07], social development [BLT+11], epidemiology [WET+12] and disaster

response [BWB11].

Studies that use mobile phone data usually anonymize the CDR personal data, exchang-

ing phone numbers by random identifiers. However, recent research has shown that individ-

ual trajectories are so unique that just a few locations are enough to identify any individual

[MHVB13]. Although, such fact is relevant for user privacy, most of the scientific inter-

est does not regard tracking individuals, but rather finding collective behavioral patterns

52

Figure 4.1: Temporal evolution of Twitter activity (messages/hour) corresponding to datasets:

(A) 20N, (B) Egypt, (C) Obama and (D) Chavez, described in Table 4.1. At all panels, we are

displaying the impact of events on Twitter activity. The four of them present a burst of activity

when the event takes place, which gradually decreases down to previous levels. Panels (A), (B) and

(C) have similar patterns despite spanning three orders of magnitude on the y-axis. The envelope

curve in panel (D) presents the same pattern across a different time scale. The gradual decrease

of activity spans for several days. The inset curve corresponds to the activity during the shadowed

area in green in a linear scale.

53

that explain social processes. For that matter, data is usually aggregated either socially or

geographically.

4.3.1 Datasets

We first analyzed the CDR data provided by France Telecom /Orange Cote dIvoire within the

framework of the Data for Development D4D Challenge [BEC+12]. The data was collected

for 150 days, from December 1, 2011 until April 28, 2012. The set of collected CDRs contains

2.5 billion calls and SMS exchanges between around five million anonymized users. In this

thesis we work on the following datasets from the D4D project:

1. Antenna-to-antenna: This dataset includes the aggregated number and duration of

calls between any pair of antennas per hour. This means that each register of the

dataset contained the number and duration of all calls made from one antenna to the

other at each hour of the observation period. Therefore, there is no user detailed data

on this dataset.

2. Individual trajectories: This dataset regards the movement of people between the

antennas during calls. It contains the trajectory of 50,000 individuals among antennas.

Each register indicates the time and location of each user, whenever they started or

received a call. In order to preserve privacy, the identity of all users was randomized

every two weeks.

3. Antenna location: The location of antennas were provided together with the datasets.

However, a random displacement was added to the actual location, in order to protect

the company’s sensitive information.

We also analyzed CDR data from the mobile operator Telefonica7 in Mexico. Among all

the data contained in a CDR, our study uses the anonymized originating and destination

numbers, the date and the duration of the call, as well as the latitude and longitude of the

serving antennas. We analyzed a total of nine months, from July, 2009 to March, 2010. In

order to protect privacy, all the information presented is aggregated above the user level.

No contract or personal data was collected, accessed or utilized for this study. No authors

of this study participated in the extraction of the dataset.

7www.telefonica.es

54

4.4 Additional Sources of Information

In order compliment the data collected from Twitter or the mobile phones datasets, we also

analyzed the following sources of information.

1. Global Administrative Areas Database

The GADM8 provides GIS-compatible maps of administrative areas worldwide. GADM

was used to classify the antennas locations in the map and associate them to admin-

istrative boundaries in Venezuela, Ivory Coast and Mexico.

2. Language Map from Ethnologue

The Ethnologue: Languages of the World9 is a reference work cataloging all of the

worlds known living languages. We have used the ethnic and language maps of Ivory

Coast [Lew09] in order to classify antennas locations and map them to ethnic groups

and languages.

3. African Infrastructure Knowledge Program

The African Infrastructure Knowledge Program10 from the African Development Bank

provides GIS-compatible maps of transport, communication, power, sanitation and

water infrastructure. We have used the maps of main roads in Ivory Coast.

4. Electoral Data

The results from the national and regional 2013 elections in Venezuela11 have been

used to compare the results from the Twitter analysis with the offline context.

5. Census Data

The most recent official census from Venezuela12 has been considered to estimate the

Twitter penetration in Venezuela. Also, the most recent official census of Mexico13

has been used to assess the representativeness and validate the population distribution

inferred with the mobile phone data.

8http://www.gadm.org/9http://www.ethnologue.com/

10http://www.infrastructureafrica.org/11http://www.cne.gob.ve12http://www.ine.gov.ve/13http://www.censo2010.org.mx/

55

6. Satellite Imagery Data

Multispectral, medium resolution (15 to 60 meters) ETM+Landsat714 satellite images

have been used for detecting and delimiting floods. The temporal resolution of this

data source is 16 days, so it helps to approximate the flooded area with reasonable

accuracy, at least before and after the flooding happened. The spatial resolution is

high enough to segment broad floods, river overflows or lake leakages. The satellite

imagery data allows us to spatially limit the affected regions with better accuracy than

the vague approximations that could be inferred retroactively from news or historical

documents.

7. Precipitation data

The Tropical Rainfall Measuring Mission project15 provides high resolution (3 hours

of temporal resolution and 0.25 squared degrees of spatial resolution) of precipitation

levels worldwide. The spatial resolution of this data is lower than the satellite images

used to segment the floods, but high enough to obtain a realistic precipitation level in

the affected area. On the other hand, the temporal resolution is adequate to generate

a time series comparable to the CDR data.

14http://earthexplorer.usgs.gov/15http:// http://trmm.gsfc.nasa.gov/

56

Chapter 5

HUMAN BEHAVIOR DURING

POLITICAL MOBILIZATION

In this chapter, we analyze the users’ behavior from Twitter activity during a political

mobilization process, such as the Venezuelan protest #SOSInternetVE (see section 4.2.2).

We characterize users according to their role in the information diffusion process [MLB12].

We build two kind of networks to represent the phenomena. First, we construct networks

to represent who receives whose messages, that we have identified as the social substratum

at which the information may flow. Second, we build the information diffusion networks,

relating who forwards whose messages, in order to represent the effective channels through

which information actually flowed within the social substratum. Then, based on the graph

theory (see chapter 2), we calculate and correlate several measures to understand the social

structure and the dynamical patterns that emerge from the studied conversation.

The organization of this chapter is as follows. In section 5.1 we present the temporal

evolution of the protest activity and in section 5.2 we study the individual user behavior.

Then, from sections 5.3 to 5.6 we discuss the structures formed by the users when they

interact with each other, either passively or actively. Next, in section 5.7, we describe the

underlying user behavior behind such structures. And finally in section 5.8 we show these

structures from the mesoscale point of view.

5.1 Temporal Behavior

We first analyze the temporal evolution of the Twitter activity related to the protest mea-

sured by the number of messages posted by minute. At the top of Fig. 5.1 we present the

evolution of the message rate for the period December 14th-19th, 2010. This series has a

57

Figure 5.1: Top: Time evolution of the message rate (messages/minute) of the Venezuelan protest

#SOSInternetVE. Arrows indicate some of the times when the protest convoker participated. Bot-

tom: Time evolution of the accumulated percentage of messages (dashed line) and participant users

(solid line).

similar shape as the Twitter time series modeled in the study of Yang [YL11] during critical

events. It can be noticed that at the beginning of December 14th, 2010, the studied hashtag

did not even exist in the Twitter servers. Then, after its first appearance on the same day,

some user activity was recorded. Yet it is on December 16th, 2010, when the protest takes

actual place and the trending topic bursts and reaches its highest point, showing critical

phenomena features. However, after December 18th, 2010, much of the interest is lost and

the trend tends to decay really fast as expected for trend topics on Twitter [AHSW11].

The protest growth can be seen more clearly at the bottom of Fig. 5.1, where we have

plotted the accumulated number of messages (dashed line) and users (solid line) as a function

of time. It is remarkable that the system grew from 22% to 87%, in terms of users, and 12%

to 84%, in terms of messages, in a time frame of 7 hours, which has been highlighted around

the afternoon of December 16th, 2010 in Fig. 5.1, and coincides with the main burst.

Furthermore, it can be noticed that the number of users that participate in the protest

saturates faster than the amount of messages at all times. This is a typical feature of local

interest conversations [KLPM10] where users post messages repetitively on the same topic.

For example, after the day the protest was convoked on December 14th, 2010, already 15%

of the users had participated. However, the messages they posted did not even reached 7%

58

Figure 5.2: Complementary cumulative distribution of the user activity during the Venezuelan

protest #SOSInternetVE. Solid line is the fit to an exponentially truncated power law, P (x >

x∗) ∝ x−βe−x/c, where β = 0.880± 0.001 and c = 65, 0± 0.6 at the last day.

of the total amount.

5.2 Individual Behavior

The user activity Ai is considered as the sum of the original and retransmitted messages,

sent by each participant i. In Fig. 5.2 we show the evolution of the cumulative distribution

of the number of messages sent (posted) by user, at the different days that the protest lasted.

It can be noticed that the distribution can be fitted to an exponentially truncated power

law, in the form: P (x > x∗) ∝ x−βe−x/c, where β = 0.880± 0.001 and c = 65, 0± 0.6 at the

last day. It is remarkable that there is a clear distinction between the days before and after

the main burst (see Fig 5.1) which reflects the criticality of the phenomena. However, at

each day of the both stages, the users presented the same behavior, in the sense that they

are distributed in the same way during the days before the protest, but also during the days

after the protest.

This distribution indicates a certain degree of complexity in the phenomena and het-

erogeneity in the user behavior. Before the main burst, 60% of the participants had sent

less than a couple of messages, 1% over 30 messages, and about 0.01% had posted over 100

59

Figure 5.3: In (top) and out (bottom) degree complementary cumulative distributions of the

followers network from the Venezuelan protest #SOSInternetVE.

messages. On the other hand, at the last day of the protest, 50% of the users also had sent a

couple of messages at most, while 1% sent over 60 messages, and just about 0.0013% posted

over 600 messages. This result shows that the percentage of most active users decreases

rapidly as the system grows.

5.3 Followers Network

In the same manner that users post messages quite differently among them, these messages

have also different relevance in the conversation development. On Twitter, not all the users

account the same level of visibility in the message stream, because the number of recipients,

and possible readers, strongly depends on the source’s in degree on the followers network.

This social substratum may be analyzed by the construction of a graph with the protest

participants, linking the users according to who follows who. The resulting is a directed

and non weighted network compound by 77,706 nodes and 5,761,331 links, displaying the

structure through which information is delivered and might be spread. The edge direction

goes from the follower to the message source, thus information flows in the opposite sense

of the edges. The attention received can be measured by means of the in degree kin. The

attention payed is measured by the out degree kout, indicating the number of people who the

60

Figure 5.4: Scatter plot of in and out degree of the followers network from the Venezuelan protest

#SOSInternetVE. Dots represent users.

user follows.

Both in and out degrees follow power law distributions as shown in Fig. 5.3. In terms of

the in degree, the distribution indicates that over 50% of the users are followed by less than

15 users, while just 1% of the users have over 1,000 followers and around 0.01% of the users

have over 20,000 followers. For the out degree distribution, we found that over 50% of the

users follow less than 40 users, while 1% of the participants follow over 600 users and 0.01%

follow over 9,500 users. This distribution presents an exponent within the expected range

for human actions [New05].

As can be seen in Table 5.1, the mean distance between nodes in this network is dF =

2.2. This value indicates the presence of the small world effect [WS98]. Previous studies

performed on the Twitter global follower graph, state that the mean distance between users

is to be 4.12 [KLPM10]. This fact is related to the presence of users that act like hubs,

concentrating a large quantity of incoming and outgoing links. However, our results are

lower than the previously reported values, due to the special characteristics of the event and

its participants. For example, the protest convoker, which is a TV station, is followed by

over 52% of the participants, linking half of the total population.

Based on the degrees correlation shown in Fig. 5.4, we found that user profiles are highly

heterogeneous and that the network is very asymmetrical. It is remarkable that there are

some users, corresponding to the scattered points located below the dotted diagonal, that are

61

Network Nodes Edges Mean distance Density Degree Assortativity

Followers 77,706 5,761,331 2.22 1.42× 10−3 -0.10

Retweets 54,423 231,485 3.40 1.25× 10−4 -0.15

Table 5.1: Followers and retweet network properties from the Venezuelan protest #SOSInter-

netVE.

widely followed but do not follow many people. At the same time we found other users, who

are more reciprocal and stay near the dotted diagonal, specially after Kin > 1, 000 followers,

where practically any users are found above the diagonal. Finally there are some users,

corresponding to the region densely located above the diagonal, who follow more people

than what they are followed. These users represent the majority of the participants.

5.4 Retweets Network

The second network is built according to who retransmits whose messages. It is a network

that emerges from the users’ interactions. The nodes are users that retransmitted messages

to its own followers, as well as users whose messages were retransmitted. This network

indicates the effective links through which the information actually flows inside the active

social substratum. In principle it might seem to be a subgraph of the follower networks,

but it is not so, since on Twitter people are able to retransmit any message, no matter if

it does not hold any type of relation with the source user. The resulting network is also a

directed graph, where edges are weighted according to the number of times a user retweeted

the source user. At total, by December 19th, 2010, the graph is compound by 54,423 nodes

and 231,485 links. The difference between the amount of nodes found in the followers graph,

shows that 30% of the users behave much more passively than the others. Furthermore, we

found that 75% of the participants were not retweeted at all.

In the retransmissions network, we have analyzed the strength function for each user. The

in strength value represents the number of times a user has been retweeted. Its distribution

follows a power law, as shown at the top of Fig. 5.5. Such distribution indicates the presence

of highly connected hubs, which explains why the mean distance between nodes is dR = 3.4,

which is also a very low value. On the other hand, the out strength shows the number of

times a single user has retransmitted. Its distribution can be fitted better to an exponentially

truncated power law distribution, as shown at the bottom of Fig. 5.5. The truncation value,

near 500, is related to the limitation for human actions as stated on the Dumbar number

62

Figure 5.5: In (top) and out (bottom) strength complementary cumulative distributions of the

retransmission network of the Venezuelan protest #SOSInternetVE. Solid line is the fit to an

exponentially truncated power law P (Sout > S∗out) ∝ S−βoute−Sout/c, where β = 0.890 ± 0.002 and

c = 61.0± 1.2.

theory [GPV11]. This theory states that people are only able to maintain tie relationships

with less than 200 people. The reason for which we found a higher value relies on the fact

that a retweet do not imply strictly a mutual relation between people. In fact, it is an

individual choice that has a very low cost in money, time and personal energy, which makes

it easy to happen.

The difference between the in and out strength distributions, is related to the way that

we have designed the network. While the out strength is due to one person’s activity, the in

strength distribution is due to the aggregation of several individual efforts. Such aggregation

is responsible for the emergence of extreme cases and a higher complexity level in the final

distribution. From the in strength distribution, shown at the top of Fig. 5.5, it can be

noticed that over 60% of the users that participated in the retransmission process gained

less that 3 retransmissions, while 1% gained more than 150 retransmissions, and only 0.01%

gained over 5,000 retransmission. Analogously, for the out strength distribution, we found

that over 60% of the users who retransmitted messages, did it over less than 3 messages,

while 1% of them retweeted over 60 messages, and less than 0.01% retransmitted more than

300 messages.

We also calculated the edge’s weight distribution and found that it follows a power law

63

Figure 5.6: Edge’s weight complementary cumulative distribution of the retransmission network

from the Venezuelan protest #SOSInternetVE.

as shown in Fig. 5.6. The edge’s weight represents the number of times that a single user

retweeted another user. The figure shows that only 10% of the edges present a weight higher

than 2. However, we found that near 0.001% of the edges have a weight higher than 80.

This indicates that the majority of users retweeted other users individually only a couple of

times, yet a small fraction of them maintained a closer tie with other users, in the sense that

they retweeted their messages close to 100 times. On the other hand, the retransmission

network also presents the same asymmetries found in the followers network. For example

the 10 most retransmitted users caused more than 20% of all retransmissions, writing less

than 0,4% of all messages.

It is remarkable that the retransmissions network is much less dense than the followers

network as stated in Table 5.1. This indicates that inside the contacts web there is a

finer structure where the information actually travels. The reason for this result is that

retransmitting implies an active behavior, instead of the passivity of the following relation.

This shows how users are more selective when it comes to take some action.

64

Figure 5.7: Visualization of the retweet network emergent from the message propagation on the

followers network. (A) Subgraph of the retweet network (green) superimposed to the corresponding

followers network (black), from the #SOSInternetVE dataset. In the figure a subset of 1000 random

nodes (yellow and red) are presented. The node size is proportional to the respective in degree

on the followers network. (B, C and D) Example of the formation of the retweet network from

independent retweet cascades on an artificial followers network. (B) shows when two users (red

nodes) post independent messages which are received by their followers (gray). (C) shows when

some users retweeted the message (yellow) and this message arrives to their followers (gray). (D)

shows the final shape of the cascades on the network, compound only by the activated nodes (red

and yellow) connected by the green links. The white nodes and gray links represent the rest of the

substratum (followers network) who did not activate. (E) shows the schema of a single cascade.

The black circles determine the cascade layers.

65

5.5 Degree Assortativity

In order to unveil how such heterogeneous users interacted with each other, we calculated

the assortativity by degree coefficient [New03a, FFGP10] for the followers network (rF ) and

the retweets network (rR). Both of them resulted to be disassortative: rF = −0.10 and

rR = −0.15 (see Table 5.1), which reveals the asymmetric shape of these networks. The

hubs that concentrate much of the incoming links, are often targeted by regular users, who

do not receive much of the collective attention. Although social networks have been reported

to be assortative [New03a], this pattern changes in the online world, where disassortativity

is usually found [HW09]. The reason relies on the fact that in the online world regular

people are now able to relate and communicate with popular accounts, either by following

or retweeting their messages in the case of Twitter.

5.6 Retweet Cascades

The retweet network can also be seen as the aggregation of independent retweet cascades,

that respectively occur when a single message is retransmitted by any user to its followers,

allowing them and their own followers, to do the same. An example of the resulting structure

is shown in Fig. 5.7 A, where a subset of the retweet network (green edges) has been plotted,

superimposed to the respective subgraph of the followers network (gray edges). The red nodes

represent those who posted an original message and the yellow nodes represent the message

propagators (those who retweet). It can be noticed that the retweet network represents a

subset of the followers graph where messages are actually being propagated. This graph

evidences that people are more selective to actively interact with their declared contacts

than just receiving updates from them [HRW09].

In order to explain the dynamical process behind these cascades, an scheme of the evo-

lution of two cascades on an artificial followers network is sketched from panels B to D in

Fig. 5.7. In panel B two independent messages are respectively posted by the red nodes

and received by their followers (gray nodes). Some of these followers retransmitted the mes-

sages (yellow nodes), through the green edges, and others did not (white nodes), as shown in

panel C. Accordingly, in panel D some of the followers of followers retransmitted the message

(also yellow nodes), and the final shape of the cascades may be appreciated. To summarize it

schematically, a single retweet cascade from the dataset is presented in Fig. 5.7 E. The white

nodes do not belong to the cascade, as we only consider those who actively participated in

the retransmission process. Using this schema some of the main cascade properties will be

66

explained in the remaining section, such as the amount of retransmissions gained by user,

as well as the cascade size, depth and rate of retransmission.

The first property we analyzed is the number of retweets gained by user, Ri, which

may also be considered as the node i in strength of the retweet network. This quantity

may increase either from cascades originally seeded by i, as well as cascades where i acted

as a propagator. For example, for the cascade shown in Fig. 5.7 E, Ri would take the

following values: R0 = 15, which is the total number of users who retweeted the message

originally posted by the node 0, either directly (nodes 1 to 11) or indirectly (nodes 12

to 15). Accordingly, R8 = 2, since the node 8 has been retweeted by nodes 15 and 14;

R1 = R4 = 1, since node 1 and 4 have been retweeted by node 12 and 13 respectively; and

finally R2 = R3 = R5 = R6 = R7 = R9 = R10 = R11 = 0, as no one retweeted them.

Another property analyzed is the cascade size, which is defined as the total amount of

nodes that have been activated in the context of a given cascade. In the example shown

in Fig. 5.7 E the resulting cascade size would be 16, as we have 1 author (node 0) plus

15 propagators (nodes 1 to 15). In the studied conversation, this property is distributed

following a power law behavior, as presented in Fig. 5.8 A. This indicates that most of the

cascades are extremely small, as more than half of them (60%) are compound at most by

2 persons besides the author, and just a small fraction are large, since around 5% of them

have more than 10 users, and 0.03% present more than 100 participants.

In order to understand the cascades structure, we have divided them by layers, as shown

with the black circles in Fig. 5.7 E. The cascade layer indicates the number of hops from

a propagator node to the source node, through the cascade links. The users correspondent

to the layer l = n represent those who retransmitted the message coming from a user of the

previous layer l = n − 1. In Fig. 5.7 E, the message author (red node) stands alone in the

layer l = 0, while in the consequent layers, we find those nodes who retweeted the message,

like the nodes 1 to 11 in layer l = 1, and the nodes 12 to 15 in layer l = 2.

The cascade depth d corresponds to the farthest layer from the message source, in which

a node has been activated. In the example shown in Fig. 5.7 E, it would take the value of

d = 2. In the analyzed conversation, the probability of a cascade to have a certain depth,

P (d), is presented in Fig. 5.8 B. Those cascades of depth d = 0, represent original messages

that were not retweeted by anyone, which comprehends close to 80% of them. In this sense,

only 17% of the cascades just have one layer of retransmission (d = 1), and this quantity

decreases exponentially as we move farther from the message’s source, reaching a maximum

depth of d = 6 layers with a very low likelihood (∼ 10−5). This indicates that the retweets

cascades found in this conversation are quite shallow, which might result counterintuitive, as

67

100 101 102 103 104

Users per cascade

10-6

10-5

10-4

10-3

10-2

10-1

100

CCDF

A

0 1 2 3 4 5 6d (depth)

10-6

10-5

10-4

10-3

10-2

10-1

100

P(d

)

B1 2 3 4 5 6

l (layer)

10-6

10-5

10-4

10-3

10-2

10-1

100

λl

C

Figure 5.8: Retweets cascades statistical properties. (A) Complementary cumulative density

function of the number of users per cascade, (B) Cascade depth distribution P (d) and (C) Re-

transmission rate by layer λl in terms of retweets over followers. The data correspond to the

#SOSInternetVE dataset.

68

Topic rF,A rF,R rR,A

SOSInternetVE 0.07 0.57 0.17

Table 5.2: Pearson correlation (r) by user of the number of followers (F), retweets (R) and activity

(A).

we would expect retransmissions to increase directly to the message’s visibility, which should

increase with each retransmission. However, shallow cascades have been detected on Twitter

in works of influence dynamics [BHMW11] and prediction of urls propagation [GAC+10], as

cases of different media, like the flow of emails inside a corporation [WWT+11]. It has

been shown that information tends to loose its capacity to attract attention when we move

farther from the author’s social surroundings, and hence the probability of a cascade to grow

is inversely dependent on the distance from the source node [WHAT04].

Finally, the rate of retransmission at each layer, λl, is estimated by averaging the ratio

between the number of users who retransmitted a message normalized by the number of indi-

viduals who received it at each layer, taking into account the followers network information.

The results are shown in Fig. 5.8 C, and it shows that λl ∼ 0.01 for l > 1, while in the first

layer the average retransmission ratio reached up to 5% (λl ∼ 0.05) of the exposed users.

5.7 Analysis of User Behavior

In this section, we will discuss the way users behaved in the #SOSInternetVE dataset, based

on their activity and role in the followers and retweets networks. In the Appendix A, we

present similar results obtained from analyzing the datasets: #20N and ETA (see section

4.2.2).

In Table 5.2, the Pearson coefficient between the users number of followers F (measured

as the kin in the followers network), retweets gained by user R and activity A, are presented.

It can be noticed that there is no correlation between the number of followers and activity

employed (rF,A = 0.07), which means that the amount of messages posted is independent of

the user position in the followers network. However, there is a strong correlation between the

number of followers and the retransmissions gained (rF,R = 0.57), which means that the most

retransmitted users tend to be the most followed ones as well. Besides, there is a positive

correlation between the number of retransmissions and activity employed (rA,R = 0.17),

which indicates that the chances of being retransmitted increase with every message posted

69

Figure 5.9: Analysis of the user behavior. (A) Scatter plot of retransmissions obtained by user

versus its activity and colored by its number of followers. (B) Scatter plot of retransmissions

obtained by user versus its number of followers and colored by its activity. (C) Scatter plot of

retransmissions obtained by user versus the ratio between the number of followers and followees,

and colored by its activity. (D) Scatter plot of retransmissions made by user versus its number of

followers and colored by its activity. Dots represent users. Data correspond to the #SOSInternetVE

dataset.

70

for all users.

In Fig. 5.9 A, we present a scatter plot of the retweets gained by user as a function of

its activity and colored by the user Kin in the followers network. It is important to clear

out that the users that appear in this plot were retransmitted at least once. These users

represent the 25% of the participants, as said in section 5.4. It can be clearly noticed that the

most retransmitted users are also the most followed ones (red dots), independently of their

activity. In fact, if a popular account increases its activity, the retransmission level boosts

nonlinearly, like the most retransmitted user that gained more than 10.000 retransmissions.

However, some less followed users (green or yellow dots) may also gain a significant amount

of retransmissions, but by means of a considerable increase in their own activity. These

users are located around the straight line of slope 1, and their retransmissions gained are

proportional to their activity. Finally, some not so followed users (blue dots in Fig. 5.9

below the dashed line), who are vast majority of the population, needed to post an enormous

amount of messages to gain, if any, a few retransmission at most.

In Fig. 5.9 B, we present a scatter plot of the retweets gained by user as a function of

its number of followers and colored by its activity. It can be noticed that the most active

users (red dots) do not have the largest amount of followers. However, these active users

gain as many retweets as the popular users (blue dots next to red dots), who have the largest

amount of followers but send much fewer messages.

In Fig. 5.9 C, we present the in strength, Sin, of the retransmissions network as a function

of the relation between the in and out degrees, Kin/Kout, of the followers network. The users

are represented by points colored by the users’ activity or amount of messages posted. This

representation let us separate the popular accounts, where Kin/Kout > 1, from the none

popular accounts, where Kin/Kout < 1, and the reciprocal users, where Kin/Kout ∼ 1. It can

be noticed that the popular accounts may get a high value of retransmissions while having

low activity. Meanwhile, the reciprocal users also get the same amount of retransmissions

than the popular accounts, but they must employ much more activity.

In Fig. 5.9 D, we present a scatter plot of the retweets made by user as a function of its

number of followers and colored by its activity. It can be noticed that the most active users

(red dots) are those who retweet the most and do not have the largest amount of followers.

Again, we see that those with the largest amount of followers are less active and make a few

retweets at most.

These results let us classify users into three categories: Information producers, active

consumers and passive consumers. The information producers are the widely followed users

who gain an enormous amount of retransmissions, whereas they have low activity. These

71

Figure 5.10: Community structure for the follower graph. Circles represent communities of

users and their size is proportional to the amount of users that belong to the community. Edges

represent the inter-community links, either followers (Left) or retransmissions (Right), and their

width is proportional to the amount of edges, normalized by the size of the outgoing community.

The data correspond to the #SOSInternetVE dataset.

users do not tend to follow a lot of people, nor retransmit many messages. We found that

these accounts belong to traditional mass media agents like TV, journalists, politicians and

celebrities. On the other hand active consumers are users with high reciprocity in relations.

They tend to gain as much audience and retransmission rate, as the amount of activity

employed. They are key in the information diffusion process, because they boost the content

and serve as the propagators of the information producers. At last, passive consumers are

the largest group of users who practically does not participate in the propagation process.

They consume more information than what they produce. They are characterized for having

low activity rate, not retransmitting many messages and receiving messages from much more

people than their audiences.

5.8 Mesoscale Communities

In order to get more insight in the structure and behavior of the Twitter users during

the protest, we have calculated the mesoscale structure for both networks. In this section

we describe the communities detected in our networks based on the algorithm described

in [BGLL08]. We chose this algorithm based on the modularity optimization, due to its

72

Figure 5.11: Community structure for the retransmission graph. Nodes represent communities

and edges represent the inter-community links. The nodes’ size are proportional to the number

of people that compound the community and the edges’ width are proportional to the number

of inter-community links normalized by the size of the community. The data correspond to the

#SOSInternetVE dataset.

73

Community Collective

0 Comedy accounts

1 Show business celebrities

2 Opposition media

3 Opposition politicians

4 International media

5 Government favorable politicians and media

Table 5.3: Main collectives around which each follower community is formed from the Venezuelan

protest #SOSInternetVE.

capacity to reveal mesoscale structure in large graphs with good computing performance.

On the follower graph, we found six main communities that grouped over 98% of the

population. We identified the most followed users at each community in order to understand

the reasons for which people have grouped. We found that these structures are formed

by users around central accounts that belong to similar collectives. Specifically, we found

communities around opposition media and journalists, opposition politicians, entertainment

celebrities, international media, comedy accounts and government favorable politicians, as

described in Table 5.3.

This structure is shown at the left side of Fig. 5.10. Each node represents a different

community and its size is proportional to the amount of users that compounds them. We

found that the largest communities are formed around the comedy accounts (0), celebrities

(1), opposition media and journalists (2) and opposition politicians (3). Meanwhile, the

smallest ones are formed around international media (4) and government favorable users

(5). The edges shown in Fig. 5.10 represent the inter-community links, and indicates the

existence of users who follow or are followed by users from other communities. The edges’

width is proportional to the amount of individual inter-community links, normalized by the

size of the outgoing community. As it has been pointed out in section 5.1, messages go from

the source to its followers. Thus the information flows in the opposite sense of the edges.

It can be noticed that there is a tie relation between the communities formed around the

opposition media, opposition politicians, celebrities and comedy accounts. These collectives

seem to have dominated the protest. Specially the opposition media community, group 2 in

Fig. 5.10, which concentrates the most amount of users and incoming links. Therefore their

messages are widely received throughout the network. Group 3 is certainly smaller than

other groups, which is a remarkable fact because it concentrates much of the opposition

74

politicians and the event consisted in an opposition political protest. However, this group

is strongly related with other communities and, even though they present a large amount of

outgoing links, their messages are also quite spread. In contrast, group 5, which represents

the government favorable accounts, seems to follow a lot of outside users, yet only a little

fraction of the participants seem to follow them. This means that most of their messages

mainly remain inside their community and are hardly read by the rest of the network.

Nevertheless, for all communities we found the same user behavior. In the sense that all

of them are formed around popular accounts that belong to traditional mass media agents,

no matter if they are opposition or government favorable, or even non Venezuelan users like

group 4. At the right side of Fig. 5.10 we also show how the followers communities retransmit

messages from other communities. This behavior has been pointed by Przemyslaw et al.

[GRM+12] who demonstrated that retweets transcends the friends communities and serve as

bridges for messages to spread throughout the network.

On the other hand, we also calculated the community structure for the retransmission

graph. In this case we found 34 main communities containing more than 96% of the popula-

tion. This network showed a completely different mesoscale structure as can be observed in

Fig. 5.11. However, a similar user behavior as the follower network has been detected. Each

of these communities contains at its core at least one popular account, like the information

producers described above, which is highly followed and retransmitted. In Table 5.4 we

present some information about these popular accounts, like their nicknames and profession.

Once again we found communities formed around traditional mass media agents, such as

TV Stations, newspapers, journalists and politicians, as well as humorists, civilian activists,

student leaders, community managers and micro-bloggers. This result indicates that people

behave selectively when retransmitting messages in comparison to just receiving them.

Once again the nodes represent the communities detected and the sizes are proportional

to the amount of users compounding each of them. The edges represent the inter-community

links, and its width is proportional to the amount of inter-community links found, normalized

by the size of the outgoing community. These links exist due to the fact that some users

retransmitted messages from another community. It can be noticed that communities are also

asymmetrical when referring to inter-community retransmissions, and also present different

profiles. For example, the community number four, which is formed around the Venezuelan

micro-blogger @cualrevolucion and Cuban blogger @yoanisanchez, is highly retransmitted

by all other communities, while it hardly retransmitted other communities.

75

Community Popular account Collective

0 @nelsonbocaranda Journalist

1 @rctv contigo TV Station

2 @elnacionalweb Newspaper

3 @indiferencia Community manager

4 @cualrevolucion, @yoanisanchez micro-bloggers

5 @ucabistas Student leaders

6 @erikadlv Journalist

7 @vvperiodistas TV Station

8 @kikobautista Journalist

9 @edoilustrado Political comics

10 @globovision TV Station

11 @palabrasdecersar Humorist micro-blogger

12 @rmh1947 Government favorable activist

13 @leopoldolopez Politician

14 @carlossicilia Humorist

15 @alberto ravell Journalist

16 @gabycastellanos Community manager

17 @ecualink Ecuadorian magazine

18 @leonardo padron Writer

19 @EUTrafico Newspaper

20 @2010misterchip Sports journalist

Table 5.4: Most retransmitted account at each retransmission community from the Venezuelan

protest #SOSInternetVE.

76

5.9 Summary

In summary, we studied the Venezuelan protest #SOSInternetVE, which took place exclu-

sively on Twitter. We have analyzed the structure and behavior of the participant users

based on their information exchange interactions. For this we have constructed the followers

networks to represent the social substratum, where information may flow, and the retweets

graph, where messages actually travel. Most of the degree distributions at both networks

follow power laws and the mean distances between nodes resulted to be very small. Then,

based on the networks structure, we identified three types of user behavior that determine

the dynamics of the information flow: Information Producers, Active Consumers and Passive

Consumers. We found some users that cause a lot of activity inside the network, posting a

little amount of messages, while others must post lots of messages in order to get retrans-

mitted. We also found a big fraction of very passive users who does not even retransmit nor

get retransmitted at all. We also carried out a community analysis to describe the mesoscale

structure of the networks. We found that people is organized around different collectives.

The most central users who conform each of these collectives are very popular and usually

they also generate smaller retransmission communities emergent from the propagation dy-

namic. This shows that people is more selective when it comes to take an active part in the

conversation. We noticed that although the online social media seems to be a purely social

phenomena, traditional media agents still enjoy a lot of power and influence over people,

who they use to boost and enhance their messages.

77

Chapter 6

EFFICIENCY OF HUMAN

ACTIVITY AS A MEASURE OF

INFLUENCE

In this chapter we address the following question: what can Twitter users do to increase

their influence? We explore two avenues for this: topology and activity. We introduce a

new index to measure the influence of users on Twitter, called user efficiency [MBLB14]. It

is based on the ratio between the emergent spreading process and the activity employed by

the user, quantified as the amount of retransmissions gained per user by message posted.

We study this property by means of a quantitative analysis of the structural and dynamical

patterns emergent from human interactions during six conversations on Twitter. We found

a universal behavior in the relation between the individual efforts, managed by the user,

and the collective reaction to such efforts, which is an emergent property of the underlying

network. In general, this universality indicates that influence can be increased by means of

the activity, but in a very expensive and inefficient way. We propose a model to explain

the user efficiency based on biased independent cascades on networks. We study this model

to understand the effects of different factors, like the topology of the underlying network

and user activity distribution, on the resulting distributions of efficiency. We found that

the emergence of a select group of highly efficient users depends on the heterogeneity of the

underlying network, rather than on the individual behavior.

The present chapter is organized as follows. First we introduce our measure of user

efficiency in section 6.1. Then in section 6.2 we show the universal behavior of such measure

across different datasets. Next we introduce a computational model to explain the obtained

distributions in section 6.3. Finally, we apply the model to the datasets and explore the

79

Figure 6.1: Scatter plot of the user in degree vs out degree in the followers network, colored by the

respective user efficiency. Dots represent users. Data correspond to the #SOSInternetVE dataset.

effects of the activity and underlying graph properties in section 6.4.

6.1 User Efficiency

The fact that not all the participants must employ the same amount of effort, to accomplish

the same level of retransmissions, implies that users have an individual efficiency to get

their messages spread by others. We define user efficiency, η, as the ratio between the

collective response to the individual efforts [MBLB14]. It is a metric of influence in the

network, quantified as the amount of retransmissions gained by user with each message

posted, defined according to the following expression:

ηi =Ri

Ai(6.1)

where Ri is the number of retweets gained by user i, and Ai is the amount of messages

posted or retweeted by the user i. The users whose η > 1 get more retweets than the

80

10-4 10-2 100 102 104

User Efficiency

10-5

10-4

10-3

10-2

10-1

100

PD

F

A

10-4 10-2 100 102 104

User Efficiency

10-5

10-4

10-3

10-2

10-1

100

CC

DF

B

−5 −4 −3 −2 −1 0 1 2 3 4Logormal quantiles

−6

−4

−2

0

2

4

6

8

Em

pir

ical quanti

les CKF

in <10

KFin <100

KFin <1000

KFin <10000

KFin <100000

Figure 6.2: User efficiency probability density function (A) and complementary cumulative density

function (B). The red dots correspond to the empirical results, the black solid line represents the

lognormal fit and the black dashed line represents a power law fit. Quantile-Quantile plot (C) of the

user efficiency distribution, filtered by the in degree in the followers network KFin. The distributions

correspond to the #SOSInternetVE dataset.

number of messages posted and therefore are more efficient to spread their information in

the network. Consequently, these users gain more influence in comparison to those whose

η < 1, which had to employ larger efforts to obtain similar outcomes.

In Fig. 6.1, we present a scatter plot of the users degree in the followers network, kin and

kout, colored by their efficiency η, from the #SOSInternetVE dataset. It may be noticed,

that the users who present an efficiency η > 1 (green, yellow, orange and red dots) are mostly

located below the dashed line of slope one, which means that their audiences (kin) are larger

than their sources of information (kout), which implies a certain level of popularity in the

network. Specially, those whose η >> 1 (orange and red dots), who may be followed by more

than 104 users, but they only follow less than 10 users. Meanwhile, the users who present a

low efficiency (blue dots), tend to receive messages from much more sources than the size of

their audiences (kout > kin), and also have a smaller amount of followers. This means that

81

these users hear more information from the network, than what they are actually listened.

However, the mean efficiency value seems to be close to 1 (Ri ∼ Ai), as shown in the user

efficiency η distribution presented in Fig. 6.2 A, which means that in average most of the

users who got retweeted, gained as many retransmissions as the amount of messages posted.

Besides, the users whose η >> 1, represent a minority part of the population, as clearly

shown in the η complementary cumulative distribution in Fig. 6.2 B. It can be noticed that

less than 2% of the retweeted population gained more than 10 retransmissions by message

sent (dashed line in Fig. 6.2 B), 0.2% gained over 100 retransmissions by message sent

(dotted line in Fig. 6.2 B) and just one user gained over 1000 retransmissions with a single

post.

In order to further understand the η distribution, we have superimposed in Fig. 6.2 A-B

the correspondent lognormal curve, with the mean and variance taken from the empirical

observations (see Table 6.1). It is known that lognormal distributions arise from multiplica-

tive growing processes, like branching processes, as they may be explained by the central

limit theorem, in the logarithmic scale [Mit04]. An example of these processes are found in

viral marketing campaigns [IE11a, IE11b], where the number of leaves grow multiplicative

as the branches split like the cascades shown in chapter 5. It can be noticed that the initial

part of the distribution fits quite well the lognormal curve, but right after its maximum the

distribution changes the scaling behavior, apparently to a power law, which we have also

superimposed in Fig. 6.2 A with a dashed line. This means that there is a higher concen-

tration of users who gain a larger amount of retransmissions by message posted, than what

is expected for a lognormal distribution. These highly efficient users correspond to the hubs

of the followers network as can be appreciated in Fig. 6.2 C, where we have plotted the

Quantile-Quantile plot of the η distribution in comparison to the lognormal distribution,

filtered by the number of followers. If η would follow a lognormal distribution, all the points

would appear in a straight line, which actually happens for the users who present less than

1000 followers. But, as we consider the most followed users, the curve begins to change its

behavior, suggesting that the underlying network topology is responsible for such deviation.

This point would be further analyzed in section 6.4.

In summary, we have seen two kind of users who may gain a significant amount of

retransmissions. One of them, are the highly connected users in the followers network,

which have no need to follow other people, and with a high efficiency, gain a much larger

amount of retweets than their own posted messages. While, there are other not so well

connected users, who may also gain a lot of retweets, but in a less efficient way, since they

need to post much more messages than the highly efficient ones.

82

Keyword Messages Users µη ση

Andreafabra 35, 835 23, 498 0.15 1.05

Gingrich 93, 063 43, 061 −0.08 1.13

Leones 142, 808 46, 608 −0.08 1.09

20N 389, 988 123, 710 −0.49 1.08

SOSInternetVE 421, 602 77, 706 −0.79 1.21

Obama 6, 818, 782 2, 265, 799 0.14 1.15

Egypt 7, 433, 542 1, 180, 715 −0.80 1.33

Table 6.1: Properties of the studied datasets and their resulting user efficiency distribution prop-

erties.

6.2 Universality

In order to identify whether this distribution is constrained to the present case study or

rather represents a consequence of an universal feature of the interaction mechanism, we

have calculated the user efficiency (η) for other conversations on Twitter. Specifically, we

performed the analysis over six different datasets described in chapter 4 and whose features

may be found in Table 6.1. All of them belong to different contexts and their sizes include

several order of magnitude in terms of the number of posted messages and participant users.

In Fig. 6.3 we present the user activity distribution of these datasets, plotted in ascendant

order according to their size (from A to F). It can be noticed that they follow a power law

behavior at the first orders of magnitude. However, the curves truncate after certain point

due to the individual constrains, as previously explained in section 5.2. Moreover, in Fig.

6.4 we present the distributions of retweets obtained by user for the same datasets. It can be

noticed that these distributions show a power law behavior at all their extension. As shown

in section 5.4, this happens because the retweets obtained are an emergent property that

results from the aggregation of many individual actions.

The results of the emergent η distributions from these datasets are presented in Fig. 6.5.

It can be noticed that the lognormal distribution emerges, even when the smallest datasets

are considered (Fig. 6.5 A-B). However, as the size of the dataset increases, the effects of

the presence of highly efficient users is more evident in the distributions, which present a

very similar shape as the one found for the #SOSinternetVE conversation (Fig. 6.2 A).

Given the fact that the size of the datasets cover from four to six orders of magnitude

and correspond to topics of different nature, it is remarkable that the resulting distributions

83

Figure 6.3: Complementary cumulative density function of the user activity, from several Twitter

conversations, increasingly ordered according to the number of messages (A-F): (A) Andreafabra,

(B) Gringich, (C) Leones, (D) 20N, (E) Obama, and (F) Egypt. The black dashed line represents

a power law fit and the red dots correspond to the measured distributions.

84

Figure 6.4: Complementary cumulative density function of the retweets obtained by user, from

several Twitter conversations, increasingly ordered according to the number of messages (A-F): (A)

Andreafabra, (B) Gringich, (C) Leones, (D) 20N, (E) Obama, and (F) Egypt. The black dashed

line represents a power law fit and the red dots correspond to the measured distributions.

85

Figure 6.5: Probability density function of the user efficiency on several Twitter conversations,

ordered increasingly according to the number of messages (A-F): (A) Andreafabra, (B) Gringich,

(C) Leones, (D) 20N, (E) Obama, and (F) Egypt. The properties of these conversations may be

found in Table 6.1. The black solid line represents the lognormal fit, the black dashed line represents

a power law fit and the red dots correspond to the measured distributions.

86

10-4 10-2 100 102 104

User Efficiency

10-710-610-510-410-310-210-1100

PDF

A

EmpiricalModel (Followers net)

100 101 102 103 104 105

Retweets Ri

10-710-610-510-410-310-210-1100

CCDF

B

10-4 10-2 100 102 104

User Efficiency

10-710-610-510-410-310-210-1100

PDF

C

100 101 102 103 104 105

Retweets Ri

10-710-610-510-410-310-210-1100

CCDF

D

Figure 6.6: Model results to the user efficiency distribution (left column) and retweets gained by

user distribution (right column), with the empirical results. The model has been applied to the

followers network from the #SOSInternetVE dataset (top panel) and the #20N dataset (bottom

panel).

present a very similar shape. This ubiquity of the resulting patterns, strongly suggests the

existence of an universal behavior in the relation between the individual efforts, managed

by the user, and the collective reaction to such efforts, which is an emergent property of the

underlying network. So we open the following question: what factors cause the emergence

of such distribution? In the next section we will propose a model to explain the emergence

of the observed distribution.

6.3 Model

In order to model the propagation of retweets that took place on the #SOSInternetVE

conversation, we propose a spreading mechanism based on independent cascades [GLM01]

taking place on the followers network. In this model, nodes are activated in analogy to

87

having posted a message, allowing their neighbors to also activate, like having retransmitted

the received message, following the cascade schema shown in Fig. 5.7. Each message may

trigger an independent cascade regardlessly of the author’s previous activations. Besides,

nodes may belong and participate in several cascades at the same time.

In the context of a given cascade, when a node i has been activated, it has a single chance

to activate each of its neighbors (followers), j, located at l layers away from the message

source. Thus the spreading probability depends on such distance l. In the sense that, the

probability of a node j to retransmit a message at l layers away from the source, is given

according to the probability of the cascade to grow vertically and have a depth of at least l

layers, P (d ≥ l), and the probability to grow inside the layer l, given by λl.

The user activity Ai is given as the result of all the messages posted by i: as a source in

layer l = 0 (Ai,0) plus all the retweets made by i at l steps farther from the message source

(Ai,l|l > 0), in the following way:

Ai = Ai,0 +dmax∑l=1

Ai,l (6.2)

where dmax is the maximum cascade depth allowed. On the one hand, Ai,0 is an indepen-

dent random variable with density distribution P (A0), and represents the initial conditions

for the spreading process. On the other hand, Ai,l|l > 0 is not independent and it rather

represents a consequence of the propagation of other nodes’ activity. Among other factors,

this quantity depends on the amount of messages received by i, which is proportional to the

amount of people who i follows on the underlying followers network (ki,out).

From this perspective, we define the retransmissions gained by user i in the following

way:

Ri =dmax−1∑l=0

Ri,l (6.3)

where Ri,l represents the retweets gained by the node i due to its given activations at

the layer l in all the cascades. This means that a node i may gain retransmissions either

from the messages originally posted by it (Ri,0), as well as from messages retweeted by i at l

layers away from the source (Ri,l). On this basis, the value of Ri,l depends on the number i’s

followers, as well as the followers of followers, and so on, until reaching the maximum depth

considered for a possible node activation, given by dmax. Hence the sum upper limit in eq.

6.3 is one layer before this value.

88

10-4 10-2 100 102 104

User Efficiency

10-710-610-510-410-310-210-1100

PDF

A

A

EmpiricalModel (Followers net)Model (Random net)

100 101 102 103 104 105

Retweets Ri

10-710-610-510-410-310-210-1100

CCDF

B

B

10-4 10-2 100 102 104

User Efficiency

10-710-610-510-410-310-210-1100

PDF

C

C

100 101 102 103 104 105

Retweets Ri

10-710-610-510-410-310-210-1100

CCDF

D

D

Figure 6.7: Effects of the underlying network topology on the model results in terms of the

user efficiency distribution (left column) and retweets gained by user distribution (right column).

The model has been applied to the followers network (blue crosses) and their randomized versions

(red x symbols). Two datasets have been considered: #SOSInternetVE (top panel) and #20N

(bottom panel). In all cases, an heterogeneous initial activity distribution P (A0) ∝ A−1.40 has been

considered.

6.4 Results

We first applied the model by computational simulations. For this purpose, we defined the

underlying network where the propagation process would take place, as well as the initial

user activity distribution P (A0). Then the messages are spread taking into account the

probability of a cascade to reach l layers P (d ≥ l) and the retransmission rate in a given

layer λl. Finally after all the initial activations are performed and the triggered cascades

extinct, we calculate the efficiency η for each user according to eq. 6.1, as well as the

correspondent density distribution.

We applied the model to two followers networks from the considered datasets. One of

these networks corresponds to the #SOSInternetVE dataset and the other one is constructed

89

from the #20N dataset (see Fig. 6.5 D). The results of the user efficiency and retweets

distribution are shown at the top and bottom panels in Fig. 6.6 respectively. These results

correspond to the average value of 50 model realizations. In both cases, the system has been

initially excited using an heterogeneous user activity distribution in the form: P (A0) ∝ A−1.40 ,

and the spreading probabilities were taken from the cascade’s characterization, given in Fig.

5.8. It can be noticed that the resulting efficiency distributions in Fig. 6.6 A and C (blue

crosses) present a very good agreement with the empirical data (open circles) in both cases.

In fact, the distributions also presents the different scaling behavior at the right side of

the curve. Besides, the resulting retweets distributions in Fig. 6.6 B and D (blue crosses),

are also in very good agreement with the empirical data (open circles). These results show

that the distributions analyzed are a reflection of the dynamical process behind the message

spreading, which happens on Twitter by means of the retweets mechanism in independent

cascades, where the probability of a cascade to grow decays as the message travels through

the network, independently of the social context. After having validated the spreading

mechanism, we are able to use the model to control the effect of the different factors that

determine the user efficiency patterns, such as the heterogeneity of the underlying network

topology and the characteristics of the individual user behavior (activity distribution).

First, we analyze the effects of the heterogeneity of underlying network topology on the

spreading process. For this matter we applied the model to two different kind of substrata:

the followers networks, from the datasets #SOSInternetVE and #20N, and their randomized

versions. These randomized networks were built to avoid the presence of hubs and create

homogeneous users profiles, by rewiring the edges so the degree distribution would follow a

Normal curve instead of a power law, but maintaining the average number of edges per node.

The resulting η distributions after having excited the system with the same heterogeneous

P (A0) are plotted by red x symbols in Fig. 6.7 A and C respectively. It can be noticed

that the distributions from these homogeneous networks present a different behavior than

the ones obtained from the empirical observations and the modelled ones on the followers

networks. There is a slightly lower density of the low efficient users, but more importantly,

the highest values of the distribution are almost two orders below the empirical values,

apparently following a lognormal behavior. However, the retweets distributions in Fig. 6.7

B and D (red x symbols) still present power law behavior, due to the heterogeneity of

P (A0), although the probabilities of retweet are lower. In both cases, this means that an

homogeneous society would allow users to gain an extremely high amount of retweets, only by

means of employing an enormous amount of initial activity as well, since the user efficiency

is strongly limited to the available connections on the underlying network.

90

10-4 10-2 100 102 104

User Efficiency

10-710-610-510-410-310-210-1100

PDF

A

EmpiricalModel (Followers net)Model (Random net)

100 101 102 103 104 105

Retweets Ri

10-710-610-510-410-310-210-1100

CCDF

B

10-4 10-2 100 102 104

User Efficiency

10-710-610-510-410-310-210-1100

PDF

C

100 101 102 103 104 105

Retweets Ri

10-710-610-510-410-310-210-1100

CCDF

D

Figure 6.8: Effects of the individual user behavior on the model results in terms of the user

efficiency distribution (left column) and retweets gained by user distribution (right column). The

model has been applied to the followers network (blue crosses) and their randomized versions (red

x symbols). Two datasets have been considered: #SOSInternetVE (top panel) and #20N (bottom

panel). In all cases, an homogeneous activity distribution P (A0) = 1/6 where A0 ∈ [1, 6] has been

considered.

91

Second, to study the effects of the individual user behavior, given by the initial ac-

tivity distribution, we also applied the model to both followers networks (the case study

#SOSInternetVE and the #20N dataset) and their randomized versions, but in this case

considering an homogeneous P (A0), in the form: P (A0) = 1/6 where A0 ∈ [1, 6], instead

of the heterogeneous one previously considered. The results of applying this homogeneous

user behavior to the heterogeneous followers networks are presented by blue crosses in Fig.

6.8. It can be noticed that the resulting user efficiency distributions in Fig. 6.8 A and C,

present the same behavior on the right side of the curve as the empirical observations (open

circles), even though the considered user behavior is radically different than the empirical

one. Besides, the retweets distribution (Fig. 6.8 B and D) also coincide quite well with

the empirical observations and hardly changes in comparison to the distributions obtained

when users posted messages in a heterogeneous way. However, if we change the substrata

to their randomized versions, the model results no longer reproduce the empirical behavior

and all the distributions loose their heterogeneity (red x symbols in Fig. 6.8). This confirms

that the emerging patterns are not dependent on the way users post original messages, but

instead a consequence of their heterogeneous connections on the underlying network.

In the case of Twitter, the followers network also represents the way that the collective

attention is organized. On this basis, this model has shown that if this collective attention

is distributed heterogeneously among the population, the way users post messages has no

further effects in the efficiency distribution, nor the retweets distribution, since the high

aggregation of users around the influential ones is what produces such large collective reac-

tions. In turn, if users would pay attention to each other homogeneously, as the randomized

version of the followers network, then the retweets gained by user would be a reflection of

the frequency and amount of posted messages, and the efficiency to gain such retweets would

be strongly limited by the properties of the underlying substratum. However, despite the

fact that in an homogeneous society it would be more difficult to find extreme cases of high

efficient users, the density of extremely low efficient users also decreases when the attention

is shared homogeneously among the collective. Therefore, this evidences that in order for

some users to gain attention from the collective, others must loose it at the same time.

6.5 Analytical Solution

In this section we provide an analytical solution to the model of user efficiency. For this

purpose, we will define the quantities Ai,l from eq. 6.2 and Ri,l from eq. 6.3, for l > 0.

Ai,l is defined in the following way:

92

Figure 6.9: Results from the analytical model of user efficiency, considering cascades up to three

layers of depth in the followers network from the #SOSInternetVE dataset. Resulting η average

(A) and standard deviation (B) from evaluating the model with 0.2 < P (d > 0) < 1.0 (x-axis)

and 0.05 < r0 < 0.3 (color). The dashed lines indicate the empirical values. (C) Resulting

η distribution from applying the analytical model to the followers network with the empirical

activity distribution P (A0) by setting P (d > 0) = 0.775 and r0 = 0.15. The white dots represent

the empirical distribution of user efficiency and the triangles represent the distribution obtained

from the analytical model.

93

Ai,l|l > 0 = 〈Al−1〉ki,outλlP (d ≥ l) (6.4)

where 〈Al−1〉 is the mean activity value of nodes in the layer l − 1 in all the cascades,

ki,out is the out degree of the user i (those who i follows), P (d ≥ l) is the probability of a

cascade to have a depth of at least l layers, and λl is the retransmission rate at the layer

l. In this sense, the activity of a node at any layer, depends on the expected activity of all

nodes on the previous layer, the node’s connectivity and the network’s permeability, given

by the probability of a cascade to grow vertically and horizontally.

Ri,l is defined as follow:

Ri,l = Ai,l

dmax∑n=l

Ki,in(n)P (d ≥ n)n∏

m=0

λm (6.5)

where Ki,in(n) is the sum of the in degree (kin,j) of nodes j, which are n layers away from

i, in the sense of the edge direction, being Ki,in(0) = ki,in, the node’s in degree.

The resulting user efficiency ηi would be:

ηi =

∑dmax−1l=0 Ri,l∑dmax

l=0 Ai,l(6.6)

We applied eq. 6.6 to the followers network from the #SOSInternetVE dataset, and con-

sidered the actual data from the original activity by node, which represent an heterogeneous

distribution. The corresponding distribution got with dmax = 2 is plotted in Fig. 6.9 C.

It can be noticed that the analytical model results present a very good agreement with the

observed data. In order to reproduce the distribution, we had to increase the probability of

the cascades to grow vertically and horizontally on the first layer to P (d > 0) = 0.775 and

r0 = 0.15 respectively. These values are different from the empirical values, which did not

reproduce the empirical distribution.

In order to obtain these probabilities, we applied eq. 6.6 by spanning 0.2 < P (d > 0) <

1.0 and 0.05 < r0 < 0.3. The results in terms of the average and standard deviation of η

are shown in 6.9 A and B respectively. The dashed lines indicate the empirical values. We

first noticed that the empirical average value of η is obtained within the range of P (d > 0)

marked with a gray shadow in Fig, 6.9 A. Then, from this range, we found the value of r0

at the intersection with the dashed line in Fig, 6.9 B.

94

6.6 Summary

In summary, we have been able to model the efficiency of users to spread their opinions during

Twitter conversations, and found that the emergent patterns are remarkable influenced by

the underlying network topology. We have shown an evidence of the robust but vulnerable

property of complex networks. In the sense that complex networks appear to be robust for

most of the external excitations, as most of people post messages that do not travel at all, but

vulnerable for selected excitations, as the activity performed by the highly efficient users have

a remarkable impact in the resulting patterns [Wat02]. This effect is also measured through

the macroscopical property of the percentage of retweets on the overall posted messages.

In the protest 47% of the messages were retweets, while our simulations gave 45 ± 3% for

the followers network and 40.3 ± 0.1% for the randomized version. This additional 5% of

retransmissions were only possible due to the complex organization of the network.

95

Chapter 7

MEASURING POLITICAL

POLARIZATION

In this chapter we propose a methodology to study polarization in social media and quantify

its effects. To this end, we introduce a computational model to estimate opinions [MBLBss]

from a contagion process on social networks; together with a new index [MBLBss] to quantify

the extent of polarization in the obtained opinions.

The model iteratively estimates the opinions of the majority, by fixing the opinion of a

minority of influential individuals and mapping the communication fluxes among the pop-

ulation. Its dynamics are similar to the DeGroot model [DeG74], with the introduction of

some users acting like “zelots” [Mob03, MMR07]. In absence of polarization the expected re-

sulting distribution of opinions would be a normal distribution centered at a neutral opinion.

However, as polarization emerges the resulting distribution shifts to a bimodal distribution

with two peaks emerging around the two dominant and confronted opinions [DW07].

Our measure of polarization is inspired by the electric dipole moment - a measure of the

charge system’s overall polarity. For two opposed point charges the electric dipole moment

increases with the distance between the charges. Analogously, the polarization of two equally

populated groups depends on how distant are their views. We apply this index to measure

the polarization in the opinions distributions obtained with the proposed opinion estimation

model.

At the end of the chapter, we show how to apply our methodology to online data gathered

from Twitter in order to estimate individual opinions and measure the emergent political

polarization. The data correspond to online conversations, during the death announcement

of the late Venezuelan President, Hugo Chavez. We found a good agreement between our

results and offline data.

97

7.1 A Model to Estimate Opinions in a Social Network

We present a model to estimate the opinions of individuals who interact on a social network.

In it we distinguish two types of individuals, elite and listeners. The first ones have a fixed

opinion and act like seeds of influence, while the opinion of the second ones depends on their

social interactions. The model is fully specified by the following assumptions:

1. Initial Conditions: The world is abstracted by a directed network, G, in which each

individual is represented by a node and links account for influence rather than friendship or

other kind of relationship. We define two different subset of nodes, S accounting for elite;

and L, accounting for listeners. Additionally we endow each elite with a parameter, Xs,

that determines her opinion value and that will remain constant for the duration of the

model. Xs lies in the range, −1 ≤ Xs ≤ 1, where 1 and -1 represent the two extreme and

confronted poles. Finally we set an initially neutral opinion, Xl(0) = 0 to all listeners.

2. Opinion Generation: At each iteration, elite nodes, S, propagate their own opinions

through the established network, G, influencing listeners, L. Hence each listener iteratively

updates her opinion value as the mean opinion value of her neighbors. Thus the opinion at

time step, t, of a given listener, i, is given by the following expression:

Xi(t) =

∑j AijXj(t− 1)

kouti

(7.1)

where Aij represents the elements of the network adjacency matrix, which is 1 if and only

if there is a link from j to i, and kouti corresponds to her out degree. The process is repeated

until all nodes converge to their respective Xi value, lying in the range −1 ≤ Xi ≤ 1. The

convergence is defined with a threshold Th such that: |Xi(t + 1) −Xi(t)| < Th. Thus, the

results of the model are given in a density distribution of nodes’ opinion values p(X). Note

that the opinions of individuals do not depend on their opinion in the previous step. This

is because we are estimating their opinion that a priori was unknown, rather than studying

the evolution of opinions.

The dynamics of the model is illustrated in Fig. 7.2, where we present an schema of the

influence spreading process. Panel A visualizes the instantiation of the model where each

elite node has been colored according to her opinion (red, Xs = −1; and blue Xs = +1).

Panels B-E show the dynamics of the influence process from the initialization (B) to the final

converged state (E). Panels (F) and (G) visualize two empirical networks corresponding to

a non polarized (F) and a polarized (G) case. Furthermore, we also illustrate the dynamics

of the model in the Video B.1, which is described in the Appendix B.

98

Figure 7.1: Schema explaining the proposed polarization index µ. (A) Density distribution of

opinions. gc stands for the gravity center of each pole, A stands for the population associated to

each ideology, and d stands for the pole distance. (B) Visualization of the polarization index, µ,

for three different situations.

7.2 A Measure of Polarization

We say that a population is perfectly polarized when divided in two groups of the same size

and with opposite opinions. Hence we propose a measure of polarization that quantifies both

effects for the resulting X ∈ [−1, 1] distribution obtained from our model. This definition

is inspired by the electric dipole moment- a measure of the charge system’s overall polarity.

In the simplest case of two point charges of opposite signs (−q and +q) the electric dipole

moment is proportional to the distance among the charges. This is analogous to a simple

scenario consisting of two persons with different ideologies, thus the polarization depends on

how conflicting are their points of view (i.e. the distance among the two ideologies).

We begin by calculating the population associated with each opinion (positive and neg-

ative). For this we define A− as the relative population of the negative opinions (X < 0).

By the same token we define A+ as the relative population of the positive opinions (X > 0).

Hence, both variables can be expressed as:

99

A− =

∫ 0

−1p(X)dX = P (X < 0) , (7.2)

A+ =

∫ 1

0

p(X)dX = P (X > 0) (7.3)

So we can express the normalized difference in population sizes, ∆A , as

∆A = |A+ − A−| = |P (X > 0)− P (X < 0)| (7.4)

Next we quantify the distance between the positive and negative opinions. In other words

we measure how differing are the opinions of the two sides. To this end we determine the

gravity center of the positive and negative opinions that can be written as

gc− =

∫ 0

−1 p(X)XdX∫ 0

−1 p(X)dX, (7.5)

gc+ =

∫ 1

0p(X)XdX∫ 1

0p(X)dX

(7.6)

and define the pole distance, d, as the normalized distance between the two gravity

centers. Hence it can be expressed as:

d =|gc+ − gc−||Xmax −Xmin|

=|gc+ − gc−|

2(7.7)

This formula gives d = 0 when there is no separation between the gravity centers, i.e.

there are no longer two differentiated groups and everyone shares a similar opinion; and

d = 1 when the two opinions are extreme and perfectly opposed.

Finally, we can use eqs. 7.4 and 7.7 to write down a general formula to measure polar-

ization as a function of the difference in size between both populations ∆A and the poles

distance d. Thus we define the polarization index as:

µ = (1−∆A)d (7.8)

This formula gives µ = 1 when the distribution is perfectly polarized. In this case

the opinion distribution function is two Dirac delta centered at −1 and +1 respectively.

Conversely, µ = 0 means that the opinions are not polarized and the resulting distribution

of opinions would either take the form of a Gaussian distribution centered at a neutral

opinion, or also be entirely centered in one of the poles, implying that the population (A)

100

Figure 7.2: Schema of the influence spreading process in the opinion estimation model. (A)

Displays the seed nodes in the network, colored according to their respective ideology. (B) Displays

the network at t = 0, before seeds start to propagate their influence. (C) Shows the state of the

network at t = 1. (D) shows the state of the network at t = n/2. (E) Displays the final state

of the network at t = n. (F) and (G) Visualizations of two examples of the result of the opinion

estimation model to the Venezuelan dataset for non polarized (F) and polarized (G) days. See the

video B.1 described in the Appendix B

of the other pole would be reduced to zero and ∆A = 1. In between, polarization can lie

within the range, 0 < µ < 1, for three reasons: i) The population sizes associated to each

opinion are equal, but the pole distance d is lower than 1. ii) Despite d being equal to 1, the

population sizes associated to each opinion are different and therefore there is a majority

sharing a similar opinion. iii) A combination of i and ii. Fig. 7.1 A illustrates the basic

concepts of the proposed index of polarization, as it visualizes the area associated to each

opinion, their corresponding gravity centers and the pole distance for a standard case of a

perfect bimodal distribution. In panel B of this figure we have visualized a non polarized

distribution (µ = 0), a perfectly polarized one (µ = 1) and a case in between.

7.3 Study of Polarization on Retweet Networks

In order to measure the extent of polarization on Twitter conversations, we propose the

following methodology: First, we build social networks from a conversation to be analyzed,

like the ones described in section 4.2.2. Then, we apply the model proposed in section 7.1

to the networks, in order to obtain the distribution of opinions of the population. Finally,

we quantify the polarization present in these opinions distributions, by means of the index

101

we proposed in section 7.2.

The social networks we considered are the retweet networks from Twitter conversations.

These user-to-user interaction networks represent the channels where information actually

flows on Twitter, as shown in chapters 5 and 6. Besides, the retweet mechanism have been

reported as the most polarized on Twitter [CRF+11, CGFM12] and it is typically used to

actively endorse ideas [BGL10, BMLB12].

In this section, we apply our opinion estimation model and polarization index to Twitter

data regarding the late Venezuelan President Hugo Chavez. The dataset used in this study

was described in section 4.2.2 under the keyword Chavez. First, we will present the networks’

properties. Then, we will define the elite of collective attention, and use them as seeds to

apply the opinion model to all networks. Finally, we will discuss the effects of edge direction

and the offline-online relationship of our results.

7.3.1 Retweets Networks

Since this conversation covers a two months period, we have built one independent retweet

network for each day of the observation period (56 networks). A single network contains

several retransmission cascades, seeded and propagated by the conversation participants.

When these cascades are aggregated, several disconnected network components emerge. In

Fig. 7.3 we present a visualization of a retweet network at an arbitrary day. It can be noticed

that most of the components are compound by two or three users at most (see gray graphs

Fig. 7.3), while there is a single component, called Giant Component (GC), whose size is

in the same order of the whole network (see colored graph in Fig. 7.3). We will apply the

polarization detection methodology to these GC and refer to them as the retweet networks

in the following sections.

In the left panel of Fig. 7.4, we present the distribution of the components’ size for three

different days. It can be noticed that the distribution follows a power law behavior, where

the size of the GC is much higher than the rest of components. In the right panel of Fig. 7.4

we present the time evolution of the GC properties. First, we present the time evolution of

the ratio between the number of nodes in the GC and the number of nodes in the respective

networks in Fig. 7.4. It can be noticed that the GC contained around 80% of the network

nodes along the observation period. The number of users in the GC widely varied during

the unfolding events. Then, in Fig. 7.4 B we present the time evolution of the GC size,

measured as the total number of nodes at each network. It can be noticed that the GC size

fluctuated around a median value of 20,000 users (gray dashed line), and grew up to 1 million

users during the death announcement (orange stripe). This temporal behavior is typical of

102

Figure 7.3: Visualization of the retweet network at day D− 29. The Giant Component has been

colored in blue and red, while the rest of components have been colored in gray.

103

Figure 7.4: (Left) Distributions of the components size of the retweet networks from the Twitter

conversation about the Venezuelan President Hugo Chavez for three days: D− 29, D and D+ 20,

where D represents the day of the main occurrence. (Right) Time evolution of the Giant Component

(GC) of the retweets networks: (A) Ratio between the number of nodes that conform the GC and

the number of nodes in the respective networks. (B) Time evolution of the whole network and GC

size in terms of nodes. (C) Relative number of messages inside Venezuela from the geolocalized

users in the GC. The orange stripe represents the day D and the state funeral period.

104

breaking news topics [YL11], with a bursty increase during the main occurrence and a slow

decay that may last for several days. During the burst the conversation went viral and many

international users joined the conversation from all around the globe (see Fig. 7.5 and the

video B.2 described in the Appendix B). This is shown in the amount of geolocated messages

inside Venezuela, given in Fig. 7.4 C. It can be noticed that the Venezuelan share of messages

represented about 80% of the analyzed content for most of the observation period (dashed

line), with the exception of the death announcement. During this day the Venezuelan share

of messages reached its lowest point close to 20% of the messages.

The retweet networks characterize the way that the collective attention is organized

during an event on Twitter. The out-strength (sout) indicates the amount of attention paid

by a given user in the conversation, while the in-strength (sin) indicates the amount of

attention received by a user from the rest of the network. The first is measured by the

number of retweets made by the participant, and the second is given by the number of

retweets gained by the participant. In Fig. 7.6 we have superimposed the in-strength (left)

and out-strength (right) complementary cumulative density functions (CCDF) for each of

the constructed networks. In both cases the distributions display power law behavior, being

the in-strength distributions broader than the out-strength distributions.

To understand how people distributed their attention, we studied the evolution of the

Gini coefficient [CV12] of these two distributions. The Gini coefficient is used to measure

inequalities in people’s income, and indicated the heterogeneity of the distribution. It gives

the value 1, when the population is perfectly unequal, indicating that hubs are concentrating

all the links. In turn, it takes the value 0 when the population is perfectly equal, indicating

that links equally distributed among nodes. Here we propose to use it as an indicator of how

the people’s attention is being distributed among the information sources. The results are

shown in Fig. 7.6 C. It can be noticed that the Gini index of incoming links is very close

to 1 during the whole observation period (blue curve in Fig. 7.6 C), which means that hubs

concentrate practically all of the collective attention. On the contrary, we see that the Gini

index for outgoing links is closer to 0 (red curve in Fig. 7.6 C), indicating that the attention

given is less unequally distributed among users, than the attention received.

Moreover, in order to understand the way these heterogeneous users interacted with each

other we studied the directed assortativity by degree evolution [New03a, HW09]. The results

are shown in Fig. 7.6 D. It can be noticed that the out-in degree assortativity (green) is

negative for all the observation period. This means that the content posted by the influential

hubs is usually retweeted by the users who are not that connected. On the other hand, the

out-out degree distribution (blue) is clearly positive for all the observation period, which

105

Figure 7.5: Visualization of geolocated messages from the Chavez conversation on three days

from different periods: before the announcement (top), during the announcement (middle), after

the announcement (bottom). The dots represent geolocalized messages. The label indicates the

day of observation, being D the day of the announcement.

106

Figure 7.6: Evolution of the topological properties of the retweet networks emergent at each day

of the observation period, in terms of: (A) Out strength complementary cumulative distribution,

(B) In strength complementary cumulative distribution, (C) Gini index evolution of the strength

distributions. (D) Directed degree assortativity evolution. The orange stripe represents the day of

the main occurrence. In A and B, the blue curves correspond to the first days and the red curves

correspond to the last days.

107

Figure 7.7: Conditioned probability density function of the accumulated in-strength (Sin) given

the participation rate (ρ), from the Twitter conversation about the Venezuelan President Hugo

Chavez. The color correspond to the density of users. The red line indicates the average accumu-

lated in-strength value Sin for a given participation rate ρ.

108

means that very active users, are usually retweeted by very active users. That effect is

related to the cascades shown in chapter 5. The other two assortativities, in-out and in-in

are close to zero, which means that no major correlation is detected.

Participation

To further understand the relationship between the individual activity and the attention

received, we will aggregate the observation period by characterizing the individuals according

to their rate of participation and total amount of retweets gained. The participation rate is

defined as:

ρ = ρi/T (7.9)

where ρi is the number of days that the user i actively participated in the retweet process

and T is the total length of the observation period. The total number of retweets gained by

user is measured as:

Sin =T∑t=0

sin(t) (7.10)

where sin(t) is the in-strength of the node i at day t. If the user did not actively partici-

pated at day t, then sin(t) = 0.

The conditioned probability density function of the accumulated in-strength Sin given

a participation rate ρ, P (Sin|ρ), is shown in Fig. 7.7. This distribution indicates the total

amount of attention received by users according to their participation rate. It can be noticed

that the largest density of users (red and orange dots in Fig. 7.7) participated less than 20%

(ρ < 0.2) of the days and present a small in-strength value (Sin < 10), which means that

most of them received a little amount of the collective attention. However, there is a direct

relation between the average conditioned value of Sin, given in 〈Sin|ρ〉, and the participation

rate ρ, indicating that the more days people participate, the more the attention they receive

(see red line in Fig. 7.7). In fact, there is a very small set of users at the upper right corner

in Fig. 7.7, who participated almost every day and present an extremely high Sin (up to

almost 100,000). This minority of highly influential users captured most of the collective

attention throughout the observation period, and are considered to be the opinion leaders.

In summary, we have seen that while most of participants hardly gain attention, there is

a very small set of users who captured most of the collective attention.

109

7.3.2 Elite nodes

The opinion estimation model described in section 7.1 defines a set of influential users called

elite. These users will act like seeds of opinions and will help to infer the opinions of the

majority of listeners. In the Twitter conversation, we consider those users who often partic-

ipate and concentrate large amounts of retweets to be the elite of the collective attention.

Their messages were widely forwarded by the conversation participants on daily basis, which

makes them leaders of the information diffusion process. In this section we will describe

their properties and the way they behaved in the conversation.

We have defined three sets of elite users, according to how much they have actively

participated in the conversation and the attention they received from the rest of participants.

The first set is compound by the top 65 most influential users, who gained an extremely high

amount of retweets, independently of their participation rate (Sin > 10, 000 and ρ > 0.0).

These users correspond to the yellow rectangle in the top of Fig. 7.7, and represent accounts

from politicians, news media and journalists. The second set of users includes those who

gained considerable amount of retweets by widely participating in time (Sin > 1000 and

ρ > 0.89). This set of 136 users include those who captured a wide part of the collective

attention by means of actively participating along the observation period (green rectangle

in Fig. 7.7). The third set includes those users did not necessarily receive much of the

attention, even after having widely participated in time (Sin > 10 and ρ > 0.82). This set of

635 users include those who were very active in the conversation but not necessarily captured

much of the collective attention (black rectangle in Fig. 7.7) as well as the most influential

ones. In fact, most of the users in the smaller sets are contained in the larger sets, as some

rectangles clearly overlap in Fig. 7.7.

In order to analyze the elite’s behavior, we have built networks with these three sets of

influential users. The networks are built by merging the edges among the respective nodes

through the observation period. This means to build a network that represents the union

of all networks, but considering only the sets of elite nodes. In Table 7.1 we present the

topological properties of the elite networks.

First of all, we have found that these networks present a segregated structure. The

networks present a clearly defined community structure according to the modularity opti-

mization algorithm [BGLL08]. For the three networks the modularity is positive and high

(Q1 = 0.43, Q2 = 0.38, Q3 = 0.35), which indicates that the communities in these graphs are

well segregated from each other. Moreover, the communities’ members share political pref-

erence. The second and third network, presented one community (or C-node) in favor of the

late President (officialism) and another one identified with the opposition parties (against

110

Elite NW Sin ρ Nodes Edges Off. C-nodes Opp. C-nodes Q r

1 10000 0.00 67 334 1 (25) 3 (42) 0.43 0.77

2 1000 0.89 136 1567 1 (48) 1 (88) 0.38 0.88

3 10 0.82 635 28245 1 (197) 1 (438) 0.35 0.91

Table 7.1: Elite networks topological properties. Sin and ρ columns represent minimum values.

Off. C-node indicates the number of network communities related to the officialism, and Opp.

C-nodes indicates the number of communities related to the opposition. The numbers in the

parentheses indicate the number of nodes in each pole. Q stands for modularity. r stands for the

Pearson coefficient of mixing patterns by ideology.

the late President). Particularly, the first network presented one community identified by

the officialism and three with the opposition. To study the preference of interaction by

political affinity, we analyzed the networks mixing patterns [New03a], given by the Pear-

son coefficient r in Table 7.1. On the three cases the assortativity values are very high

(r1 = 0.77, r2 = 0.88, r3 = 0.91), which evidences that the interactions on these networks are

strongly polarized.

To further understand the polarized structures of these networks, we present a visual-

ization of the three elite networks in the bottom row of Fig. 7.8. The nodes have been

colored according to the determined political affinity (red for the officialism and blue for the

opposition). It can be noticed that the larger the network (from left to right) the clearer and

more defined the poles are. This may also be noticed in the adjacency matrices represented

in the top row of Fig. 7.8. We have colored the edges to distinguish the interactions within

poles or between them. Red dots indicate an edge between two nodes from the officialism

block, blue indicates edges within the opposition block and pale yellow represents edges that

connect two different blocks. It can be noticed that the matrices present a clearly defined

block diagonal structure. This indicates that most of the blocks’ edges remain in the same

block (over 90% of edges at all cases), and that there are scarce connections among blocks.

7.3.3 Estimating Opinions

In the present section, we will apply the model to estimate opinions described in section

7.1 to each of the 56 daily constructed retweet networks. The elite’s influence is defined by

a fixed opinion which depends on the political pole: Xs = −1 for the officialism pole and

Xs = +1 for the opposition pole. The rest of nodes would iteratively estimate their own

opinion Xi(t), by applying eq. 7.1, until reaching the convergence (|Xi(t)−Xi(t−1)| < 10−3).

111

Figure 7.8: Adjacency matrices (top) and corresponding visualization (bottom) of the considered

elite networks. (A) Corresponds to the seed with Sin ≥ 10000 and ρ ≥ 0. (B) Corresponds to the

seed with Sin ≥ 1000 and ρ ≥ 0.89. (C) Corresponds to the seed with Sin ≥ 10 and ρ ≥ 0.82. Nodes

have been ascendantly ordered according to their opinions Xs. The color indicates the average value

of the node’s opinions Xij at both sides of the edge i− j.

In Fig. 7.9 an schema of two possible networks and expected outcomes is presented. The

elite users have been represented as red and blue nodes in the networks of Fig. 7.9 A and

E. If polarization is present, like the case shown in the top row of Fig. 7.9, the network will

display a two island structure (Fig. 7.9 B), the adjacency matrix will display two diagonal

blocks of nodes well connected within, but segregated from each other (Fig. 7.9 C) and

the estimated opinions distribution will be bimodal (Fig. 7.9 D). Meanwhile, if there is no

polarization in the graph, like in the bottom row of Fig. 7.9, the network will present a single

island structure (Fig. 7.9 F), the adjacency matrix will display homogeneous connections

among nodes (Fig. 7.9 G) and the estimated opinion distribution will be monomodal (Fig.

7.9 H).

In order to show more clearly the model results, we have colored the edges of the adjacency

matrices in Fig. 7.9 B and E in proportion to the average opinion of the two connected nodes

(i and j), defining the opinions adjacency matrix AXijin the following way:

AXij=Xi +Xj

2(7.11)

Red and blue dots represent edges between users of the same ideology, while pale blue

112

Figure 7.9: Visualization of two cases of possible retweet networks and expected outcomes. The

top row represents a polarized case and the bottom row represents a nonpolarized case. Panels

A and E show the position of the elite nodes, colored in each network. Panels B and F shows

the respective networks, coloring the nodes with their estimated opinion. Panels C and G show

the opinion adjacency matrices AXij . The colored dots in the matrices represent interactions:

blue and red dots indicate interactions within the same group; pale blue and yellow dots indicate

interactions across groups. Nodes have been ascendently ordered according to their estimated

opinion Xi. Panels D and H represent the resulting opinion distributions.

113

and yellow dots represent interactions between nodes of different ideologies. In the polarized

case, the elite’s opinions will not mix given the scarce amount inter-group connections and

the resulting nodes’ opinions will gather at the extreme values. As a consequence the matrix

will display two diagonal blocks, respectively colored in red and blue (see Fig. 7.9 C).

In contrast, on the depolarized case, the elite’s opinions will mix given the existence of

connections between the poles and the nodes’ opinions will homogeneously gather around a

single value like zero. Consequently the adjacency matrix would display a larger amount of

inter-ideological interactions, shown by the non-diagonal structure of yellow and pale blue

dots (see Fig. 7.9 G).

Obtaining Opinions and Measuring Polarization

The results of applying the model to the undirected versions of the retweet networks, using

the three sets of elite nodes presented in section 7.3.2, are shown in Fig. 7.10 respectively.

It can be noticed that three elites yield to similar results. During the days preceding the

announcement (from D − 29 to D − 1), X presents a bimodal distribution in which the

officialism population (negative side of the X distribution) is considerably smaller than

the opposition (positive side of the X distribution). This means that during this period

the conversation was polarized, but predominantly monopolized by the opposition. Hence,

despite the pole distance reached values over 0.9 (Fig. 7.11 B), the polarization index just

averaged under 0.4 (Fig. 7.11 C). Then a shift in the conversation emergent patterns took

place on the day of the president’s death announcement (day D). During this day X looses

its bimodal distribution, and the resulting p(X) has a single peak closer to neutral values,

minimizing the pole distance. All these meaning that the conversation was not so polarized.

Therefore, the polarization index diminishes down to µ ≈ 0.

The explanation for the change during day D is the abrupt growth of information cascades

when people react to critical events [BWB11]. The cascades interconnected the previously

segregated modules into a single-island structure many times bigger than the usual size

of the network. Besides, a large amount of users from all around the globe joined to the

conversation, making the topic international, rather than local from Venezuela. During this

day the percentage of users tweeting from Venezuela (≈ 20%) was very low in comparison

to the rest of the days (average around > 80%). Hence, our set of Venezuelan elite were not

capable of polarizing this majority of worldwide users.

After day D, the conversation gradually recovers its bimodal distribution of opinions as

the conversation turns back to primordially Venezuelan attention. Moreover, the polarization

reaches its maximum from day D+12 (marked with the dashed line in Fig. 7.11 C) onwards,

114

Figure 7.10: Time evolution of estimated opinions (Xi) probability density functions (p(X)) for

the Venezuelan conversation. These distributions respectively result from applying the model to

the retweet networks using the elites No. 1 (top panel), No. 2 (middle panel) and No. 3 (bottom

panel) described in section 7.3.2. Labels indicate the day of observation, D standing for the day of

the President’s death. Colors indicate the number of participants.

115

Figure 7.11: Time evolution of the polarization index µ (C), and the variables associated with

it: pole distance d (B) and the difference in population sizes (A) for the Venezuelan conversation

in the undirected version of the networks. The magenta line represents the average of the results

from applying the model with the three elite users from section 7.3.2. The gray shadow shows the

standard deviation. The orange stripe indicates the day of main event.

116

day that the officialism new leader entered the conversation. The new leader entered Twitter

together with a large number of new participants from the officialism that decreased the

previously asymmetrical ∆A closer to zero. From this day onwards X presents a bimodal

distribution, where the populations of both sides are similar. Therefore, the polarization

index averages values around 0.8.

We have also analyzed the opinion distributions according to their statistical values, such

as the average, standard deviation and kurtosis. It can be noticed, that the mean value (Fig.

7.12 A) was positive until the introduction of the new leader at the dashed line. That

happened because the opposition had a larger participation than the officialism, until both

populations equaled in size and the mean value dropped to zero. Accordingly, the standard

deviation (Fig. 7.12 B) fluctuated from its lowest point during the main announcement to

its highest values during the most polarized days. Finally, the kurtosis showed a bi-modal

behavior (below the horizontal dashed line in Fig. 7.12 C) for almost all days, with the

exception of the main announcement when it showed a well defined positive value, indicating

a depolarized structure.

In order to further understand the relationship between the structure of the networks and

the opinions obtained, in Fig. 7.13 we present the time evolution of the opinion adjacency

matrices from the retweet networks. For this plot we have only considered the results from

the elite No. 1 from section 7.3.2. We have represented the matrices as explained in Fig. 7.9

B and E. Nodes have been ordered according to their estimated opinion Xi and edges have

been colored as dots, according to the value AXijdefined in eq. 7.11.

It can be noticed that before the announcement (from D − 29 to D − 1) the matrices

show well defined two block structures, where the blue block is larger than the red block.

This means that there are too scarce inter-block connections (pale yellow dots) and thus

the networks are polarized, although a single group seems to monopolize the conversation

due to its larger relative size. Then, during the week of the main announcement (from D

to D + 5) we notice how the matrix transits from a fully connected to a segregated one, by

gradually reducing the inter-module connections and increasing the number of internal edges

at both modules. That stage represents the week when the event took international relevance

and many outsiders joined the conversation. The gradual decrease of such participation is

reflected in the gradual unveiling of the polarized core of the conversation. Finally, during

the polarized days (from D + 13 to D + 25), the matrix again shows the well defined two

blocks structure, where connections between modules are abundant but across modules are

scarce.

This shows that although the pole from the officialism remarkably increased their size

117

Figure 7.12: Time evolution of the statistical properties of the Xi distribution in terms of (A)

Average, (B) Standard deviation and (C) Kurtosis. The orange stripe represents the day of the

main occurrence (D) and the state funeral period. The magenta line represents the average of the

results from applying the model with the three elite users from section 7.3.2. The gray shadow

represents the standard deviation.

118

Figure 7.13: Time evolution of the opinion adjacency matrices AXij from the Twitter conversa-

tion about the Venezuelan President Hugo Chavez. Nodes have been plotted in ascendant order

according to their estimated opinion Xi. The label indicates the day of observation (from D − 29

to D + 26). The color indicates the average value of the node’s opinions at both sides of the edge

i− j.

119

during the last stage, the networks’ structure constantly showed too few inter-modular in-

teractions and polarized interactions.

Effects of Rewiring Edges

In order to further understand the effects of the topological properties of the networks in

the resulting opinion distributions, we have applied the opinion estimation model to rewired

versions of the undirected retweet networks. In order to randomize the networks, we have

rewired the edges by keeping the nodes’ degree. That means to randomly exchange edges

between nodes, in order to create new network configurations. Our goal is to discriminate

whether the resulting opinion distributions are the result of the effects of the elite on any

random network, or whether the actual networks show actual polarized structures around

the elite.

The average results of applying the opinion estimation model to 200 rewired versions of

each of the retweet networks are presented in Fig. 7.14 with dashed black lines, together with

the corresponding results from the original networks in solid green lines. It can be noticed

that the opinion distributions from the rewired networks present a single smoother increase

near the neutral opinion of Xi = 0. This means that if edges are randomly re-distributed

among nodes, then the polarization in the network is lost and the resulting structures present

single island structures. This effect is noticeable when we compare the opinion distributions

from the rewired networks with the original behavior during the most polarized days (from

D + 12 onwards). The curves show a remarkably different behavior. This means that the

way that nodes are connected in these polarized structures is far from being the result of a

random configuration. Instead, such differences indicate the existence of strong correlations

and conditioning in the user behavior. In contrast, on day D, both the original and rewired

versions of the network give the same opinion distribution results. Such similarity confirms

that the user interactions at this day occurred without the conditioning of the political

preference, but rather like if the nodes’ interactions happened independently and randomly.

7.3.4 Contagion by Influence

The retweet mechanism is directed by nature. The edge direction is related to the influ-

ence that one user plays on another. Therefore, in order to unveil the actual contagion by

influence, we will apply the model to the same networks, but considering the direction of

the edges. In this way, all nodes will only propagate their opinions to those who directly

influenced, that is to those who retweeted their messages.

120

Figure 7.14: Effects of rewiring edges in the results of the opinion estimation model. Time

evolution of estimated opinion (Xi) cumulative probability density functions (CDF) resulting from

the opinion estimation model to the undirected networks (solid) and corresponding rewired versions

(dashed). The label indicates the day of observation (from D−29 to D+26). Columns are ordered

from Monday to Sunday. The labels indicate the corresponding day of observation, from D− 29 to

D+ 26, being D the day of the President’s death announcement. The distributions for the rewired

networks represent the average over 200 realizations. These curves correspond to the results from

applying the model with the elite No. 3 described in 7.3.2.

121

Figure 7.15: Time evolution of the estimated opinions (Xi) probability density functions (p(X))

for the Venezuelan conversation. Labels indicate the day of observation, D standing for the day of

the President’s death. Colors indicate the number of participants. These curves are the average of

the results from applying the model with the three elite users from section 7.3.2.

The resulting opinion distributions, obtained by averaging the results from the three

elites presented in Table 7.1, are shown in Fig. 7.15. Almost all distributions present a

similar behavior than the distributions previously obtained, when we did not considered the

direction of the edges in the networks (see Fig. 7.10). Moreover, the new distributions are

more extremely polarized, since they present a more clearly defined bimodal shape. Even

during the days where the undirected results indicated single island structures (D + 1 to

D + 2), in the directed case we see two peaks at each extreme of the distribution. This

is reflected in the polarization index; which is generally higher than the undirected case,

reaching almost to 0.9 at the most polarized stage (from D+12 onwards in 7.16 C). Similarly,

the pole distance d (Fig. 7.16 B) is much closer to 1 than the undirected case, indicating

that the people’s opinions are separated at their maximum distance.

In order to compare the results from applying the model in both kind of networks, we

present in Fig. 7.17 the time evolution of the cumulative probability density functions

(CDF) of nodes’ Xi, resulting from the opinion estimation model on the directed network

(solid) and undirected network (dashed). The color indicates the kurtosis values of the

distributions, which is negative for polarized and bimodal distributions (red curves) and

positive for depolarized and unimodal distributions (from yellow to blue curves). If the

network was polarized, the distribution will display two sudden increases of users near the

extreme values, and practically no increase is detected in the central values (see D− 29). In

contrast, if the network is not polarized, the distribution will only display a single, continuous

122

Figure 7.16: Time evolution of the polarization index µ (C), and the variables associated with it:

the pole distance d (B) and the difference in population sizes (A) for the Venezuelan conversation.

The magenta line represents the average of the results from applying the model with the three elite

users from section 7.3.2. The gray shadow shows the standard deviation.

123

and smoother growth (see D).

It can be noticed that the patterns of the CDF are very similar for the majority of

days in both kind of networks. However, the distributions from the undirected version of

the networks present a smoother growth than the directed version, even when the network

is polarized (see D − 26 or D − 15). This means, that the participants polarization as a

whole is lower than the polarization of those users directly influenced by the opinion leaders.

Such observation is remarkably noticed at the week of the death announcement (from D to

D+5). During these days, the apparent depolarized networks contain a highly polarized sub-

network, directly influenced by the elite nodes. Therefore, the networks in general present a

highly polarized baseline embedded in the unconditioned popular interactions.

In summary, if we consider the users that are directly influenced by the elite, we see that

polarization is much stronger in the network, defining a polarized social baseline. However, if

we consider the whole network, we see that the emergent polarization is lower and sometimes

nonexistent. Therefore, in order to detect those users who are influenced the most by the

elite, we must consider the direction of the edges.

7.3.5 Offline Polarization

So far we have shown a strong polarization around Venezuelan online political discussions

on Twitter. To further understand the basis of such online polarization, in this section we

will explain the relationship between the Twitter activity, and the polarization present in

the Venezuelan society as a whole. To this end, we will discuss the electoral results of the

elections convoked after the President’s decease. Second, we will show the territorial impact

of the Venezuelan polarization in social media.

Electoral Polarization

After the president died on March 5th 2013, new elections were convoked in Venezuela. In

these elections, the candidate from the officialism (50.6%) together with the candidate from

the opposition parties (49.1%) gathered over 99.7% of votes. This shows the high degree

of political polarization in the Venezuelan electorate. Moreover, it confirms that polarized

societies leave little space for moderate voices, as independent candidates only gathered

the remaining 0.3% of votes. Yet according to recent polls [Hin13], Venezuelan citizens not

identified with any party represent about 25% of the population, evidencing that polarization

is a cause for over-representation of the most powerful groups.

In Fig. 7.18 we present the way that votes from officialism and opposition are distributed

among the population. More specifically we show the relative number of voting stations

124

Figure 7.17: Effects of edges’ direction in the results of the opinion estimation model. Time

evolution of estimated opinion (Xi) cumulative probability density functions (CDF) resulting from

the opinion estimation model on the directed network (solid) and undirected network (dashed).

The label indicates the day of observation (from D − 29 to D + 26). Columns are ordered from

Monday to Sunday. The color indicates the kurtosis values of the distributions. The labels indicate

the corresponding day of observation, from D − 29 to D + 26, being D the day of the President’s

death announcement. These curves are the average of the results from applying the model with

the three elite users from section 7.3.2.

125

Figure 7.18: Electoral polarization in Venezuela. Distribution of voting stations according to the

winner party and the location of station, according to the 2013 Venezuelan Presidential elections.

where the officialism (red) or the opposition (blue) had won, according to the geographical

location of the voting station. It is an indirect measure of social-economical level, since we

are able to classify voting stations in the following way:

• Rural: Mostly pour inland villages [IFfAD09].

• Urban informal: Referred to informal settlements in cities or slums [UH03]

• Undefined: Urban areas that might be considered slums or not.

• Urban formal: Proper urban neighbourhoods from medium class up.

• Abroad: Referred to Venezuelan emigrants voting at consulates and embassies, which

tend to be people from higher classes [Fre11].

We see that there is a strong correlation in the voting patterns and the economical level

of the voter, since the officialism widely wins at the voting stations placed at poorer areas

(located at the left side), while the contrary occurs with the opposition, which gets stronger

as we consider the wealthier regions (located at the right side).

This result shows how the political support in Venezuela is completely catalyzed by

the two major options, who found their voters in a mutually exclusive way. The voting

preferences appear aligned to social class. Of course, as Fidel Castro famously said to

126

Hugo Chavez after having lost the 2007 Referendum: ”there are not 4 million oligarchs in

Venezuela”, which means that opposition also finds space in the poorer areas. In fact, the

disproportional amount of rejection that the officialism gets in the wealthiest regions has

been reported to be stronger than the disproportional amount of support it receives from

lower classes [Lup10], which is also noticeable in Fig. 7.18.

Territorial Polarization

To further understand the relationship between our findings on Twitter and the electoral

results, in this section we explore some of the territorial distribution of the analyzed inter-

actions. More specifically we analyze the way these messages were posted in the capital city

of Venezuela, Caracas, taking only into account the tweets from the most polarized days

presented in section 7.3.3. In Fig. 7.19 we present the map of the five municipalities that

conform the city, bordered in green. The labels correspond to the municipality name and

the color indicates the party of the respective major, like the officialism in Libertador and

the opposition in Chacao, Sucre, Baruta and El Hatillo, according to the 2013 Venezuelan

local elections.

In the map we have colored in yellow the urbanized areas and in pink the informally

populated regions (slums). The contour lines represent the location of the mass of messages

identified to each ideology. It can be noticed that these contours correspond to the electoral

results, as those municipalities that are governed by the opposition contain the highest

concentration of users identified with this pole, and the same effect happens in the officialism

side of the political spectrum. Moreover, the area with the highest concentration of users

aligned with the officialism, corresponds to the part of the city with the largest concentration

of informal and poorer neighborhoods (pink areas), at the same time that the opposition

users are concentrated in the region of highest formal urban development.

This result evidences that the political conflict in Venezuela presents a strong territorial

facet. The territorial segregation is related to the degree of intolerance of people to coexist

with those who are different [Sch71]. The consequence of such territorial polarization have

been reported to be highly harmful for the city life [GG03] as public spaces become political

insignia and free circulation is affected by the fear of being identified as an opponent. As

a result, the city looses its role of social encounter and opens place to a warlike language,

where spaces are no longer democratic, but territories of the parts of a conflict.

127

7.3.6 Discussion

Venezuela has shown considerable evidences of polarization in multiple social dimensions.

The political and electoral polarization, presents a strong class and territorial polarization as

well. These type of polarization are well reflected on the Twitter activity, which is distributed

accordingly in geographical and social-economical terms. These social arrangements are not

isolated from each other, but instead there is a strong relationship between them.

The poorer informal neighborhoods emerged in Caracas, and pretty much all of Latin

America, during the twentieth century, due to migrations from rural to urban areas [Gal73].

Migrants were looking for employment and a better life, which not always was found, in-

creasing with time the social gap between people living in the same city up to astonish-

ing levels [Ber97]. For instance in Rio de Janeiro, Brazil, some neighbourhoods present

North-European alike Human Development Indexes (HDI), while others show Sub-Saharan

equivalents1 a few kilometres away.

It is known that the larger the income gap is, the stronger the resulting political po-

larization [MPR02]. In Venezuela, however, several other kind of social segregation process

took place at the same time increasing the divergence of people’s criteria. The consequent

conditioning of the inhabitants due to HDI differences, turn the society into two well differ-

entiated populations, even with territorial borders. This social segregation served for many

authors as basis for the political polarization catalyzed by the late President Hugo Chavez

[EH02].

7.4 Summary

In this chapter, we have proposed a methodology to detect political polarization in social

networks. The methodology consists on a contagion model to infer people’s opinions and a

new index to measure the degree of polarization in the opinions obtained. We apply this

methodology to detect polarization in user interactions on the online social network Twitter,

around a conversation of political interest, such as the announcement of the sudden death

of a nation’s President in office. We found that the conversation was polarized due to the

influence of an elite of opinion leaders.

1Tabela N 1172, http://portalgeo.rio.rj.gov.br/

128

Figure 7.19: Mass of tweets in the city of Caracas. Contour levels (from inside to outside 0.25,

0.20, 0.15, 0.10) represent the mass of tweets identified as in favor of the government (red) and

against it (blue). Areas bordered in green correspond to the five municipalities that conform the

city. White regions display unpopulated areas, yellow regions represent populated areas and pink

regions correspond the informal and poorer neighborhoods (slums). The label color indicates the

ruling party at each municipality, according to the 2013 Venezuelan local elections: red represents

the officialism party at Libertador and blue indicates opposition parties at Chacao, Sucre, Baruta

and El Hatillo.

129

Chapter 8

URBAN COLLECTIVE PATTERNS

In this chapter we explore urban dynamical patterns around the world. We analyze geolo-

cated Twitter activity to characterize the cyclical behavior of urban routines. We found that

the urban rhythms can be classified in three kinds of behavior determined by the combination

of morning and afternoon activity.

Recent studies have found that individual activities combine into regular cycles of collec-

tive behavior [CGW+08, PSR12]. These patterns of collective behavior are also found in the

biological activity of living organisms, like heartbeats or respiration. This synchrony is not

simply due to external factors like light and dark or due to biological factors like circadian

rhythms. It arises out of complex relationships and fills a particular function in society which

has great economic and social benefits. Our economic system is based upon the contributions

of multiple workers, the completion of tasks within a given time frame depends upon the

availability of other workers either simultaneously or in the correct sequence [VDAVH04].

The functioning of complex systems, like human societies, depends not only upon the

functionalities of its members but also upon the coordination of people’s actions. Many

important societal aspects such as economical activities would not be possible to develop

if individuals behave independently from each other. Although people seem to behave ran-

domly and unpredictably, it does not mean that their actions are independent from each

other. Collective activities can only be engaged when there are interdependencies in the

individual actions. Such interdependencies condition people’s decisions and diminish in-

dividuals’ freedom of will, in order to favors the system’s ability to gain capabilities as a

whole.

131

Figure 8.1: World Twitter Activity. Geographical density of Twitter activity (number of tweets)

during one average day in logarithmic scale. Red and orange indicate a high concentration of

activity, while blue and green indicate a lower concentration of tweets, and black indicates the

absence of activity. Insets: Average week of Twitter activity on several cities (ac,d(t)).

8.1 World Activity

We first analyzed the geographical distribution of the world activity (see the Video B.3

described in the Appendix B). We built a map with a representative day of activity, by

averaging the number of geolocated tweets across latitudes and longitudes. For this purpose,

we defined a matrix Tij that will aggregate the geolocations of the messages in a grid of

0.25 squared degrees of spatial resolution per hour. Therefore, we map the coordinates

(lonm, latm) of messages m to indexes in the matrix i, j as:

i = b4(lonm + 180)c (8.1a)

j = b4(latm + 90)c (8.1b)

where b· c represents the floor function. Then, we count all the tweets that meet this

criterion.

We aggregated the tweets for each week of the observation period w, and each day d of

the week, and built a respective matrix Tij for each hour t of the day. Then, we aggregated

all the hourly grids, Tij,d,w(t), into daily grids Tij,d,w =∑

t Tij,d,w(t) that contain the activity

of each day from the observation period. Finally, we averaged across all days and weeks from

132

the observation period and build an average daily grid, T ′ij, in the following way:

T ′ij =1

W

1

D

∑w

∑d

Tij,d,w (8.2)

where W is the total number of weeks from the observation period and D is the number

of days at each week (7).

In Fig 8.1 we show the resulting geographical density of tweets during the average day.

Red and orange regions indicate a high concentration of activity, while blue and green regions

indicate a lower concentration of tweets. Black regions indicate the absence of activity. It

can be noticed that Twitter is not homogeneously used across the world. Regions like the

Americas, Europe, Middle-East and South-East Asia seem to concentrate many more Twitter

users, than countries like China or India that present much less activity that the expected

for their large populations. Moreover, we can notice the different demographic densities.

For instance, in the US, vast void black regions in the west side of the country coexist with

densely populated red regions in the east side. That effect is also noticeable in Europe, where

the west is much more active than the east side; as well as Korea, where north and south

present remarkable differences. In fact, the red spots indicate the presence of active large

and medium cities. Next, we will analyze some of this cities by aggregating their localized

behavior into temporal series.

8.2 Urban Dynamics

We have analyzed the dynamics of 52 main cities across the world, by looking at the variation

of the number tweets per hour. For this purpose, we built a temporal series representing

an average week of Twitter activity per city, c: ac,d(t). An average week is compound by d

representative days (from 1 to 7), each of which are compound by t hours (from 0 to 23).

In order to build ac,d(t), we first determined the slots that comprehend the city in the grid,

according to the city coordinates and eq. 8.1. Then we sequentially collected the number of

tweets at the selected slots, and built a temporal series of tweets per hour, nc,d,w(t), where w

is the number of observed weeks (total W). For this purpose, the number of tweets nc,d,w(t)

from city c, in hour t, of day d, and week w was normalized according to:

n′c,d,w(t) =nc,d,w(t)− 〈nc,d,w(t)〉

σ(nc,d,w(t))(8.3)

where 〈nc,d,w(t)〉 = (1/24)∑

t nc,d,w(t) is the average and σ(nc,d,w(t)) is the standard

deviation. The Twitter activity of the representative week, of seven representative days d,

133

was given by:

ac,d(t) =1

W

∑w

n′c,d,w(t) (8.4)

In Fig 8.2 we show the temporal behavior of all these cities, and some of them are also

shown as insets in Fig. 8.1. It can be noticed that all series cycle between valleys and peaks

of activity during weekdays. The valleys of activity occur at early morning hours when

most people are sleeping, while the peaks of activity occur during the day, either during the

morning or the afternoon, while people go to work or return home. Depending on the height

of these peaks, we have identified different kinds of behaviors. Some cities like New York

City or Jakarta display a single large peak (green curves). Other cities like Sao Paulo or

Santiago show several small peaks of activity during the morning before a large peak at the

afternoon (blue curves). Finally, cities like London or Moscow display two peaks of activity

of similar size (yellow curves).

8.3 Dynamical Classes of Behavior

In order to further understand the dynamical patterns of the cities, we performed clustering

and multidimensional scaling algorithms to the temporal series. Specifically, we applied

the k-means algorithm in order to find clusters of cities’ temporal series [Mac67]. For this

purpose, we interpret each hour of the temporal series as an independent dimension and

cities represent a single point in a multidimensional space (24x7 dimensions). The clustering

algorithm associates cities that have a similar behavior, and thus are closer to each other,

than those who do not share the same behavior, and thus are farther. In order to find

the best number of clusters, we calculated the silhouette profile [Rou87] and found that it

maximizes at 3 clusters.

The average behavior of the three clusters are shown in the top panel of Fig. 8.3. Colors

correspond to the clustering results. The difference between the three classes is due to the

combination of morning and afternoon peaks, respectively marked with a square and a circle

red symbols. We concretely found the following behaviors:

1. The third class (Fig. 8.3 A) presents two large peaks of similar sizes: one in the

morning (red x symbol) and another one in the afternoon (red circle).

2. The second class (Fig. 8.3 B) presents a medium-sized peak in the morning (red

square), followed by a very large peak at the afternoon (red x symbol).

134

3. The first class (Fig. 8.3 C) presents an almost imperceptible small peak at the morning

(red square) and a very large peak at the afternoon (red x symbol).

In order to visualize these clusters, we performed a multidimensional reduction based

on multidimensional scaling (MDS) [BG05]. The results are shown in the bottom panel of

Fig. 8.3. The MDS algorithm projects the points from the multidimensional space, into a

bidimensional one, by maintaining the distance between the elements. The new dimensions

do not necessarily have a physical meaning. However, we interpret the new dimensions as

modality in the daily pattern (x-axis) and symmetry (y-axis). The cluster in the left (green)

is highly symmetrical and presents a single peak, while the cluster in the right is symmetrical

and presents two peaks (yellow). The third cluster (blue) is not symmetrical and present a

larger afternoon peak than a morning peak.

It is remarkable that these clusters share cultural and regional affinity. If we notice

the series in the insets of Fig. 8.1 and in Fig. 8.2, we can perceive that the clustering

results (shown by the colors) are related to the geography and culture. For instance, most

of European and African cities are in the yellow cluster, while North America and East Asia

cities are in the green cluster, and the blue cluster mainly corresponds to South American

cities.

8.4 Summary

In summary, we have seen that the Twitter activity from urban areas have a pulsing behavior,

due to the cycles of work, recreation and sleep. We found that there are three classes of

behavior, based on the combination of morning and afternoon peaks of activity.

135

Figure 8.2: Temporal behavior of 52 cities across all continents. Series represent the representative

week of Twitter activity for each city (ac,i(t)). Color indicates the result of the clustering classifier.

136

Figure 8.3: Clustering of cities according to their temporal behavior. Colors indicate the results

of k-means clustering algorithm. Axes correspond to collapsed dimensions using multidimensional-

scaling algorithms. On the top panel we show the average behavior of each class (from A to C).

We have respectively marked the morning and afternoon peaks of activity with a red x symbol and

a circle.

137

Chapter 9

INFERRING HUMAN BEHAVIOR

FROM MOBILE PHONE DATA

The analysis of human data exhaust to improve social well-being is a very timely subject that

has attracted the attention of several researchers, as well as governmental and international

organizations over the last years. In countries with limited economical resources, these

sources of information represent opportunities to gain intelligence about their social systems

without the need of deploying expensive fieldwork. For instance, mobile phone data or

Call Detail Records (CDR) resulted to be an accurate source of data to estimate human

migrations after the cholera outbreaks in Haiti in 2010 [BLT+11]. In Kenya, a similar

approach may remarkably reduce the spread of contagious diseases like Malaria [WET+12]

by identifying sources and sinks of human displacements. Furthermore, recent studies using

CDR data have shown the ability to measure the impact of earthquakes on communication

patterns [BLT+11, MFMFM13] and to build predictive models of potential areas of disruption

following an earthquake [KEH10]. These studies are very important, since their results may

benefit a large amount of human population, by improving and enhancing the efficacy and

efficiency of governmental processes of strategic planning.

In this chapter, we infer human behavioral patterns from CDR data. We first study the

communication patterns in a developing country, by looking into how regional areas interact

with each other [MCB+ss, MCB+13]. Then, we explore the potential of CDR analysis, in

order to measure the impact of natural disasters on people’s behavior. For this purpose, we

develop a framework to combine CDR data with other data sources, in order to characterize

communication patterns and to detect abnormal variations in the usual behavior [PMT+14].

139

9.1 Characterizing Communication and Mobility Pat-

terns in a Developing Country

In this section, we analyze mobile phone data to understand the structure of regional and

ethnic interactions in Ivory Coast [MCB+ss, MCB+13]. We construct and analyze complex

social networks at several layers of interactions, such as calling activity and human mobility.

We show the role of underlying forces, like culture or economy, that influence and determine

the Ivorian regional and national communication patterns.

9.1.1 Context

In the recent decades, African countries have gone through several armed conflicts among

different ethnic and religious groups. The borders arbitrarily traced by Europeans for ad-

ministrative convenience of the former colonial order split and joined ethnic groups into new

countries, forcing them to coexist within previously nonexistent frontiers. Asymmetries in

economical and geographical benefits between different ethnic groups have led some countries

to different levels of social polarization, which have eventually resulted in civil wars. Recent

studies have shown that violence emerges between ethnic groups when their territories are

not well defined [ML07], or when a group is large enough in order to prevail among others,

but not as strong as to maintain order. Ivory Coast is not an exception of this context.

In less than two decades the Ivorians have engaged in two internal armed conflicts, due

to asymmetries between their inhabitants. Therefore, the characterization and understand-

ing of their ethnic relationships is crucial to consolidate peace and to strengthen the social

cohesion needed for any further economical development.

Ivory Coast presents a complex society compound by more than 60 different ethnic groups.

Although French being the official and broadly spoken language across the country, each

ethnic group has its own native language. Such many and diverse languages are classified

into four large linguistic families: Kwa, Kru, Mande and Gur [Lew09]. The territories of

these four linguistic families are well defined in the four coordinates of the country, as shown

in Fig. 9.1.

In summary, the Kwa group is located in the southeast side of the country. This is

the most economically developed region where the capital city and other major cities are

located, as well as the main Ivorian airport and seaport. The Kru group is located in the

southwest side, also in the Atlantic coast. The second seaport in Ivory Coast is located at

this region, which brings economical benefits to these people. The Mande group is found in

140

Figure 9.1: Ethno-linguistic map of Ivory Coast. Figure adapted from [Lew09]

141

Network Nodes Edges Density Clustering

Calls Network 1,215 1,284,311 0.87 0.95

Trajectories Network 1,215 187,102 0.13 0.58

Table 9.1: Properties of the Calls and Human Trajectories Networks.

the northeast side of the country, and the northwest region is occupied by the Gur family.

The northern regions occupied by the Mande and Gur groups are the least populated regions

of the country and less economically developed areas.

9.1.2 Characterizing Populated Areas

In order to characterize populated areas in Ivory Coast we studied the structure of the human

trajectories network at the meso-scale level. This network displays the people’s mobility

patterns within a given territory. It is built out of the aggregation of individual trajectories.

Each trajectory is defined as the sequential set of antennas that served a particular user

in time. Antennas represent nodes and an edge is created between two antennas, i and j,

if a user makes two consecutive calls, first from antenna i and later from antenna j. The

edges are directed, from i to j, and weighted according to the number of times that all users

performed the same trajectory. The resulting network has 1,215 nodes and 187,102 edges.

It is a sparse network with high clustering coefficient (see Table 9.1). A visualization of

the dynamical growth of this graph during an arbitrary day is presented in the Video B.4

described in the Appendix B.

By applying the community detection algorithm based on modularity optimization [BGLL08],

we found that the trajectories network could be classified in 100 network communities, which

are shown in Fig. 9.2 together with the map of Ivory Coast. Communities comprehend a

limited territorial area, not necessarily contained within the same regional borders, and are

related to urban and rural settlements. It can be noticed that there is a larger density of

antennas and communities in the south side of the country, while in the north side scattered

antennas conform a few communities. Such difference in the density of antennas and com-

munities is consistent with demographical information that reports the south side of Ivory

Coast as more densely populated.

The density of edges also display the same structure. A snapshot of the trajectories

network is presented in Fig. 9.3. The nodes are located at the antennas’ geographical

coordinates and edges are colored in blue. The width of the edge is proportional to the

142

Figure 9.2: Mapping the community structure of the trajectories network of Ivory Coast. An-

tennas represent nodes and are plotted in different colors and shapes, according to the community

they belong gotten from the community detection algorithm.

Figure 9.3: Mapping the structure of the trajectories network on the Ivory Coast geographical

map. The blue lines represent the edges of the network and their width is proportional to the edge

weight. Superimposed the main roads of Ivory Coast have been plotted as red lines. The location

of the country’s main cities are marked with black circles.

143

Figure 9.4: Mapping the closeness-centrality property of the trajectories network in Ivory Coast.

The edges have been colored according to the closeness centrality mean value of the two connected

nodes. The red regions indicate higher closeness-centrality, the yellow and pale blue regions indicate

medium centrality, and the dark blue regions indicate lower closeness-centrality.

edge’s weight, which means that the most intense edges represent the trajectories more

frequently used. The main cities (black circles) and southern regions concentrate a larger

amount of edges than the north side, indicating a remarkable difference in the amount

of human displacements between the two regions. Apart from demographic density, this

patterns also result from the underlying infrastructure and economical activity. In Fig. 9.3

we have superimposed in red color the main roads of the country. Most trajectories keep

a remarkable correspondence to available roads. Some of them seem to be more frequently

used, like the ones linking the north with the south of the country; while others are less

frequently used, like the transverse road up in the north. The fact that some infrastructures

are more frequently used than others can be a consequence that the region with more activity

showed in Fig. 9.3 corresponds to the zone of cocoa plantations. Ivory Coast is the largest

cocoa producer in the world with 36% of the global share [ICO12].

It has been stated that the economical development of large regions can be characterized

and understood by means of cellphone activity patterns [EMC10]. Accordingly, in this

study we have analyzed the closeness-centrality property of the antennas in the trajectories

network. This network property is inversely proportional to the average distance from a

node to the rest of the network in terms of connections. It provides information about the

144

Figure 9.5: Mapping the linguistic identity of the trajectories network of Ivory Coast. The edges

have been colored according to the linguistic group to which the most connected antenna at each

community belongs to. There are four major linguistic families represented in yellow (northwest),

purple (northeast), green (southwest) and blue (southeast). Black circles indicate the location of

the major cities.

central or peripheral behavior of nodes or regions according to all human displacements.

In Fig. 9.4 we present the trajectories network coloring the edges according to the mean

value of antennas’ closeness-centrality. Red regions are highly central, yellow and pale blue

regions are intermediate, and dark blue regions are peripheral. It can be noticed, that the

most central area (red) corresponds to the main city and the regions it adjoins, while the

most peripheral regions are located in the north and west sides (blue). This is in agreement

to international reports [ECd08] that identify the north and the west side of the country as

the less developed areas.

9.1.3 Ethnic Interactions

In order to understand the ethnic composition of this graph, we have taken into account

the ethnic and linguistic identity of each network community. For this purpose, we mapped

each community to its geographically closest ethnic group, according to the location of the

communities’ most connected antenna and the ethno-linguistic map show in Fig. 9.1. In Fig.

9.5 we present the trajectories network by coloring edges according to the linguistic family.

145

It can be seen how the most densely connected areas, like the capital city or the cities in the

center of the country (black circles), concentrate links from different linguistic areas, while

most of regions mainly present trajectories within their own linguistic family.

After mapping the ethnic groups, we have constructed a second network taking into ac-

count a new layer of interaction, such as the antenna-to-antenna calling information (see

section 4.3.1). In this network, the nodes represent the 100 communities found in the tra-

jectories network (see section 9.1.2) whose ethnic identity is already known. The edges

correspond to the number of calls made from one community to the other. The edge direc-

tion goes from the emitter community to the receiver community and the weight is equal to

the number of occurrences found in the datasets.

In order to get a clearer view of the way that ethnic groups communicate with each other,

we present in Fig. 9.6 A the weighted adjacency matrix of the ethnic groups calling network

normalized by row. This normalization provides relative information about the destination

and origin of outgoing and incoming calls by group. The diagonal entries of the matrix are

higher than the other elements, indicating that most of outgoing calls remain in the same

community. In fact, the preference of people to communicate with similar ones increases with

the scale of aggregation. When we aggregate the communities by ethnic group and linguistic

family (Fig. 9.6 B and C), the assortative coefficient [New03a] of each matrix increases from

r ∼ 0.5 to r ∼ 0.8 (Fig. 9.6 D), being r = 1 the case of absolute segregation. Such increase

indicates that there is a higher segregation between ethnic groups when we consider their

linguistic family.

Moreover, not all families behave the same way. The southern families (number 1 and 2

in Fig. 9.6 C) present a larger proportion of calls directed to their own linguistic family, in

comparison to the northern families (number 3 and 4 in Fig. 9.6 C), whose activity directed

to other linguistic families is relatively larger. In Fig. 9.7, we present the intra-family

flux (calls directed to the same linguistic family) and inter-family flux (calls directed to a

different linguistic family) of calls. In the figure the symbols represent communities from

the trajectories network and the color corresponds to the linguistic family they belong to.

The further the community is located below the dashed line of slope 1, the higher the family

internal traffic in comparison to the external traffic. Most of the southern ethnic groups

(blue and green dots) are farther from the diagonal line than the northern ones (yellow and

red dots). This means that the internal traffic in southern ethnic groups is much higher

than their external one, while on northern families the external traffic is comparable with

the internal one.

The external calling traffic from the northern ethnic groups is directed selectively towards

146

Figure 9.6: Normalized adjacency matrices of the calls network corresponding to the community

structure from the trajectories network (A), ethnic group aggregation (B) and linguistic family

aggregation (C). Assortativity coefficient of selectiveness to call on local scale (community), subre-

gional scale (ethnic group) and regional scale (linguistic family) (D).

147

Figure 9.7: Scatter plot of intra linguistic family flux (calls directed to an antenna in the same

linguistic family as the emitter antenna) versus inter linguistic family flux (calls directed to an

antenna in a different linguistic family than the emitter antenna). Symbols represent communities

from the trajectories network and the color indicates the linguistic family to which the community

belongs. The dashed line has slope 1.

148

their adjoin southern families. In Fig. 9.6 C, we see that the families 1 and 4 are more densely

connected among themselves than with the rest of families. The same happens with families

2 and 3, which are also more connected among themselves than with the rest of families.

Such observation is in good agreement with the mobility patterns shown in Fig. 9.3, where

the vertical roads seem to have a higher significance than the horizontal ones; as well as with

the patterns shown in Fig. 9.5, where we showed that the mobility of the northern families

to the south are stronger with the adjoin regions.

9.1.4 Effects of Selectiveness in the Calling Behavior

To further understand the selectiveness in the communication patterns between the east and

west side of the country, we built a third network taking into account another layer of social

interactions. Specifically we built a network from the calling behavior at the tower level,

extracting only information from the first dataset described in section 4.3.1. The nodes in

this network also represent single antennas, and an edge is created from the antenna i to the

antenna j, when a user that is being served by the antenna i makes a call to another user

who is served by the antenna j. The resulting is a directed and weighted network, where

the weight of the edges represents the total number of calls made from the antenna i to the

antenna j along the whole observation period. It is almost a fully connected network with

extremely high clustering coefficient (see Table 9.1).

The calls network is compound by over 19 communities of antennas according to the

modularity optimization algorithm [BGLL08]. The distribution of these communities along

the geography of Ivory Coast is shown in Fig. 9.8. The communities show a relationship with

administrative areas marked with gray lines, although at some cases these human borders

are not in correspondence to the political ones. We present an animation with the dynamics

of this network in the Video B.5 described in the Appendix B, together with a visualization

of the influence that each of the 19 communities have among each other.

To capture the influence that each community has on the rest of antennas from the

network, we analyzed the density of calls directed to the given communities from the rest

of antennas. To quantify such preference, we have measured the density of calls between

communities and classified them using a k-means clustering algorithm [Mac67]. The results

are presented in Fig. 9.9, where we have plotted the antennas with different colors, according

to the classifier results. We found that the country is divided between the east side and west

side of the map, as was previously intuited in the Fig. 9.6 C.

149

Figure 9.8: Mapping the community structure of the calls network of Ivory Coast. Antennas

represent nodes and are plotted in different colors and shapes, according to the community they

belong gotten from the community detection algorithm.

Figure 9.9: Mapping the classification results of antennas according to the way the calls net-

work communities are related. A k-means clustering classifier has been applied to the community

structure of the calls network.

150

9.1.5 Summary

In summary, we have characterized the interactions and resulting structure of the diverse

geographical and social areas of Ivory Coast. We found that on a local and subregional scale,

the ethno-linguistic factor determines the interaction patterns, while on a wider scale, the

available infrastructure and economic facts play a major influence in the social dynamics.

As a result the Ivorian communication map is organized in two interacting regions located

at the east and west side of the country. On each side the northern ethnic groups seem to

be influenced by the southern ethnic groups. This study shows how CDR data can be used

to understand the social composition of societies and the way that cultural exchange takes

place. It also reveals that the peripheral and poorer communities seem to be more influenced

by the wealthier ones than otherwise. Given the recent history of violence in Ivory Coast,

these studies could allow to identify whether conditions are set for social unrest.

9.2 Flooding through the Lens of Mobile Phone Activ-

ity

In this section, we explore the potential of analyzing CDR data for characterizing the re-

action of populations to natural disasters, using the Tabasco, Mexico floods in 2009 as a

case study. For this matter, we develop a multimodal data integration framework that facil-

itates the combining of CDR data with other data sources- remote sensing, rainfall activity,

census and civil protection information, in order to quantitatively characterize changes in

communication patterns during the floods [PMT+14]. The ultimate goal is to contribute to

the development of real-time decision-support tools based on CDR data, in order for gov-

ernments, international organizations and humanitarian actors to enhance their responses.

Natural disasters such as floods or earthquakes affect hundreds of millions of people

worldwide every year1. Effectiveness of humanitarian response is limited, in part, by the lack

of timely and accurate information about the patterns of movement and communication of

the affected population. Specifically, there is a need for dynamic in-situ information across

the event timeline: a baseline for understanding the previous and usual behavior, real-time

measurements of the behavior during the disaster, and the capacity to track return to normal

behavioral patterns during the recovery phase.

1EM-DAT database: http://emdat.be/disaster-trends

151

Figure 9.10: Left: Visualization of the precipitation data obtained from the NASA TRMM at

November, 2nd, 2009. The red square encloses the observed region. Right: Accumulated rainfalls

during the first two weeks of November, 2009 (jet colormap) over the Tabasco area. The floods

segmentation is shown by the white shade. The area correspond to the red square in the left panel.

9.2.1 Context

The state of Tabasco is located to the south of the Gulf of Mexico, covering 24, 738km2

(1,3% of national total area). Due to its location and topographical features, Tabasco is

subject to frequent flooding events, such as those that occurred in 2007, 2008 and 2009. On

28th October 2009, a cold front (Nr. 9) entered northwest Mexico and reached Tabasco on

the 31th, where it remained for four days. It rained intensely until November the 3th over

the west of Tabasco, within the Tonala basin. The National Meteorological Service (SMN)

recorded 800mm of accumulated rain in three days, 4-fold the regular accumulated rain level

for November. In Fig. 9.10 we present a visualization of the precipitation data obtained

from the NASA TRMM at November, 2th, 2009, together with the accumulated rainfalls

during the first two weeks of November, 2009 in the region of Tabasco. The rainfall levels

in the right panel have been colored from the highest (red) to the lowest values (blue). The

floods segmentation generated from the Landsat-7 images is shown in white shadow.

As the Tonala basin lacks hydraulic infrastructure for controlling river floods, the rain

water flowed freely to the coastal plains, causing flooding. The greatest damage occurred in

the Huimanguillo and Cardenas municipalities. On November the 3rd, after the heavy rain,

the state of emergency was declared in Huimanguillo and Cardenas. Response activities

coordinated by Civil Protection and the system for Integral Development of Families (DIF),

with contributions from other state and federal entities, such as the Federal Preventive

152

Figure 9.11: Left: map of 2010 census (green bars) vs CDRs based population estimation (purple

bars) in several cities of Tabasco (red=affected cities, blue=other cities) and surroundings. Right:

The plot shows linear correlation between the CDR census and the real census (r-square 0.97).

Police and the National Water Commission (CONAGUA). On November the 11th, a state

of emergency was declared in Comalcalco, Cunduacan, and Paraıso municipalities.

In January 2010, the National Center for Disaster Prevention (CENAPRED) carried

out a mission to assess the damage caused by the floods, together with the Planificacion

State Secretariat and Civil Protection. They interviewed over 16 state and federal agents

in charge of coordinating recovery actions. CENAPRED collected all the information and

compiled a report on the impact of the floods. According to the report, in economic terms,

the total losses in the state of Tabasco reached 190 million USD, 50% of which were due to

damage to road infrastructure; 16% were related to productive activities (agriculture and

ranching); and 7% of losses corresponded to social damage (dwelling, health, education).

The floods also had a significant emotional and psychological impact on peoples lives. The

CENAPRED report states that the total human, social and economic losses caused by the

2007, 2008 and 2009 stationary floods highlight the vulnerability of Tabasco to such natural

events. Furthermore, this recurring situation hinders the state from achieving total recovery

after each disaster. Hence it is recommended that resources be invested in designing and

implementing mitigation plans and prevention actions rather than in covering post-event

costs.

153

9.2.2 Assessing the Representativeness of CDR data

We considered a subset of the CDRs provided by the spanish company Telefonica2 comprising

only those mobile users (social baseline) who made calls from Tabasco during the month

prior to the onset of the reported floods on November 1st, 2009 (baseline period). In order

to evaluate how representative these data is of the real population of Tabasco, we have

compared the population distribution derived from the CDR data with the 2010 census of

Tabasco, used as the ground truth.

The social baseline has been characterized by assigning the home antenna tower (HAT)

for each user, meaning the antenna tower most used at night during the baseline (BL) period

[BCH+13]. Number of users per city (or administrative boundary) was inferred by cross-

referencing the users HAT with the GADM database. We then compared the 2010 census

information with the CDR population estimation for the main cities of the regions affected by

the 2009 floods: Cardenas, Huimanguillo, Paraiso, Comalco, Cunduacan and other nearby

cities (see Fig. 9.11). Results showed a linear relation between both variables with a relative

homogeneity of the telecom penetration in the affected region of around the 20%. Hence,

this analysis provides preliminary results that support the assumption of a homogeneous

representativeness of communication activity and mobility patterns extracted from CDRs in

the affected cities.

9.2.3 Population Response to Floods

For the analysis, the CDR data of the baseline has been aggregated by day and by antenna

to understand how the floods modulated the normal communication patterns observed at

the antenna level. In particular, we measured the number of users placing or receiving calls

in each antenna and for each day. We refer to this raw measurement as the antenna com-

munication activity x(t) (see Fig. 9.12). To detect abnormalities in this activity, we propose

the antenna variation metric that relies on the comparison x(t) against their characteristic

variation obtained during the baseline period. Mathematically, the antenna variation metric,

xnorm(t), is defined as the z-score from x(t) referred to the normal distribution characterizing

the baseline pattern as follows:

xnorm(t) =|x(t)− µBL|

σBL(9.1)

where the average and standard deviation (µBL, σBL) statistically characterizes the activ-

ity during the BL period (the month before the flooding onset). A graphical scheme of this

2www.telefonica.com/

154

Figure 9.12: Time evolution of the number of unique users per cell tower x(t). The gray stripes

indicate the Flood and Christmas periods where stronger variations are observed. The labels at

the top-right of each chart indicate the municipality where the tower is located. Towers have been

ordered and colored according to the maximum degree of variation during floods in decreasing

order.

155

Figure 9.13: Scheme of the Antenna Variation metric for cell towers. The black curve represents

the raw signal x(t). The gray stripe indicates the Flood period. The red line indicates the average

value (µBL) of users served during the Baseline period. The pink stripe indicates the standard

deviation (σBL) from the average value during the Baseline period. The blue line indicates the

deviation from the average value at a given day. Our measure of antenna variation results from the

ratio of the blue line divided by the green line.

156

Figure 9.14: Time evolution of the Antenna Variation metric (xnorm) for the considered towers.

The gray stripes indicate the Flood and Xmas periods. Color is proportional to the degree of

variation during the flooding period. It can be noticed that antennas have a spike of activity

during the floods (left shadowed region), as well as during Christmas and New Years Eve.

157

Figure 9.15: Impact Map of Tabasco for the 2009 floods. Circles represent antennas and their size

is proportional to the variation metric during the floods. The dark blue segmentation represents

the flooded region. The color of municipalities is proportional to the number of affected people.

The map shows the most critical day featuring the highest values of the antenna variation metric.

158

measure is presented in Fig. 9.13. A static z-score has been previously used to characterize

calling behaviors in large scale time sensitive emergency events like bombings, earthquakes

or brief storms [BWB11]. Here, we have computed xnorm(t) from the beginning of the BL

period until the end of January (Aprox. 2 months after rainfalls ended), generating temporal

series of this z-score for the antennas in the affected areas.

In Fig. 9.14 we present the temporal evolution of the antenna variation metric xnorm(t)

-derived from the CDRs- at all towers. Series have been colored according to their maximum

variation during the Floods (gray shaded region at the left). It can be noticed that some

antennas display a variation extremely high during the floods, up to 25 times higher than

its usual variations. These antennas are located in the most affected areas. The spatial dis-

tribution of the maximum value of the antenna variation metric is shown in an impact map

(see Fig. 9.15) that combines the metric with other contextual indicators: the municipalities

have been colored according to the official number of affected population and the segmen-

tation of the flooded area. The impact map is consistent with our ground truth evidence

(flood segmentation and civil protection records), since the antenna activity spikes in the

most affected municipalities: Cardenas and Huimanguillo. Furthermore, we also present the

daily variations of the antennas along the observation period in the Video B.5 described in

the Appendix B.

During the floods, the distribution of the maximum in the antenna variation metric is

wider than the BL period distribution, featuring more antenna with higher variation metric

(see Fig. 9.16). The real-time nature of mobile phone signals allows us to compare so-

cial patterns against their modulating factors. Here, we compare the proposed metric with

rainfall levels. These precipitation levels are obtained from the NASA TRMM projects day-

resolution estimations of the rainfalls. The six hottest antenna that also feature different

metric profile have been taken to observe the rainfall levels at the antenna level (see Fig.

9.17 Top). As shown, the typical delay between the maximum level of precipitations and the

peak in the variations of the hot antenna indicator is 4 days. One possible explanation is

that a population might not react in a way that alters the communication activity globally

even under extreme climatological conditions. Instead, the response captured in the commu-

nication activity could have occurred due to the initial flooding effects, after the rivers and

water reserves overflowed around November 5th and 6th as was reported in different news.

The civil protection warning was issued on the day of maximum precipitations (Novem-

ber 3rd). It would be expected that this warning would result in a spike in communications

activity, but this reaction can only be observed in two antennas located along Federal Road

180D that eventually suffered an outage (see Fig. 9.17 Bottom). These sudden variations

159

Figure 9.16: Distribution of the maximum of the antenna variation metric for the BL period

(gray) and floods (red). The curves show the percentage of antennas (y-axis) whose maximum

variation metric value (xnorm) is higher than a given value (x-axis).

160

and the following outage may indicate the point of the highest rain impact, likely caus-

ing a severe traffic jam on 180D. The increase of the antenna occupancy time due to the

jam would eventually generate the shown communication activity peaks (although further

analysis would be required).

On the other hand, the maximum of the antenna variation in the antennas with higher

population happens on November 6th when the rain was already vanishing. Several sources

also raised the estimates of the affected population from 50,000 to 100,000 people that

day. Thus, the hypothesis would be that for gradual-onset disasters (due to a cumulative

effect of some potential factor), the proposed metric might provide an estimation of the

populations awareness and subsequent reaction rather than a means to detect the onset of

the event. The delayed spike in antenna variation in this case may indicate that while the

civil protection warning did not produce the sufficient level of awareness in the population,

the initial consequences of the flooding did.

9.2.4 Summary

In summary, we have proposed a methodology based on integrated analysis of CDRs with

several data sources, including remote sensing imagery and rainfall information. We tested

the representativeness of the CDR data observing a homogeneous penetration of mobile

phones in the affected cities. We found abnormal communication activity that could be

used to measure the impact of the disaster. The populations reaction -in terms of increased

communication- took place when the emergency was declared, rather than during the previ-

ous alert stage, as expected. This could be an indicator of the skepticism or lack of awareness

of the population regarding the heightened risk of floods. If this is the case, a systematic

study of the reasons for such behavior is recommended, since lack of awareness of a hazard

implies an increase in vulnerability to its effects.

161

Figure 9.17: Top: Antenna variation metric (red) vs the precipitation level (blue) for the six

hottest antennas (A to F). The slashed line shows the emergency warning date as notified in the

news. Bottom: Map featuring the position and date (e.g. 6N is 6th November) where the maximum

of the antenna variation metric was observed.

162

Chapter 10

Conclusions

In this thesis, we have shown that several societal processes can be understood by analyzing

the data derived from people’s interactions with electronic media together with the mathe-

matical and computational tools from complexity science. We have proposed methodologies

to treat large volumes of unstructured data, resulting from human activity on social media

and through mobile phone. We have been able to unveil people’s collective behavior and to

retrieve structural and dynamical information about the underlying social systems from the

raw data.

Next, we present the conclusions obtained from our studies:

1. We developed methods to characterize and understand the social systems’ structure,

functioning and time evolution. To this end, we abstracted the systems as complex

networks and analyzed the evolution of their properties. We have applied this analysis

to several Twitter conversations during different events, finding similar patterns across

diverse contexts.

(a) We have shown, that the user activity distributions on several Twitter conversa-

tions typically scale as fat-tailed distributions, truncated by the individual con-

strains ad physical limitations. This means that the conversations are usually fed

by a small group of very active persons, while the large majority of users hardly

participates.

(b) During events, we have identified that the temporal behavior of the collective

activity is explosive and bursty. Most of the related information is posted during

the most critical hours of the event, when the topic captures the interest of the

majority of participants. We have shown that bursts present very similar shapes,

163

independently of the number of users and messages, which may span across several

orders of magnitude.

(c) We have shown that the user interactions on Twitter can be well defined by two

networks associated to the mechanisms provided by this online service for users

to receive and forward information. Both networks are directed and the sense of

the edges indicates the flow of attention and information.

i. One network emerges from the followers mechanism. In this network nodes

are linked by who receives whose messages, and its structure displays the

social substratum where information may flow during conversations.

ii. The other network emerges from the retweet mechanism. In this graph, users

are linked according to who forwarded (or retweeted) whose content and repre-

sents the information diffusion graph where messages actually traveled during

conversations.

(d) We have found that both followers and retweet networks present complex prop-

erties. The degree and strength distributions follow power laws at most of cases,

where the distribution resulting from the aggregation of collective behaviors present

a broader tail than the distributions emergent from individual actions. Besides,

the average shortest path between nodes result to be very small, since the few

hubs that connect most of the networks gather an extremely significant amount

of connections.

(e) We have shown that the directed assortativity of these networks varies according

to the direction considered. In general, the out-in relationship is disassortative,

meaning that non-popular accounts usually target their edges to popular accounts,

like selecting them as sources of information or to propagate their messages across

the network. Meanwhile, the out-out relationship is positively assortative, mean-

ing that the active users are linked among each other.

(f) We showed that the retweet mechanism can also be understood as information

cascades taking place on the followers network. We found that the size distribution

of cascades decays as a power law, indicating that while most of cascades hardly

include more than a couple participants, some few cascades are much more larger.

We have determined that the probability of a cascade to grow exponentially decays

as it moves farther from the original message source, in agreement to previous

works.

(g) In the mesoscale, we found that people are organized around influential accounts

164

from different collectives, like journalists, politicians or traditional media. The

followers network presents larger and denser communities, while the retweet net-

work present smaller communities with fewer edges. At the communities from the

followers network, the most central users are very popular and usually influence

the emergence of smaller retransmission communities due to the propagation of

their content.

(h) We have shown that people are more selective when it comes to take an active

part in the conversation, like retransmitting a message, rather than just passively

participating, like receiving and reading information from other sources.

(i) Moreover, our results indicate that although the online social media seem to be a

purely social phenomena, traditional media agents still enjoy a lot of power and

influence over people, who they use to boost and enhance their messages.

2. The characterization achieved of the social systems allowed us to understand the way

users interact and influence each other during events and conversations. Based on the

networks’ structure, we have classified users, as system’s elements, according to their

relationship with the environment and their role in the collective functioning.

(a) We have shown that there are three types of user behavior that determine the

dynamics of the information flow: Information Producers, Active Consumers and

Passive Consumers.

i. Information producers represent a very small group of highly influential users

who dominate the collective attention and catalyze the information diffusion

process. These users cause a lot of activity inside the network, posting a little

amount of messages.

ii. Active consumers usually retransmit a large amount of messages, gaining

influence in proportion to their activity employed. These users act like social

bridges delivering messages from other people to their own sub-networks.

iii. Finally, passive consumers are those who hardly participate, retransmit mes-

sages nor get retransmitted at all. These users represent the large majority

of the population while their activity represents less than half of the stream

of messages.

3. We have introduced a new measure of influence in the network called user efficiency,

defined as the ratio between the retransmissions gained by message posted. We have

also proposed a computational model to explain the distributions of user efficiency and

165

to explore the effects of the underlying network’s topological properties and the way

users post messages. We show that users can compensate their topological deficits

by means of modifying their behavior in order to be influential in the conversations.

However, this process is very costly for the user.

(a) We found that the user efficiency distribution follows a lognormal distribution with

a fatter tail than expected, due to the effects of the extremely connected hubs.

In average most of the users who get retweeted, gain as many retransmissions as

messages posted. However, a minority of them, occupying a privileged position

in the followers network, accomplish a very high level of retransmission with little

effort.

(b) We showed that the user efficiency distribution is universal across several Twitter

conversations. We demonstrated that the same distribution emerges from several

conversations of diverse nature and cultural context, whose sizes in users and

participants varied across several orders of magnitude.

(c) The user efficiency distributions have been explained by modeling the underlying

rules of the message spreading process by means of a computational model, based

on independent cascades taking place on the followers networks. The cascades are

biased in order to decay their probability of growth as the message travels farther

from the original source.

(d) The developed computational model revealed that the emergence of a small frac-

tion of highly efficient users results from the heterogeneity of the underlying net-

work, rather than the differences in the individual user behavior. Therefore, the

changes in the activity behavior are not significant if the underlying network

presents a scale-free structure.

(e) When considering homogeneous networks, we have shown that the retransmissions

gained by user are mainly proportional to their activity, meaning that there is

not an influential set of highly efficient users in this kind of graphs. In fact, an

homogeneously organized society would need a much larger population to find

the same level of efficiency to diffuse information that we get by complex and

heterogeneous organizing.

(f) Our results show that regular users can compensate their topological deficits by

means of change in their behavior. However, since the activity must be increased

in a very costly and even unaffordable way, such enhancement would be achieved

far less efficiently than the users with high connectivity.

166

(g) We conclude that although individuals may have remarkable psychological and

contextual differences, the dynamical patterns are due to simple and universal

interaction mechanisms.

4. We have proposed a methodology to infer the degree of polarization in social inter-

actions. The methodology consists of a polarization index and a model to estimate

opinions in social networks. We have illustrated how to apply this methodology by de-

tecting and measuring the polarization on a Twitter conversation related to the recent

death of the former Venezuelan president Hugo Chavez.

(a) We have introduced a new way to measure and quantify the degree of polarization

of a social group based on the concepts of physics and inspired by the electric

dipole moment. We have shown that the polarization of two equally populated

groups depends on how distant are their views, just like the electric dipole moment

increases with the distance between the charges

(b) We have shown that the opinions of a large number of participants on Twitter

conversations can be inferred with a social contagion model, in which a minority of

influential individuals -called elite- propagate their opinions through the emergent

retweet networks.

(c) Our methodology can detect different degrees of polarization, depending on the

structure of the network. If the network is polarized around the elite, then we are

able to detect a two islands structure. Instead, if the network is not polarized,

then we appreciate a single island structure.

(d) We applied this methodology to a Twitter conversation regarding the death an-

nouncement of the former Venezuelan president Hugo Chavez. We found that

the polarization degree varied according to developing external events. Based on

these results, we have identified the following periods:

i. Before the main announcement, we found the networks to be polarized around

the two political poles. However, the polarization index did not presented

maximum values since one pole was larger than the other.

ii. During the main announcement, we found the conversation to have no polar-

ization. We found single island structures with a remarkable participation of

international users.

iii. After the main announcement, the polarization emerged in the conversation

again and the networks showed two island structures. At this stage, both

167

poles reached similar sizes and the polarization index presented maximum

values.

(e) The Venezuelan elite were not capable of polarizing the network when the con-

versation stopped being local of Venezuela and turned to be international. The

more international users we detected, the less the polarization degree we found.

(f) However, by applying the model allowing the flow of information only in the

direction of who-influences-who, we found a social baseline that presented a higher

degree of polarization across the whole conversation.

(g) We contrasted our results against offline data, such as municipality governments

or socioeconomic factors, finding a good correlation between the online and offline

polarization.

(h) We have shown that a minority of elite users were able to influence the whole

online social network, resulting in a highly politically polarized conversation. This

means that most of users are exposed to opinions to which are favorable and cross

ideological interactions hardly occur.

5. We have also analyzed the temporal behavior of Twitter aggregated activity in urban

areas across the world. We characterized the kinetics of Twitter activity from over 50

cities by building temporal series of average behavior. We have shown that cities can

be classified by three classes of behavior due to combinations of morning and afternoon

activity.

(a) We found that cities present a collective cyclic behavior, due to daily routines and

collective activities. This behavior consists in periodic minima of activity during

the early morning and peaks of activity during the daytime.

(b) We have identified three classes of dynamical behavior, based on morning and

afternoon activity.

i. One class presents two peaks of similar size: one before noon and another

before night. We showed that most of these cities are located in Europe,

Middle East and Africa.

ii. Another class presents two peaks of different size: a smaller before noon and a

larger before night. We showed that most of these cities are located in South

America.

iii. The last class presents a single peak before night. Most of these cities resulted

to be from North America and East Asia.

168

6. Moreover, we have analyzed mobile phone activity from the country Ivory Coast in

order to infer the human behavior from calling and mobility patterns. In this study,

we have characterized the way geographical regions interact with each other. We

have found that the communication patterns have a correlation with the transport

infrastructure, economical development and cultural identity.

(a) We have shown that communication patterns in a developing country can be

characterized by the construction of networks of people calling each other and

moving through antennas. The networks are directed and weighted according to

the number of recurrences.

(b) We showed that the calls network behaves like a fully connected network with an

extremely high clustering coefficient, while the mobility network is sparser and

presents a lower clustering coefficient.

(c) At the mesoscale, we have shown that the calls network presents fewer and larger

communities, while the mobility network presents more but smaller communi-

ties. We found that the communities at both networks are related to regions and

populated areas, like cities or villages.

(d) We showed that the mobility network is a reflection of the transport infrastructure.

Besides, we found that the economical development is related to the closeness

centrality property of this network.

(e) We found that the communities from the calling network are clustered in two

regions located at both sides of the country. We have evidence to believe that such

division is due to cultural factors, like the spoken language, as well as economical

factors.

7. Finally, we have studied the effects of natural disasters on the collective people’s be-

havior, like the 2009 floods in Tabasco, Mexico. During this study we proposed a

methodology to integrate mobile phone data with other data sources, in order to en-

hance the information managed by local governments and international agencies during

emergencies. The results show that mobile phone activity could be a complementary

source of information in order to estimate the impact of natural disasters almost at

real time. Our conclusions from this study are the following:

(a) We have shown that mobile phone data is representative to estimate measurements

over the full population, since we observed a good correlation between the number

of users and inhabitants per region.

169

(b) We have shown that the mobile phone activity presents a bursty and hetero-

geneous behavior during natural disasters at the antennas close to the affected

areas.

(c) We have shown that these abnormal variations can be detected by normalizing

the behavior at each antenna during the emergency with its usual behavior.

(d) Our findings showed that relevant information results from the antenna-level ag-

gregation of cell phone traffic and not from the individual records. Therefore, we

have shown that user privacy is not compromised.

(e) We conclude that that popular reactions to catastrophes could be incorporated

into an evolving emergency management strategy and policies evaluation.

170

Appendix A

User Behavior

In this appendix, we show the characterization of the user behavior in different datasets. We

present the results from applying the same experiments performed in section 5.7 to two other

Twitter conversations described in section 4.2.2. More specifically, we present the results

from the 20N dataset in Fig. A.1 and the results from the ETA dataset in Fig. A.2.

It can be noticed that the patterns obtained from these datasets are very similar to the

results obtained in section 5.7 from the #SOSINternetVE dataset. In Fig. A.1 A and A.2

A, we show that the most retransmitted users are also the most followed ones (red dots),

independently of their activity. In Fig. A.1 B and A.2 B, we show that the most active users

(red dots) do not have the largest amount of followers. However, these active users may gain

as many retweets as the popular users. In Fig. A.1 C and A.2 C, we show that the most

active users (red dots) are reciprocal and mainly located at Kin/Kout ∼ 1. Meanwhile, we

see that popular users are asymmetrical and present Kin > Kout. Finally, in Fig. A.1 D and

A.2 D, we show that the most active users (red dots) are those who retweet the most and

do not have the largest amount of followers. Also, we see that the most followed ones hardly

retweet other users.

In summary, we have shown that our characterization of the user behavior is not con-

strained to a single dataset, but rather seems to be a general property of Twitter conversa-

tions. Again, we found that there are three kind of users. One group is compound by highly

followed users, that post a few messages and obtain a high quantity of retweets. Another

group is compound by less folowed users, who are very active, make a lot of retweets and

obtain as many retweets as their activity. Finally, there is a third set of less followed users

that hardly participate and consequently hardly gain retweets in the conversation.

171

Figure A.1: Analysis of the user behavior. (A) Scatter plot of retransmissions obtained by user





followers and colored by its activity. Dots represent users. Data correspond to the 20N dataset.

172

Figure A.2: Analysis of the user behavior. (A) Scatter plot of retransmissions obtained by user





followers and colored by its activity. Dots represent users. Data correspond to the ETA dataset.

173

Appendix B

Videos

In this appendix, we present the videos that we have made to illustrate some of our results.

For each video, we present a figure composed by three arbitrarily chosen snapshots. We also

provide a description of the video in the figures’ caption.

Specifically, we present the following videos:

1. Evolution of the opinion estimation model in Fig. B.1. In this video we show the

evolution of the opinion estimation model in a sample network. At the beginning of

the video, all nodes except from the elite are colored in white. Then, as the video goes

on, nodes iteratively adopt new opinions and change their colors iteratively.

2. Worldwide Twitter reaction to the announcement of Hugo Chavez decease in Fig.

B.2. In this video we show the evolution of geolocated messages per minute during a

24h period, including the decease announcement of the former Venezuelan president.

At the beginning of the video we see some scattered messages, mainly concentrated

in Venezuela. Then, once the news is released, we notice an explosion of activity

worldwide.

3. Worldwide Twitter activity in Fig. B.3. In this video we show the dynamics of Twitter

activity worldwide, during one arbitrary week. In the video we can notice a global wave

of activity going from east to west on daily basis. Se show that people periodically

goes to sleep and becomes active during the day.

4. Human trajectories network evolution in Ivory Coast in Fig. B.4. In this video we

show the evolution of the human mobility network in Ivory Coast during an arbitrary

day. We also show the location of the network communities in the map.

175

5. Calls network evolution in Ivory Coast in Fig. B.5. In this video we show the evolution

of the mobile phones’ calls network in Ivory Coast during an arbitrary day. We also

show the influence that the network communities play on each other.

6. Time-lapse of the Tabasco impact map in Fig. B.6. In this video we show the temporal

evolution of the antenna variation metric before, during an after the 2009 Floods

occurred in Tabasco, Mexico.

We have built these videos by means of Python scripts almost exclusively. In general,

videos are compound by a set of frames. In our videos, each frame is build as an independent

plot and saved as an independent figure. Then, we compiled all figures into a single video

using the ffmpeg1 program. The only requirement is that figure files must be numbered in

the order that will appear in the final video.

1https://www.ffmpeg.org/

176

Figure B.1: Evolution of the opinion estimation model. Nodes are colored according to their

opinion Xi. In principle, all nodes’ opinions are zero; thus, they are colored in white. However,

nodes with an opinion below zero are red and above zero are blue. The elite is hidden in the

network and will spread their opinions iteratively. We see how the network is increasingly colored

at each time step. Because the network is polarized around the elite, the red and blue colors are

not mixed.

177

Figure B.2: Worldwide Twitter reaction to the announcement of Hugo Chavez decease. Yellow

circles represent a geolocated tweet. The video spans for a 24h period. We show a counter indicating

the remaining time before the announcement and the time after it. It can be noticed that at the

moment of the announcement the whole world reacted massively to the news by posting related

messages.

178

Figure B.3: Worldwide Twitter activity. In this video we present the worldwide Twitter activity

during an arbitrary week. We plot all geolocated tweets as white dots in the map. It can be noticed

that there is a wave of activity from the east to the west side of the globe as days evolve. Also, it

is noticeable that the activity decreases to its minimum levels during early mornings.

179

Figure B.4: Human trajectories network evolution in Ivory Coast. In this video, we present the

dynamical growth of the human trajectories network during an arbitrary day. Dots represent users

moving across the country from antenna to antenna. The edge color is related to the network

community where the target node belongs to. It can be noticed that the network grows in a sparse

way, mostly connecting nodes that are geographically close to each other. Other regions like the

capital city (right bottom) concentrate most of the long distance edges.

Figure B.5: Calls network evolution in Ivory Coast. In this video, we present the dynamical

growth of the calls network during a period of 12 hours at an arbitrary day. Dots represent calls,

traveling from one antenna to the other at each hour. The edge color is related to the network

community where the target node belongs to. It can be noticed that there is an explosion of calls

after 6am, showing the dense structure of the network.

180

Figure B.6: Time-lapse of the Tabasco impact map. The video displays the absolute value of the

antenna variation metric from Oct, 2009 to Jan, 2010 as in the temporal series. Each antenna is

represented by a circle with color and size proportional to the daily metric value. The segmented

flooded area has been colored in light blue. It can be noticed that the antennas near the flooding area

dramatically increased their variation during the floods. This effect is noticeable during Christmas

and New Years Eve, where all antennas present extremely large variation.

181

Bibliography

[ACFO13] D. Acemoglu, G. Como, F. Fagnani, and A. Ozdaglar, Opinion fluctuations

and disagreement in social networks, Mathematics of Operations Research 38

(2013), no. 1, 1–27.

[AG05] L. A. Adamic and N. Glance, The political blogosphere and the 2004 U.S.

election: Divided they blog, Proceedings of LinkKDD, 2005.

[AHSW11] S. Asur, B. A. Huberman, G. Szabo, and C. Wang, Trends in social media:

Persistence and decay, CoRR abs/1102.1402 (2011).

[AJB99] R. Albert, H. Jeong, and A-L Barabasi, Internet: Diameter of the world-wide

web, Nature 401 (1999), no. 6749, 130–131.

[AMV+14] J. Adebayo, T. Musso, K. Virdee, C. Friedman, and Y. Bar-Yam, An explo-

ration of social identity: The structure of the bbc news-sharing community on

twitter, Complexity 19 (2014), no. 5, 55–63.

[AO11] D. Acemoglu and A. Ozdaglar, Opinion dynamics and learning in social net-

works, Dynamic Games and Applications 1 (2011), no. 1, 3–49.

[APR99] J. Abello, P. M. Pardalos, and M. G. C. Resende, On very large maximum

clique problems, AMS-DIMACS Series in Discrete Mathematics and Theoret-

ical Computer Science 50 (1999), 119–130.

[AS11] R. Alonso-Sanz, Discrete systems with memory, vol. 75, World Scientific,

2011.

[ASBS00] L. A. Amaral, A. Scala, M. Barthelemy, and H. E. Stanley, Classes of small-

world networks., Proc Natl Acad Sci 97 (2000), no. 21, 11149–11152.

[AW12] S. Aral and D. Walker, Identifying influential and susceptible members of

social networks, Science 337 (2012), no. 6092, 337–341.

183

[BA99] A-L Barabasi and R Albert, Emergence of Scaling in Random Networks, Sci-

ence 286 (1999), no. 5439, 509–512.

[Bar05] A-L Barabasi, The origin of bursts and heavy tails in human dynamics, Nature

435 (2005), 207.

[Bar12] A-L Barabasi, Network Science Project, http://barabasilab.neu.edu/ net-

worksciencebook, 2012.

[BB07] D. Baldassarri and P. Bearman, Dynamics of Political Polarization, American

Sociological Review 72 (2007), no. 5, 784–811.

[BBPSV04] A. Barrat, M. Barthelemy, R. Pastor-Satorras, and A. Vespignani, The archi-

tecture of complex weighted networks, Proceedings of the National Academy

of Sciences of the United States of America 101 (2004), no. 11, 3747–3752.

[BCH+13] R. Becker, R. Caceres, K. Hanson, S. Isaacman, J-M Loh, M. Martonosi,

J. Rowland, S. Urbanek, A. Varshavsky, and C. Volinsky, Human mobility

characterization from cellular network data, Commun. ACM 56 (2013), no. 1,

74–82.

[BEC+12] V. D. Blondel, M. Esch, C. Chan, F. Clerot, P. Deville, E. Huens, F. Morlot,

Z. Smoreda, and C. Ziemlicki, Data for development: the d4d challenge on

mobile phone data, CoRR abs/1210.0137 (2012).

[Ber97] A. Berry, The income distribution threat in latin america, Latin American

Research Review 32 (1997), no. 2, pp. 3–40 (English).

[Bet13] L. M. A. Bettencourt, The origins of scaling in cities, Science 340 (2013),

no. 6139, 1438–1441.

[BG05] I. Borg and P.J.F. Groenen, Modern Multidimensional Scaling: Theory and

Applications, Springer, 2005.

[BG08] D. Baldassarri and A. Gelman, Partisans without Constraint: Political Po-

larization and Trends in American Public Opinion, American Journal of So-

ciology 114 (2008), no. 2, 408–446.

[BGL10] D. Boyd, S. Golder, and G. Lotan, Tweet, tweet, retweet: Conversational

aspects of retweeting on twitter., HICSS, IEEE Computer Society, 2010, pp. 1–

10.

184

[BGLL08] V. D. Blondel, J. L. Guillaume, R. Lambiotte, and E. Lefebvre, Fast unfolding

of communities in large networks, J. Stat. Mech (2008), P10008.

[BH48] R. Bellman and T. E Harris, On the theory of age-dependent stochastic branch-

ing processes, Proc. Nat. Acad. Sci. USA 34 (1948), no. 12, 601.

[BHMW11] E. Bakshy, J. M. Hofman, W. A. Mason, and D. J. Watts, Everyone’s an

influencer: quantifying influence on twitter, Proceedings of the fourth ACM

international conference on Web search and data mining (New York, NY,

USA), WSDM ’11, ACM, 2011, pp. 65–74.

[BKM+00] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata,

A. Tomkins, and J. Wiener, Graph structure in the web, Comput. Netw. 33

(2000), 309–320.

[BKO11] D. Bindel, J. Kleinberg, and S. Oren, How bad is forming your own opin-

ion?, Foundations of Computer Science (FOCS), 2011 IEEE 52nd Annual

Symposium (2011), 57–66.

[BLM+06] S. Boccaletti, V. Latora, Y. Moreno, M. Chavez, and D-U. Hwang, Complex

networks : Structure and dynamics, Phys. Rep. 424 (2006), no. 4-5, 175–308.

[BLT+11] L. Bengtsson, X. Lu, A. Thorson, R. Garfield, and J. von Schreeb, Improved

response to disasters and outbreaks by tracking population movements with

mobile phone network data: A post-earthquake geospatial study in haiti, PLoS

Med 8 (2011), no. 8, e1001083.

[BMBL14] J. Borondo, A. J. Morales, R. M. Benito, and J. C. Losada, Mapping the on-

line communication patterns of political conversations”, Physica A: Statistical

Mechanics and its Applications 414 (2014), 403–413.

[BMLB12] J. Borondo, A. J. Morales, J. C. Losada, and R. M. Benito, Characterizing

and modeling an electoral campaign in the context of Twitter: 2011 Spanish

Presidential election as a case study., Chaos 22 (2012), no. 2, 023138.

[BMZ11] J. Bollen, H. Mao, and X-J Zeng, Twitter mood predicts the stock market., J.

Comput. Science 2 (2011), no. 1, 1–8.

[BS09] E. Bullmore and O. Sporns, Complex brain networks: graph theoretical anal-

ysis of structural and functional systems, Nature Reviews Neuroscience 10

(2009), no. 3, 186–198.

185

[BTW87] P. Bak, C. Tang, and K. Wiesenfeld, Self-organized criticality. an explanation

of 1/f noise, Physical Review Letters 59 (1987), 381–384.

[BWB11] J. P. Bagrow, D. Wang, and A-L Barabasi, Collective Response of Human

Populations to Large-Scale Emergencies, PLOS ONE 6 (2011), no. 3, e17680.

[BY97] Y. Bar-Yam, Dynamics of complex systems, vol. 213, Addison-Wesley Read-

ing, MA, 1997.

[BYB13] Y. Bar-Yam and M. Bialik, Beyond big data: Identifying important informa-

tion for real world challenges, arXiv in press (2013).

[Cas96] M. Castells, Rise of the network society, 1st ed., Blackwell Publishers, Inc.,

Cambridge, MA, USA, 1996.

[CBBV06] V. Colizza, A. Barrat, M. Barthlemy, and A. Vespignani, The role of the

airline transportation network in the prediction and predictability of global

epidemics, Proceedings of the National Academy of Sciences of the United

States of America 103 (2006), no. 7, 2015–2020.

[CCG+02] Q. Chen, H. Chang, R. Govindan, S. Jamin, S. Shenker, and W. Willinger,

The origin of power-laws in internet topologies revisited., INFOCOM, 2002.

[CE09] A. Cheng and M. Evans, An in-depth look inside the twitter world.,

http://www.sysomos.com/insidetwitter, 2009.

[CF07] N. A. Christakis and J. H. Fowler, The spread of obesity in a large social

network over 32 years, New England journal of medicine 357 (2007), no. 4,

370–379.

[CF08] N. A. Christakis and J. H. Fowler, The collective dynamics of smoking in a

large social network, New England journal of medicine 358 (2008), no. 21,

2249–2258.

[CFHB+05] R. Criado, J. Flores, B. Hernandez-Bermejo, J. Pello, and M. Romance, Ef-

fective measurement of network vulnerability under random and intentional

attacks, Journal of Mathematical Modelling and Algorithms 4 (2005), no. 3,

307–316.

[CFMF13] M. Conover, E. Ferrara, F. Menczer, and A. Flammini, The digital evolution

of occupy wall street, PLOS ONE 8 (2013), no. 5, e64679.

186

[CGFM12] M. D. Conover, B. Goncalves, A. Flammini, and F. Menczer, Partisan asym-

metries in online political activity, EPJ Data Science 1 (2012), no. 1, 1–19

(English).

[CGW+08] J. Candia, M. C. Gonzalez, P. Wang, T. Schoenharl, G. Madey, and A.-L.

Barabasi, Uncovering individual and collective human dynamics from mobile

phone records, Journal of Physics A 41 (2008), no. 22, 224015.

[CH03] R. Cohen and S. Havlin, Scale-Free Networks Are Ultrasmall, Phys. Rev. Lett.

90 (2003), 058701.

[CHBG10] M. Cha, H. Haddadi, F. Benevenuto, and K.P. Gummadi, Measuring user

influence in Twitter: The million follower fallacy, 4th International AAAI

Conference on Weblogs and Social Media (ICWSM), 2010.

[Cho13] K. Chodorow, Mongodb: the definitive guide, ” O’Reilly Media, Inc.”, 2013.

[Com11] Inc. ComScore, Social networking on-the-go: U.s. mobile social media audi-

ence grows 37 percent in the past year, Tech. report, 2011.

[Cou12] Digital Policy Council, World leader rankings on twitter, Research Note, 2012.

[CPRVP09] R. Criado, J. Pello, M. Romance, and M Vela-Perez, A node-based multiscale

vulnerability of complex networks, International Journal of Bifurcation and

Chaos 19 (2009), no. 02, 703–710.

[CRF+11] M. D. Conover, J. Ratkiewicz, M. Francisco, B. Goncalves, A. Flammini, and

F. Menczer, Political polarization on twitter, 2011.

[Cro06] D. Crockford, The application/json media type for javascript object notation

(json), RFC 4627, IETF, 7 2006.

[Cum12] G. Cumming, Understanding the new statistics : effect sizes, confidence in-

tervals, and meta-analysis, Multivariate applications series, Routledge Aca-

demic, London, 2012.

[CV12] Lidia Ceriani and Paolo Verme, The origins of the gini index: extracts from

variabilita e mutabilita (1912) by corrado gini, The Journal of Economic In-

equality 10 (2012), no. 3, 421–443.

187

[DB13] M. Duggan and J. Brenner, The demographics of social media users, 2012,

vol. 14, Pew Research Center’s Internet & American Life Project, 2013.

[DBM13] C. Doerr, N. Blenn, and P. Mieghem, Lognormal infection times of online

information spread, CoRR abs/1305.5235 (2013).

[DD87] C. J Date and H. Darwen, A guide to the sql standard, vol. 3, Addison-Wesley

New York, 1987.

[DeG74] M. H. DeGroot, Reaching a consensus, Journal of the American Statistical

Association 69 (1974), no. 345, 118–121.

[DG08] J. Dean and S. Ghemawat, Mapreduce: simplified data processing on large

clusters, Communications of the ACM 51 (2008), no. 1, 107–113.

[DGL13] P. Dandekara, A. Goelb, and D.T. Leec, Biased assimilation, homophily, and

the dynamics of polarization, Proc. Nat. Acad. Sci. (2013).

[Dia90] L. J. Diamond, Three Paradoxes of Democracy, Journal of Democracy 1

(1990), no. 3, 48–60.

[Dia97] J. M. Diamond, Guns, germs, and steel: The fates of human societies, W.W.

Norton, New York, 1997.

[DO14] F. D’Orazio and J. Owens, White paper: How stuff spreads 2: How videos go

viral part 1., Tech. report, 2014.

[Dow57] A. Downs, An economic theory of political action in a democracy, The Journal

of Political Economy (1957), 135–150.

[Dun92] R. I. M. Dunbar, Neocortex size as a constraint on group size in primates,

Journal of Human Evolution 22 (1992), no. 6, 469–493.

[DW07] A. K. Dixit and J. W. Weibull, Political polarization, Proceedings of the

National Academy of Sciences 104 (2007), no. 18, 7351–7356.

[DYB03] G. F. Davis, M. Yoo, and W. E. Baker, The small world of the American

Corporate Elite, 1982-2001, Strategic Organization 1 (2003), 301–326.

[ECd08] Communaute Europeenne and Republique Cote d’Ivoire, Document de strate-

gie pays et programe indicatif national pour la periode 2008-2013, Tech. re-

port, UE, 2008.

188

[EEBL11] P. Expert, T. S. Evans, V. D. Blondel, and R. Lambiotte, Uncovering space-

independent communities in spatial networks, Proceedings of the National

Academy of Sciences 108 (2011), no. 19, 7663–7668.

[EH02] S. Ellner and D. Hellinger, Venezuelan politics in the Chavez era: Class,

polarization and conflict, Lynne Rienner Publishers, 2002.

[EMC10] N. Eagle, M. Macy, and R. Claxton, Network diversity and economic devel-

opment, Science 328 (2010), no. 5981, 1029–1031.

[EP06] N. Eagle and A. Pentland, Reality mining: sensing complex social systems,

Personal and ubiquitous computing 10 (2006), no. 4, 255–268.

[ER60] P. Erdos and A. Renyi, On the evolution of random graphs, Publication of the

Mathematical Institute of the Hungarian Academy of Sciences, 1960, pp. 17–

61.

[FC08] J. H. Fowler and N. A. Christakis, Dynamic spread of happiness in a large

social network: longitudinal analysis over 20 years in the framingham heart

study, Bmj 337 (2008).

[FFGP10] J. G. Foster, D. V. Foster, P. Grassberger, and M. Paczuski, Edge direc-

tion and the structure of networks, Proceedings of the National Academy of

Sciences 107 (2010), no. 24, 10815–10820.

[FGH12] M. Fernandez, J. Galeano, and C. A. Hidalgo, Bipartite networks provide new

insights on international trade markets, Networks and Heterogeneous Media

7 (2012), no. 3, 399–413.

[FJ90] N. E. Friedkin and E. C. Johnsen, Social influence and opinions, Journal of

Mathematical Sociology 15 (1990), no. 3-4, 193–206.

[For10] S. Fortunato, Community detection in graphs, Physics Reports 486 (2010),

no. 3-5, 75 – 174.

[Fre11] A. Freitez, La emigracion desde Venezuela durante la ultima decada, Temas

de Coyuntura (2011), no. 63, 11–38.

[GA12] D. Gayo-Avello, ”i wanted to predict elections with twitter and all i got was

this lousy paper” – a balanced survey on election prediction using twitter data,

CoRR abs/1204.6441 (2012).

189

[GAC+10] W. Galuba, K. Aberer, D. Chakraborty, Z. Despotovic, and W. Kellerer,

Outtweeting the twitterers - predicting information cascades in microblogs,

Proceedings of the 3rd conference on Online social networks (Berkeley, CA,

USA), WOSN’10, USENIX Association, 2010, pp. 3–3.

[Gal73] E.H. Galeano, Open veins of latin america: Five centuries of the pillage of a

continent, Modern reader paperback. 308, Monthly Review Press, 1973.

[GG03] M. P. Garcia-Guadilla, Politizacion y polarizacion de la sociedad civil vene-

zolana: Las dos caras frente a la democracia, Espacio Abierto 12 (2003),

no. 001, 31–62.

[GHB08] M. C. Gonzalez, C. A. Hidalgo, and A-L. Barabasi, Understanding individual

human mobility patterns, Nature 453 (2008), no. 7196, 779–782.

[GHKV07] M. C. Gonzalez, H. J. Herrmann, J Kertesz, and T Vicsek, Community struc-

ture and ethnic preferences in school friendship networks, Physica A: Statis-

tical mechanics and its applications 379 (2007), no. 1, 307–316.

[GI95] J. W. Grossman and P. D. F. Ion, On a portion of the well known collaboration

graph, Congressus Numerantium 108 (1995), 129–131.

[GIT09] B. D. Gomperts, M. K. IJsbrand, and P.E.R. Tatham, Copyright, Signal

Transduction (Second Edition), Academic Press, San Diego, second edition

ed., 2009, pp. iv –.

[GJ10a] B. Golub and M. O. Jackson, Naive learning in social networks and the wis-

dom of crowds, American Economic Journal: Microeconomics (2010), 112–

149.

[GJ10b] B. Golub and M. O. Jackson, Using selection bias to explain the observed

structure of internet diffusions, Proc. Nat. Acad. Sci. USA 107 (2010), no. 24,

10833–10836.

[GKK11] V. Gomez, H.J Kappen, and A. Kaltenbrunner, Modeling the structure and

evolution of discussion cascades, Proceedings of the 22nd ACM conference on

Hypertext and hypermedia, ACM, 2011, pp. 181–190.

[GLM01] J. Goldenberg, B. Libai, and E. Muller, Talk of the network: A complex

systems look at the underlying process of word-of-mouth, Marketing Letters

(2001).

190

[GMSS12] D. Garcia, F. Mendez, U. Serdult, and F. Schweitzer, Political polarization

and popularity in online participatory media: An integrated approach, Pro-

ceedings of the First Edition Workshop on Politics, Elections and Data (New

York, NY, USA), PLEAD ’12, ACM, 2012, pp. 3–10.

[GN02] M. Girvan and M. E. J. Newman, Community structure in social and biological

networks, PNAS 99 (2002), no. 12, 7821–7826.

[GPG12] L. J Gilarranz, J. M Pastor, and J. Galeano, The architecture of weighted

mutualistic networks, Oikos 121 (2012), no. 7, 1154–1162.

[GPV11] B. Goncalves, N. Perra, and A. Vespignani, Modeling users’ activity on twitter

networks: Validation of dunbar’s number, PLoS ONE 6 (2011), no. 8.

[Gra73] M. Granovetter, The Strength of Weak Ties, The American Journal of Soci-

ology 78 (1973), no. 6, 1360–1380.

[Gra78] M. Granovetter, Threshold models of collective behavior, The American Jour-

nal of Sociology 83 (1978), no. 6, 1420–1443.

[GRM+12] P. A. Grabowicz, J. J. Ramasco, E. Moro, J.M. Pujol, and C.M. Eguiluz,

Social features of online networks: The strength of intermediary ties in online

social media, PLoS ONE 7 (2012), no. 1, e29358.

[Hin13] Hinterlaces, Monitor pais, 12 2013.

[HKBH07] CA Hidalgo, B. Klinger, A.L. Barabasi, and R. Hausmann, The product space

conditions the development of nations, Science 317 (2007), no. 5837, 482.

[HL75] R. A. Holley and T. M. Liggett, Ergodic theorems for weakly interacting infi-

nite systems and the voter model., The annals of probability (1975), 643–663.

[HRW09] B. A. Huberman, D. M. Romero, and F. Wu, Social networks that matter:

Twitter under the microscope, First Monday 14 (2009), no. 1.

[HSB+13] B. Hawelka, I. Sitko, E. Beinat, S. Sobolevsky, P. Kazakopoulos, and

C. Ratti, Geo-located twitter as the proxy for global mobility patterns., CoRR

abs/1311.0680 (2013).

[Huc01] R. Huckfeldt, The social communication of political expertise, American Jour-

nal of Political Science (2001), 425–438.

191

[HW09] H-B Hu and X-F Wang, Disassortative mixing in online social networks, EPL

(Europhysics Letters) 86 (2009), no. 1, 18003.

[HZGMBY13] A. Herdagdelen, W. Zuo, A.S. Gard-Murray, and Y. Bar-Yam, An exploration

of social identity: The geography and politics of news-sharing communities in

twitter, Complexity 19 (2013), 10–20.

[ICO12] ICCO International Cocoa Organization, Annual report 2011/2012, Tech. re-

port, ICCO, 2012.

[IE11a] J. L. Iribarren and Moro. E., Affinity paths and information diffusion in social

networks, Social Networks 33 (2011), no. 2, 134 – 142.

[IE11b] J. L. Iribarren and Moro. E., Branching dynamics of viral information spread-

ing, Phys. Rev. E 84 (2011), 046116.

[IFfAD09] IFAD International Fund for Agriculture Development, Enabling poor rural

people to overcome poverty in the Bolivarian Republic of Venezuela”, 2009.

[IJBZ08] B. Schmittmann I. J. Benczik, S. Z. Benczik and R. K. P. Zia, Lack of con-

sensus in social systems, Europhys. Lett 82 (2008), 48006.

[Jac10] M. O. Jackson, Social and economic networks, Princeton University Press

(2010).

[JCZB06] P. F. Jonsson, T. Cavanna, D. Zicha, and P. A. Bates, Cluster analysis of

networks generated through homology: automatic identification of important

protein communities involved in cancer metastasis., BMC Bioinformatics 7

(2006), 2.

[JKKK12] H.H Jo, M. Karsai, J. Kertesz, and K. Kaski, Circadian pattern and burstiness

in mobile phone communication, New Journal of Physics 14 (2012), no. 1,

013055+.

[JMBO01] H. Jeong, S.P. Mason, A.-L. Barabasi, and Z.N. Oltvai, Lethality and central-

ity in protein networks, Nature 411 (2001).

[JSFT09] A. Java, X. Song, T. Finin, and B. Tseng, Why we twitter: An analysis of a

microblogging community, Advances in Web Mining and Web Usage Analysis,

Springer, 2009, pp. 118–138.

192

[Kaw13] T. Kawamoto, A stochastic model of the tweet diffusion on the Twitter net-

work, Physica A: Statistical Mechanics and its Applications (2013).

[KEH10] A. Kapoor, N. Eagle, and E. Horvitz, People, quakes, and communications:

Inferences from call dynamics about a seismic event and its influences on

a population., AAAI Spring Symposium: Artificial Intelligence for Develop-

ment, AAAI, 2010.

[Kel58] H. C. Kelman, Compliance, identification, and internalization: Three pro-

cesses of attitude change, Journal of conflict resolution (1958), 51–60.

[KKK02] L. Kullmann, J. Kertesz, and K. Kaski, Time-dependent cross-correlations

between different stock returns: A directed network of influence, Phys. Rev.

E 66 (2002), 026125.

[KKT03] D. Kempe, J. Kleinberg, and E. Tardos, Maximizing the spread of influ-

ence through a social network, KDD ’03: Proceedings of the ninth ACM

SIGKDD international conference on Knowledge discovery and data mining,

ACM Press, 2003, pp. 137–146.

[KLPM10] H. Kwak, C. Lee, H. Park, and S. Moon, What is twitter, a social network or

a news media?, WWW ’10: Proceedings of the 19th international conference

on World wide web (New York, NY, USA), ACM, 2010, pp. 591–600.

[KM27] W. O. Kermack and Ag McKendrick, A Contribution to the Mathematical

Theory of Epidemics, Proceedings of the Royal Society of London. Series A,

Containing Papers of a Mathematical and Physical Character 115 (1927),

no. 772, 700–721.

[KOS11] A. S. King, F. J. Orlando, and D. B. Sparks, Ideological Extremity and Pri-

mary Success: A Social Network Approach, 2011 MPSA Conference (2011).

[Kra00] U. Krause, A discrete nonlinear and non-autonomous model of consensus

formation, Communications in difference equations (2000), 227–236.

[Kra09] D. Krackhardt, A plunge into networks, Science 326 (2009), 47–48.

[KSA+10] M. Kolar, L. Song, A. Ahmed, E. P. Xing, et al., Estimating time-varying

networks, The Annals of Applied Statistics 4 (2010), no. 1, 94–123.

193

[KSESM12] K. Klemm, M.A. Serrano, V.M. Eguiluz, and M. San-Miguel, A measure of

individual role in collective dynamics, Scientific Reports 2 (2012), no. 292.

[LBP13] S.Y Liu, A. Baronchelli, and N. Perra, Contagion dynamics in time-varying

metapopulation networks, Physical Review E 87 (2013), no. 3, 032805.

[LeB96] LeBon, G., The Crowd: A Study of the Popular Mind, New York Macmillan

Co., 1896.

[Lew09] M. P. Lewis, Ethnologue: Languages of the world, 16 ed., SIL International,

2009.

[LGRC12] J. Lehmann, B. Goncalves, J. J. Ramasco, and C. Cattuto, Dynamical classes

of collective attention in twitter, Proceedings of the 21st international confer-

ence on World Wide Web (New York, NY, USA), WWW ’12, ACM, 2012,

pp. 251–260.

[LNK07] D. Liben-Nowell and J. Kleinberg, The link-prediction problem for social net-

works, Journal of the American society for information science and technology

58 (2007), no. 7, 1019–1031.

[LNL94] B. Latane, A. Nowak, and J. H Liu, Measuring emergent social phenomena:

Dynamism, polarization, and clustering as order parameters of social systems,

Behavioral science 39 (1994), no. 1, 1–24.

[LPA+09] D. Lazer, A. Pentland, L. Adamic, S. Aral, A-L Barabasi, D. Brewer,

N. Christakis, N. Contractor, J. Fowler, M. Gutmann, T. Jebara, G. King,

M. Macy, D. Roy, and M. Alstyne, Social science: Computational social sci-

ence, Science 323 (2009), no. 5915, 721–723.

[LSAA11] A. Livne, M. P. Simmons, E. Adar, and L. A. Adamic, The party is over

here: Structure and content in the 2010 election., ICWSM (Lada A. Adamic,

Ricardo A. Baeza-Yates, and Scott Counts, eds.), The AAAI Press, 2011.

[Lup10] N. Lupu, Who votes for chavismo?: Class voting in Hugo Chavez’s Venezuela,

Latin American Research Review 45 (2010), no. 1, 7–32.

[Lus03] D. Lusseau, The emergent properties of a dolphin social network, Proceedings

of the Royal Society of London. Series B: Biological Sciences 270 (2003),

no. Suppl 2, S186–S188.

194

[Mac67] J. B. MacQueen, Some methods for classification and analysis of multivariate

observations, Proc. of the fifth Berkeley Symposium on Mathematical Statis-

tics and Probability (L. M. Le Cam and J. Neyman, eds.), vol. 1, University

of California Press, 1967, pp. 281–297.

[MBLB14] A. J. Morales, J. Borondo, J. C. Losada, and R. M. Benito, Efficiency of hu-

man activity on information spreading on Twitter, Social Networks 39 (2014),

1–11.

[MBLBss] A. J. Morales, J. Borondo, J.C. Losada, and R.M. Benito, Measuring Politi-

cal Polarization: Twitter shows the two sides of Venezuela, Chaos (2014, In

press).

[MCB+13] A. J. Morales, W. Creixell, J. Borondo, J.C. Losada, and R.M. Benito, Under-

standing Ethnical Interactions in Ivory Coast, 3rd International Conference

on the Analysis of Mobile Phone Datasets, 2013.

[MCB+ss] A. J. Morales, W. Creixell, J. Borondo, J.C. Losada, and R.M. Benito, Char-

acterizing Ethnic Interactions from Human Communication Patterns in Ivory

Coast, Networks and Heterogeneous Media (2014, In press).

[MFMFM13] B. Moumni, V. Frias-Martinez, and E. Frias-Martinez, Characterizing social

response to urban earthquakes using cell-phone network data: the 2012 oax-

aca earthquake, Proceedings of the 2013 ACM conference on Pervasive and

ubiquitous computing adjunct publication, ACM, 2013, pp. 1199–1208.

[MHVB13] Y.A. Montjoye, C. A Hidalgo, M. Verleysen, and V. D Blondel, Unique in the

crowd: The privacy bounds of human mobility, Scientific reports 3 (2013).

[Mil63] S. Milgram, Behavioral study of obedience, Journal of Abnormal and Social

Psychology 67 (1963), no. 4, 371–378.

[Mil11] G. Miller, Social Scientists Wade Into the Tweet Stream, Science 333 (2011),

no. 6051, 1814–1815.

[Mit04] M. Mitzenmacher, A brief history of generative models for power law and

lognormal distributions, Internet Mathematics 1 (2004), no. 2, 226–251.

[ML07] Y. Bar-Yam M. Lim, R. Metzler, Global pattern formation and ethnic/cultural

violence, Science 317 (2007).

195

[MLA+11] A. Mislove, S. Lehmann, Y-Y Ahn, J-P Onnela, and J. N. Rosenquist, Un-

derstanding the demographics of twitter users., ICWSM (Lada A. Adamic,

Ricardo A. Baeza-Yates, and Scott Counts, eds.), The AAAI Press, 2011.

[MLB12] A. J. Morales, J. C. Losada, and R.M. Benito, Users structure and behavior

on an online social network during a political protest, Physica A: Statistical

Mechanics and its Applications 391 (2012), no. 21, 5244 – 5253.

[MML10] G. Miritello, E. Moro, and R. Lara, The dynamical strength of social ties in

information spreading, CoRR abs/1011.5367 (2010).

[MMR07] A. Petersen M. Mobilia and S. Redner, On the role of zealotry in the voter

model., Journal of Statistical Mechanics: Theory and Experiment 8 (2007),

08029.

[Mob03] M. Mobilia, Does a single zealot affect an infinite group of voters?., Physical

review letters 91 (2003), no. 2, 028701.

[Mor51] J.L. Moreno, Sociometry, experimental method and the science of society.,

Beacon House, Inc., 1951.

[MPLC13] F. Morstatter, J Pfeffer, H Liu, and K M Carley, Is the sample good enough?

comparing data from twitters streaming api with twitters firehose, Proceedings

of The 7th International AAAI Conference on Weblogs and Social Media ,

The AAAI Press, 2013.

[MPR02] N. Mccarty, K. Poole, and H. Rosenthal, Political polarization and income

inequality.

[MR13] M. D. Makowsky and J. Rubin, An agent-based model of centralized institu-

tions, social network technology, and revolution, PLOS ONE 8 (2013), e80380.

[MS02] J. Montoya and R.S. Sole, Small world patterns in food webs, Journal of

Theoretical Biology 214 (2002), no. 3, 405 – 412.

[MSMA08] R. Dean Malmgren, Daniel B. Stouffer, Adilson E. Motter, and Luıs A. N.

Amaral, A Poissonian explanation for heavy tails in e-mail communication,

Proc. Nat. Acad. Sci. USA 105 (2008), no. 47, 18153–18158.

[MV12] E. Minaya and K. Vyas, When Chavez tweets, Venezuelans listen, Wall Street

Journal (April 25, 2012).

196

[NDXT11] N. P Nguyen, T. N Dinh, Y. Xuan, and M.T Thai, Adaptive algorithms for

detecting community structure in dynamic social networks, INFOCOM, 2011

Proceedings IEEE, IEEE, 2011, pp. 2282–2290.

[New02a] M. E. J. Newman, Assortative mixing in networks, Phys. Rev. Lett. 89 (2002),

no. 20, 208701.

[New02b] M. E. J. Newman, Spread of epidemic disease on networks, Phys. Rev. E 66

(2002), 016128.

[New03a] M. E. J. Newman, Mixing patterns in networks, Physical Review E 67 (2003),

no. 2, 026126.

[New03b] M. E. J. Newman, The structure and function of complex networks, SIAM

review 45 (2003), no. 2, 167–256.

[New05] M. E. J. Newman, Power laws, Pareto distributions and Zipf ’s law, Contem-

porary Physics 46 (2005), no. 5, 323–351.

[New06] M. E. J. Newman, Modularity and community structure in networks, Proc.

Natl. Acad. Sci. USA 103 (2006), 8577.

[NFB02] M. E. J. Newman, S. Forrest, and J. Balthrop, Email networks and the spread

of computer viruses, Phys. Rev. E 66 (2002), 035101.

[NMR05] M. J. Neely, E. Modiano, and C. E. Rohrs, Dynamic power allocation and

routing for time-varying wireless networks, Selected Areas in Communica-

tions, IEEE Journal on 23 (2005), no. 1, 89–103.

[NP03] M. E. J. Newman and J. Park, Why social networks are different from other

types of networks, Phys. Rev. E 68 (2003), 036122.

[NT12] J. Nigel and F. Toro, Facebook gives a platform to the challenger of Chavez,

2012.

[NWS02] M. E. J. Newman, D. J. Watts, and S. Strogatz, Random graph models of

social networks, Proc. Natl. Acad. Sci. USA 99 (2002), no. 1, 2566–2572.

[OSH+07] J.P. Onnela, J. Saramaki, J. Hyvonen, G. Szabo, D. Lazer, K. Kaski,

J. Kertesz, and A. L. Barabasi, Structure and tie strengths in mobile commu-

nication networks, Proc. Natl. Acad. Sci. USA 104 (2007), no. 18, 7332–7336.

197

[Pen08] A. Pentland, Honest signals: How they shape our world, The MIT Press,

2008.

[Pen14] A. Pentland, Social physics: How good ideas spread-the lessons from a new

science, Penguin Group (USA) Incorporated, 2014.

[PGPSV12] N Perra, B. Goncalves, R. Pastor-Satorras, and A. Vespignani, Activity driven

modeling of time varying networks, Scientific reports 2 (2012).

[PMT+14] D. Pastor, A. J. Morales, Y. Torres, J. Bauer, A. Wadhwa, C. Castro-Correa,

A. Caldern-Mariscal, L. Romanoff, J. Lee, A. Rutherford, V. Frias-Martinez,

N. Oliver, E. Frias-Martinez, and M. Luengo-Oroz, Flooding through the lens

of mobile phone activity, IEEE Global Humanitarian Technology Conference

(GHTC), 2014.

[PSR12] A. Pielow, R. Sioshansi, and M. C. Roberts, Modeling short-run electricity

demand with long-term growth rates and consumer price elasticity in com-

mercial and industrial sectors, Energy 46 (2012), no. 1, 533 – 540.

[PSV01] R. Pastor-Satorras and A. Vespignani, Epidemic dynamics and endemic states

in complex networks, Phys. Rev. E 63 (2001), 066117.

[PSV02] R. Pastor-Satorras and A. Vespignani, Epidemic dynamics in finite size scale-

free networks, Phys. Rev. E 65 (2002), 035108.

[RB10] M. Rosvall and C. T. Bergstrom, Multilevel compression of random walks on

networks reveals hierarchical organization in large integrated systems, CoRR

abs/1010.0431 (2010).

[Red98] S. Redner, How popular is your paper? an empirical study of the citation

distribution, European Physical Journal B 4 (1998), no. 2, 131–134.

[RFF+10] J. Ratkiewicz, S. Fortunato, A. Flammini, F. Menczer, and A. Vespignani,

Characterizing and modeling the dynamics of online popularity, Physical re-

view letters 105 (2010), no. 15, 158701.

[RGAH11] D. M. Romero, W. Galuba, S. Asur, and B. A. Huberman, Influence and

passivity in social media, Proceedings of the ECML/PKDD 2011, 2011.

198

[RLH11] L. E. C. Rocha, F. Liljeros, and P. Holme, Simulated epidemics in an empirical

spatiotemporal network of 50,185 sexual contacts., PLoS Comp Biol 7 (2011),

e1001109.

[RMFC10] J. N. Rosenquist, J. Murabito, J. H. Fowler, and N. A. Christakis, The spread

of alcohol consumption behavior in a large social network, Annals of Internal

Medicine 152 (2010), no. 7, 426–433.

[RMM+10] K. K. Rachuri, M. Musolesi, C. Mascolo, P. J. Rentfrow, C. Longworth,

and A. Aucinas, EmotionSense: a mobile phones based adaptive platform

for experimental social psychology research, Proceedings of the 12th ACM

international conference on Ubiquitous computing (New York, NY, USA),

Ubicomp ’10, ACM, 2010, pp. 281–290.

[Rou87] P. J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and valida-

tion of cluster analysis, Journal of computational and applied mathematics

20 (1987), 53–65.

[RTU11] D. M. Romero, C. Tan, and J. Ugander, Social-Topical Affiliations: The

Interplay between Structure and Popularity, arXiv:1112.1115 (2011).

[San07] A. Santiago, Modelos Generalizados de Enlace Preferencial en Redes Com-

plejas Heterogneas, Ph.D. thesis, Universidad Politecnica de Madrid, 2007.

[SAR08] V. Sood, T. Antal, and S. Redner, Voter models on heterogeneous networks,

Physical Review E 77 (2008), no. 4, 041121.

[SB08] A. Santiago and R. M Benito, An extended formalism for preferential attach-

ment in heterogeneous complex networks, Europhysics Letters (2008).

[Sch71] T. C. Schelling, Dynamic models of segregation, J. Math. Sociol. 1 (1971),

no. 2, 143–186.

[SCL00] M. L. Sachtjen, B. A. Carreras, and V. E. Lynch, Disturbances in a power

transmission system, Phys. Rev. E 61 (2000), 4877–4882.

[SEM05] K. Suchecki, V. M. Eguiluz, and M. San Miguel, Voter model dynamics in

complex networks: Role of dimensionality, disorder, and degree distribution.,

Physical Review E 72 (2005), no. 3, 036132.

199

[Sem12] Semiocast, Twitter reaches half a billion accounts more than 140 millions in

the u.s., WWW page, 2012.

[Sha01] C. E. Shannon, A mathematical theory of communication, ACM SIGMOBILE

Mobile Computing and Communications Review 5 (2001), no. 1, 3–55.

[Shi95] R. J. Shiller, Conversation, information, and herd behavior, The American

Economic Review (1995), 181–185.

[Sim62] H. A. Simon, The architecture of complexity, Proceedings of the American

Philosophical Society 106 (1962), no. 6, 467–482.

[SJN+07] C. J. Stam, B. F. Jones, G. Nolte, M. Breakspear, and P. Scheltens, Small-

world networks and functional connectivity in alzheimer’s disease, Cerebral

Cortex 17 (2007), no. 1, 92–99.

[SNK08] K. Saito, R. Nakano, and M. Kimura, Prediction of information diffu-

sion probabilities for independent cascade model., KES (3) (Ignac Lovrek,

Robert J. Howlett, and Lakhmi C. Jain, eds.), Lecture Notes in Computer

Science, vol. 5179, Springer, 2008, pp. 67–75.

[SOM10] T. Sakaki, M. Okazaki, and Y. Matsuo, Earthquake shakes twitter users: real-

time event detection by social sensors, Proceedings of the 19th international

conference on World wide web (New York, NY, USA), WWW ’10, ACM,

2010, pp. 851–860.

[SS12] P. Sobkowicz and A. Sobkowicz, Two-year study of emotion and communica-

tion patterns in a highly polarized political discussion forum, Social Science

Computer Review 30 (2012), no. 4, 448–469.

[TL13] G. Tang and F. L. F. Lee, Facebook use and political participation: The impact

of exposure to shared political information, connections with public political

actors, and network structural heterogeneity, Social Science Computer Review

31 (2013), no. 6, 763–773.

[TUGB12] J. L Toole, M. Ulm, M. C Gonzalez, and D. Bauer, Inferring land use from

mobile phone activity, Proceedings of the ACM SIGKDD international work-

shop on urban computing, ACM, 2012, pp. 1–8.

200

[UBMK12] J. Ugander, L. Backstrom, C. Marlow, and J. Kleinberg, Structural diversity

in social contagion, Proceedings of the National Academy of Sciences 109

(2012), no. 16, 5962–5966.

[UH03] UN-HABITAT, The challenge of slums - global report on human settlements

2003, Tech. report, UN, 2003.

[VDAVH04] W. Van-Der-Aalst and K. M. Van-Hee, Workflow management: models, meth-

ods, and systems, MIT press, 2004.

[VH86] E. Von Hippel, Lead users: a source of novel product concepts, Management

science 32 (1986), no. 7, 791–805.

[Wat02] D. J. Watts, A simple model of global cascades on random networks, Proceed-

ings of the National Academy of Sciences 99 (2002), no. 9, 5766–5771.

[Wat04] D. J. Watts, The ”new” science of networks, Annual Review of Sociology 30

(2004), 243–270.

[WET+12] A. Wesolowski, N. Eagle, A. J. Tatem, D. L. Smith, A. M. Noor, E. W. Snow,

and C. O. Buckee, Quantifying the Impact of Human Mobility on Malaria,

Science 338 (2012), no. 6104, 267–270.

[WF01] A. Wagner and D. A. Fell, The small world inside large metabolic networks.,

Proc R Soc Lond B Biol Sci 268 (2001), no. 1478, 1803–1810.

[WG75] H. W. Watson and F. Galton, On the probability of the extinction of families.,

The Journal of the Anthropological Institute of Great Britain and Ireland 4

(1875), 138–144.

[WH04] D. M. Wilkinson and B. A. Huberman, A method for finding communities of

related genes, Proc. Nat. Acad. of Sci. USA 10 (2004), no. 1073.

[WHAT04] F. Wu, B. A. Huberman, L. A. Adamic, and J. R. Tyler, Information flow

in social groups, Physica A: Statistical Mechanics and its Applications 337

(2004), no. 1-2, 327–335.

[Whi09] T. White, Hadoop: the definitive guide, ” O’Reilly Media, Inc.”, 2009.

[WRB06] S. Wuchty, E. Ravasz, and A-L Barabasi, The architecture of biological net-

works, Complex systems science in biomedicine, Springer, 2006, pp. 165–181.

201

[WS98] D. J. Watts and S. H. Strogatz, Collective dynamics of ’small-world’networks.,

Nature 393 (1998), no. 6684, 409–10.

[WWT+11] D. Wang, Z. Wen, H. Tong, C-Y Lin, C Song, and A-L Barabasi, Informa-

tion spreading in context, Proceedings of the 20th international conference on

World wide web (New York, NY, USA), WWW ’11, ACM, 2011, pp. 735–744.

[XC05] J. Xu and H. Chen, Criminal network analysis and visualization, Communi-

cations of the ACM 48 (2005), no. 6, 100–107.

[XLZ+12] F Xiong, Y Liu, Z-J Zhang, J Zhu, and Y Zhang, An information diffusion

model based on retweeting mechanism for online social media, Physics Letters

A 376 (2012), no. 3031, 2103 – 2108.

[YL11] J. Yang and J. Leskovec, Patterns of temporal variation in online media,

Proceedings of the Fourth ACM International Conference on Web Search and

Data Mining (New York, NY, USA), WSDM ’11, ACM, 2011, pp. 177–186.

[ZCH+12] Z. D. Zhao, S. M. Cai, J. Huang, Y. Fu, and T. Zhou, Scaling behavior of

online human activity, EPL (Europhysics Letters) 100 (2012), no. 4, 48004.

[ZFT+08] Y. Zhang, A. J. Friend, A. L. Traud, M. A. Porter, J. H. Fowler, and P. J.

Mucha, Community structure in congressional cosponsorship networks, Phys-

ica A: Statistical Mechanics and its Applications 387 (2008), no. 7, 1705–

1712.

202

ANALISIS Y MODELIZACI ON DE LA DIN AMICA EMERGENTE...

Documents

Transcript of ANALISIS Y MODELIZACI ON DE LA DIN AMICA EMERGENTE...