Network Analysis of the Spotify Artist Collaboration Graph · edges. This is because another...

13
Network Analysis of the Spotify Artist Collaboration Graph Tobin South Supervised by Dr Lewis Mitchell and Prof Matt Roughan University of Adelaide Vacation Research Scholarships are funded jointly by the Department of Education and Training and the Australian Mathematical Sciences Institute.

Transcript of Network Analysis of the Spotify Artist Collaboration Graph · edges. This is because another...

Page 1: Network Analysis of the Spotify Artist Collaboration Graph · edges. This is because another artists using their work is treated as a collaboration. Figure 4: Complementary cumulative

Network Analysis of the Spotify Artist

Collaboration Graph

Tobin SouthSupervised by Dr Lewis Mitchell and Prof Matt Roughan

University of Adelaide

Vacation Research Scholarships are funded jointly by the Department of Education and Training

and the Australian Mathematical Sciences Institute.

Page 2: Network Analysis of the Spotify Artist Collaboration Graph · edges. This is because another artists using their work is treated as a collaboration. Figure 4: Complementary cumulative

Contents

1 Abstract 2

2 Introduction 2

3 Construction of Collaboration Network 2

3.1 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

3.2 Collection Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3.3 Vertices and Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

4 Exploratory Data Analysis 4

5 Graph Analysis 6

5.1 Degree Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

5.2 Friendship Paradox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

5.3 Eigenvector Centrality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

6 Genre Subgraphs 9

6.1 Genre Coexistence Hypergraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

7 Conclusion 11

1

Page 3: Network Analysis of the Spotify Artist Collaboration Graph · edges. This is because another artists using their work is treated as a collaboration. Figure 4: Complementary cumulative

1 Abstract

Music can be found in every culture, past and present, and comes in varieties that can often be hard to

categorise. The diversity of music in its many styles and forms inspires many questions about its nature. The

advent of the digital age presents a new opportunity for a data-driven analysis of music. Using the popular

music streaming service Spotify, a music-artist collaboration network is constructed through web calls. The

network is analysed for degree distribution, neighbour degree distribution, clustering and centrality. Classical

artists are found to be the most central to the whole network, while rap artists are found to be the most central

to the popular subgraph. The data also lends itself to constructing hypergraphs based on genre which allow

the relationship between musical genres to be examined and characterized.

2 Introduction

Music has been a part of human culture for at least 55,000 years, and has evolved into an important part of society

[1]. While music itself has been studied extensively, new music streaming and processing technologies allow a

unique data-driven analysis of music and listeners. Spotify1 one of the largest commercial music streaming

services in the world, with over 140 million users [2]. Being such a dominant force in the music consumption

market, Spotify is a unique source of data for the analysis of music, both modern and classical. This research

focuses on the analysis of the musical artist collaboration network.

3 Construction of Collaboration Network

3.1 Networks

Networks, known in mathematics as Graphs, are representations of relationships and connections between

objects. In the most general sense a graph is an ordered pair G = (V,E) made of a set V of ‘vertices’ or ‘nodes’

and E of ‘edges’ which connect pairs of these vertices. [3]. In this graph the edges are undirected, meaning that

there each pair of connected vertices does not have an orientation. The graph could be constructed as a direct

graph, where pair’s of nodes do have an orientation. This would not substantially change the overall results.

This graph is formed from vertices, which are artists who have published songs on Spotify, and edges which

connect these artists and indicate that artists have appeared in the same song or album together.

In total, the giant component of the artist collaboration network contains 1,250,065 vertices, which forms a

large portion of the network of just over 2 million artists. However, most artists have a very low popularity

and the majority of popular artists are included in the central core of the network. The artists not in the core

network has not fully been examined and is an avenue of future research.

1www.spotify.com

2

Page 4: Network Analysis of the Spotify Artist Collaboration Graph · edges. This is because another artists using their work is treated as a collaboration. Figure 4: Complementary cumulative

3.2 Collection Methods

The graph was collected using a breadth first search through the collaboration network via Application Programming

Interface (API) calls to Spotify’s developer service. Starting from a given artist, all the artists songs were

collected and used to extract that artist’s collaborators, adding to the search queue. This processed continued

until the main component was collected. This method ignores nodes that cannot be reaches via edge jumps.

Random artist collection would be one method to test the completeness of the network to be measured.

3.3 Vertices and Edges

Between these 1,250,065 artists is 4,393,615 edges connecting them, meaning, on average, artists collaborate

with 3.514 other artists. Primarily, these edges indicate that two artists have worked together on a song. This

usually means that the artists have co-performed or have produced the song together in some regard. However,

edges can take other forms, such as ‘appears-on’ edges, and these edges can also be directed. Typically if

two artists work together on a song, that song will appear in both artist profiles. However, artists can decide

to not include songs on their profile and, importantly, both artists do not have to both be involved in the

actual making of the song. This relates to the ‘appears-on’ edges, which indicates that an artist has used some

other artist in their work without the other artist being involved. This usually takes three forms: sampling,

remakes or remixes. Sampling is when artists use a section or multiple sections of another artist’s music within

their own song. Remakes or remixes are where artists make an original take of another artist’s song, two

examples of which include, covers of songs in different styles, or use of classical music by other groups, such as

an orchestra performing Mozart. While these directed edges are interesting, for the purposes of graph analysis

all edges are treated as undirected and the directed data is transformed to a property referred to as an artist’s

reciprocation coefficient. This denotes what fraction of an artist’s songs point to artists that point back at them.

Spotify provides other useful information about the artists, e.g., the popularity of the artist, which is given

as a number between 0 and 100. Ed Sheeran solely lies at a popularity of 100 as he is the artist on Spotify with

the highest number of total streams. Spotify defines an artist’s popularity as the fraction of an artists total

streams compared to Ed Sheeran’s, floored to an integer value.

Popularity =

⌊Artist Total Streams

Ed Sheeran Total Streams

Further, Spotify also provides the number of followers the artist has on the streaming service. Followers on

Spotify are users that have chosen to see updates about the activity of an artist, however this is not required

to listen to any artist or song.

Another valuable piece of information Spotify provides is the genres that the artist is placed in. This

classification is performed by Spotify using an internal methodology but provides very detailed classifications.

3

Page 5: Network Analysis of the Spotify Artist Collaboration Graph · edges. This is because another artists using their work is treated as a collaboration. Figure 4: Complementary cumulative

Each artist may sit with several sets of detailed genres. However, this analysis is only performed on more

popular artists. Since most artists have a low popularity, 89.71% of artists are not classified into genre.

4 Exploratory Data Analysis

Popularity is an interesting metric in its ability to capture what matters to Spotify most: what users are

listening to. However, the nature of the way popularity is calculated means that, as seen in Figure 1, most

artists have a very low popularity, and only a small number contribute to the majority of music streams. Artists

with a popularity of above 5 contribute 94.47% of all music streams but make up only 34.5% of the network

with 215,718 artists.

While one may assume that the number of followers an artist has and their popularity would be highly

(a) (b)

Figure 1: (a) Complimentary Cumulative Distribution Function of the Popularity of all artists in the graph

(b) Contribution of remaining artists to total popularity as low popularity nodes are removed

correlated, the story is more complex. Figure 2 shows that higher popularity nodes generally have more

followers, but this is not true for all high popularity nodes. The mechanisms that explain this difference are

hard to identify as they are related to Spotify’s different processes for incentivising users to follow artists and to

listen to their music. One possibility is that the number of followers an artist has is more a measure of fan base

size and engagement than public awareness or listening. In Figure 2 the most populated point on the graph by

orders of magnitude is 0 popularity and 0 followers, with 150,464 artists. Additionally, in the nearby region,

from a popularity of below 2 and less than 10 followers is 318,319 artists.

Another point of investigation is the relationship between popularity and degree. The degree of a vertex is

the number of connections coming from it [3], which here is the number of collaborations the artist has made.

As observed in Figure 3, popularity and degree are not highly correlated. There are a number of medium-level

4

Page 6: Network Analysis of the Spotify Artist Collaboration Graph · edges. This is because another artists using their work is treated as a collaboration. Figure 4: Complementary cumulative

(50 - 80) popularity artists with very high degree, many of which are orchestras. These orchestras ‘collaborate’

with a large number of artists, as the orchestras are formed by many members, and the orchestras cite a number

of classical artists.

Figure 2: Popularity of artists against the number of artist followers

Figure 3: Popularity of artists against the number of artist node degree

5

Page 7: Network Analysis of the Spotify Artist Collaboration Graph · edges. This is because another artists using their work is treated as a collaboration. Figure 4: Complementary cumulative

5 Graph Analysis

5.1 Degree Distribution

The degree distribution of artists has a heavy tail distribution, meaning very few artists have a very high degree,

the two largest being Johann Sebastian Bach, with 10856 edges and Wolfgang Amadeus Mozart with 11302

edges. This is because another artists using their work is treated as a collaboration.

Figure 4: Complementary cumulative distribution function of the node degrees in the graph

5.2 Friendship Paradox

An interesting finding in networks is the so called “friendship paradox” [4], the idea that, on average, most

people have fewer friends than their friends have. In Figure 5(a) this is examined by adding together the

number of collaborations that collaborators of an artist have made. There are more ‘friends of friends’ in total

than there are friends, because the connections of your friends will likely also be connections of other friends.

To deal with this discrepancy, Figure 5(b) normalises the results to examine what portion of artists have what

levels of friends of friends. As can be seen the node neighbour degree distribution is shifted slightly to the right

compared to the node degree distribution. While this does support the friendship paradox, other social networks

provides a more starker distributions right shift [5].

5.3 Eigenvector Centrality

A key motivation for the project was to probe the way humans analyse and evaluate music. A key component

is the idea of determining what music is ‘important’. One way to do this that is appropriate for this dataset is

through network centrality. Centrality can be measured using methods including betweenness, closeness, and

PageRank algorithms, however, this investigation focuses on eigenvector centrality.

6

Page 8: Network Analysis of the Spotify Artist Collaboration Graph · edges. This is because another artists using their work is treated as a collaboration. Figure 4: Complementary cumulative

(a) (b)

Figure 5: (a) Joint histogram of node degree and total degree of all node neighbours . (b) Joint histogram of

node degree and total degree of all node neighbours normalised to the total number of connections.

For a node vi ∈ V , with an edge, ei,j ∈ E, connecting vi and vj and where Ni is the set of neighbours of

vi and λ is a constant. The relative centrality score, xi, of vi is:

xi =1

λ

∑j∈Ni

xj

This can be written as the eigenvector equation, where A is the adjacency matrix of the graph, made from

A = [ai,j ] where ai,j = 1 if ei,j exist or 0 otherwise. This equation then takes the form:

Ax = λx

We chose eigenvector centrality as it is a robust metric for examining community structure [6]. Unlike other

metrics, it can be accurately approximated within reasonable computing times using numerical methods such

as power iteration. This is essential considering the size of the data in this investigation.

Eigencentrality found the most central artists were classical artists, as shown in Figure 6(a), where size indicates

relative popularity and colour shows centrality, with red being high centrality and blue low centrality. This

finding is interesting in its own right as it indicates that classical artists tend to be connected to other central

or important artists. Part of this relationship can be explained by the high degree of many classical artists and

orchestras. There is however, another interesting finding regarding these.

While calculating centrality on samples of the data for computational speed, an interesting change was observed.

As low popularity nodes are removed, the artists having high eigenvector centrality change to the artists in the

core of the rap genre. This is visualised in Figure 6(b). This transition is sharp at the critical value of below

7

Page 9: Network Analysis of the Spotify Artist Collaboration Graph · edges. This is because another artists using their work is treated as a collaboration. Figure 4: Complementary cumulative

47 as seen in Figure 6(b). This finding has interesting implications regarding the structure of the network.

(a)(b)

Figure 6: (a) Visualisation of most eigenvector central core of network. (b) Change in eigenvector centrality

as low popularity nodes are removed.

Figure 7: Joint histograms of node popularity distributions of rap and classical artists.

Such a stark change at a critical value is interesting. An early hypothesis is that critical nodes are removed

at that value. However, the removal of nodes near the critical region from the whole graph produces the same

results, suggesting that no single or small set of artists are solely responsible for this change. It should also be

noted that the central artists shown in Figure 6(b) are still included in the graph and are not removed in the

critical region. An explanation is suggested in Figure 7. The popularity distribution of artists in the classical

genre is to the left of artists in the rap genre. As low popularity nodes are filtered, more classical artists are

8

Page 10: Network Analysis of the Spotify Artist Collaboration Graph · edges. This is because another artists using their work is treated as a collaboration. Figure 4: Complementary cumulative

removed than rap artists. This explains why centrality changes, but fails to explain why it happens so suddenly

at a critical value. Future work will examine how eigenvector centrality changes on simulated graphs as nodes

are removed, but this was not covered in the scope of this project.

6 Genre Subgraphs

Another nvestigation is the analysis of subgraphs produced by extracting artists from the network that belong

to a particular genre. The edges connecting these artists are still maintained. Figure 8(a) demonstrates the size

distribution of genres, with many genres having few artists. This can be explained by the granular nature of the

genre categorization that Spotify provides. The genres can be highly specific, e.g. ‘swedish pop punk ’. However,

many genres exist with a large number of artists, such as pop, rap and other common genres. Furthermore,

while small genres can still contain quite popular artists, the average popularity of a genre has an increasing

lower bound as the size of the genre increases, as seen in Figure 8(b).

(a) (b)

Figure 8: (a) Histogram of the number of artists classified into each genre (b) Number of artists classified into

each genre against that genres average popularity of artists within it.

These subgraphs can be analysed using the same techniques as for larger graphs, including the clustering of the

elements within. The clustering coefficient of a vertex vi with ki neighbours is:

Ci =2|{eab : va, vb ∈ Ni, eab ∈ E}|

ki(ki − 1)

The average of Ci gives a measure of how close together typical nodes in a graph are. The average clustering

coefficient shows, in Figure 9(a), that the size of the genre subgraph does not correlate with a more clustered

subgraph. Figure 9(b) shows that average popularity in the subgraph is similarly not correlated with the average

clustering. This suggests that clustering and closeness within a genre is not a key element for creating popular

9

Page 11: Network Analysis of the Spotify Artist Collaboration Graph · edges. This is because another artists using their work is treated as a collaboration. Figure 4: Complementary cumulative

music.

(a) (b)

Figure 9: (a) Average clustering coefficient of vertices against number of artists classified into each genre. (b)

Average clustering coefficient of vertices against that genres average popularity of artists within it.

6.1 Genre Coexistence Hypergraph

Another interesting result from the data collected is that a hypergraph can be formed from the artist genres.

This hypergraph is formed by connected artists that share the same genre. A hyperedge indicates when artists

are categorized as two genres simultaneously. Examining these forms a graph of genre connections. The resulting

graph indicates a measure of closeness or coexistence of genre, with weighted edges based on the number of

artists that share the two genres. The whole graph is to large to visualise but the largest genres have the

normalised weight shown in the heatmap of Figure 10.

10

Page 12: Network Analysis of the Spotify Artist Collaboration Graph · edges. This is because another artists using their work is treated as a collaboration. Figure 4: Complementary cumulative

Figure 10: Heatmap of largest genres coexistence adjacency.

A secondary graph was formed from a similar process, showing genre collaboration. This has vertices as genres

and weighted edges indicating how many times an artist of one genre has collaborated with an artist of another

genre. This graph provides an alternative method to analyse the closeness of genres. We note that some genres,

e.g. ‘pop rap’ and ‘southern hip hop’, are ‘close’ as expected , while others, e.g. ‘electric house’ and ‘hip hop’

are distant.

7 Conclusion

The Spotify music artists collaboration graph is a unique and interesting dataset. The graph was collected using

a Breadth First Search through the collaboration network via API calls to Spotify’s developer service. This

search produced the largest connected core of the network, but does not contain all artists on Spotify. The full

completeness of the network is unknown but the majority of popular artists are contained within it.

Within the data, additional graphs can be extracted in the form of the genre connection graph. Two types

of connection can be defined, that of cross collaboration and that of coexistence with artists. These networks

present their own interesting properties, worthy of analysis.

A key finding was the dynamics through which the eigenvector centrality of the network changes as low popularity

nodes are removed from the graph. This critical value behaviour is worth further investigation possibly through

simulation of random graph models.

11

Page 13: Network Analysis of the Spotify Artist Collaboration Graph · edges. This is because another artists using their work is treated as a collaboration. Figure 4: Complementary cumulative

References

[1] Wallin, Nils Lennart; Steven Brown; Bjorn Merker (2001). The Origins of Music. Cambridge: MIT Press.

[2] Plaugic, Lizzie (January 4, 2018). ”Spotify now has 70 million subscribers”. The Verge. Retrieved January

21, 2018.

[3] Newman, M.E.J. Networks: An Introduction. Oxford University Press. 2010

[4] Feld, Scott L. (1991), ”Why your friends have more friends than you do”, American Journal of Sociology,

96 (6): 1464-1477,

[5] Hodas, N.O., Kooti, F. and Lerman, K., 2013. Friendship Paradox Redux: Your Friends Are More Interesting

Than You. ICWSM, 13, pp.8-10.

[6] Borgatti, S.P., Carley, K.M. and Krackhardt, D., 2006. On the robustness of centrality measures under

conditions of imperfect data. Social networks, 28(2), pp.124-136.

12