Congressional PageRank: Graph Analytics of US Congress With Neo4j
-
Upload
william-lyon -
Category
Technology
-
view
563 -
download
1
Transcript of Congressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank:Graph Analytics Of US Congress
William Lyon
Graph Day - Austin, TXJanuary 2016
Agenda
• Brief intro to Neo4j graph database• Modeling US Congress as a graph• Exploring the 114th Congress • Finding influential legislators
Neo4j – Key Features
Native Graph StorageEnsures data consistency and performance
Native Graph Processing Millions of hops per second, in real time
“Whiteboard Friendly” Data ModelingModel data as it naturally occurs
High Data IntegrityFully ACID transactions
Powerful, Expressive Query LanguageRequires 10x to 100x less code than SQL Scalability and High Availability Vertical and horizontal scaling optimized for graphs Built-in ETLSeamless import from other databases Integration Drivers and APIs for popular languages
MATCH(A)
Property Graph Model
The Whiteboard Model Is the Physical Model
Relational Versus Graph Models
Relational Model Graph Model
KNOWS
KNOWS
KNOWS
ANDREAS
TOBIAS
MICA
DELIA
Person FriendPerson-Friend
ANDREASDELIA
TOBIAS
MICA
Property Graph Model Components
Nodes • The objects in the graph • Can have name-value properties • Can be labeled
Relationships • Relate nodes by type and
direction • Can have name-value properties
CAR
DRIVES
name: “Dan” born: May 29, 1970
twitter: “@dan”name: “Ann”
born: Dec 5, 1975
since: Jan 10, 2011
brand: “Volvo” model: “V70”
LOVES
LOVES
LIVES WITH
OWNS
PERSON PERSON
Cypher Query Language
Cypher: Powerful and Expressive Query Language
CREATE (:Person { name:“Dan”} ) -[:LOVES]-> (:Person { name:“Ann”} )
LOVES
Dan Ann
LABEL PROPERTY
NODE NODE
LABEL PROPERTY
MATCH (boss)-[:MANAGES*0..3]->(sub), (sub)-[:MANAGES*1..3]->(report)WHERE boss.name = “John Doe”RETURN sub.name AS Subordinate, count(report) AS Total
Express Complex Queries Easily with Cypher
Find all direct reports and how many people they manage, up to 3 levels down
Cypher Query
SQL Query
http://www.opencypher.org/
Getting Data into Neo4j
Cypher-Based “LOAD CSV” Capability • Transactional (ACID) writes • Initial and incremental loads of up to
10 million nodes and relationships
Command-Line Bulk Loader
neo4j-import • For initial database population • For loads with 10B+ records • Up to 1M records per second
4.58 million things and their relationships…
Loads in 100 seconds!
Neo4j
Graph Database
• Property graph datamodel• Nodes and relationships
• Native graph processing• Cypher query language
Graphing US Congress
https://github.com/legis-graph/legis-graph
https://github.com/legis-graph/legis-graph
LOAD CSV WITH HEADERS FROM “file:///legislators.csv” AS line MERGE (l:Legislator (thomasID: line.thomasID}) SET l = line MERGE (s:State {code:line.state})<-[:REPRESENTS]-(l) …
US Congress
https://github.com/legis-graph/legis-graph
What Legislators represent Texas?
MATCH (s:State {code: "TX"})<-[:REPRESENTS]-(l:Legislator) RETURN l,s;
…include congressional body and partyMATCH (s:State {code: "TX"})<-[:REPRESENTS]-(l:Legislator) MATCH (p:Party)<-[:IS_MEMBER_OF]-(l)-[:ELECTED_TO]->(b:Body) RETURN b,l,s,p;
How to find influential legislators?
Bill Sponsorship
Bill Cosponsorship
Degree centrality
Bill Cosponsorship
• Cosponsors are “influenced by” bill sponsors
• Add INFLUENCED_BY relationships
Betweenness centrality
The number of times a node acts as a bridge along the shortest path between two other nodes.
https://en.wikipedia.org/wiki/Betweenness_centrality
image credit: https://en.wikipedia.org/wiki/PageRank
PageRankCypher approximation
UNWIND range(1,10) AS round MATCH (l:Legislator) WHERE rand() < 0.1 MATCH (l:Legislator)-[:INFLUENCED_BY]->(o:Legislator) SET o.rank = coalesce(o.rank,0) + 1;
http://neo4j.com/blog/using-neo4j-hr-analytics/
Neo4j server extensions with Java
Neo4j server extensions with Java
curl http://localhost:7474/service/v1/pagerank/Person/KNOWS
PageRankGraph processing server extension
https://github.com/maxdemarzi/graph_processing
curl http://localhost:7474/service/v1/pagerank/Person/KNOWS
PageRank
neo4j-noderank
https://github.com/graphaware/neo4j-noderank
Two issues
• Local vs global• Iterative algorithms and graph complexity
Local vs globalLocal Global
Local vs globalLocal Global
Offline / batchOLTP / realtime
For iterative algorithms like PageRank, it’s all about complexity of the graphLots of paths. Lots of iterations
Graph complexity
PageRank
Graph global!
PageRank
Graph global!Iterative!
• Efficient in-memory data processing and machine learning platform
• Graph analytics with GraphX• In-memory message passing algorithm
Apache Spark is a fast and general engine for large-scale data processing.
http://spark.apache.org/
PageRankSpark with Neo4j - Scala
https://github.com/AnormCypher/AnormCypher
import org.anormcypher._ import org.apache.spark.graphx._ import org.apache.spark.graphx.lib._
val total = 100000000 val batch = total/1000000 val links = sc.range(0,batch).repartition(batch).mapPartitionsWithIndex( (i,p) => { val dbConn = Neo4jREST("localhost", 9474, "/db/data/", "neo4j", "test") val q = "MATCH (l1:Legislator)-[:INFLUENCED_BY]->(l2:Legislator) RETURN id(l1) as from, id(l2) as to skip {skip} limit 1000000" p.flatMap( skip => { Cypher(q).on("skip"->skip*1000000).apply()(dbConn).map(row => (row[Int]("from").toLong,row[Int]("to").toLong) ) }) })
links.cache links.count
val edges = links.map( l => Edge(l._1,l._2, None)) val g = Graph.fromEdges(edges,"none") val v = PageRank.run(g, 5).vertices
Extract subgraph. Run PageRank using Spark GraphX.
val res = v.repartition(total/100000).mapPartitions( part => { val localConn = Neo4jREST("localhost", 9474, "/db/data/", "neo4j", "test") val updateStmt = Cypher("UNWIND {updates} as update MATCH (p) where id(p) = update.id SET p.pagerank = update.rank") val updates = part.map( v => Map("id"->v._1.toLong, "rank" -> v._2.toDouble)) val count = updateStmt.on("updates"->updates).execute()(localConn) Iterator(part.size) })
Write back to graph
PageRank
Mazerunner
http://www.kennybastani.com/2014/11/using-apache-spark-and-neo4j-for-big.html
• Enables two-way ETL between Spark and Neo4j
• Run GraphX jobs from data in Neo4j
• Write results back to Neo4j
PageRank
Mazerunner
http://www.kennybastani.com/2014/11/using-apache-spark-and-neo4j-for-big.html
• Enables two-way ETL between Spark and Neo4j
• Run GraphX jobs from data in Neo4j
• Write results back to Neo4j
• Support for:• PageRank• Closeness Centrality• Betweenness Centrality• Triangle Counting• Connected Components• Strongly Connected Components
https://github.com/neo4j-contrib/neo4j-mazerunner
curl http://localhost:7474/service/mazerunner/analysis/pagerank/INFLUENCED_BY
• Cosponsors are “influenced by” bill sponsors
• Add INFLUENCED_BY relationships
Who are the influential legislators?
Who are the influential legislators?
Influential legislators by topic
Influential legislators by topic
graphdatabases.com
http://graphgist.neo4j.com/
http://portal.graphgist.org/challenge/index.html
Links
• http://www.lyonwj.com/2015/09/20/legis-graph-congressional-data-using-neo4j/
• http://www.lyonwj.com/2015/10/11/congressional-pagerank/• https://github.com/legis-graph/legis-graph• https://github.com/neo4j-contrib/neo4j-mazerunner• http://www.kennybastani.com/2014/11/graph-analytics-docker-
spark-neo4j.html• http://www.kennybastani.com/2015/03/spark-neo4j-tutorial-
docker.html