1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16,...
-
Upload
elisabeth-richardson -
Category
Documents
-
view
217 -
download
0
description
Transcript of 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16,...
1
Introduction to Graphs
15-211 Fundamental Data Structures and Algorithms
Aleks NanevskiMarch 16, 2004
2
Announcements Homework Assignment #6 out today Building a Web Search Engine
Web Crawler (due March 25)Web Reader (due March 25)Web Search (due April 5)
Start early and ask questions
3
Graphs — an overviewVertices (aka nodes)
4
Graphs — an overview
PIT
BOS
JFK
DTW
LAX
SFO
Vertices (aka nodes)
5
Graphs — an overview
PIT
BOS
JFK
DTW
LAX
SFO
Vertices (aka nodes)
Edges
6
Undirected
Graphs — an overview
PIT
BOS
JFK
DTW
LAX
SFO
Vertices (aka nodes)
Edges
7
Undirected
Graphs — an overview
PIT
BOS
JFK
DTW
LAX
SFO
Vertices (aka nodes)
Edges
1987
2273
3442145
2462
618
211318
190
Weights
8
Examples of Graphs
QuestionsWhat is the fastest way to get from Pittsburgh
to St Louis?What is the cheapest way to get from Pittsburgh
to St Louis?
Vanguard Airlines Route Map
9
Web Graph
<href …>
<href …>
<href …>
<href …>
<href …>
<href …>
<href …>
Web Pages are nodes (vertices)HTML references are links (edges)
10
Graphs as models Physical objects are often modeled
by meshes, which are a particular kind of graph structure.
By Jonathan Shewchuk
11
Structure of the Internet
Europe
JapanBackbone 1
Backbone 2
Backbone 3
Backbone 4, 5, N
Australia
Regional A
Regional B
NAP
NAP
NAP
NAP
SOURCE: CISCO SYSTEMS
12
Relationship graphs Graphs are also used to model
relationships among entities.Scheduling and resource constraints.Inheritance hierarchies.
15-113 15-151
15-211
15-212 15-213
15-312
15-411 15-412
15-251
15-45115-462
13
Terminology Graph G = (V,E)
Set V of vertices (nodes)Set E of edges
Elements of E are pair (v,w) where v,w V. An edge (v,v) is a self-loop. (Usually assume no self-
loops.)
Weighted graphElements of E are (v,w,x) where x is a weight,
i.e., a cost associated with the edge.
14
Terminology, cont’d Directed graph (digraph)
The edge pairs are orderedEvery edge has a specified direction
Undirected graphThe edge pairs are unordered
E is a symmetric relation (v,w) E implies (w,v) E In an undirected graph (v,w) and (w,v) are usually
treated as though they are the same edge
15
Directed Graph (digraph)
PIT
BOS
JFK
DTW
LAX
SFO
Vertices (aka nodes)
Edges
16
Undirected Graph
PIT
BOS
JFK
DTW
LAX
SFO
Vertices (aka nodes)
Edges
17
PathsA path is a sequence of edges from one node
to anotherA length of a path is the number of edges in
it.
PIT
BOS
JFK
DTW
LAX
SFO
Question: What are paths of length 2 in the above graph?
18
PathsA simple path is a path where no
vertex is repeated first and last vertices can be the same
Question: an example of a simple path? A non-simple path?
PIT
BOS
JFK
DTW
LAX
SFO
19
Connected graphsConnected graph
A graph is connected if for all (u,v) in V, there exists a path from u to v.
A directed graph is strongly connected if for all (u,v) in V, there exists a path from u to v.
A directed graph is weakly connected if for all (u,v) in V, either (u,v) is in E or (v,u) is in E.
Complete GraphA graph with all nodes connected to each other directly
20
CyclesA cycle (in a directed graph) is a path that
begins and ends in the same vertex.
A directed acyclic graph (DAG) is a directed graph having no cycles.
21
So, is this a connected graph?
Cyclic or Acyclic?
Directed or Undirected?
22
Directed graph (unconnected)
Cyclic or Acyclic?
23
Tree is a Graph
A graph with no cycles is called a tree. This is a general definition of a tree
A
CB
F
D E
GA
C
B
F
D E
G
24
Edges and Degree Degree
number of edges incident to a vertexin a directed graph
in-degree(v) - number of edges into vertex v out-degree(v) - number of edges from
vertex v
25
Dense Graphs For most graphs |E| |V|2
A dense graph when most edges are presentE = (|V|2)large number of edges (quadratic)Also |E| > |V| log |V| can be considered
dense
26
Sparse Graphs A sparse graph is a graph with
relatively few edgesno clear definitionmetric for sparsity |E| < |V| log |V|Examples of sparse graphs
computer network
Representing Graphs
28
Graph operations Navigation in a graph.
(v,w)E Rv = {w | (v,w)E} Dw = {v | (v,w)E}
Enumeration of the elements of a graph. E V
Size of a graph |V| |E|
29
Representing graphs Adjacency matrix
2-dimensional array For each edge (u,v), set A[u][v] to
true; otherwise false
1
4
76
3 5
2
1 2 3 4 5 6 71 x x2 x x3 x4 x x x5 x x67
3 41
4 52
63
4
3 67
5
4 76
7
Adjacency lists For each vertex, keep a list of adjacent vertices
Q: How to represent weights?
30
Choosing a representation Size of V relative to size of E is a primary
factor.Dense: E/V is largeSparse: E/V is smallAdjacency matrix is expensive if the graph is
sparse. WHY?Adjacency list is expensive if the graph is
dense. WHY?
Dynamic changes to V.Adjacency matrix is expensive to copy/extend if
V is extended. WHY?
Graphs : ApplicationSearch Engines
32
Search Engines
33
What are they? Tools for finding information on the Web
Problem: “hidden” databases, e.g. New York Times (ie, databases of keywords hosted by the web site itself. These cannot be accessed by Yahoo, Google etc.)
Search engineA machine-constructed index (usually by
keyword) So many search engines, we need search
engines to find them. Searchenginecollosus.com
34
What are they? Search engines: key tools for ecommerce
Buyers and sellers must find each other How do they work? How much do they index? Are they reliable? How are hits ordered? Can the order be changed?
35
The Process
1. Acquire the collection, i.e. all the documents[Off-line process]
2. Create an inverted index [Off-line process]
3. Match queries to documents[On-line process, the actual retrieval]
4. Present the results to user[On-line process: display, summarize, ...]
36
SE Architecture Spider
Crawls the web to find pages. Follows hyperlinks. Never stops
Indexer Produces data structures for fast searching of all
words in the pages (i.e., it updates the lexicon) Retriever
Query interface Database lookup to find hits
Billions of documents 1 TB RAM, many terabytes of disk
Ranking
37
Did you know? The concept of a Web spider was
developed by Dr.Fuzzy Mouldin Implemented in 1994 on the Web Went into the creation of Lycos Lycos propelled CMU into the top 5
most successful schools Commercialization proceeds
Tangible evidence Newel-Simon Hall
Dr. Michael L. (Fuzzy) Mauldin
38
Did you know? A meta-search engine: send a
query to many search engines, than rank and cluster the results. (especially useful if you’re not sure which keywords produce optimal results)
Vivisimo was developed here at CMU
Developed by Prof. Raul Valdes-Perez
Developed in 2000
39
A look at > 20,000 servers Spiders over:
4.28 billion pages 880 million images
Supports 35 non-English languages Over 200 million queries per day
40
Google’s server farm
41
Web Spiders Start with an initial page P0. Find URLs on P0 and
add them to a queue When done with P0, pass it to an indexing program,
get a page P1 from the queue and repeat Can be specialized (e.g. only look for email
addresses) Issues
Which page to look at next? (Special subjects, recency) Avoid overloading a site How deep within a site do you go (depth search)? How frequently to visit pages?
42
So, why Spider the Web?User Perceptions Most annoying: Engine finds nothing
(too small an index, but not an issue since 1997 or so).
Somewhat annoying: Obsolete linksRefresh Collection by deleting dead link
OK if index is slightly smallerDone every 1-2 weeks in best engines
Mildly annoying: Failure to find new site=> Re-spider entire web=> Done every 2-4 weeks in best engines
43
So, why Spider the Web?, cont’d
Cost of Spidering Semi-parallel algorithmic decomposition Spider can (and does) run in hundreds of
severs simultaneously Very high network connectivity Servers can migrate from spidering to query
processing depending on time-of-day load Running a full web spider takes days even with
hundreds of dedicated servers
44
Indexing Arrangement of data (data structure) to permit
fast searching Which list is easier to search? sow fox pig eel yak hen ant cat dog hog ant cat dog eel fox hen hog pig sow yak Sorting helps. Why?
Permits binary search. About log2n probes into list log2(1 billion) ~ 30
Permits interpolation search. About log2(log2n) probes log2 log2(1 billion) ~ 5
45
Inverted Files A file is a list of words by position
- First entry is the word in position 1 (first word)- Entry 4562 is the word in position 4562 (4562nd word)- Last entry is the last word
An inverted file is a list of positions by word!
POS1
10
20
30
36
FILE
a (1, 4, 40)entry (11, 20, 31)file (2, 38)list (5, 41)position (9, 16, 26)positions (44)word (14, 19, 24, 29, 35, 45)words (7)4562 (21, 27)
INVERTED FILE
46
Inverted Files for Multiple Documents
107 4 322 354 381 405232 6 15 195 248 1897 1951 2192677 1 481713 3 42 312 802
WORD NDOCS PTRjezebel 20jezer 3jezerit 1jeziah 1jeziel 1jezliah 1jezoar 1jezrahliah 1jezreel 39
jezoar
34 6 1 118 2087 3922 3981 500244 3 215 2291 301056 4 5 22 134 992
DOCID OCCUR POS 1 POS 2 . . .
566 3 203 245 287
67 1 132. . .
“jezebel” occurs6 times in document 34,3 times in document 44,4 times in document 56 . . .
LEXICON
WORD INDEX
47
Ranking (Scoring) Hits Hits must be presented in some order What order?
Relevance, recency, popularity, reliability? Some ranking methods
Presence of keywords in title of document Closeness of keywords to start of document Frequency of keyword in document Link popularity (how many pages point to this one)
48
Ranking (Scoring) Hits, cont’d Can the user control? Can the page owner
control? Can you find out what order is used? Spamdexing: influencing retrieval ranking by
altering a web page. (Puts “spam” in the index)
49
Link Popularity How many pages link to this page?
on the whole Web in our database?
http://www.linkpopularity.com Link popularity is used for ranking
Many measuresNumber of links in (In-links)Weighted number of links in (by weight of
referring page)
50
Search Engine Sizes
ATW AllTheWebAV AltavistaGG GoogleINK InktomiTMA Teoma
SOURCE: SEARCHENGINEWATCH.COM
Billions of textual documents indexed.
51
Search Engine Sizes, cont’d
SOURCE: SEARCHENGINEWATCH.COM
Billions of textual documents indexed (as of Sept 2, 2003)
ATW AllTheWebAV AltavistaGG GoogleINK InktomiTMA Teoma
52
Current Status of Web Spiders
Historical Notes WebCrawler: first documented
spider Lycos: first large-scale spider
53
Current Status of Web Spiders
Enhanced Spidering In-link counts to pages can be
established during spidering. Hint: Hmmm… hw6? In-link counts are the basis for
Google’s page-rank method
54
Current Status of Web SpidersUnsolved Problems Most spidering re-traverses stable web
graph=> on-demand re-spidering when changes occur
Completeness or near-completeness is still a major issue
Cannot Spider information stored in local-DB
Spidering non-textual data
55
Reading About graphs:
Chapter 14.
About Web search:http://www.cadenza.org/search_engine_
terms/srchad.htm