1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16,...

55
1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004

description

3 Graphs — an overview Vertices (aka nodes)

Transcript of 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16,...

Page 1: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

1

Introduction to Graphs

15-211 Fundamental Data Structures and Algorithms

Aleks NanevskiMarch 16, 2004

Page 2: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

2

Announcements Homework Assignment #6 out today Building a Web Search Engine

Web Crawler (due March 25)Web Reader (due March 25)Web Search (due April 5)

Start early and ask questions

Page 3: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

3

Graphs — an overviewVertices (aka nodes)

Page 4: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

4

Graphs — an overview

PIT

BOS

JFK

DTW

LAX

SFO

Vertices (aka nodes)

Page 5: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

5

Graphs — an overview

PIT

BOS

JFK

DTW

LAX

SFO

Vertices (aka nodes)

Edges

Page 6: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

6

Undirected

Graphs — an overview

PIT

BOS

JFK

DTW

LAX

SFO

Vertices (aka nodes)

Edges

Page 7: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

7

Undirected

Graphs — an overview

PIT

BOS

JFK

DTW

LAX

SFO

Vertices (aka nodes)

Edges

1987

2273

3442145

2462

618

211318

190

Weights

Page 8: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

8

Examples of Graphs

QuestionsWhat is the fastest way to get from Pittsburgh

to St Louis?What is the cheapest way to get from Pittsburgh

to St Louis?

Vanguard Airlines Route Map

Page 9: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

9

Web Graph

<href …>

<href …>

<href …>

<href …>

<href …>

<href …>

<href …>

Web Pages are nodes (vertices)HTML references are links (edges)

Page 10: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

10

Graphs as models Physical objects are often modeled

by meshes, which are a particular kind of graph structure.

By Jonathan Shewchuk

Page 11: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

11

Structure of the Internet

Europe

JapanBackbone 1

Backbone 2

Backbone 3

Backbone 4, 5, N

Australia

Regional A

Regional B

NAP

NAP

NAP

NAP

SOURCE: CISCO SYSTEMS

Page 12: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

12

Relationship graphs Graphs are also used to model

relationships among entities.Scheduling and resource constraints.Inheritance hierarchies.

15-113 15-151

15-211

15-212 15-213

15-312

15-411 15-412

15-251

15-45115-462

Page 13: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

13

Terminology Graph G = (V,E)

Set V of vertices (nodes)Set E of edges

Elements of E are pair (v,w) where v,w V. An edge (v,v) is a self-loop. (Usually assume no self-

loops.)

Weighted graphElements of E are (v,w,x) where x is a weight,

i.e., a cost associated with the edge.

Page 14: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

14

Terminology, cont’d Directed graph (digraph)

The edge pairs are orderedEvery edge has a specified direction

Undirected graphThe edge pairs are unordered

E is a symmetric relation (v,w) E implies (w,v) E In an undirected graph (v,w) and (w,v) are usually

treated as though they are the same edge

Page 15: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

15

Directed Graph (digraph)

PIT

BOS

JFK

DTW

LAX

SFO

Vertices (aka nodes)

Edges

Page 16: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

16

Undirected Graph

PIT

BOS

JFK

DTW

LAX

SFO

Vertices (aka nodes)

Edges

Page 17: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

17

PathsA path is a sequence of edges from one node

to anotherA length of a path is the number of edges in

it.

PIT

BOS

JFK

DTW

LAX

SFO

Question: What are paths of length 2 in the above graph?

Page 18: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

18

PathsA simple path is a path where no

vertex is repeated first and last vertices can be the same

Question: an example of a simple path? A non-simple path?

PIT

BOS

JFK

DTW

LAX

SFO

Page 19: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

19

Connected graphsConnected graph

A graph is connected if for all (u,v) in V, there exists a path from u to v.

A directed graph is strongly connected if for all (u,v) in V, there exists a path from u to v.

A directed graph is weakly connected if for all (u,v) in V, either (u,v) is in E or (v,u) is in E.

Complete GraphA graph with all nodes connected to each other directly

Page 20: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

20

CyclesA cycle (in a directed graph) is a path that

begins and ends in the same vertex.

A directed acyclic graph (DAG) is a directed graph having no cycles.

Page 21: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

21

So, is this a connected graph?

Cyclic or Acyclic?

Directed or Undirected?

Page 22: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

22

Directed graph (unconnected)

Cyclic or Acyclic?

Page 23: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

23

Tree is a Graph

A graph with no cycles is called a tree. This is a general definition of a tree

A

CB

F

D E

GA

C

B

F

D E

G

Page 24: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

24

Edges and Degree Degree

number of edges incident to a vertexin a directed graph

in-degree(v) - number of edges into vertex v out-degree(v) - number of edges from

vertex v

Page 25: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

25

Dense Graphs For most graphs |E| |V|2

A dense graph when most edges are presentE = (|V|2)large number of edges (quadratic)Also |E| > |V| log |V| can be considered

dense

Page 26: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

26

Sparse Graphs A sparse graph is a graph with

relatively few edgesno clear definitionmetric for sparsity |E| < |V| log |V|Examples of sparse graphs

computer network

Page 27: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

Representing Graphs

Page 28: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

28

Graph operations Navigation in a graph.

(v,w)E Rv = {w | (v,w)E} Dw = {v | (v,w)E}

Enumeration of the elements of a graph. E V

Size of a graph |V| |E|

Page 29: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

29

Representing graphs Adjacency matrix

2-dimensional array For each edge (u,v), set A[u][v] to

true; otherwise false

1

4

76

3 5

2

1 2 3 4 5 6 71 x x2 x x3 x4 x x x5 x x67

3 41

4 52

63

4

3 67

5

4 76

7

Adjacency lists For each vertex, keep a list of adjacent vertices

Q: How to represent weights?

Page 30: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

30

Choosing a representation Size of V relative to size of E is a primary

factor.Dense: E/V is largeSparse: E/V is smallAdjacency matrix is expensive if the graph is

sparse. WHY?Adjacency list is expensive if the graph is

dense. WHY?

Dynamic changes to V.Adjacency matrix is expensive to copy/extend if

V is extended. WHY?

Page 31: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

Graphs : ApplicationSearch Engines

Page 32: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

32

Search Engines

Page 33: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

33

What are they? Tools for finding information on the Web

Problem: “hidden” databases, e.g. New York Times (ie, databases of keywords hosted by the web site itself. These cannot be accessed by Yahoo, Google etc.)

Search engineA machine-constructed index (usually by

keyword) So many search engines, we need search

engines to find them. Searchenginecollosus.com

Page 34: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

34

What are they? Search engines: key tools for ecommerce

Buyers and sellers must find each other How do they work? How much do they index? Are they reliable? How are hits ordered? Can the order be changed?

Page 35: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

35

The Process

1. Acquire the collection, i.e. all the documents[Off-line process]

2. Create an inverted index [Off-line process]

3. Match queries to documents[On-line process, the actual retrieval]

4. Present the results to user[On-line process: display, summarize, ...]

Page 36: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

36

SE Architecture Spider

Crawls the web to find pages. Follows hyperlinks. Never stops

Indexer Produces data structures for fast searching of all

words in the pages (i.e., it updates the lexicon) Retriever

Query interface Database lookup to find hits

Billions of documents 1 TB RAM, many terabytes of disk

Ranking

Page 37: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

37

Did you know? The concept of a Web spider was

developed by Dr.Fuzzy Mouldin Implemented in 1994 on the Web Went into the creation of Lycos Lycos propelled CMU into the top 5

most successful schools Commercialization proceeds

Tangible evidence Newel-Simon Hall

Dr. Michael L. (Fuzzy) Mauldin

Page 38: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

38

Did you know? A meta-search engine: send a

query to many search engines, than rank and cluster the results. (especially useful if you’re not sure which keywords produce optimal results)

Vivisimo was developed here at CMU

Developed by Prof. Raul Valdes-Perez

Developed in 2000

Page 39: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

39

A look at > 20,000 servers Spiders over:

4.28 billion pages 880 million images

Supports 35 non-English languages Over 200 million queries per day

Page 40: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

40

Google’s server farm

Page 41: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

41

Web Spiders Start with an initial page P0. Find URLs on P0 and

add them to a queue When done with P0, pass it to an indexing program,

get a page P1 from the queue and repeat Can be specialized (e.g. only look for email

addresses) Issues

Which page to look at next? (Special subjects, recency) Avoid overloading a site How deep within a site do you go (depth search)? How frequently to visit pages?

Page 42: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

42

So, why Spider the Web?User Perceptions Most annoying: Engine finds nothing

(too small an index, but not an issue since 1997 or so).

Somewhat annoying: Obsolete linksRefresh Collection by deleting dead link

OK if index is slightly smallerDone every 1-2 weeks in best engines

Mildly annoying: Failure to find new site=> Re-spider entire web=> Done every 2-4 weeks in best engines

Page 43: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

43

So, why Spider the Web?, cont’d

Cost of Spidering Semi-parallel algorithmic decomposition Spider can (and does) run in hundreds of

severs simultaneously Very high network connectivity Servers can migrate from spidering to query

processing depending on time-of-day load Running a full web spider takes days even with

hundreds of dedicated servers

Page 44: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

44

Indexing Arrangement of data (data structure) to permit

fast searching Which list is easier to search? sow fox pig eel yak hen ant cat dog hog ant cat dog eel fox hen hog pig sow yak Sorting helps. Why?

Permits binary search. About log2n probes into list log2(1 billion) ~ 30

Permits interpolation search. About log2(log2n) probes log2 log2(1 billion) ~ 5

Page 45: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

45

Inverted Files A file is a list of words by position

- First entry is the word in position 1 (first word)- Entry 4562 is the word in position 4562 (4562nd word)- Last entry is the last word

An inverted file is a list of positions by word!

POS1

10

20

30

36

FILE

a (1, 4, 40)entry (11, 20, 31)file (2, 38)list (5, 41)position (9, 16, 26)positions (44)word (14, 19, 24, 29, 35, 45)words (7)4562 (21, 27)

INVERTED FILE

Page 46: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

46

Inverted Files for Multiple Documents

107 4 322 354 381 405232 6 15 195 248 1897 1951 2192677 1 481713 3 42 312 802

WORD NDOCS PTRjezebel 20jezer 3jezerit 1jeziah 1jeziel 1jezliah 1jezoar 1jezrahliah 1jezreel 39

jezoar

34 6 1 118 2087 3922 3981 500244 3 215 2291 301056 4 5 22 134 992

DOCID OCCUR POS 1 POS 2 . . .

566 3 203 245 287

67 1 132. . .

“jezebel” occurs6 times in document 34,3 times in document 44,4 times in document 56 . . .

LEXICON

WORD INDEX

Page 47: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

47

Ranking (Scoring) Hits Hits must be presented in some order What order?

Relevance, recency, popularity, reliability? Some ranking methods

Presence of keywords in title of document Closeness of keywords to start of document Frequency of keyword in document Link popularity (how many pages point to this one)

Page 48: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

48

Ranking (Scoring) Hits, cont’d Can the user control? Can the page owner

control? Can you find out what order is used? Spamdexing: influencing retrieval ranking by

altering a web page. (Puts “spam” in the index)

Page 49: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

49

Link Popularity How many pages link to this page?

on the whole Web in our database?

http://www.linkpopularity.com Link popularity is used for ranking

Many measuresNumber of links in (In-links)Weighted number of links in (by weight of

referring page)

Page 50: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

50

Search Engine Sizes

ATW AllTheWebAV AltavistaGG GoogleINK InktomiTMA Teoma

SOURCE: SEARCHENGINEWATCH.COM

Billions of textual documents indexed.

Page 51: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

51

Search Engine Sizes, cont’d

SOURCE: SEARCHENGINEWATCH.COM

Billions of textual documents indexed (as of Sept 2, 2003)

ATW AllTheWebAV AltavistaGG GoogleINK InktomiTMA Teoma

Page 52: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

52

Current Status of Web Spiders

Historical Notes WebCrawler: first documented

spider Lycos: first large-scale spider

Page 53: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

53

Current Status of Web Spiders

Enhanced Spidering In-link counts to pages can be

established during spidering. Hint: Hmmm… hw6? In-link counts are the basis for

Google’s page-rank method

Page 54: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

54

Current Status of Web SpidersUnsolved Problems Most spidering re-traverses stable web

graph=> on-demand re-spidering when changes occur

Completeness or near-completeness is still a major issue

Cannot Spider information stored in local-DB

Spidering non-textual data

Page 55: 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

55

Reading About graphs:

Chapter 14.

About Web search:http://www.cadenza.org/search_engine_

terms/srchad.htm