, Fall 2006IAT 800 Recursion, Web Crawling. , Fall 2006IAT 800 Today’s Nonsense Recursion – Why...

27
, Fall 2006 IAT 800 IAT 800 Recursion, Web Crawling
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    3

Transcript of , Fall 2006IAT 800 Recursion, Web Crawling. , Fall 2006IAT 800 Today’s Nonsense Recursion – Why...

, Fall 2006 IAT 800

IAT 800

Recursion, Web Crawling

, Fall 2006 IAT 800

Today’s Nonsense

Recursion – Why is my head spinning?

Web Crawling – Recursing in HTML

Shortened class again today. Boy, your TA must be a total slacker, or something.

, Fall 2006 IAT 800

Recursion

Recursion basically means calling a method from inside itself.

int factorial(int n) { if(n > 1) { return n * factorial(n-1); } else return 1;}

What the?!?

, Fall 2006 IAT 800

Inside Itself?!

Let’s step through what happens.

factorial(3);int factorial(int n) { if(n > 1) { return n * factorial(n-1); } else return 1;}

(n = 3)

, Fall 2006 IAT 800

Inside Itself?!

Let’s step through what happens.

factorial(3);int factorial(int n) { if(n > 1) { return n * factorial(n-1); } else return 1;}

(n = 3)

int factorial(int n) { if(n > 1) { return n * factorial(n-1); } else return 1;}

(n = 2)

, Fall 2006 IAT 800

Inside Itself?!

Let’s step through what happens.

factorial(3);int factorial(int n) { if(n > 1) { return n * factorial(n-1); } else return 1;}

(n = 3)int factorial(int n) { if(n > 1) { return n * factorial(n-1); } else return 1;}

(n = 2)

int factorial(int n) { if(n > 1) { return n * factorial(n-1); } else return 1;}

(n = 1)

, Fall 2006 IAT 800

Inside Itself?!

Let’s step through what happens.

factorial(3);int factorial(int n) { if(n > 1) { return n * factorial(n-1); } else return 1;}

(n = 3)int factorial(int n) { if(n > 1) { return n * factorial(n-1); } else return 1;}

(n = 2)

1

, Fall 2006 IAT 800

Inside Itself?!

Let’s step through what happens.

factorial(3);int factorial(int n) { if(n > 1) { return n * factorial(n-1); } else return 1;}

(n = 3)

int factorial(int n) { if(n > 1) { return n * 1; } else return 1;}

(n = 2)

, Fall 2006 IAT 800

Inside Itself?!

Let’s step through what happens.

factorial(3);int factorial(int n) { if(n > 1) { return n * factorial(n-1); } else return 1;}

(n = 3)

int factorial(int n) { if(n > 1) { return 2 * 1; } else return 1;}

(n = 2)

, Fall 2006 IAT 800

Inside Itself?!

Let’s step through what happens.

factorial(3);int factorial(int n) { if(n > 1) { return n * factorial(n-1); } else return 1;}

(n = 3)

2

, Fall 2006 IAT 800

Inside Itself?!

Let’s step through what happens.

factorial(3);int factorial(int n) { if(n > 1) { return n * 2; } else return 1;}

(n = 3)

, Fall 2006 IAT 800

Inside Itself?!

Let’s step through what happens.

factorial(3);int factorial(int n) { if(n > 1) { return 3 * 2; } else return 1;}

(n = 3)

, Fall 2006 IAT 800

Inside Itself?!

Let’s step through what happens.

factorial(3); 6

, Fall 2006 IAT 800

Base Case

The most important thing to include in a recursive call is the “base case”, something that will assure that the function stops calling itself at some point.

In our example, we made sure it only called the recursive function if n was greater than 1, and each time we call it, n gets smaller, so we know it will eventually get to a number less than or equal to 1.

, Fall 2006 IAT 800

Web Crawling

Let’s use recursion for something more interesting.

Say we have some method “parsePage”, that looks at a web page. Suppose we then want that method to follow the links on that page and parse the pages it is linked to.

We’d then want to call the “parsePage” method on those links from inside the parsePage method we have.

, Fall 2006 IAT 800

Web Crawling

, Fall 2006 IAT 800

Web Crawling

, Fall 2006 IAT 800

Web Crawling

You can see here the need for a base case. One way of controlling our search is by placing a limit on the depth of the links we follow.

For instance, in the visual example, we followed the links from our start page (depth 1), and then the links from those pages (depth 2).

, Fall 2006 IAT 800

Recursion

Fig. 3: Remember—base cases prevent infinite cats.

http://infinitecat.com/

, Fall 2006 IAT 800

Parse HTML? “Parsing” means to walk through the structure of a file (look at it word-by-word)– Look at an HTML file

The structure of an HTML file is the tag structure – So parsing means to walk through and interpret the tags

If you can parse HTML files, you can pull content out of web pages and do stuff with it

Procedural manipulation of web content

<font size=-1 color=>Results <b>1</b> - <b>20</b> of about <b>202</b> for <b><a href=/url?sa=X&oi=dict&q=http://www.answers.com/matrix%26r%3D67 title="Look up definition of matrix"><b>matrix</b></a><b> </b><a href=/url?sa=X&oi=dict&q=http://www.answers.com/red%26r%3D67 title="Look up definition of red"><b>red</b></a> …

<b>some text</b> == some text

, Fall 2006 IAT 800

Basic approach

Use two classes to parse

One class reads info from a URL – HTMLParser The other class is used by HTMLParser to process

tags – child of HTMLEditorKit.ParserCallback

HTMLParser recognizes when a tag appears (<TAG>) and calls appropriate methods on the ParserCallback class (start-tags, end-tags, simple-tags, text, etc.)

The programmer (ie. you), fill in the ParserCallback methods to do whatever you want when you see different kinds of tags

, Fall 2006 IAT 800

Running the example

We’ve written HTMLParser for you

To access it, it must be in the data directory of your project

Simplest thing will be just to copy the code from the website and put the directory in your default sketchbook directory

, Fall 2006 IAT 800

handleSimpleTag public void handleSimpleTag(HTML.Tag tag,

MutableAttributeSet attrib, int pos)– Called for tags like IMG– tag stores the name of the tag– attrib stores any attributes– pos is the position in the file

Example: <img src=“image.gif” alt=“text description of image” align=“right” width=“10”>– The tag is img– The attributes are src, alt, align, width (with

their respective values)

, Fall 2006 IAT 800

handleStartTag public void handleStartTag(HTML.Tag tag,

MutableAttributeSet attrib, int pos)– Called for tags like BODY– tag stores the name of the tag– attrib stores any attributes– pos is the position in the file

Example: <body bgcolor=“#FFFFFF” topmargin=“0” leftmargin=“0” marginheight=“0” marginwidth=“0”>– The tag is body– The attributes are bgcolor, topmargin, leftmargin,

marginheight (with their respective values)

, Fall 2006 IAT 800

handleEndTag public void handleEndTag(HTML.Tag tag, int pos)– Called for tags like </a>– tag stores the name of the tag– pos is the position in the file

, Fall 2006 IAT 800

handleText public void handleText(char[] data, int pos)– Handles anything that’s not a tag (the text between tags)

– data is an array of characters containing the text

– pos is the position

, Fall 2006 IAT 800

Filling in these methods You fill in these methods to do whatever processing you want

In the image collage example– handleSimpleTag is looking for images

– handleStartTag is looking for the start of anchors and follows links