, Fall 2006IAT 800 Recursion, Web Crawling. , Fall 2006IAT 800 Today’s Nonsense Recursion – Why...
-
date post
21-Dec-2015 -
Category
Documents
-
view
218 -
download
3
Transcript of , Fall 2006IAT 800 Recursion, Web Crawling. , Fall 2006IAT 800 Today’s Nonsense Recursion – Why...
, Fall 2006 IAT 800
Today’s Nonsense
Recursion – Why is my head spinning?
Web Crawling – Recursing in HTML
Shortened class again today. Boy, your TA must be a total slacker, or something.
, Fall 2006 IAT 800
Recursion
Recursion basically means calling a method from inside itself.
int factorial(int n) { if(n > 1) { return n * factorial(n-1); } else return 1;}
What the?!?
, Fall 2006 IAT 800
Inside Itself?!
Let’s step through what happens.
factorial(3);int factorial(int n) { if(n > 1) { return n * factorial(n-1); } else return 1;}
(n = 3)
, Fall 2006 IAT 800
Inside Itself?!
Let’s step through what happens.
factorial(3);int factorial(int n) { if(n > 1) { return n * factorial(n-1); } else return 1;}
(n = 3)
int factorial(int n) { if(n > 1) { return n * factorial(n-1); } else return 1;}
(n = 2)
, Fall 2006 IAT 800
Inside Itself?!
Let’s step through what happens.
factorial(3);int factorial(int n) { if(n > 1) { return n * factorial(n-1); } else return 1;}
(n = 3)int factorial(int n) { if(n > 1) { return n * factorial(n-1); } else return 1;}
(n = 2)
int factorial(int n) { if(n > 1) { return n * factorial(n-1); } else return 1;}
(n = 1)
, Fall 2006 IAT 800
Inside Itself?!
Let’s step through what happens.
factorial(3);int factorial(int n) { if(n > 1) { return n * factorial(n-1); } else return 1;}
(n = 3)int factorial(int n) { if(n > 1) { return n * factorial(n-1); } else return 1;}
(n = 2)
1
, Fall 2006 IAT 800
Inside Itself?!
Let’s step through what happens.
factorial(3);int factorial(int n) { if(n > 1) { return n * factorial(n-1); } else return 1;}
(n = 3)
int factorial(int n) { if(n > 1) { return n * 1; } else return 1;}
(n = 2)
, Fall 2006 IAT 800
Inside Itself?!
Let’s step through what happens.
factorial(3);int factorial(int n) { if(n > 1) { return n * factorial(n-1); } else return 1;}
(n = 3)
int factorial(int n) { if(n > 1) { return 2 * 1; } else return 1;}
(n = 2)
, Fall 2006 IAT 800
Inside Itself?!
Let’s step through what happens.
factorial(3);int factorial(int n) { if(n > 1) { return n * factorial(n-1); } else return 1;}
(n = 3)
2
, Fall 2006 IAT 800
Inside Itself?!
Let’s step through what happens.
factorial(3);int factorial(int n) { if(n > 1) { return n * 2; } else return 1;}
(n = 3)
, Fall 2006 IAT 800
Inside Itself?!
Let’s step through what happens.
factorial(3);int factorial(int n) { if(n > 1) { return 3 * 2; } else return 1;}
(n = 3)
, Fall 2006 IAT 800
Base Case
The most important thing to include in a recursive call is the “base case”, something that will assure that the function stops calling itself at some point.
In our example, we made sure it only called the recursive function if n was greater than 1, and each time we call it, n gets smaller, so we know it will eventually get to a number less than or equal to 1.
, Fall 2006 IAT 800
Web Crawling
Let’s use recursion for something more interesting.
Say we have some method “parsePage”, that looks at a web page. Suppose we then want that method to follow the links on that page and parse the pages it is linked to.
We’d then want to call the “parsePage” method on those links from inside the parsePage method we have.
, Fall 2006 IAT 800
Web Crawling
You can see here the need for a base case. One way of controlling our search is by placing a limit on the depth of the links we follow.
For instance, in the visual example, we followed the links from our start page (depth 1), and then the links from those pages (depth 2).
, Fall 2006 IAT 800
Recursion
Fig. 3: Remember—base cases prevent infinite cats.
http://infinitecat.com/
, Fall 2006 IAT 800
Parse HTML? “Parsing” means to walk through the structure of a file (look at it word-by-word)– Look at an HTML file
The structure of an HTML file is the tag structure – So parsing means to walk through and interpret the tags
If you can parse HTML files, you can pull content out of web pages and do stuff with it
Procedural manipulation of web content
<font size=-1 color=>Results <b>1</b> - <b>20</b> of about <b>202</b> for <b><a href=/url?sa=X&oi=dict&q=http://www.answers.com/matrix%26r%3D67 title="Look up definition of matrix"><b>matrix</b></a><b> </b><a href=/url?sa=X&oi=dict&q=http://www.answers.com/red%26r%3D67 title="Look up definition of red"><b>red</b></a> …
<b>some text</b> == some text
, Fall 2006 IAT 800
Basic approach
Use two classes to parse
One class reads info from a URL – HTMLParser The other class is used by HTMLParser to process
tags – child of HTMLEditorKit.ParserCallback
HTMLParser recognizes when a tag appears (<TAG>) and calls appropriate methods on the ParserCallback class (start-tags, end-tags, simple-tags, text, etc.)
The programmer (ie. you), fill in the ParserCallback methods to do whatever you want when you see different kinds of tags
, Fall 2006 IAT 800
Running the example
We’ve written HTMLParser for you
To access it, it must be in the data directory of your project
Simplest thing will be just to copy the code from the website and put the directory in your default sketchbook directory
, Fall 2006 IAT 800
handleSimpleTag public void handleSimpleTag(HTML.Tag tag,
MutableAttributeSet attrib, int pos)– Called for tags like IMG– tag stores the name of the tag– attrib stores any attributes– pos is the position in the file
Example: <img src=“image.gif” alt=“text description of image” align=“right” width=“10”>– The tag is img– The attributes are src, alt, align, width (with
their respective values)
, Fall 2006 IAT 800
handleStartTag public void handleStartTag(HTML.Tag tag,
MutableAttributeSet attrib, int pos)– Called for tags like BODY– tag stores the name of the tag– attrib stores any attributes– pos is the position in the file
Example: <body bgcolor=“#FFFFFF” topmargin=“0” leftmargin=“0” marginheight=“0” marginwidth=“0”>– The tag is body– The attributes are bgcolor, topmargin, leftmargin,
marginheight (with their respective values)
, Fall 2006 IAT 800
handleEndTag public void handleEndTag(HTML.Tag tag, int pos)– Called for tags like </a>– tag stores the name of the tag– pos is the position in the file
, Fall 2006 IAT 800
handleText public void handleText(char[] data, int pos)– Handles anything that’s not a tag (the text between tags)
– data is an array of characters containing the text
– pos is the position