Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU) David Hawking...

Focused Crawling in Depression Portal Search: A Feasibility Study

Thanh Tin Tang (ANU)David Hawking (CSIRO)

Nick Craswell (Microsoft)Ramesh Sankaranarayana(ANU)

2

Why Depression?

Leading cause of disability burden in Australia

One in five people suffer from a mental disorder in any one year

The Web is a good way to deliver information and treatments, but ...

A lot of depression information on the Web is of poor quality

3

Bluepages Search (BPS)

4

BluePages Search

5

Bluepages Search

Indexes approximately 200 sites, e.g. Whole server: suicidal.com/ Directory: www.healingwell.com/depression/ Individual page: www.mcmanweb.com/article-226.htm

Approximately 2 weeks of manual effort to create / update seed list and include patterns

Experiments showed that Google (with ‘depression’) had better relevance but more bad advice

Relevance: Only 17% of relevant pages returned by Google were contained in the BPS crawl

6

Approach

BPS: higher quality but much lower coverage, and … It is time consuming to identify and maintain the list of sites

to be included Is it worth it? Can it be done more cheaply? How to increase coverage but still maintain high quality? Can we automate the process?

=> Seed list: Using an existing directory, e.g.: DMOZ, Yahoo!

Directory Crawling:

Use general crawler with inclusion/exclusion rules Use focused crawler with mechanisms to predict relevant/high

quality links from source pages

7

DMOZ Depression Directory

DMOZ is “the most comprehensive human-edited directory of the web”

Depression directory contains: Links to a few other DMOZ pages Links to servers, directories, and

individual pages about depression

Other pages in DMOZ

Servers, directories &

individual pages

8

DMOZ Seed List

How to generate Start from the depression directory Decide whether to include links to other pages

within the DMOZ site (little manual effort) Automatically generate most of the seed URLs

Seed URLs are same as URLs, except that default page suffixes are removed.

E.g.: www.depression.com/default.asp has the pattern www.depression.com

9

Should DMOZ be used?

Requires very little effort in boundary setting Provides a big seed list of URLs locating

heterogeneously on the Web (three times bigger than BPS)

Using 101 judged queries from our previous study, we retrieved 227 judged URLs from DMOZ of which 186 were relevant (81%)

=> DMOZ provided a good set of relevant pages with little effort, but…can we find more relevant pages else where?

10

Focused Crawler

Seeks, acquires, indexes and maintains pages on a specific set of topics

Requires small investment in hardware and network resources

Starts with a seed list of URLs relevant to the topics of interest

Follows links from seed pages to identify the most promising links to crawl

Is focused crawling a promising technique for building a depression portal?

11

One link away URLs

Additional Link-accessible Relevant Information

Illustration of one link away collection

If pages in the current crawl have no link to additional relevant content, the prospect of successful focused crawling is very low

DMOZ Crawl

12

Additional Link Experiments

Experiment: Relevance of outgoing links from a crawled collection An unrestricted crawler starting from the BPS

crawl can reach 25.3% (quite high) more known relevant pages in one single step from current crawled pages.

Experiment: Linking patterns between relevant pages Out of 196 new relevant URLs, 158 were linked

to by known relevant pages.

13

Findings for Additional Links

Relevant pages tend to link to each other Outgoing link set of a good collection

contains quite a large number of additional relevant pages

These support the idea of focused crawling, but …

How can a crawler tell which links lead to relevant content?

14

Hypertext Classification

Traditional text classification only looks at the text in each document

Hypertext classification uses link information

We experimented with anchor text, text around the link and URL words

Here is an example

15

Features

URL: http://www.depression.com/psychotherapy.html

=> URL words: depression, com, psychotherapy

Anchor text: psychotherapy Text around the link:

50 bytes before: section, learn

50 bytes after: talk, therapy, standard, treatment

16

Input Data & Measures

Calculate tf.idf for all the features appearing in each URL

10-fold cross validation on 295 relevant and 251 irrelevant URLs

Classifiers: IBK, ZeroR, Naïve Bayes, C4.5, Bagging and AdaboostM1, etc.

Measures: Accuracy, precision and recall.

17

Hypertext Classification - Results

=> In overall, J48 is the best classifier

68.1388.1577.83J48

69.8378.0373.07Naïve Bayes

65.4277.5171.06Complement Naïve Bayes

10054.0254.02ZeroR

Recall (%)Precision (%)Accuracy (%)Classifier

18

Hypertext Classification - Others

Bagging and boosting showed little improvement for recall

No applicable results in the literature relating to the topic of depression to compare

A classifier looking at the content of the target pages showed similar results

=> Hypertext classification is quite effective

19

Findings

Web pages about depression are strongly interlinked

DMOZ depression category seems to provide a good seed list for a focused crawl

Predictive classification of outgoing links using link features achieves promising results

=> Cheap and high coverage depression portal might be built & maintained using focused crawling techniques starting with the DMOZ seed list

20

Future Work

Build a domain-specific search portal: URL ranking in the order of degree of relevance Data structures to hold accumulated information

for unvisited URLs Determine how to use the focused crawler

operationally: No include/exclude rules, but appropriate

stopping conditions What to do if none of the outgoing links are

classified as relevant?

21

Future Work

Incorporate site quality into the focused crawler or filtering high quality pages after crawling

Extend the techniques to other domains, such

as health related domains, is it applicable?

Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU) David Hawking...

Documents

Transcript of Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU) David Hawking...