Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU) David Hawking...
-
date post
19-Dec-2015 -
Category
Documents
-
view
213 -
download
0
Transcript of Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU) David Hawking...
Focused Crawling in Depression Portal Search: A Feasibility Study
Thanh Tin Tang (ANU)David Hawking (CSIRO)
Nick Craswell (Microsoft)Ramesh Sankaranarayana(ANU)
2
Why Depression?
Leading cause of disability burden in Australia
One in five people suffer from a mental disorder in any one year
The Web is a good way to deliver information and treatments, but ...
A lot of depression information on the Web is of poor quality
3
Bluepages Search (BPS)
4
BluePages Search
5
Bluepages Search
Indexes approximately 200 sites, e.g. Whole server: suicidal.com/ Directory: www.healingwell.com/depression/ Individual page: www.mcmanweb.com/article-226.htm
Approximately 2 weeks of manual effort to create / update seed list and include patterns
Experiments showed that Google (with ‘depression’) had better relevance but more bad advice
Relevance: Only 17% of relevant pages returned by Google were contained in the BPS crawl
6
Approach
BPS: higher quality but much lower coverage, and … It is time consuming to identify and maintain the list of sites
to be included Is it worth it? Can it be done more cheaply? How to increase coverage but still maintain high quality? Can we automate the process?
=> Seed list: Using an existing directory, e.g.: DMOZ, Yahoo!
Directory Crawling:
Use general crawler with inclusion/exclusion rules Use focused crawler with mechanisms to predict relevant/high
quality links from source pages
7
DMOZ Depression Directory
DMOZ is “the most comprehensive human-edited directory of the web”
Depression directory contains: Links to a few other DMOZ pages Links to servers, directories, and
individual pages about depression
Other pages in DMOZ
Servers, directories &
individual pages
8
DMOZ Seed List
How to generate Start from the depression directory Decide whether to include links to other pages
within the DMOZ site (little manual effort) Automatically generate most of the seed URLs
Seed URLs are same as URLs, except that default page suffixes are removed.
E.g.: www.depression.com/default.asp has the pattern www.depression.com
9
Should DMOZ be used?
Requires very little effort in boundary setting Provides a big seed list of URLs locating
heterogeneously on the Web (three times bigger than BPS)
Using 101 judged queries from our previous study, we retrieved 227 judged URLs from DMOZ of which 186 were relevant (81%)
=> DMOZ provided a good set of relevant pages with little effort, but…can we find more relevant pages else where?
10
Focused Crawler
Seeks, acquires, indexes and maintains pages on a specific set of topics
Requires small investment in hardware and network resources
Starts with a seed list of URLs relevant to the topics of interest
Follows links from seed pages to identify the most promising links to crawl
Is focused crawling a promising technique for building a depression portal?
11
One link away URLs
Additional Link-accessible Relevant Information
Illustration of one link away collection
If pages in the current crawl have no link to additional relevant content, the prospect of successful focused crawling is very low
DMOZ Crawl
12
Additional Link Experiments
Experiment: Relevance of outgoing links from a crawled collection An unrestricted crawler starting from the BPS
crawl can reach 25.3% (quite high) more known relevant pages in one single step from current crawled pages.
Experiment: Linking patterns between relevant pages Out of 196 new relevant URLs, 158 were linked
to by known relevant pages.
13
Findings for Additional Links
Relevant pages tend to link to each other Outgoing link set of a good collection
contains quite a large number of additional relevant pages
These support the idea of focused crawling, but …
How can a crawler tell which links lead to relevant content?
14
Hypertext Classification
Traditional text classification only looks at the text in each document
Hypertext classification uses link information
We experimented with anchor text, text around the link and URL words
Here is an example
15
Features
URL: http://www.depression.com/psychotherapy.html
=> URL words: depression, com, psychotherapy
Anchor text: psychotherapy Text around the link:
50 bytes before: section, learn
50 bytes after: talk, therapy, standard, treatment
16
Input Data & Measures
Calculate tf.idf for all the features appearing in each URL
10-fold cross validation on 295 relevant and 251 irrelevant URLs
Classifiers: IBK, ZeroR, Naïve Bayes, C4.5, Bagging and AdaboostM1, etc.
Measures: Accuracy, precision and recall.
17
Hypertext Classification - Results
=> In overall, J48 is the best classifier
68.1388.1577.83J48
69.8378.0373.07Naïve Bayes
65.4277.5171.06Complement Naïve Bayes
10054.0254.02ZeroR
Recall (%)Precision (%)Accuracy (%)Classifier
18
Hypertext Classification - Others
Bagging and boosting showed little improvement for recall
No applicable results in the literature relating to the topic of depression to compare
A classifier looking at the content of the target pages showed similar results
=> Hypertext classification is quite effective
19
Findings
Web pages about depression are strongly interlinked
DMOZ depression category seems to provide a good seed list for a focused crawl
Predictive classification of outgoing links using link features achieves promising results
=> Cheap and high coverage depression portal might be built & maintained using focused crawling techniques starting with the DMOZ seed list
20
Future Work
Build a domain-specific search portal: URL ranking in the order of degree of relevance Data structures to hold accumulated information
for unvisited URLs Determine how to use the focused crawler
operationally: No include/exclude rules, but appropriate
stopping conditions What to do if none of the outgoing links are
classified as relevant?
21
Future Work
Incorporate site quality into the focused crawler or filtering high quality pages after crawling
Extend the techniques to other domains, such
as health related domains, is it applicable?