Design of a Click-tracking Network for Full-text Search Engine Group 5: Yuan Hu, Yu Ge, Youwen Gong,...
-
Upload
gordon-scott -
Category
Documents
-
view
223 -
download
2
Transcript of Design of a Click-tracking Network for Full-text Search Engine Group 5: Yuan Hu, Yu Ge, Youwen Gong,...
Design of a Click-tracking Network for
Full-text Search Engine
Group 5: Yuan Hu, Yu Ge, Youwen Gong, Zenghui Qiu and Miao Liu
Outline• Introduction
• Objective
• Project diagram– Web Crawling– Indexing schema
• Ranking strategies – PageRank Algorithms– Neural Network– Content-Based Ranking
• Software and Reference
Introduction
• Full-text Search Engine – search on key words– rank results
• What is in a Search Engine?– Crawling– Indexing– Ranking results of query
Project Diagram
Website
Crawling
Text & urls
Database
Indexing
Query Function
Click-Tracking Network
PageRank Algorithms
Content-Based Ranking
Ranked results
Web Crawling
Depth 1: crawling all the url links on the main page
Depth 2: crawling all the url links found in depth 1
Main page:
……
http://en.wikipedia.org/wiki/Machine_learning
http://en.wikipedia.org/wiki/Decision_tree_learning#Information_gain
http://en.wikipedia.org/wiki/Machine_learning#Decision_tree_learning
……
# Implemented with Python urllib2 module and BeautifulSoup API
Schema for Basic Index
Link
Row_ID
From_ID
To_ID
Url_list
Row_ID
UrlWord_locat
ionUrl_ID
Word_ID
LocationWord_list
Row_ID
Word
Link_words
Word_ID
Link_ID
# Implemented with SQLite
Results for Multiple-words Query
Words Combination
Same url _idWord location
! Notice that all the url_ids returned are not ranked..
Query function
PageRank Algorithm
•Developed by Larry Page at Stanford U. in 1996.•How important that page is.•The importance of the page is calculated from all the other pages that link to it.
http://www.rasch.org/rmt/rmt232a.htm
http://www.rasch.org/rmt/rmt232a.htm
How to Calculate PR
• d: damping factor, 0<d<1, 0.85.• PR(B), ……..,PR(D)…. : PageRank value of
each webpage linking to page A.• L(B),…….,L(D),….. : The number of links
going out of page B,……D…..
Example
PR(A) = 0.15 + 0.85 * ( PR(B)/links(B) + PR(C)/links(C) +PR(D)/links(D) )= 0.15 + 0.85 * ( 0.5/4 + 0.7/4 + 0.2/1 )= 0.15 + 0.85 * ( 0.125 + 0.175 + 0.2)= 0.15 + 0.85 * 0.465= 0.575
How to Update the PR Value If we don’t know what their PR should be to
begin with, just assign an initial PR value for every page.
20 Iterations
Update
http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm
http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm
Neural Network
Why?• Make reasonable guess about results
for queries that they have never seen before.
Click-tracking • The weights are updated based on
the search results which the user clicked.
Neural Net Work
• Step1: Setting Up the Database
• Step2: Feeding Forward Activation
• Step3: Training with BackPropagation
How Neural Network works?Solid line: Strong connectionsBold text: Active node
Step1: Setting Up the ANN Database
• Create a table for hidden layer(red box)
• Create two tables for the connections(green boxes)
Step2: Feeding Forward Activation
• Objective: activate the ANN. – Take words as inputs– Activate the links in the network– Give outputs for URL
• Hyperbolic tangent function
X-axis: total input to the node
Step3: Training with Backpropagation
• Train the network every time someone performs a search and choose one of the links
• The same algorithm covered in class. • Learning rate = 0.5
Step 1:
From ID
To IDHidden node
Strength
Step 2:
relevance of URL input URL
Results For Neural Network
Step 3:
Training with one query
Content-Based Ranking
• Word frequency
• Document location
• Word distance
Basic Idea: Calculate a score based only on the query and the content of the page
Reference• Collective Intelligence- Toby Segaran• SQLite Tutorial - ZetCode• Dive into Python – Mark Pilgrim
Software• Ubuntu 11.04• Python 2.7.3• SQLite