Trends in Web Search and its relevance to Digital Libraries Min-Yen Kan Web IR NLP Group (WING)...
-
Upload
todd-cummings -
Category
Documents
-
view
221 -
download
0
Transcript of Trends in Web Search and its relevance to Digital Libraries Min-Yen Kan Web IR NLP Group (WING)...
Trends in Web Search and its relevance to Digital Libraries
Min-Yen Kan
Web IR NLP Group (WING)
National University of Singapore
226 Sep 2008
Min-Yen Kan, WING@NUS
World Scientific Talk
Tips on Web Searching
• Visualize results, then come up with multiple queries• Use multiple search engines• Advanced Search
– inurl:, site:
– “Phrasal search”
But that’s just general search…• Federated resources / Niche search engines
326 Sep 2008
Min-Yen Kan, WING@NUS
World Scientific Talk
Site- and Task-specific resources• Site Prestige
Know what others think and do– Google PageRank (Link structure), Alexa (Traffic)– Google Trends / Insight (Queries)
• Social Searching (Web 2.0)The voice of the reader / critic– (Bookmarks / Tags) Del.icio.us, Citeulike.org, Bibsonomy.org– (News) Digg / Slashdot– (Blogs) Google Blog, Technorati
• People Search:Finding public information on a person
– Spock (web), Zabasearch (US only)– LinkedIn, Facebook– Must validate your sources
http://labs.digg.com/arc/
426 Sep 2008
Min-Yen Kan, WING@NUS
World Scientific Talk
Expert Search Find people who will advocate on your behalf• What do they want?
• Scholar: – Active? → Check their recent articles– Names common? → Define area of interest– Compare against peers– Download vs. citation counts
• Patent search: – Referenced by: (citation count; different than scholar)
• Identifying webfaced advocates:
– Blog search, PageRank
http://flickr.com/photos/phauly/
How do machines do it?• Expert search task as benchmark test• Download web pages to analyze• Needed to deal with spam pages• Used PageRank to assess prestige
How do machines do it?• Expert search task as benchmark test• Download web pages to analyze• Needed to deal with spam pages• Used PageRank to assess prestige
→ Impact
526 Sep 2008
Min-Yen Kan, WING@NUS
World Scientific Talk
Problem or opportunity?• Revenue from print continually declining• Students and researchers rely on internet• Researchers want archiving rights – freedom of academic information
Characteristics:• Not zero-sum content• Distribution is now largely the role of search engines → Necessitates new role of publisher and new revenue model
– Will classic models work? Advertising, Subscription, Transactional & Bundling – Variants? Versioning (Varian), Moving window (JSTOR)
http://flickr.com/photos/danielbroche/
The game has fundamentally changed
The game has fundamentally changed
626 Sep 2008
Min-Yen Kan, WING@NUS
World Scientific Talk
Forecasting
–
Content is becoming free– MIT / Stanford opening up textbooks – Open access archiving→ long term: content will not be primary revenue source
eBook revenue hasn’t held up its promise yet…
– Device gap: iPhone and nextGen devices→ Revenue may be further down the pipe
+
Academic publishers– Connect to libraries and federations at institution level– Individual customers are secondary
Trusted source– Expertise in copyediting, typesetting, project management, distribution, social networking– Many individual web publishers rediscovering same problems→ Consultancy model→ Win-win partnerships with individual authors
726 Sep 2008
Min-Yen Kan, WING@NUS
World Scientific Talk
Web Trends• Social Content• Wisdom of masses: Crowdsourcing• Rich Media • Open Source / Access
Paradigmatic change – Classifieds → Craigslist– POTS → Skype– CD store → iTunes– Publishers → ??
http://www.informationarchitects.jp/slash/iA_WebTrends_2007_2_1024_768.gif
826 Sep 2008
Min-Yen Kan, WING@NUS
World Scientific Talk
Where is research going?
• Search API usage• Browser as computer• Web page structure,
mining text data
• Modeling web users at tasks: Exploring / Fact-finding• Personalization, recommending• Social networks• Understanding opinion• Query and log analysis
http://flickr.com/photos/alisdair/
User centric
Server centric
926 Sep 2008
Min-Yen Kan, WING@NUS
World Scientific Talk
Webfaced pop quiz – which is which?
SpringerAmerican Statistical
Society World Scientific
courtesy: http://pagerank.si/
WING@NUS
1026 Sep 2008
Min-Yen Kan, WING@NUS
World Scientific Talk
Forecast: Know your strengthsGet advocates• Make it easy to get individuals to insist to their institution to buy your materials• Know who is accessing (not necessarily buying) your content Content revenue will continue to decline• Find an economic model that works for you• Work as partners in content creation
Be savvy on trends• Be visible: do “white hat” Search Engine Optimization (SEO)• Make your abstracts indexable by others
+
Academic publishers– Connect to libraries and federations at institution level– Individual customers are secondary
Trusted source– Expertise in copyediting, typesetting, project management, distribution, social networking– Many individual web publishers rediscovering same problems–→ Consultancy model–→ Win-win partnerships with individual authors
1126 Sep 2008
Min-Yen Kan, WING@NUS
World Scientific Talk
Trends in Digital Libraries• Expanding types of information in search • Automated tools for DLs• Usability in E-books and online media• User modeling• Personalization, annotation and relation to other user tasks
http://flickr.com/photos/pathfinderlinden
>> WING @ NUS
1226 Sep 2008
Min-Yen Kan, WING@NUS
World Scientific Talk
Scholarly Digital Libraries
• ForeCite: our scholarly DL• Data Cleaning• Slide and Document Alignment• Searching in the OPAC• Math Information Retrieval
1326 Sep 2008
Min-Yen Kan, WING@NUS
World Scientific Talk
ForeCite: Beyond the document as an item
A user-centric DL framework• Put author / reader functionality together• Tagging, correction, annotation and viewing• Automatic tools: keyphrases and sentence classification• For use on and offline, organizes local PDF files for you• Only need your web browser
ServerServer ClientClient
1426 Sep 2008
Min-Yen Kan, WING@NUS
World Scientific Talk
Data Cleaning• Addresses
– Dongwon Lee, 110 E. Foster Ave. #410, State College, PA, 16802– LEE Dong, 110 East Foster Avenue Apartment 410, Univ. Park, PA 16802-2343
• Products– Honda Fix vs. Honda Jazz– Apple iPod Nano 4GB vs. 4GB iPod nano 4GB
• Idea: use web as additional context for disambiguation and clustering• Placed 3rd in Web People Search Task (WEPS 2007)
Search results:
“Jeffrey D. Ullman” 384,000 pages“Jeffrey D. Ullman” + “aho” 174,000 pages
“J. Ullman” 124,000 pages“J. Ullman” + “aho” 41,000 pages
“Shimon Ullman” 27,300 pages“Shimon Ullman” + “aho” 66 pages
45%45%
33%33%
0%0%
1526 Sep 2008
Min-Yen Kan, WING@NUS
World Scientific Talk
Slides and their relationship to documents
Document in focusDocument in focus Slides in FocusSlides in Focus
1626 Sep 2008
Min-Yen Kan, WING@NUS
World Scientific Talk
Searching in Libraries
http://linc.comp.nus.edu.sg
1726 Sep 2008
Min-Yen Kan, WING@NUS
World Scientific Talk
Symbolic Information SearchHow do users want to search math materials?
Our answer: Text-to-Expression Linking– Resolve text keywords to expressions– e.g., “Pythagorean Theorem” “a2+b2=c2” or “x2+y2=z2”
Reduce the need for expression input
Solves the notational variation problem
Not quite right…
1826 Sep 2008
Min-Yen Kan, WING@NUS
World Scientific Talk
Conclusions
• Consider us your research WING!• Trade data and problems for solutions and interns
Meanwhile:• Use better search strategies• Practice white hat SEO• Identify webfaced advocates
1926 Sep 2008
Min-Yen Kan, WING@NUS
World Scientific Talk
References• Kahin and Varian (2000) Internet Publishing and Beyond• Towle et al. (2007) Electronic Books in the 2003-2005 Period, Pub Res Q 23:95-104
Photo Credits• Flickr Creative Commons Search
Thanks to all of you for listening
& my fellow WING group members
2126 Sep 2008
Min-Yen Kan, WING@NUS
World Scientific Talk
Abstract•I will present trends in current academic research on web search anddigital libraries, and discuss their relevance to publishers and theireconomic model. With respect to the web, I will cover how searchengines are starting to specialize and use click through and ad datato improve relevance ranking. With respect to digital libraryresearch, I discuss my group's research at NUS on advancing thestate-of-the-art in scholarly digital libraries. I cover advances onhow we deal with data cleaning issues, and slide and equationretrieval and alignment.