1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible)...

36
1 CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru

Transcript of 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible)...

Page 1: 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru.

1

CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269

Deep (Invisible) Web

- Manoj Ravuru

Page 2: 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru.

2

CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269

Outline

Web and Search Engines

Types of Web

What is Deep Web? How big it is? Is it important?

What makes it Deep and what is in it?

Deep Web content classification and categories

Crawling and Indexing Deep Web

Deep Web Statistics

Page 3: 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru.

3

CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269

Outline

Deep Web Quality

How to find and use Deep Web?

Deep Web Gateways

Deep Web Issues

Summary

References

Page 4: 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru.

4

CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269

Web and Search Engines

In 1991, Web was created by Tim Berners-Lee, a researcher at the CERN high-energy physics laboratory in Switzerland.

Berners-Lee designed the Web to be platform-independent.

To enable this cross-platform capability, Berners-Lee created HTML, or Hypertext Markup Language - simplified version of SGML (Standard Generalized Markup Language).

The simplicity of Markup languages format makes it easier to introduce the concept of search engines which the user can use to search and retrieve HTML documents of their interest on the web.

This Shallow Web, also known as the Surface Web or Static Web, is a collection of Web sites indexed by automated search engines.

Search engines Web crawler follows URL links on the Web, and indexes every word on every HTML page on the web and store them in huge databases that can be searched on demand.

Page 5: 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru.

5

CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269

Types of Web

Static Web

Dynamic Web

Opaque Web

Private Web

Proprietary Web

Pay per click Web

Page 6: 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru.

6

CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269

What is Deep Web?

Web pages accessing vast information repository that search engines cannot or will not index.

Mainly refers to the rich content information that search engines don't have direct access to, like databases.

Deep Web pages are dynamically created as the result of a specific search.

Deep Web also called Invisible Web.

Term invisible in "Invisible Web" is actually a misnomer.

Deep Web information is available via the Web but isn't accessible by the search engines.

Page 7: 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru.

7

CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269

How big is Invisible Web ?Cannot be determined accurately.

In a word, it's humungous.

Deep web is approximately 500 times bigger than the searchable or surface Web. May be bigger than that.

Considering that Google alone covers around 8 billion pages, that's just mind boggling.

Major search engines together index only 20% of the Web, then they miss 80% of the content.

Deep Web includes images, sounds, presentations and many other types of media not visible to search engines.

Page 8: 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru.

8

CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269

Is Deep Web Important ?

Web as a vast library. Requires more digging to find what’s needed.

Search engines only search a very small portion of the web make the Invisible Web a very tempting resource. There's a lot more information out there than one could ever imagine.

Significant content of Deep Web is quality content that exists in documents within searchable databases on the web which conventional search engines (well known and mostly used) can't access it.

Currently businesses, researchers, consumers etc, may not get quality and needed information.

Search Engines themselves have problems in providing relevant content – at least for bit complicated or obscure queries.

Page 9: 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru.

9

CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269

Why the name “Invisible” ?

Spiders crawling through the Web, when run into a page from the Invisible Web, they don't know quite what to do with it.

Spiders can record the address of the page it couldn’t access, but can't tell the information the page contains.

Main factors are due to technical barriers ex: databases, passwords protected pages, script-based pages.

Page 10: 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru.

10

CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269

What makes it Deep ?

Proprietary sites

Sites requiring a registration

Sites with scripts

Dynamic sites

Ephemeral sites

Sites blocked by local webmasters

Sites blocked by search engine policy

Sites with special formats

Searchable databases

Page 11: 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru.

11

CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269

Other factors…

Exclude Pages by policy.

Spiders/crawlers do not report what it can't index.

Task of actually finding all the pages on the Web.

Page 12: 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru.

12

CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269

Deep Web resource classification

Dynamic content - Dynamic pages in response to a submitted query

Unlinked content - Pages which are not linked to by other pages

Limited access content - Sites that require registration or limit access to their pages

Scripted content - Pages that are only accessible through links produced by JavaScript and Flash which require special handling.

Non-text content - Multimedia (image) files, Usenet archives and documents in non-HTML file formats such as PDF and DOC documents

Page 13: 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru.

13

CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269

Deep web content categories

Page 14: 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru.

14

CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269

Crawling & Indexing the Deep Web

Major search engines such as Google, AltaVista, Inktomi does index dynamic context through the use of following programs

Paid partnership programs

Trusted feed services

Premium inclusion programs

Quigo's QUIBOT remotely crawls through pages from the deep Web, enabling it to index a large portion of the deep Web and making this content available to users searching on Quigo and partner portals.

Quigo's DeepWebGateway enables search engines to index deep Web content that they do not access directly. This technology also solves other problems related to deep Web crawling and indexing, such as spider traps and personalization.

Page 15: 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru.

15

CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269

Deep Web StatisticsPublic information on the deep Web is currently 400 to 550 times larger than the commonly defined World Wide Web.

The deep Web contains 7,500 terabytes of information compared to 29 terabytes of information in the surface Web.

The deep Web contains nearly 550 billion individual documents compared to the 2.5 billion of the surface Web.

Ninety-five percent of the deep web contains publicly accessible information that is not subject to fees or subscriptions.

More than 200,000 deep Web sites presently exist.

60 of the largest deep-Web sites collectively contain about 750 terabytes of information -- sufficient by themselves to exceed the size of the surface Web 40 times.

On average, deep Web sites receive fifty per cent greater monthly traffic than surface sites and are more highly linked to than surface sites.

Page 16: 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru.

16

CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269

Deep Web Statistics (contd)

Deep Web sites tend to be narrower, with deeper content, than conventional surface sites.

Total quality content of the deep Web is 1,000 to 2,000 times greater than that of the surface Web.

More than half of the deep Web content resides in topic-specific databases.

Eighty-five percent of Web users use search engines to find needed information, but nearly as high a percentage cite the inability to find desired information as one of their biggest frustrations.

More than 95% of deep Web information is publicly available without restriction.

International Data Corporation predicts that the number of surface Web documents will grow from the current two billion or so to 13 billion within three years, a factor increase of 6.5 times. Deep Web growth should exceed this rate, perhaps increasing about nine-fold over the same period.

Page 17: 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru.

17

CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269

Deep Web Quality

About a three-fold improved likelihood for obtaining quality results from the deep Web as compared to the surface Web.

Overall precision and recall would be higher due to presence of highly relevant information for each subject area.

Degree of content overlap between deep Web sites to be much less than for surface Web sites.

Observations from working with the deep Web sources and data suggest there are important information categories where duplication does exist. Prominent among these are yellow/white pages, genealogical records, and public records with commercial potential such as SEC filings. On the other hand, there are entire categories of deep Web sites whose content appears uniquely valuable. These mostly fall within the categories of topical databases, publications, and internal site indices which accounts in total for about 80% of deep Web sites.

Duplication will be lower within the deep Web.

Page 18: 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru.

18

CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269

Finding Deep Web

General web directorieswww.completeplanet.com, www.thebighub.com

Deep Web search engines that sends single query to dozens of databases simultaneously.

www.alltheweb.com, www.brightplanet.com

Specialized Databaseswww.nsdl.org, http://catalog.loc.gov

Use Google and other search engines to locate searchable databases.

Example for Google & Yahoo: languages database or toxic chemicals database

Page 19: 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru.

19

CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269

Deep Web search strategies to follow

Be aware that the Deep Web exists.

Use a general search engine for broad topic searching.

Use a searchable database for focused searches.

Register on special sites and use their archives.

Call the reference desk at a local college if in need of a proprietary Web site. Many college libraries subscribe to these services and provide free on-site searching.

Many libraries offer free remote online access to commercial and research databases for anyone with a library card.

Page 20: 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru.

20

CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269

Deep Web Gateways – Web Directories

Infomine [http://infomine.ucr.edu/] is a virtual library of Internet resources relevant to faculty, students, and research staff at the university level.

It contains useful Internet resources such as databases, electronic journals, electronic books, bulletin boards, mailing lists, online library card catalogs, articles, directories of researchers, and many other types of information.

Infomine is librarian built. Librarians from the University of California, Wake Forest University, California State University, the University of Detroit - Mercy, and other universities and colleges have contributed to building Infomine.

Page 21: 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru.

21

CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269

Infomine Web Directory

Page 22: 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru.

22

CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269

Deep Web Gateways – Web Directories

Digital Librarian [http://www.digital-librarian.com/] is librarian’s choice of the best of the web.

Page 23: 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru.

23

CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269

Deep Web Gateways – Search Engines

Turbo10 is a meta search engine that provides a universal interface to Deep Web search engines.

Turbo10 is designed to help search Deeper and browse faster.

Turbo10 has developed search technology since 2001. It connects Internet searchers to Deep Web search engines.

Page 24: 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru.

24

CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269

Turbo10 Deep Web Search Engine

Page 25: 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru.

25

CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269

Deep Web Gateways – Search Engines

AlltheWeb [http://www.alltheweb.com/] combines one of the largest and freshest indices with the most powerful search features that allow anyone to find anything faster than with any other search engine.

AlltheWeb's index (provided by Yahoo!) includes billions of web pages, as well as tens of millions of PDF and MS Word® files. Yahoo! frequently scans the entire web to ensure that our content is fresh and to eliminate broken links. 

AlltheWeb offers a variety of specialized search tools and advanced search features, and supports searching in 36 different languages.  Our image, audio, and video searches include hundreds of millions of multimedia files.

AllTheWeb provides with the controls necessary to find the most relevant content through some of the most sophisticated advanced search features available.

Page 26: 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru.

26

CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269

AlltheWeb – Deep Web Search Engine

Page 27: 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru.

27

CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269

Deep Web Gateways – Specialized Databases

NSDL (National Science Digital Library - http://nsdl.org/) was established as an online library which directs users to exemplary resources for science, technology, engineering, and mathematics (STEM) education and research.

NSDL provides an organized point of access to STEM content that is aggregated from a variety of other digital libraries, NSF-funded projects, and NSDL-reviewed web sites.

NSDL also provides access to services and tools that enhance the use of this content in a variety of contexts.

Page 28: 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru.

28

CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269

NSDL – Specialized Database

Page 29: 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru.

29

CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269

Other notable Deep Web resources

Deep Query Manager (DQM), BrightPlanet's <http://www.brightplanet.com/> powerful search tool designed to retrieve information from thousands of Deep Web databases and search engines at one time.

AlphaSearch <http://www.calvin.edu/library/searreso/internet/as/> is an extremely useful directory of "gateway" sites that collect and organize Web sites that focus on a particular subject.

Many databases that make up GPO Access. <http://www.access.gpo.gov/>.

Telephone directory databases such as Anywho <http://www.anywho.com/>.

Page 30: 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru.

30

CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269

Deep Web Issues

Complete indexing for Deep Web is impossible.

Deep web content is dynamic and can change faster than the contents in static/surface web.

There is no bright line that separates content sources on the Web. Users need to choose the database (Deep Web resource) of their interest on their own.

Deep Web phenomenon is not well known to the Internet-searching public.

Value of deep web content is incalculable.

Page 31: 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru.

31

CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269

Summary

World Wide Web

“Invisible/Deep Web”“Visible/Surface Web”

Search DirectoriesSearch Engines

Examples:Librarians Index to

the Internet,Yahoo

Fee-based

Specialized, searchable Databases

Examples:Google,Yahoo,

Altavista

Free

Examples:Library Catalogs,

digital library archives,Dictionaries,

Encyclopedias, Article databases

Examples:Library Catalogs,

digital library archives,Dictionaries,

Encyclopedias, Article databases

Page 32: 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru.

32

CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269

Summary

Deep Web content is highly relevant to every information need, market, and domain.

The deep Web is the largest growing category of new information on the Internet.

Serious information seekers can no longer avoid the importance or quality of deep Web information.

Deep Web information is only a component of total information available. Searching must evolve to encompass the complete Web.

Directed query technology is the only means to integrate deep and surface Web information.

Page 33: 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru.

33

CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269

Summary

Specific vertical market services are already evolving to partially address the deep web challenges. These will likely need to be supplemented with a persistent query system customizable by the user that would set the queries, search sites, filters, and schedules for repeated queries.

Search directories that offer hand-picked information chosen from the surface Web to meet popular search needs

Use search engines for more robust surface-level searches and content-aggregation vertical "infohubs" for deep Web information to provide answers where comprehensiveness and quality are imperative.

Page 34: 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru.

34

CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269

References

1. Wikipedia, the free encyclopedia, “Deep Web” 24 April 2007http://en.wikipedia.org/wiki/Hidden_web2. Wendy Boswell, “The Invisible Web” 21 April 2007http://websearch.about.com/od/invisibleweb/a/invisible_web.htm.3. Chris Sherman, "The Invisible Web“ 20 April 2007http://www.freepint.co.uk/issues/080600.htm#feature.4. Joe Barker, “Invisible or Deep Web” 9 March 2007http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/InvisibleWeb.html5. Michael K. Bergman, “The Deep Web: Surfacing Hidden Value” September 24, 2001 http://www.press.umich.edu/jep/07-01/bergman.html6. Laura Cohen, “The Deep Web” 22 November 2006 http://www.internettutorials.net/deepweb.html7. Marcus P. Zillman, “Deep Web Research” April 23, 2007 http://deepwebresearch.blogspot.com/8. Paul Bruemmer, “Indexing Deep Web Content” March 27, 2002 http://www.searchengineguide.com/wi/2002/0327_wi2.html

Page 35: 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru.

35

CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269

References

9. Danny Sullivan, “Invisible Web Gets Deeper” August 2, 2000 http://searchenginewatch.com/showPage.html?page=216287110. Chris Sherman, “Search for the invisible web” September 6, 2001http://technology.guardian.co.uk/online/story/0,3605,547140,00.html11. “Greg Linden”, Deep Web Strategy March 2007 http://www.semantic-web.at/10.57.1089.press.greg-linden-on-google-s-deep-web-strategy.htm12. Alex Wright, “In search of the deep Web” 9 March 2004 http://archive.salon.com/tech/feature/2004/03/09/deep_web/index_np.html13. Danny Sullivan, “Invisible Web" Revealed” June 11, 1999 http://searchenginewatch.com/showPage.html?page=216732114. Michael Cross, “The hidden potential of the web” April 21, 2004 http://society.guardian.co.uk/e-public/story/0,13927,1195901,00.html

Page 36: 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru.

36

CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269

Thank You !!!

Manoj Ravuru

([email protected])