1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible)...

1

CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269

Deep (Invisible) Web

- Manoj Ravuru

2


Outline

Web and Search Engines

Types of Web

What is Deep Web? How big it is? Is it important?

What makes it Deep and what is in it?

Deep Web content classification and categories

Crawling and Indexing Deep Web

Deep Web Statistics

3


Outline

Deep Web Quality

How to find and use Deep Web?

Deep Web Gateways

Deep Web Issues

Summary

References

4


Web and Search Engines

In 1991, Web was created by Tim Berners-Lee, a researcher at the CERN high-energy physics laboratory in Switzerland.

Berners-Lee designed the Web to be platform-independent.

To enable this cross-platform capability, Berners-Lee created HTML, or Hypertext Markup Language - simplified version of SGML (Standard Generalized Markup Language).

The simplicity of Markup languages format makes it easier to introduce the concept of search engines which the user can use to search and retrieve HTML documents of their interest on the web.

This Shallow Web, also known as the Surface Web or Static Web, is a collection of Web sites indexed by automated search engines.

Search engines Web crawler follows URL links on the Web, and indexes every word on every HTML page on the web and store them in huge databases that can be searched on demand.

5


Types of Web

Static Web

Dynamic Web

Opaque Web

Private Web

Proprietary Web

Pay per click Web

6


What is Deep Web?

Web pages accessing vast information repository that search engines cannot or will not index.

Mainly refers to the rich content information that search engines don't have direct access to, like databases.

Deep Web pages are dynamically created as the result of a specific search.

Deep Web also called Invisible Web.

Term invisible in "Invisible Web" is actually a misnomer.

Deep Web information is available via the Web but isn't accessible by the search engines.

7


How big is Invisible Web ?Cannot be determined accurately.

In a word, it's humungous.

Deep web is approximately 500 times bigger than the searchable or surface Web. May be bigger than that.

Considering that Google alone covers around 8 billion pages, that's just mind boggling.

Major search engines together index only 20% of the Web, then they miss 80% of the content.

Deep Web includes images, sounds, presentations and many other types of media not visible to search engines.

8


Is Deep Web Important ?

Web as a vast library. Requires more digging to find what’s needed.

Search engines only search a very small portion of the web make the Invisible Web a very tempting resource. There's a lot more information out there than one could ever imagine.

Significant content of Deep Web is quality content that exists in documents within searchable databases on the web which conventional search engines (well known and mostly used) can't access it.

Currently businesses, researchers, consumers etc, may not get quality and needed information.

Search Engines themselves have problems in providing relevant content – at least for bit complicated or obscure queries.

9


Why the name “Invisible” ?

Spiders crawling through the Web, when run into a page from the Invisible Web, they don't know quite what to do with it.

Spiders can record the address of the page it couldn’t access, but can't tell the information the page contains.

Main factors are due to technical barriers ex: databases, passwords protected pages, script-based pages.

10


What makes it Deep ?

Proprietary sites

Sites requiring a registration

Sites with scripts

Dynamic sites

Ephemeral sites

Sites blocked by local webmasters

Sites blocked by search engine policy

Sites with special formats

Searchable databases

11


Other factors…

Exclude Pages by policy.

Spiders/crawlers do not report what it can't index.

Task of actually finding all the pages on the Web.

12


Deep Web resource classification

Dynamic content - Dynamic pages in response to a submitted query

Unlinked content - Pages which are not linked to by other pages

Limited access content - Sites that require registration or limit access to their pages

Scripted content - Pages that are only accessible through links produced by JavaScript and Flash which require special handling.

Non-text content - Multimedia (image) files, Usenet archives and documents in non-HTML file formats such as PDF and DOC documents

13


Deep web content categories

14


Crawling & Indexing the Deep Web

Major search engines such as Google, AltaVista, Inktomi does index dynamic context through the use of following programs

Paid partnership programs

Trusted feed services

Premium inclusion programs

Quigo's QUIBOT remotely crawls through pages from the deep Web, enabling it to index a large portion of the deep Web and making this content available to users searching on Quigo and partner portals.

Quigo's DeepWebGateway enables search engines to index deep Web content that they do not access directly. This technology also solves other problems related to deep Web crawling and indexing, such as spider traps and personalization.

15


Deep Web StatisticsPublic information on the deep Web is currently 400 to 550 times larger than the commonly defined World Wide Web.

The deep Web contains 7,500 terabytes of information compared to 29 terabytes of information in the surface Web.

The deep Web contains nearly 550 billion individual documents compared to the 2.5 billion of the surface Web.

Ninety-five percent of the deep web contains publicly accessible information that is not subject to fees or subscriptions.

More than 200,000 deep Web sites presently exist.

60 of the largest deep-Web sites collectively contain about 750 terabytes of information -- sufficient by themselves to exceed the size of the surface Web 40 times.

On average, deep Web sites receive fifty per cent greater monthly traffic than surface sites and are more highly linked to than surface sites.

16


Deep Web Statistics (contd)

Deep Web sites tend to be narrower, with deeper content, than conventional surface sites.

Total quality content of the deep Web is 1,000 to 2,000 times greater than that of the surface Web.

More than half of the deep Web content resides in topic-specific databases.

Eighty-five percent of Web users use search engines to find needed information, but nearly as high a percentage cite the inability to find desired information as one of their biggest frustrations.

More than 95% of deep Web information is publicly available without restriction.

International Data Corporation predicts that the number of surface Web documents will grow from the current two billion or so to 13 billion within three years, a factor increase of 6.5 times. Deep Web growth should exceed this rate, perhaps increasing about nine-fold over the same period.

17


Deep Web Quality

About a three-fold improved likelihood for obtaining quality results from the deep Web as compared to the surface Web.

Overall precision and recall would be higher due to presence of highly relevant information for each subject area.

Degree of content overlap between deep Web sites to be much less than for surface Web sites.

Observations from working with the deep Web sources and data suggest there are important information categories where duplication does exist. Prominent among these are yellow/white pages, genealogical records, and public records with commercial potential such as SEC filings. On the other hand, there are entire categories of deep Web sites whose content appears uniquely valuable. These mostly fall within the categories of topical databases, publications, and internal site indices which accounts in total for about 80% of deep Web sites.

Duplication will be lower within the deep Web.

18


Finding Deep Web

General web directorieswww.completeplanet.com, www.thebighub.com

Deep Web search engines that sends single query to dozens of databases simultaneously.

www.alltheweb.com, www.brightplanet.com

Specialized Databaseswww.nsdl.org, http://catalog.loc.gov

Use Google and other search engines to locate searchable databases.

Example for Google & Yahoo: languages database or toxic chemicals database

http://www.completeplanet.com/

http://www.thebighub.com/

http://www.alltheweb.com/

http://www.brightplanet.com/

http://www.nsdl.org/

http://catalog.loc.gov/

19


Deep Web search strategies to follow

Be aware that the Deep Web exists.

Use a general search engine for broad topic searching.

Use a searchable database for focused searches.

Register on special sites and use their archives.

Call the reference desk at a local college if in need of a proprietary Web site. Many college libraries subscribe to these services and provide free on-site searching.

Many libraries offer free remote online access to commercial and research databases for anyone with a library card.

20


Deep Web Gateways – Web Directories

Infomine [http://infomine.ucr.edu/] is a virtual library of Internet resources relevant to faculty, students, and research staff at the university level.

It contains useful Internet resources such as databases, electronic journals, electronic books, bulletin boards, mailing lists, online library card catalogs, articles, directories of researchers, and many other types of information.

Infomine is librarian built. Librarians from the University of California, Wake Forest University, California State University, the University of Detroit - Mercy, and other universities and colleges have contributed to building Infomine.

21


Infomine Web Directory

22


Deep Web Gateways – Web Directories

Digital Librarian [http://www.digital-librarian.com/] is librarian’s choice of the best of the web.

23


Deep Web Gateways – Search Engines

Turbo10 is a meta search engine that provides a universal interface to Deep Web search engines.

Turbo10 is designed to help search Deeper and browse faster.

Turbo10 has developed search technology since 2001. It connects Internet searchers to Deep Web search engines.

24


Turbo10 Deep Web Search Engine

25


Deep Web Gateways – Search Engines

AlltheWeb [http://www.alltheweb.com/] combines one of the largest and freshest indices with the most powerful search features that allow anyone to find anything faster than with any other search engine.

AlltheWeb's index (provided by Yahoo!) includes billions of web pages, as well as tens of millions of PDF and MS Word® files. Yahoo! frequently scans the entire web to ensure that our content is fresh and to eliminate broken links.

AlltheWeb offers a variety of specialized search tools and advanced search features, and supports searching in 36 different languages. Our image, audio, and video searches include hundreds of millions of multimedia files.

AllTheWeb provides with the controls necessary to find the most relevant content through some of the most sophisticated advanced search features available.

26


AlltheWeb – Deep Web Search Engine

27


Deep Web Gateways – Specialized Databases

NSDL (National Science Digital Library - http://nsdl.org/) was established as an online library which directs users to exemplary resources for science, technology, engineering, and mathematics (STEM) education and research.

NSDL provides an organized point of access to STEM content that is aggregated from a variety of other digital libraries, NSF-funded projects, and NSDL-reviewed web sites.

NSDL also provides access to services and tools that enhance the use of this content in a variety of contexts.

http://nsdl.org/

28


NSDL – Specialized Database

29


Other notable Deep Web resources

Deep Query Manager (DQM), BrightPlanet's <http://www.brightplanet.com/> powerful search tool designed to retrieve information from thousands of Deep Web databases and search engines at one time.

AlphaSearch <http://www.calvin.edu/library/searreso/internet/as/> is an extremely useful directory of "gateway" sites that collect and organize Web sites that focus on a particular subject.

Many databases that make up GPO Access. <http://www.access.gpo.gov/>.

Telephone directory databases such as Anywho <http://www.anywho.com/>.

http://www.access.gpo.gov/



http://www.anywho.com/



30


Deep Web Issues

Complete indexing for Deep Web is impossible.

Deep web content is dynamic and can change faster than the contents in static/surface web.

There is no bright line that separates content sources on the Web. Users need to choose the database (Deep Web resource) of their interest on their own.

Deep Web phenomenon is not well known to the Internet-searching public.

Value of deep web content is incalculable.

31


Summary

World Wide Web

“Invisible/Deep Web”“Visible/Surface Web”

Search DirectoriesSearch Engines

Examples:Librarians Index to

the Internet,Yahoo

Fee-based

Specialized, searchable Databases

Examples:Google,Yahoo,

Altavista

Free

Examples:Library Catalogs,

digital library archives,Dictionaries,

Encyclopedias, Article databases

Examples:Library Catalogs,

digital library archives,Dictionaries,

Encyclopedias, Article databases

32


Summary

Deep Web content is highly relevant to every information need, market, and domain.

The deep Web is the largest growing category of new information on the Internet.

Serious information seekers can no longer avoid the importance or quality of deep Web information.

Deep Web information is only a component of total information available. Searching must evolve to encompass the complete Web.

Directed query technology is the only means to integrate deep and surface Web information.

33


Summary

Specific vertical market services are already evolving to partially address the deep web challenges. These will likely need to be supplemented with a persistent query system customizable by the user that would set the queries, search sites, filters, and schedules for repeated queries.

Search directories that offer hand-picked information chosen from the surface Web to meet popular search needs

Use search engines for more robust surface-level searches and content-aggregation vertical "infohubs" for deep Web information to provide answers where comprehensiveness and quality are imperative.

34


References

1. Wikipedia, the free encyclopedia, “Deep Web” 24 April 2007http://en.wikipedia.org/wiki/Hidden_web2. Wendy Boswell, “The Invisible Web” 21 April 2007http://websearch.about.com/od/invisibleweb/a/invisible_web.htm.3. Chris Sherman, "The Invisible Web“ 20 April 2007http://www.freepint.co.uk/issues/080600.htm#feature.4. Joe Barker, “Invisible or Deep Web” 9 March 2007http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/InvisibleWeb.html5. Michael K. Bergman, “The Deep Web: Surfacing Hidden Value” September 24, 2001 http://www.press.umich.edu/jep/07-01/bergman.html6. Laura Cohen, “The Deep Web” 22 November 2006 http://www.internettutorials.net/deepweb.html7. Marcus P. Zillman, “Deep Web Research” April 23, 2007 http://deepwebresearch.blogspot.com/8. Paul Bruemmer, “Indexing Deep Web Content” March 27, 2002 http://www.searchengineguide.com/wi/2002/0327_wi2.html

http://en.wikipedia.org/wiki/Hidden_web

http://websearch.about.com/od/invisibleweb/a/invisible_web.htm

http://websearch.about.com/od/invisibleweb/a/invisible_web.htm

http://www.freepint.co.uk/issues/080600.htm#feature

http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/InvisibleWeb.html

http://www.internettutorials.net/deepweb.html

http://deepwebresearch.blogspot.com/

http://www.searchengineguide.com/wi/2002/0327_wi2.html

35


References

9. Danny Sullivan, “Invisible Web Gets Deeper” August 2, 2000 http://searchenginewatch.com/showPage.html?page=216287110. Chris Sherman, “Search for the invisible web” September 6, 2001http://technology.guardian.co.uk/online/story/0,3605,547140,00.html11. “Greg Linden”, Deep Web Strategy March 2007 http://www.semantic-web.at/10.57.1089.press.greg-linden-on-google-s-deep-web-strategy.htm12. Alex Wright, “In search of the deep Web” 9 March 2004 http://archive.salon.com/tech/feature/2004/03/09/deep_web/index_np.html13. Danny Sullivan, “Invisible Web" Revealed” June 11, 1999 http://searchenginewatch.com/showPage.html?page=216732114. Michael Cross, “The hidden potential of the web” April 21, 2004 http://society.guardian.co.uk/e-public/story/0,13927,1195901,00.html

http://searchenginewatch.com/showPage.html?page=2162871

http://archive.salon.com/tech/feature/2004/03/09/deep_web/index_np.html

36


Thank You !!!

Manoj Ravuru

([email protected])

1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible)...

Documents

Transcript of 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible)...