Abe Lederman, President and CTO Deep Web Technologies, Inc. ScienceEducation.gov Meeting National...
-
Upload
clifford-lester -
Category
Documents
-
view
228 -
download
0
Transcript of Abe Lederman, President and CTO Deep Web Technologies, Inc. ScienceEducation.gov Meeting National...
![Page 1: Abe Lederman, President and CTO Deep Web Technologies, Inc. ScienceEducation.gov Meeting National Academy of Sciences, March 18, 2009 A Look at the Technology.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649edf5503460f94befade/html5/thumbnails/1.jpg)
Abe Lederman, President and CTO
Deep Web Technologies, Inc.
ScienceEducation.gov Meeting
National Academy of Sciences, March 18, 2009
A Look at the Technology
Under the Hood
![Page 2: Abe Lederman, President and CTO Deep Web Technologies, Inc. ScienceEducation.gov Meeting National Academy of Sciences, March 18, 2009 A Look at the Technology.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649edf5503460f94befade/html5/thumbnails/2.jpg)
Content Integration Technologies for ScienceEducation.gov
• Crawling and Indexing (Part of Science.gov, E-Print Network)
• Federated Search (Science.gov, WorldWideScience.org)
ScienceEducation.gov Needs to successfully integrate content from a
variety of websites and databases requiring custom tools other search engines are unable to provide.
![Page 3: Abe Lederman, President and CTO Deep Web Technologies, Inc. ScienceEducation.gov Meeting National Academy of Sciences, March 18, 2009 A Look at the Technology.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649edf5503460f94befade/html5/thumbnails/3.jpg)
Drawing on the Experience of the E-Print Network
Gateway to 30,000 websites and databases worldwide, containing over 5 million e-prints in basic and applied sciences.
![Page 4: Abe Lederman, President and CTO Deep Web Technologies, Inc. ScienceEducation.gov Meeting National Academy of Sciences, March 18, 2009 A Look at the Technology.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649edf5503460f94befade/html5/thumbnails/4.jpg)
Drawing on the Experience of the E-Print Network
• Initially developed in 2001• Crawls and indexes 30,000 websites• Uses sophisticated filters to ensure that
only quality e-prints are included in the Network
• Contains full-text index of over 1.5 million e-prints
• Uses an Admin Tool to manage websites in the E-Print Network
![Page 5: Abe Lederman, President and CTO Deep Web Technologies, Inc. ScienceEducation.gov Meeting National Academy of Sciences, March 18, 2009 A Look at the Technology.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649edf5503460f94befade/html5/thumbnails/5.jpg)
![Page 6: Abe Lederman, President and CTO Deep Web Technologies, Inc. ScienceEducation.gov Meeting National Academy of Sciences, March 18, 2009 A Look at the Technology.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649edf5503460f94befade/html5/thumbnails/6.jpg)
What is Federated Search?
Federated Search is an application or service that allows a user to submit a
search in parallel to multiple, distributed information sources
and retrieve aggregated, ranked and de-duped results.
![Page 7: Abe Lederman, President and CTO Deep Web Technologies, Inc. ScienceEducation.gov Meeting National Academy of Sciences, March 18, 2009 A Look at the Technology.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649edf5503460f94befade/html5/thumbnails/7.jpg)
In Other Words…One Search, Many Sources
DOD
Search
EPANASAFDA
NIH
DOE NSF
Other Agencies
![Page 8: Abe Lederman, President and CTO Deep Web Technologies, Inc. ScienceEducation.gov Meeting National Academy of Sciences, March 18, 2009 A Look at the Technology.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649edf5503460f94befade/html5/thumbnails/8.jpg)
Assembling the ScienceEducation.gov Search Engine- Part I
Assemble Starting URLs
Education Experts
![Page 9: Abe Lederman, President and CTO Deep Web Technologies, Inc. ScienceEducation.gov Meeting National Academy of Sciences, March 18, 2009 A Look at the Technology.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649edf5503460f94befade/html5/thumbnails/9.jpg)
Assembling the ScienceEducation.gov Search Engine- Part II
Starting URLs Crawl Websites
Filter Bad URLsAnd Remove Duplicates
Build Index
Assign Learning Levels
ScienceEducation.gov Index
![Page 10: Abe Lederman, President and CTO Deep Web Technologies, Inc. ScienceEducation.gov Meeting National Academy of Sciences, March 18, 2009 A Look at the Technology.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649edf5503460f94befade/html5/thumbnails/10.jpg)
Challenges Ahead
• Determining what sites
to crawl
• Filtering undesirable
URLs
• Assigning appropriate
learning level to content
• Categorizing content
![Page 11: Abe Lederman, President and CTO Deep Web Technologies, Inc. ScienceEducation.gov Meeting National Academy of Sciences, March 18, 2009 A Look at the Technology.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649edf5503460f94befade/html5/thumbnails/11.jpg)
To Crawl or Not To Crawl?
Would miss these
Don’t crawl these pages
Will crawl these
![Page 12: Abe Lederman, President and CTO Deep Web Technologies, Inc. ScienceEducation.gov Meeting National Academy of Sciences, March 18, 2009 A Look at the Technology.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649edf5503460f94befade/html5/thumbnails/12.jpg)
Filtering Undesirable URLs
All Crawled URLs
Filter
Good URLs
CalendarContact
FeedbackHousing
.
.
.Registration
Survey
![Page 13: Abe Lederman, President and CTO Deep Web Technologies, Inc. ScienceEducation.gov Meeting National Academy of Sciences, March 18, 2009 A Look at the Technology.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649edf5503460f94befade/html5/thumbnails/13.jpg)
Removing Duplicate Web Pages
URL: http://seawifs.gsfc.nasa.gov/OCEAN_PLANET/HTML/education_threats.html
DUP: http://seawifs.gsfc.nasa.gov/OCEAN_PLANET/HTML/ocean_planet_book_threats.html
TITLE: Ocean Planet: Threats
SNIPPET: Threats to the health of the oceans Oil spills account for only about five percent of the oil entering the oceans The Coast Guard estimates that for United States waters sewage treatment plants discharge twice as much oil each year as tanker spills Each year industrial household cleaning gardening and automotive products pollute water About 65 000 chemicals are used commercially in the United States today with about 1 000 new ones added each year Only about 300 have been extensively tested for toxicity It is estimated that medical waste that washed up onto Long Island and New Jersey beaches in the summer of 1988 cost as much as 3 billion in lost revenue from tourism and recreation.
![Page 14: Abe Lederman, President and CTO Deep Web Technologies, Inc. ScienceEducation.gov Meeting National Academy of Sciences, March 18, 2009 A Look at the Technology.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649edf5503460f94befade/html5/thumbnails/14.jpg)
Learning Level Stratification
![Page 15: Abe Lederman, President and CTO Deep Web Technologies, Inc. ScienceEducation.gov Meeting National Academy of Sciences, March 18, 2009 A Look at the Technology.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649edf5503460f94befade/html5/thumbnails/15.jpg)
Categorizing Content
• Audience: Student or Teacher• Grade Level: K-3, 4-6, 7-9, 10-12, College• Content Type: Interactive Activities, Lesson Plans, Reference Materials, Science Fair Projects, Videos• Subject Area: Chemistry, Computer Science, Energy, Life Sciences,
Mathematics, Physics
![Page 16: Abe Lederman, President and CTO Deep Web Technologies, Inc. ScienceEducation.gov Meeting National Academy of Sciences, March 18, 2009 A Look at the Technology.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649edf5503460f94befade/html5/thumbnails/16.jpg)
A Look at the TechnologyUnder the Hood
Thank you!Abe Lederman