Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf ·...

30
Semantic Search engines Existing Solutions

Transcript of Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf ·...

Page 1: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted

Semantic Search engines

Existing Solutions

Page 2: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted

Linked Data

Page 3: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted

How can I get my dataset into the diagram?

• There must be resolvable http:// (or https://) URIs.

• They must resolve, with or without content negotiation, to RDF data in one of the popular RDF formats (RDFa, RDF/XML, Turtle, N-Triples).

• The dataset must contain at least 1000 triples. (Hence, your FOAF file most likely does not qualify.)

Page 4: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted

How can I get my dataset into the diagram?

• The dataset must be connected via RDF links to a dataset that is already in the diagram. This means, either your dataset must use URIs from the other dataset, or vice versa. We arbitrarily require at least 50 links.

• Access of the entire dataset must be possiblevia RDF crawling, via an RDF dump, or via a SPARQL endpoint.

Page 5: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted

Why Linked Data?

• Easier search for structured documents (use of URIs in RDF triples is similar to the use of URLs in classical links)

• Easier ontology matching – Central authorities providing URIs for other data sources (e.g. DBpedia)

Page 6: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted

Semantic Search Engines

Page 7: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted

Document-Centric Semantic Search Engines

Page 8: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted

Watson

• http://kmi-web05.open.ac.uk/WatsonWUI/

• Parsing: Jena

• Repository: Jena?

• Reasoning: NO

• Keyword based search, SPARQL endpoint

Page 9: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted

Watson - Schema

Page 10: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted

Swoogle

• http://swoogle.umbc.edu/

• Crawler: 3 Custom Crawlers

– Google Crawler (.rdf, .owl files)

– Focused Crawler

– Extracted URIs crawler

• Repository: Jena

• Index: Lucene

• Keyword based search

Page 11: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted

Swoogle Architecture

Page 12: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted

Data Analysis

• Classification of Semantic Web Documents

– Databases – Makes assertions about individuals

– Ontologies – Defines new terms

• Compute rank of SWDs

• Search ordering: Swoogle PR – analogy to GPR

Page 13: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted

Entity-Centric Semantic Search Engines

Page 14: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted

Falcons

• http://iws.seu.edu.cn/services/falcons/

• Reasoning/Ontology matching: Falcon-ao

• Search ordering: TF-IDF in combination with popularity of ontologies

• Classes recommendation: Ordering according to their popularity

• Keyword search: Based on the indexed texts extracted from Virtual Documents

Page 15: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted

Falcon Screenshot

Page 16: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted

Falcon-ao

• Linguistic Matching for Ontologies– Virtual Documents (names,

labels, comments)– Levenshtein edit distance– Vector Space Model + cosine

similarity of VDs

• Graph Matching for Ontologies– Similarity of two entities comes

from the accumulation of similarities of involved statements

– Similarity of two statements comes from the accumulation of similarities of involved entities

Page 17: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted

SWSE

• http://swse.deri.org/• Crawler: MultiCrawler• Repository: YARS2 – storing quadruples (subject,

predicate, object, context)• Ontology matching: URIs, IFPs• Reasoning: Future work (Scalable Authoritative

OWL Reasoner - SAOR)• Search ordering: ReConRank (Page Rank for

Linked Data)• Keyword based search: Lucene

Page 18: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted

SWSE Architecture

• Consolidate – find synonymous identifiers

• Rank – links-based analysis, scores assignment

Page 19: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted

Sindice.com

• http://www.sindice.com

• Crawler: SindiceBot

– robots.rdf – semantic site maps

– crawling pingthesemanticweb.com

• 3 Indexes:

– URI index

– IFP index

– Keyword index

Page 20: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted

Sindice Architecture

• Crawler:

– Apache Nutch

– Hadoop

– MapReduce

• Reasoner: OWLIM Reasoner

• Keyword based search: Solr

• http://www.sig.ma

Page 21: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted

Sindice Architecture

Page 22: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted

Basic structure

Structured datacrawler

Unstructured datacrawler

Documents repository

Data extractor

Indexer

Entity repository

Other apps using API

Searcher

Sorter

Page 23: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted

Basic structure

Crawler

Documents repository

(Cache)

Data extractor(Parser)

Indexer

Entity repository

Other apps using API

Searcher

Sorter

Ping

Scheduler

Page 24: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted

Basic structure

Crawler

Indexer

SERQL

Searcher

SorterOWLIM

Ping

Scheduler

Flat Files?

Sesame

Page 25: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted

Crawling Problems

• Locating resources (not so big problem nowadays)

• Re-Crawl Timing

• Life data sources

• Automatically generated data sources

Page 26: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted

Storage Problems

• Ontology matching – structural and linguistic methods are not 100 % accurate

• Reasoning

– Tradeoff quality vs. scalability

– Data sources credibility (spamming)

• Indexing – tradeoff quality vs. scalability

– Keyword search vs. SPARQL

Page 27: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted

Searching Problems

• Extent of some queriesSELECT ?s ?o

WHERE { ?s rdf:type ?o }

– Stop words

– Top-k results

• Results ordering

– Application of Page Rank – prone to spamming

– Resources credibility

Page 28: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted

Semantic web Crawler

• Slug

– Simple – starts from a given set of documents and follows extracted URIs

– Bugs

• MultiCrawler

– No downloadable version

– Description in a paper

• Apache Nutch based solution

Page 29: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted

Java Triplestores I

• YARS2 – not devloped any more (http://sw.deri.org/2004/06/yars/)

• Jena (http://jena.sourceforge.net/)– TDB storage (access via API)– SDB storage (SPARQL endpoint)

• Sesame (http://www.openrdf.org/)– Sesame Server– SERQL

• Virtuoso (http://virtuoso.openlinksw.com)– Unified storage engine (XML, SQL, RDF, Free Text)– Berlin Benchmark

Page 30: Semantic Search enginesresearch.ivolasek.com/_media/other-texts/existing_solutions.pdf · •Crawler: 3 Custom Crawlers –Google Crawler (.rdf, .owl files) –Focused Crawler –Extracted

Java Triplestores II

• JRDF

– 2008 triplestore across Hadoop

– Currently no support for OWL

• Mulgara

– SPARQL, TQL

– Connection API