Semantic Search engines. Existing Solutions

Semantic Search engines Existing Solutions Linked Data How can I get my dataset into the diagram? • There must be resolvable http:// (or https://)...
Author: Prudence Hunt
2 downloads 0 Views 1MB Size
Semantic Search engines Existing Solutions

Linked Data

How can I get my dataset into the diagram? • There must be resolvable http:// (or https://) URIs. • They must resolve, with or without content negotiation, to RDF data in one of the popular RDF formats (RDFa, RDF/XML, Turtle, NTriples). • The dataset must contain at least 1000 triples. (Hence, your FOAF file most likely does not qualify.)

How can I get my dataset into the diagram? • The dataset must be connected via RDF links to a dataset that is already in the diagram. This means, either your dataset must use URIs from the other dataset, or vice versa. We arbitrarily require at least 50 links. • Access of the entire dataset must be possible via RDF crawling, via an RDF dump, or via a SPARQL endpoint.

Why Linked Data? • Easier search for structured documents (use of URIs in RDF triples is similar to the use of URLs in classical links) • Easier ontology matching – Central authorities providing URIs for other data sources (e.g. DBpedia)

Semantic Search Engines

Document-Centric Semantic Search Engines

Watson • • • • •

http://kmi-web05.open.ac.uk/WatsonWUI/ Parsing: Jena Repository: Jena? Reasoning: NO Keyword based search, SPARQL endpoint

Watson - Schema

Swoogle • http://swoogle.umbc.edu/ • Crawler: 3 Custom Crawlers – Google Crawler (.rdf, .owl files) – Focused Crawler – Extracted URIs crawler

• Repository: Jena • Index: Lucene • Keyword based search

Swoogle Architecture

Data Analysis • Classification of Semantic Web Documents – Databases – Makes assertions about individuals – Ontologies – Defines new terms

• Compute rank of SWDs • Search ordering: Swoogle PR – analogy to GPR

Entity-Centric Semantic Search Engines

Falcons • http://iws.seu.edu.cn/services/falcons/ • Reasoning/Ontology matching: Falcon-ao • Search ordering: TF-IDF in combination with popularity of ontologies • Classes recommendation: Ordering according to their popularity • Keyword search: Based on the indexed texts extracted from Virtual Documents

Falcon Screenshot

Falcon-ao • Linguistic Matching for Ontologies – Virtual Documents (names, labels, comments) – Levenshtein edit distance – Vector Space Model + cosine similarity of VDs

• Graph Matching for Ontologies – Similarity of two entities comes from the accumulation of similarities of involved statements – Similarity of two statements comes from the accumulation of similarities of involved entities

SWSE • http://swse.deri.org/ • Crawler: MultiCrawler • Repository: YARS2 – storing quadruples (subject, predicate, object, context) • Ontology matching: URIs, IFPs • Reasoning: Future work (Scalable Authoritative OWL Reasoner - SAOR) • Search ordering: ReConRank (Page Rank for Linked Data) • Keyword based search: Lucene

SWSE Architecture

• Consolidate – find synonymous identifiers • Rank – links-based analysis, scores assignment

Sindice.com • http://www.sindice.com • Crawler: SindiceBot – robots.rdf – semantic site maps – crawling pingthesemanticweb.com

• 3 Indexes: – URI index – IFP index – Keyword index

Sindice Architecture • Crawler: – Apache Nutch – Hadoop – MapReduce

• Reasoner: OWLIM Reasoner • Keyword based search: Solr • http://www.sig.ma

Sindice Architecture

Basic structure Documents repository

Indexer

Structured data crawler

Sorter Data extractor

Entity repository

Unstructured data crawler

Other apps using API Searcher

Basic structure Documents repository (Cache)

Indexer

Sorter

Ping

Data extractor (Parser)

Entity repository

Other apps using API Scheduler

Crawler

Searcher

Basic structure Flat Files? Indexer Sorter OWLIM

Ping Sesame

SERQL Scheduler

Crawler

Searcher

Crawling Problems • Locating resources (not so big problem nowadays) • Re-Crawl Timing • Life data sources • Automatically generated data sources

Storage Problems • Ontology matching – structural and linguistic methods are not 100 % accurate • Reasoning – Tradeoff quality vs. scalability – Data sources credibility (spamming)

• Indexing – tradeoff quality vs. scalability – Keyword search vs. SPARQL

Searching Problems • Extent of some queries SELECT ?s ?o WHERE { ?s rdf:type ?o } – Stop words – Top-k results

• Results ordering – Application of Page Rank – prone to spamming – Resources credibility

Semantic web Crawler • Slug – Simple – starts from a given set of documents and follows extracted URIs – Bugs

• MultiCrawler – No downloadable version – Description in a paper

• Apache Nutch based solution

Java Triplestores I • YARS2 – not devloped any more (http://sw.deri.org/2004/06/yars/) • Jena (http://jena.sourceforge.net/) – TDB storage (access via API) – SDB storage (SPARQL endpoint)

• Sesame (http://www.openrdf.org/) – Sesame Server – SERQL

• Virtuoso (http://virtuoso.openlinksw.com) – Unified storage engine (XML, SQL, RDF, Free Text) – Berlin Benchmark

Java Triplestores II • JRDF – 2008 triplestore across Hadoop – Currently no support for OWL

• Mulgara – SPARQL, TQL – Connection API