How can I get my dataset into the diagram? • There must be resolvable http:// (or https://) URIs. • They must resolve, with or without content negotiation, to RDF data in one of the popular RDF formats (RDFa, RDF/XML, Turtle, NTriples). • The dataset must contain at least 1000 triples. (Hence, your FOAF file most likely does not qualify.)
How can I get my dataset into the diagram? • The dataset must be connected via RDF links to a dataset that is already in the diagram. This means, either your dataset must use URIs from the other dataset, or vice versa. We arbitrarily require at least 50 links. • Access of the entire dataset must be possible via RDF crawling, via an RDF dump, or via a SPARQL endpoint.
Why Linked Data? • Easier search for structured documents (use of URIs in RDF triples is similar to the use of URLs in classical links) • Easier ontology matching – Central authorities providing URIs for other data sources (e.g. DBpedia)
Semantic Search Engines
Document-Centric Semantic Search Engines
Watson • • • • •
http://kmi-web05.open.ac.uk/WatsonWUI/ Parsing: Jena Repository: Jena? Reasoning: NO Keyword based search, SPARQL endpoint
• Repository: Jena • Index: Lucene • Keyword based search
Swoogle Architecture
Data Analysis • Classification of Semantic Web Documents – Databases – Makes assertions about individuals – Ontologies – Defines new terms
• Compute rank of SWDs • Search ordering: Swoogle PR – analogy to GPR
Entity-Centric Semantic Search Engines
Falcons • http://iws.seu.edu.cn/services/falcons/ • Reasoning/Ontology matching: Falcon-ao • Search ordering: TF-IDF in combination with popularity of ontologies • Classes recommendation: Ordering according to their popularity • Keyword search: Based on the indexed texts extracted from Virtual Documents
Falcon Screenshot
Falcon-ao • Linguistic Matching for Ontologies – Virtual Documents (names, labels, comments) – Levenshtein edit distance – Vector Space Model + cosine similarity of VDs
• Graph Matching for Ontologies – Similarity of two entities comes from the accumulation of similarities of involved statements – Similarity of two statements comes from the accumulation of similarities of involved entities
Crawling Problems • Locating resources (not so big problem nowadays) • Re-Crawl Timing • Life data sources • Automatically generated data sources
Storage Problems • Ontology matching – structural and linguistic methods are not 100 % accurate • Reasoning – Tradeoff quality vs. scalability – Data sources credibility (spamming)
• Indexing – tradeoff quality vs. scalability – Keyword search vs. SPARQL
Searching Problems • Extent of some queries SELECT ?s ?o WHERE { ?s rdf:type ?o } – Stop words – Top-k results
• Results ordering – Application of Page Rank – prone to spamming – Resources credibility
Semantic web Crawler • Slug – Simple – starts from a given set of documents and follows extracted URIs – Bugs
• MultiCrawler – No downloadable version – Description in a paper
• Apache Nutch based solution
Java Triplestores I • YARS2 – not devloped any more (http://sw.deri.org/2004/06/yars/) • Jena (http://jena.sourceforge.net/) – TDB storage (access via API) – SDB storage (SPARQL endpoint)
• Sesame (http://www.openrdf.org/) – Sesame Server – SERQL