The Edisyn Search Engine

The Edisyn Search Engine Jan Pieter Kunst Meertens Institute, Amsterdam Edisyn Search Engine Edisyn Search Engine • Software deliverable of the...

Author: Amberlynn Blankenship

2 downloads 3 Views 119KB Size

Report

Download PDF

Recommend Documents

EXT: Indexed Search Engine

Search Engine Ranking Report

CS Search Engine Technology

Structural Web Search Engine

Search Engine Dictionary

SEO (Search Engine Optimization)

SEARCH ENGINE OPTIMIZATION

SEARCH ENGINE OPTIMISATION

SEO. Search Engine Optimization

Search Engine Optimization

Measuring search engine bias

SEARCH ENGINE OPTIMIZATION

Search Engine Compatibility Report

Profile Based Search Engine

Search Engine Optimization Master

Myanmar Language Search Engine

ON THE EVOLUTION OF SEARCH ENGINE RANKINGS

The Parallel Distributed Image Search Engine (ParaDISE)

The CHAIN-REDS Semantic Search Engine

The Structure of Search Engine Law

Tag Based Audio Search Engine

SONIC TIMING TOOLS SEARCH ENGINE

Optimizing a Web Search Engine

The Edisyn Search Engine

Jan Pieter Kunst Meertens Institute, Amsterdam

Edisyn Search Engine

Edisyn Search Engine

•

Software deliverable of the Edisyn project (2005-2011)

Edisyn Search Engine

• •

Software deliverable of the Edisyn project (2005-2011) Goal: concurrent search of various dialect-syntactical corpora with a single set of terms using a single interface

Edisyn Search Engine

• •

Software deliverable of the Edisyn project (2005-2011)

•

Developed at the Meertens Institute

Goal: concurrent search of various dialect-syntactical corpora with a single set of terms using a single interface

Edisyn Search Engine

• •

Software deliverable of the Edisyn project (2005-2011)

• •

Developed at the Meertens Institute

Goal: concurrent search of various dialect-syntactical corpora with a single set of terms using a single interface

Available at www.meertens.knaw.nl/edisyn/searchengine/

Originally planned implementation

Originally planned implementation

•

A physically distributed search engine

Originally planned implementation

• •

A physically distributed search engine Each group hosts and maintains its own corpus

Originally planned implementation

• • •

A physically distributed search engine Each group hosts and maintains its own corpus All corpora are on the web and have a web service interface

Originally planned implementation

• • •

A physically distributed search engine

•

The Edisyn search engine uses these web service interfaces to communicate with the individual corpora

Each group hosts and maintains its own corpus All corpora are on the web and have a web service interface

Practical problems with this plan

Practical problems with this plan

•

Groups don’t have technical staff, so are not in a position to make their corpus available on the web

Practical problems with this plan

•

Groups don’t have technical staff, so are not in a position to make their corpus available on the web

•

Groups with technical staff don’t have funding to make their corpus available on the web

Practical problems with this plan

•

Groups don’t have technical staff, so are not in a position to make their corpus available on the web

•

Groups with technical staff don’t have funding to make their corpus available on the web

•

Groups who have a corpus on the web don’t have funding to add a web service interface to their existing web-based corpus

Actual implementation

Actual implementation

•

The corpora (except one) are all hosted at the Meertens institute

Actual implementation

•

The corpora (except one) are all hosted at the Meertens institute

•

(But the communication between corpora and search engine still uses web services)

Actual implementation

•

The corpora (except one) are all hosted at the Meertens institute

•

(But the communication between corpora and search engine still uses web services)

•

The one corpus which is hosted remotely (Nordic Dialect Corpus) does not have a web service interface, so we had to resort to a “screen scraping” technique in that case

Conclusion

Conclusion

•

The search engine is not physically distributed in the way it was hoped for in the original plan

Conclusion

•

The search engine is not physically distributed in the way it was hoped for in the original plan

•

But the actual implementation is still based on web services, so adding a remote corpus should not be any harder than adding a local corpus

Prospects for the future

Prospects for the future

•

Based on experiences from the past, chances are slim that new corpora will have a web service interface

Prospects for the future

•

Based on experiences from the past, chances are slim that new corpora will have a web service interface

•

In all probability, new corpora will be added by providing me with a copy of the corpus which I will add to the search engine

Prospects for the future

•

Based on experiences from the past, chances are slim that new corpora will have a web service interface

•

In all probability, new corpora will be added by providing me with a copy of the corpus which I will add to the search engine

•

Please note that the Edisyn project at the Meertens institute is officially finished, so the time I can spend on processing a new corpus might be limited (depending on workload from running projects)

Best practices

Best practices

•

Do not use a word processor or other "free text editing" program for transcribing speech

Best practices

•

Do not use a word processor or other "free text editing" program for transcribing speech

•

Use something which saves its data in a structured format so that it can be processed by a computer program; preferably a dedicated transcription application

Best practices

•

Do not use a word processor or other "free text editing" program for transcribing speech

•

Use something which saves its data in a structured format so that it can be processed by a computer program; preferably a dedicated transcription application

•

I advise the transcription program PRAAT: www.praat.org.

Best practices

•

Do not use a word processor or other "free text editing" program for transcribing speech

•

Use something which saves its data in a structured format so that it can be processed by a computer program; preferably a dedicated transcription application

• •

I advise the transcription program PRAAT: www.praat.org. Another option is ELAN: tla.mpi.nl/tools/tla-tools/elan/

Best practices

•

Do not use a word processor or other "free text editing" program for transcribing speech

•

Use something which saves its data in a structured format so that it can be processed by a computer program; preferably a dedicated transcription application

• • •

I advise the transcription program PRAAT: www.praat.org. Another option is ELAN: tla.mpi.nl/tools/tla-tools/elan/ With these programs, different speakers are saved in different tiers and time codes of the sound file are saved automatically

Best practices for transcription programs

Best practices for transcription programs

•

Don't mix data and metadata. Typos which result in an invalid structure are easily made and this breaks import scripts, e.g.: [v=54] bla bla bla [/v[

Something like this is better: +-------------+ | bla bla bla | +-------------+ | v=54 | +-------------+

Best practices for transcription programs

•

Don't mix data and metadata. Typos which result in an invalid structure are easily made and this breaks import scripts, e.g.: [v=54] bla bla bla [/v[

Something like this is better: +-------------+ | bla bla bla | +-------------+ | v=54 | +-------------+

Best practices for transcription programs

•

Don't mix data and metadata. Typos which result in an invalid structure are easily made and this breaks import scripts, e.g.: [v=54] bla bla bla [/v[

Something like this is better: +-------------+ | bla bla bla | +-------------+ | v=54 | +-------------+

•

Use the features of your transcription program as intended by its authors. A predictable structure greatly helps writing import scripts

Part of speech tags

Part of speech tags

•

If possible, use the Edisyn tag set for tagging your data. See www.meertens.knaw.nl/edisyn/searchengine/tagset

Part of speech tags

•

If possible, use the Edisyn tag set for tagging your data. See www.meertens.knaw.nl/edisyn/searchengine/tagset

•

If using the Edisyn tag set is not possible, please provide a mapping from your tag set to the Edisyn tag set so that I can construct an automatic translation

Part of speech tags

•

If possible, use the Edisyn tag set for tagging your data. See www.meertens.knaw.nl/edisyn/searchengine/tagset

•

If using the Edisyn tag set is not possible, please provide a mapping from your tag set to the Edisyn tag set so that I can construct an automatic translation

•

The way the search engine works is that all corpora use their native tag set and the search request uses the Edisyn tag set; the query is translated from Edisyn to native before it is sent to the corpus. For each tag set there is an XML file which contains the mapping to the Edisyn tag set.

Part of speech tags

•

If possible, use the Edisyn tag set for tagging your data. See www.meertens.knaw.nl/edisyn/searchengine/tagset

•

If using the Edisyn tag set is not possible, please provide a mapping from your tag set to the Edisyn tag set so that I can construct an automatic translation

•

The way the search engine works is that all corpora use their native tag set and the search request uses the Edisyn tag set; the query is translated from Edisyn to native before it is sent to the corpus. For each tag set there is an XML file which contains the mapping to the Edisyn tag set.

•

A table of used tags and their meaning should be provided (see Edisyn search engine for examples)

English glosses

English glosses

•

Adding english glosses to your corpus enhances its usefulness. A crude way of doing this (which we employed in the SAND corpus) is to translate all unique words out of context. The resulting “translations” are often incoherent, but better than nothing.

Geographic information

Geographic information

•

It helps a lot if your locations come with lat/long data. (These can be found with Google Maps or Google Earth.) Finding the correct geographic locations for the current corpora in Edisyn took me a lot of work (not being familiar with the regions and placenames in question)

General information

General information

•

A concise description of your project (methodology used, description of data collection protocols, etc.) should be provided (see Edisyn search engine for examples)

Planned additions/updates

Planned additions/updates

•

If time permitting (the project being officially finished) these are some of the things I would like to add to the search engine:

Planned additions/updates

•

If time permitting (the project being officially finished) these are some of the things I would like to add to the search engine:

•

Download search results in some easy to process format (csv, json, xml)

Planned additions/updates

•

If time permitting (the project being officially finished) these are some of the things I would like to add to the search engine:

•

Download search results in some easy to process format (csv, json, xml)

•

Google has shut down its translation API. The now useless Google Translate link should be removed and replaced by another solution for automatic translation

Planned additions/updates

•

If time permitting (the project being officially finished) these are some of the things I would like to add to the search engine:

•

Download search results in some easy to process format (csv, json, xml)

•

Google has shut down its translation API. The now useless Google Translate link should be removed and replaced by another solution for automatic translation

•

Other suggestions “from the field” always welcome. Actual implementation not guaranteed.

Thank you for your attention!