Open Source Software for Digital Libraries

Open Source Software for Digital Libraries Jon Dunn Associate Director for Technology John A. Walsh Manager of Electronic Text Technologies Indiana Un...
Author: Harvey Nichols
1 downloads 1 Views 804KB Size
Open Source Software for Digital Libraries Jon Dunn Associate Director for Technology John A. Walsh Manager of Electronic Text Technologies Indiana University Digital Library Program IU Digital Library Brown Bag Series Bloomington, IN 09 April 2004

Outline ¾ Open Source Introduction ¾ Categories of Open Source Software for

Libraries ¾ Open Source Digital Library Systems ¾ Open Source XML Tools and Systems

What is open source software? ¾

In the phrase open source, source refers to source code, the human-readable computer code which is the origin, or source, of the computer application. Open refers to the terms of access to that computer source code. So open source software is software for which the source code is freely available. But this is a very general and incomplete definition.

¾

A detailed definition of open source software is maintained by the Open Source Initiative

Advantages and Disadvantages Advantages ¾ Access to source code and ability and right to modify it ¾ Right to redistribute modifications to benefit wider community ¾ Free ¾ Excellent support networks ¾ Large and enthusiastic user base Disadvantages ¾ Limited or no accountability ¾ Informal and unaccountable support channels

Categories of Open Source Software ¾ Operating Systems z

Linux

¾ Programming Languages z

Perl, PHP, Python

¾ Applications z

Apache, Tomcat, emacs, grep, MySQL, sendmail, ssh

Different Open Source Licenses ¾ GNU GPL ("General Public License") ¾ GNU Lesser GPL ¾ BSD License ¾ Mozilla Public License ¾ IU Open Source License ¾ And more...

Open Source Software in the DLP ¾ Linux, Apache, Tomcat, PHP, Perl, DLXS,

ImageMagick, ePrints, MySQL, Darwin Streaming Server, emacs, CVS, Webalizer, LibXML, LibXSLT, Saxon, and more!

Open Source Resources ¾ Open Source Initiative ¾ GNU ¾ SourceForge

Some categories of open source library software ¾ Library-oriented search engines z

Cheshire, Pears

¾ Z39.50 toolkits z

ZetaPerl (Perl), JAFER (Java), YAZ (C/C++)

¾ MARC parsers z

MARC.pm (Perl), MARC4J (Java)

¾ Image processing z

ImageMagick, tiffinfo/tiffdump

Some categories of open source library software ¾

Portals z

¾

OAI service providers and data providers z z

¾

PHP OAI Data Provider Lots! See www.openarchives.org

METS tools z

¾

MyLibrary

Page turners, toolkits, more: see www.loc.gov/mets/

Digital object repositories z

Fedora

A Good Starting Point ¾ oss4lib: Open Source Systems for

Libraries z

www.oss4lib.org

Complete DL Systems ¾ DSpace ¾ Eprints ¾ Greenstone

DSpace ¾

¾ ¾ ¾

“DSpace is a groundbreaking digital institutional repository that captures, stores, indexes, preserves, and redistributes the intellectual output of a university’s research faculty in digital formats.” Developed jointly by MIT Libraries and HewlettPackard Licensed under BSD distribution license www.dspace.org

DSpace ¾ Supports submission of, management of,

and access to digital content z

Formats: text, images, audio, video

¾ Organized based on organizational needs

of a large university z

Communities and collections

DSpace Features ¾ Digital preservation z

Persistent IDs, support levels for different file formats

¾ Access control ¾ Versioning ¾ Search and retrieval z

Based on qualified Dublin Core metadata

¾ OAI-PMH data provider z

To support metadata harvesters

DSpace Technology ¾ OS: Unix or Linux ¾ Written in Java ¾ PostgreSQL relational database ¾ Provides complete Web user interface, but

Java APIs available

DSpace Data Model

DSpace Architecture

DSpace Demonstration ¾ MIT DSpace z

dspace.mit.edu

EPrints ¾ ¾ ¾ ¾

¾ ¾

“free software which creates online archives” Developed by University of Southampton, UK Supports self-archiving of e-prints Can be configured as institutional repository or otherwise, e.g. repository focused on particular research area or discipline Licensed under GNU General Public License software.eprints.org

EPrints ¾ ¾ ¾ ¾

Supports submission, management of, and access to digital content Can support multiple archives on one server Moderated or unmoderated archives Search and retrieval z z

¾ ¾

Based on metadata Metadata can be customized for different archives and document types

No access control OAI-PMH data provider

EPrints Technology ¾ OS: Unix or Linux ¾ Written in Perl ¾ Requirements: z z

Apache web server MySQL relational database

EPrints Demonstration ¾ Digital Library of the Commons z

dlc.dlib.indiana.edu

Greenstone ¾ “Suite of software for building and

distributing digital library collections” ¾ Developed by University of Waikato, New Zealand z

Developed in cooperation with UNESCO and the Human Info NGO

¾ Licensed under GNU General Public

License ¾ www.greenstone.org

Greenstone Features ¾ ¾

Supports creation and management of collections by administrator(s) Web interface for search and retrieval z z

¾

Extensive document filters z z

¾

Customizable metadata Supports full text search of content Word, Excel, PowerPoint, PDF, ... Can extract metadata from documents

Many ways to build a collection, including: z z z

Local files Retrieve web sites Retrieve objects via OAI-PMH

Greenstone Features ¾ Focus on: z z z

Ease of installation Ease of use Internationalization • Full support for English, French, Spanish, Russian, and Kazakh • Support for many other languages

z

Low barriers to use • Minimal system requirements • Creation of CD-ROMs

Greenstone Technology ¾ ¾ ¾ ¾ ¾

Runs on Windows (back to 3.1), Linux, Mac OS X, Unix Written in C++, Perl, and Java Uses MG/MG++ search engine Several different Web and Java/Swing user interfaces for various functions Web interface for user access

Greenstone Demonstration ¾ Examples at www.greenstone.org

Open Source XML Tools and Systems ¾

Utilities z

¾

Editors z

¾

Xalan, Xerces, libxml, libxslt, saxon emacs / nxml-mode

Database / Search Engines • • •

¾

Apache Xindice Berkeley DB XML eXist

Publishing/WebApplication Frameworks • AxKit • Cocoon

XML Databases & Search Engines ¾ Apache Xindice ¾ Berkeley DB XML ¾ eXist

Apache Xindice ¾ http://xml.apache.org/xindice/ ¾ Technology: Java ¾ Optimized for large numbers of small XML

files. Does not work well on large files.

Berkeley DB XML ¾ http://www.sleepycat.com/products/xml.shtml ¾ Technology: C ¾ C++ and Java APIs

eXist ¾ http://exist.sourceforge.net/ ¾ Technology: Java

XML Publishing / Web Application Frameworks ¾

XML Publishing, or Web Application, Frameworks provide systems for publishing XML data in a variety of formats, such as HTML, WAP/WML, PDF, etc. Both AxKit and Cocoon use a "pipeline" paradigm to route incoming requests through different processing routines.

¾

Apache AxKit Apache Cocoon

¾

Apache AxKit ¾ ¾ ¾

http://axkit.org/ Technology: Perl AxKit is an XML Application Server for Apache. It provides on-the-fly conversion from XML to any format, such as HTML, WAP or text using either W3C standard techniques, or flexible custom code. AxKit also uses a built-in Perl interpreter to provide some amazingly powerful techniques for XML transformation.

Apache Cocoon ¾ http://cocoon.apache.org/ ¾ Technology: Java ¾ "Apache Cocoon is a web development

framework built around the concepts of separation of concerns and componentbased web development."

Cocoon: Key Concepts ¾ ¾ ¾ ¾

publishing framework XML and XSLT "pipelined SAX processing" separation of: z z z

¾ ¾

content logic style

centralized configuration sophisticated caching

Cocoon: Problems to Be Solved ¾

Separation of content, style, logic, and management functions in an XML content based web site:

Cocoon: Problems to be Solved (cont.) ¾

Data mapping:

Cocoon: Basic mechanisms for processing XML documents ¾ ¾

¾

¾ ¾

Dispatching based on Matchers. Generation of XML documents (from content, logic, Relation DB, objects or any combination) through Generators Transformation (to another XML, objects or any combination) of XML documents through Transformers Aggregation of XML documents through Aggregators Rendering XML through Serializers

Cocoon: Basic mechanisms for processing XML documents

Cocoon: The Pipeline Sequence of interactions:

Cocoon: The Pipeline

Generators, Transformers, & Serializers ¾ Generators ¾ Transformers ¾ Serializers

Cocoon: Configuration: The Sitemap ... ... ... ... ... ...

Cocoon: Configuration: A Pipeline map:pipelines> map:pipeline> pattern="technochat/"> "/> src="technochat/index.xhtml map:serialize/> pattern="technochat/*.xml"> src="technochat/{1}.xml"/> pattern="technochat/*.html"> src="technochat/{1}.xml"/> src="technochat/tei2html.xsl"/> map:serialize/> "> pattern="technochat/*.

"> pattern="technochat/*. src="technochat/{1}.xml"/> src="technochat/tei2svg.xsl"/> > src="technochat/tei2svg.xsl"/> type="svgxml"/> > src="technochat/tei2fo.xsl"/>