Implementation of web services

EMBRACE 3rd AGM: Lyon April 23-24 2007 Implementation of web services Peter Fischer Hallin, M.Sc.Eng., PhD student Center for Biological Sequence Ana...
Author: Guest
0 downloads 0 Views 8MB Size
EMBRACE 3rd AGM: Lyon April 23-24 2007

Implementation of web services Peter Fischer Hallin, M.Sc.Eng., PhD student Center for Biological Sequence Analysis - DTU April 23, 2007

Outline • A brief introduction to Web Services concepts • Deliverables and work package description • CBS Web Services - examples and principles • Future challenges: data types, complex content, and management

A brief introduction to the terms of Web Services

Interoperability - SOAP based Web Services “... a software system designed to support interoperable Machine to Machine interaction over a network”

827635 Toptimate 3-Piece Set 827635 3-Piece luggage set. Black Polyester. 96.50 true

SOAP is the envelope format that binds to HTTP or HTTPS protocols.

Web Services in 30sec • Most Web Services are described by WSDL files (Web Services Description Language) • Concept: Each service will have a WSDL file. Once acquired, this file will describe all aspects of the service: what the services requires, what it provides, which functions (operations) it has and to which address (endpoint) the request should be directed. • Definitions of input/output is declared either within the WSDL or as included files, in the format of XML Schema Definitions, XSDs. • WSDL-files can be loaded by modules in Perl, C, Python and operations are imported as object functions. Code can easily be written which parses a response from one service and sends it as requests to another (if objects are compatible)

normal vs. asynchronous jobs

Deliverables and work package description

WP2 objectives Tool Integration – Sequence Analysis The objective of this workpackage is to integrate the widest possible range of tools for sequence analysis that can all work together in an integrated fashion. A large existing resource base for this is EMBOSS, SoapLab and Taverna, developed in the main by partners 1a and 5, which will be able to take advantage of the APIs to databases that will be developed in WP1. Third party tools can also be integrated into the system. Workflow analysis will be developed, with the ability to provide standard workflows or user customized workflows. Dependent on the outcomes of WP3 on Technology assessment and adoption, Grid service processing will permit the best possible utilization of computational resources and the highest possible throughput.

WP2 objectives Tool Integration – Sequence Analysis The objective of this workpackage is to integrate the widest possible range of tools for sequence analysis that can all work together in an integrated fashion. A large existing resource base for this is EMBOSS, SoapLab and Taverna, developed in the main by partners 1a and 5, which will be able to take advantage of the APIs to databases that will be developed in WP1. Third party tools can also be integrated into the system. Workflow analysis will be developed, with the ability to provide standard workflows or user customized workflows. Dependent on the outcomes of WP3 on Technology assessment and adoption, Grid service processing will permit the best possible utilization of computational resources and the highest possible throughput.

WP2 objectives Tool Integration – Sequence Analysis The objective of this workpackage is to integrate the widest possible range of tools for sequence analysis that can all work together in an integrated fashion. A large existing resource base for this is EMBOSS, SoapLab and Taverna, developed in the main by partners 1a and 5, which will be able to take advantage of the APIs to databases that will be developed in WP1. Third party tools can also be integrated into the system. Workflow analysis will be developed, with the ability to provide standard workflows or user customized workflows. Dependent on the outcomes of WP3 on Technology assessment and adoption, Grid service processing will permit the best possible utilization of computational resources and the highest possible throughput.

WP2 objectives Tool Integration – Sequence Analysis The objective of this workpackage is to integrate the widest possible range of tools for sequence analysis that can all work together in an integrated fashion. A large existing resource base for this is EMBOSS, SoapLab and Taverna, developed in the main by partners 1a and 5, which will be able to take advantage of the APIs to databases that will be developed in WP1. Third party tools can also be integrated into the system. Workflow analysis will be developed, with the ability to provide standard workflows or user customized workflows. Dependent on the outcomes of WP3 on Technology assessment and adoption, Grid service processing will permit the best possible utilization of computational resources and the highest possible throughput.

Tool integration plan WP2 EMBOSS Sequence Analysis Suite, Protein motifs, UTOPIA, CINEMA, PatSearch, [email protected] (CNRS), Protein PTM, modHMM, Palign, SMART, ELM, Gepardi, Functional annotation - INTACAB...

Tool integration plan WP2 EMBOSS Sequence Analysis Suite, Protein motifs, UTOPIA, CINEMA, PatSearch, [email protected] (CNRS), Protein PTM, modHMM, Palign, SMART, ELM, Gepardi, Functional annotation - INTACAB... “Protein post-translational modifications — CBS (Partner 8) maintains databases and tools for the identification and analysis of posttranslational modification sites.”

Predictions of post-translational modifications

work strategy • GOAL: To develop a robust SOAP compliant Web Services infrastructure for key prediction servers at CBS • Both synchronous and asynchronous functionality (submitJob -> pollQueue -> fetchResult) • Extensive logging and queuing system • Careful analysis of the conceptual data content of input and output objects • Redesign of input and output parsers to produce richer XML messages. • Services described by WSDL files (Web Services Description Language) • Definitions of input and output objects by XML Schema Definition (XSD)

CBS prediction servers related to Protein post-translational Service LipoP SignalP DictyOGlyc NetAcet NetCorona NetGlycate NetNGlyc NetOGlyc NetPhosK NetPhos NetPicoRNA ProP YinOYang NetPhosYeast

Description Signal peptidase I & II cleavage sites in gram- bacteria Signal peptide and cleavage sites O-(alpha)-GlcNAc glycosylation sites N-terminal acetylation in eukaryotic proteins Coronavirus 3C-like proteinase cleavage sites in proteins Glycation of ε amino groups of lysines N-linked glycosylation sites in human proteins O-GalNAc (mucin type) glycosylation sites Kinase specific phosphorylation sites Generic phosphorylation sites in eukaryotic proteins Posttranslational cleavage by picornaviral proteases Arginine and lysine propeptide cleavage sites O-(beta)-GlcNAc glycosylation and Yin-Yang sites Serine and threonine phosphorylation

Other relevant prediction servers: Service Description TMHMM Transmembrane helices in proteins NetCTL Integrated class I antigen presentation NetChop Proteasomal cleavages (MHC ligands). RNAmmer Ribosomal RNA sub units GenomeAtlas Microbial Genome Database / Genome Visualization

Currently implemented predictions servers cover TMHMM, NetCTL, NetChop, RNAmmer, SignalP, EasyGene, more will come soon ...

The CBS WSDLs • All Schema definitions of the CBS services are placed in separate XSD files. In most cases, each WSDL contains references to two XSDs: • Service Specific: Object definitions used only by this service • Common: Object definitions for commonly used objects that are shared among different CBS predictions servers. (e.g. queue related messages and fasta-like sequence content) • A human readable documentation is located in tags within the service element of the WSDL and indexed by the web pages.

Documentation of CBS EMBRACE efforts

http://www.cbs.dtu.dk/ws/cbswork.php

Documentation of CBS EMBRACE efforts

http://www.cbs.dtu.dk/ws/cbswork.php

Results • We have developed a stable environment for implementation of Web Services on the server side. • We have developed client side scripts containing template workflows, available for download (Perl and Python). • Services input and output has been validated against our XML Schema Definitions • Utilizing our Web Services during exercises in course held by CBS: ‘EMBRACE Workshop on Bioinformatics of Immunology’, January 24-26, 2007

CBS Web Services examples and principles

Prediction of rRNA genes: Alignment from secondary structure and construction of hidden markov models

http://www.psb.ugent.be/rRNA/

Lagesen K, Hallin PF, Rødland E, Stærfeldt HH, Rognes T Ussery DW RNammer: consistent annotation of rRNA genes in genomic sequences Accepted for publication March 6 2007 in Nucleic Acids Research.

Prediction of rRNA genes: Alignment from secondary structure and construction of hidden markov models

Lagesen K, Hallin PF, Rødland E, Stærfeldt HH, Rognes T Ussery DW RNammer: consistent annotation of rRNA genes in genomic sequences Accepted for publication March 6 2007 in Nucleic Acids Research.

Good performance - selectivity and sensitivity in the range 0.98-1.00: Selectivity [0.9;1.0]

Sensitivity [0.9;1.0]

When you receive a pleasant e-mail ...

When you receive a pleasant e-mail ...

The RNAmmer program is available as a traditional HTML based prediction server at http://www.cbs.dtu.dk/services as well as through a SOAP based web service. It is also available for download through the same site. Acknowledgments We are grateful for funding from EMBIO at the University of Oslo, the Research Council of Norway and the Danish Center for Scientific Computing. It was also supported by a grant from the European Union through the EMBRACE Network of Excellence, contract number LSHG-CT-2004-512092. We would also like to thank our colleagues for critical reading of the manuscript.

•Point to input data, typically a fasta file

http://www.cbs.dtu.dk/services/RNAmmer

•Point to input data, typically a fasta file

•Submit and wait for job to finish

http://www.cbs.dtu.dk/services/RNAmmer

•Point to input data, typically a fasta file

•Submit and wait for job to finish

•Retrieve the result

http://www.cbs.dtu.dk/services/RNAmmer

•Point to input data, typically a fasta file

•Submit and wait for job to finish

•Retrieve the result •View supplemental data

http://www.cbs.dtu.dk/services/RNAmmer

Simplified version of the CBS endpoint script: server.cgi

The RNAmmer Perl module, included automatically by the auto-dispatch ...

The RNAmmer Perl module, included automatically by the auto-dispatch ...

Including an in-house module to parse fasta and SOAP::Lite objects

The RNAmmer Perl module, included automatically by the auto-dispatch ...

Including an in-house module to parse fasta and SOAP::Lite objects ... and our queueing module (QUite A Queue)

The RNAmmer Perl module, included automatically by the auto-dispatch ...

Including an in-house module to parse fasta and SOAP::Lite objects ... and our queueing module (QUite A Queue) Perl::Lite will auto-dispatch the operation called by the client to these Perl sub’s - this sub is taking care of placing the UNIX command into the queueing system

The RNAmmer Perl module, included automatically by the auto-dispatch ...

Including an in-house module to parse fasta and SOAP::Lite objects ... and our queueing module (QUite A Queue) Perl::Lite will auto-dispatch the operation called by the client to these Perl sub’s - this sub is taking care of placing the UNIX command into the queueing system

Extracting results from the queue system is standard - independent from the Service (here RNAmmer)

The RNAmmer Perl module, included automatically by the auto-dispatch ...

Including an in-house module to parse fasta and SOAP::Lite objects ... and our queueing module (QUite A Queue) Perl::Lite will auto-dispatch the operation called by the client to these Perl sub’s - this sub is taking care of placing the UNIX command into the queueing system

Extracting results from the queue system is standard - independent from the Service (here RNAmmer) Also, polling the queue is a standard routine ...

Using soapUI it looks like this:

Using soapUI it looks like this:

Using soapUI it looks like this:

Using soapUI it looks like this:

Using soapUI it looks like this:

Future challenges: Data types, complex content, and management

Technology is ready, but... • We are close to having the essential technology recommendations ready. • From our point of view, there are still some unsolved issues which the sooner the better requires our attention: • Level of object typing • Management of objects and service descriptions • Attachment of raw data

First dilemma: typing • The SOAP and XSD technologies offers clear opportunities to thoroughly define incoming and outgoing data. • On the one hand, we want a certain level of granularity: this allows for strict definitions of all object elements and attributes (enumerations, strings, ints, floats, date, regular expressions etc.). It allows for easy access to message components from your programming language. • On the other hand, more granulated messages also leaves less compatible operations and more complex XML Schema (note that the word ‘compatible’ is here used only in the context of XSD/WSDL)

Example: Substance P: (a neurotransmitter)

Example: Substance P: (a neurotransmitter)

Example: Substance P: (a neurotransmitter)

Example: Substance P: (a neurotransmitter)

Example: Substance P: (a neurotransmitter)

Lower extreme - untyped message • All data sent as a large, raw, strings • In the context of XSD, all services will then be compatible (although in the real world, they might not!) • A different compatibility check will need to be applied - XSD will not help you here!

Upper extreme - strictly typed message • All aspects of the message is defined. The author of the service sits down, and carefully analyzes the conceptual data content and writes the XML schema definition to fit the operation. • However, giving free hands to all authors will likely give as many incompatible services as there are programmers - management must be applied

Second dilemma: standardization • Are objects likely to be reused in many workflows, across many partners and operations, it is favorable to apply a standardization. This will ease the connection of operations across the network. • A nightmare scenario: 13 different ways to represent an alignment, a FASTA entry, or a BLAST report.

granularity/ strictly defined objects

Typing vs. standardization

standardization/ required management

Typing vs. standardization

granularity/ strictly defined objects

A logic and natural threshold for typing: Data that is likely to be a the endpoint of workflows or have limited scientific meaning in the context of Bioinformatics: Jpeg/PNG, PosctScript/PDF,

standardization/ required management

The untyped data content • Examples: 3D rendered images of protein structure, publications and documents. • A final technology recommendation not ready yet. • SOAP MIME attachments likely to be recommended together with... • In-message base64 encoded XML content

The untyped data content • Examples: 3D rendered images of protein structure, publications and documents. • A final technology recommendation not ready yet. • SOAP MIME attachments likely to be recommended together with... • In-message base64 encoded XML content

Example : Calculation of amino acid usage

Example : Calulation of amino acid usage

Example : Calulation of amino acid usage

Example: Predicting rRNA genes in a complete genome sequence

Example: Predicting rRNA genes in a complete genome sequence

E.coli genome properties

To be published in textbook on comparative genomics of microbial genomes

A centrol XSD repository? Operation name

Type

Service

Partner

file

author

runServiceRequest

input

RNAmmer

CBS

ws_rnammer_1_1e.xsd

Peter Hallin

runServiceResponse

output

RNAmmer

CBS

ws_rnammer_1_1e.xsd

Peter Hallin

runServiceRequest

input

SignalP

CBS

ws_signalp_3_1.xsd

Henrik Nielsen

runServiceResponse

output

SignalP

CBS

ws_signalp_3_1.xsd

Henrik Nielsen

Issues: How do we measure message compatibility? How do we manage such a repository? Meta Web Service: WS operation to describe partner Web Service?

Web Services are here to stay • The EMBRACE network will deliver ways for bioinformatic computers / grids to communicate using SOAP based Web Services. • The major challenge: To ensure that the network partners agree on definitions of object types and content. • In XSD, we could define a string holding a genomic sequence like this: - but it still takes a human to conclude that this is a genomic sequence and not a baking recipe! A true challenge remains: To let computers talk to computers - without having humans to decipher WSDL.

overall goal Once the challenges are met, we can begin to exploit the power of bioscience Web Services: We will be able to take a given genome sequence, protein sequence or structure and ask the network: “What can I do with this?”, “Who have the services that fit my data?”

Acknowledgments Center for Biological Sequence Analysis, DTU Kristoffer Rapacki Francisco Roque Hans-Henrik Stærfeldt Bergen Center for Computational Science Jan Christian Bryne And ... Oliver Piilgaard Hallin, born December 2006

Acknowledgments Center for Biological Sequence Analysis, DTU Kristoffer Rapacki Francisco Roque Hans-Henrik Stærfeldt Bergen Center for Computational Science Jan Christian Bryne And ... Oliver Piilgaard Hallin, born December 2006