Functional Annotation of Gene Products

Functional Annotation of Gene Products Patrik Georgii-Hemming TRITA-NA-E04067 NADA Numerisk analys och datalogi KTH 100 44 Stockholm Department o...

Author: Hollie Ray

5 downloads 2 Views 473KB Size

Report

Download PDF

Recommend Documents

Protein family classification and functional annotation

Associations of Microarray Analysis Results with Gene Ontology Annotation

Multi-functional Products Catalogue

Functional Fibres for Unique Products

Functional annotation for the Mootha dataset FOREL analysis

Functional analysis of X-chromosomal gene expression in Drosophila melanogaster

Gene therapy is the introduction of a functional gene into a

[ FUNCTIONAL FOODS & NATURAL HEALTH PRODUCTS ] [ FUNCTIONAL FOODS & NATURAL HEALTH PRODUCTS ] Canada s competitive advantages

Comparison of Protein Active Site Structures for Functional Annotation of Proteins and Drug Design

Strategies for designing novel functional meat products

Other FUNctional Products for Your Active Life!

Introduction to Microencapsulation of Functional Ingredients in Food Products

Functional and Biological Properties of the Bee Products: a Review

Dietary fibre from vegetable products as source of functional ingredients

Microarray annotation

Automatic Annotation of Cellular Data

Annotation notes

Guidelines for annotation: Entity annotation of Disorders, Findings and Body structures in Swedish health records. Selection of annotation classes

Multivariate detection of gene-gene interactions

mirdb: A microrna target prediction and functional annotation database with a wiki interface

Semantic Annotation for IP

Annotation. Charles Bukowski

Annotation. Philip Jose Farmer

Functional Annotation of Gene Products

Patrik Georgii-Hemming

TRITA-NA-E04067

NADA Numerisk analys och datalogi KTH 100 44 Stockholm

Department of Numerical Analysis and Computer Science Royal Institute of Technology SE-100 44 Stockholm, Sweden

Functional Annotation of Gene Products

Patrik Georgii-Hemming

TRITA-NA-E04067

Master’s Thesis in Computer Science (20 credits) Single Subject Courses, Stockholm University 2004 Supervisor at Nada was Stefan Arnborg Examiner was Stefan Arnborg

Abstract In recent years new technologies have been developed that allow biologists to measure the expression of thousands of genes at the same time. The amount of data generated in a single experiment presents a significant challenge for the analyst. This project concerns one aspect of the analysis, functional annotation, where one associates relevant knowledge from different biological databases with gene products of interest. The first step in functional annotation is to decide what information should be included. The next step is to mine different biological databases for this information and to store the information locally to enable queries of the aquired data. In this project we have developed an application to automate retrieval of information from five selected biological databases. The application stores the data in an embedded database that can be queried using SQL.

Funktionell annotering av genprodukter

Sammanfattning Utvecklingen av nya tekniker har på senare år gjort det möjligt för biologer att studera uttrycket av tusentals gener på samma gång. Den stora mängden data från experiment där dessa tekniker används är en stor utmaning för dem som ska analysera dessa data. I det här projektet har vi studerat en aspekt av processen, funktionell annotering. Funktionell annotering syftar till att associera relevant information från olika biologiska databaser med intressanta genprodukter. Det första steget vid funktionell annotering är att bestämma vilken typ av information som är intressant. Nästa steg är att med hjälp av databrytning lokalisera den här informationen i olika biologiska databaser och sedan lagra informationen i en lokal databas för vidare användning. I det här projektet har vi utvecklat ett program som automatiskt hämtar information från fem utvalda databaser. Programmet lagrar data i en inbäddad databas som tillåter SQL-sökning.

Contents 1 Introduction 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 What we are trying to achieve . . . . . . . . . . . . . 1.1.2 Different kinds of information about genes . . . . . . 1.1.3 Challenges and problems . . . . . . . . . . . . . . . . 1.2 Biological databases . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Data modeling and data management . . . . . . . . 1.2.2 Data retrieval . . . . . . . . . . . . . . . . . . . . . . 1.2.3 Data acquisition . . . . . . . . . . . . . . . . . . . . 1.2.4 Databases used in this project . . . . . . . . . . . . . 1.3 Current annotation strategies . . . . . . . . . . . . . . . . . 1.3.1 Manual annotation . . . . . . . . . . . . . . . . . . . 1.3.2 Analysis pipelines . . . . . . . . . . . . . . . . . . . . 1.3.3 More ambitious approaches based on link integration 1.3.4 Datawarehousing . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

1 1 1 2 3 4 4 5 6 6 11 11 11 13 13

2 The present work 2.1 Goals and strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Clarifying the task and the problems . . . . . . . . . . . . . . 2.1.2 How should applications for functional annotation be designed? 2.1.3 Providing a small proof-of-concept annotation program . . . . 2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Choosing data sources . . . . . . . . . . . . . . . . . . . . . . 2.2.2 The terminology problem . . . . . . . . . . . . . . . . . . . . 2.2.3 The validation problem . . . . . . . . . . . . . . . . . . . . . 2.2.4 The GeneAnnotator program (GEA) . . . . . . . . . . . . . . 2.3 What remains to be done . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Designing a user-friendly interface . . . . . . . . . . . . . . . 2.3.2 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Designing for extensibility and flexibility . . . . . . . . . . . .

14 14 14 14 15 15 16 16 17 17 19 19 19 21

3 Future directions 3.1 Ongoing research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Web services . . . . . . . . . . . . . . . . . . . . . . . . . . .

22 22 22

3.2

3.1.2 Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.3 Globally unique qualifiers . . . . . . . . . . . . . . . . . . . . Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23 23 23

References

25

Appendix

27

Chapter 1

Introduction 1.1

Background

This project was done at Center for Genomics and Bioinformatics (CGB), Karolinska Institute. It forms a small part of a collaborative effort between biologists and bioinformaticians to create an application for automatic analysis of results from high-throughput experiments.

1.1.1

What we are trying to achieve

One of the most central properties shared by all life forms is the ability to store and propagate information. The series of discoveries of how diverse life forms can be coded into a string of chemical entities called nucleotides gave birth to the scientific discipline of molecular biology. These nucleotides come in four variations (coded A,C,G,T) and are combined linearly to form a string of DNA1 . Defined regions, genes, of DNA in the genome contains information on how proteins should be built from amino acids. The information flow from gene to protein is divided into two processes. Transcription, where the information in the gene is copied into a RNA2 sequence and translation, where the RNA-strand is translated to a protein which does the actual work in the cell. To understand the mechanisms behind different processes in the cells it is necessary to understand which proteins are present and how they interact during these processes. For technical reasons it is much easier to study the RNA sequences than it is to study the proteins directly. The amount of different RNA sequences during different cellular processes can serve as a marker for their corresponding proteins. In recent years several high-throughput methods have been developed to measure the level of several thousands or even tens of thousands of these RNAs in a single experiment. One big challenge is how to interpret the results of these experiments. The task we are concerned with in this project is how to begin to understand what the presence of these RNAs ”mean”. Regardless of the hypothesis that lead the 1 2

deoxyribonucleic acid ribonucleic acid

1

biologist to perform a particular experiment there are some steps that must always be taken in the process of analyzing data from the high-throughput methods we are interested in here. One of the first steps is to find as much information as possible about the function of the individual gene products3 . Functional annotation of the gene products is the process of associating individual gene products with relevant information about the gene product derived from different sources. Ideally this information should be stored in a way that makes all information easily accessible to the biologists.

1.1.2

Different kinds of information about genes

It is of course necessary to define what kind of information one is looking for.I will outline the type of information we have focused on in this work. Databases containing literature references: Scientific articles describing the function of gene products are important sources of information. In practice it is enough to search PubMed, a service of the National Library of Medicine in the United States. PubMed includes over 14 million citations for biomedical articles back to the 1950’s [21]. Sequence databases: Sequence databases hold the most basic information about nucleotide sequences4 and protein sequences5 . The data includes the sequences, references to the scientists that submitted the sequences and the name of the entity the sequence is derived from if available. The sequence entities can be a section of the genome, a known gene, RNA or a protein. The database may also hold more information but this varies from database to database. GenBank [17], EMBL [6] and DDBJ [5] are the most comprehensive sequence databases. Ontology databases: An ontology in this context is simply a list of terms (including a definition of their meaning) and the relationships between the terms. An ontology provides a conceptualization of a domain of knowledge, facilitates communication between domain experts and makes it easier to write software that is dependent on domain knowledge [14]. In this project we have used the Gene Ontology (GO) which is actually three ontologies named biological process, molecular function and cellular component [3]. Each ontology is a directed acyclic graph where the terms are the nodes and the relationships are the arcs. There are two types of relationships, “is-a” and “part-of”. Most gene products have been annotated with one or several gene ontology terms. These terms give a good indication of the function of the gene product. 3 Since we are interested in making inferences about proteins and not RNAs, I will sometimes use this more vague term to denote the abstract entity composed of RNA and corresponding protein 4 a nucleotide sequence is the ”word” created by the ordering of nucleotides in a gene or RNA 5 a protein sequence is the ordering of the amino acids that build up the protein

2

Databases with information about metabolic pathways: A metabolic pathway is a small part of the biochemistry of a cell, e.g. fat metabolism or energy production in the Krebbs cycle. Many gene products are enzymes involved in regulation of these processes. Several databases hold information about metabolic pathways but most of them are very specialized and only hold information about a particular organism or cell type. In this project we decided to use KEGG (Kyoto encyclopedia of genes and genomes) which is the most comprehensive metabolic pathways database [15]. Signaling pathways databases: The activity of a cell is regulated by external signals. Insulin, e.g., binds to a receptor molecule at the cell surface and starts a cascade of events where one protein (gene product) activates the next in the cell. One result of this signaling is that the cell starts taking up sugar from the bloodstream. If we can place a gene product in one or several signaling pathways we will have learned a lot about its function. Unfortunately this information is available only in pictorial form. This means that these databases can only give answer to queries about which signaling pathways a gene product participates in. It is, e.g., not possible to ask about the location of a gene product in a signaling pathway, this information must be deduced by the user looking at the pictures of the signaling pathways. Ideally, the data about each signaling pathways should be held in a directed graph since this would allow more advanced queries. We have used the Biocarta database in this project [16]. Biocarta holds information about signaling pathways in humans and mice. Domain databases: Proteins are composed of several domains, distinct parts each having its own function. One kind of domain is common to proteins that sits in the cell membrane, another kind of domain is common to proteins that can bind to DNA and so forth. A domain is a kind of recurrent “theme” in proteins. If we know that a gene product has a certain domain we can guess what function the gene product has if the function of other gene products with this domain is known. The problem with this kind of information is that it is often based on computational predictions. Pfam (Protein families), e.g., is a database that uses hidden Markov models to predict the presence of domains based on the protein sequence. Other databases hold information from crystallographic experiments which provide direct evidence for the presence of particular domains. The quality of the data is obviously not the same. However, we have chosen to ignore this difficulty in this project and we use the CDD (Conserved domain database) [18] which includes information from several other databases, e.g. Pfam.

1.1.3

Challenges and problems

The problem with functional annotation is to a large degree a data integration problem since the relevant information is fragmented across many different data 3

sources. EBI (European Bioinformatics Institute) and Infobiogen (a research institute in France) maintains a catalog of biological databases, dbcat, which currently holds links to 511 different databases [12]. One seemingly trivial problem is how to identify the entities you want to annotate. There is no standardized way to assign and maintain names of biological objects across databases. For example, searching the OMIM (online mendelian inheritance in humans) database for “SLAP” results in two completely unrelated proteins, “Sarcolemmal associated protein” and “Src-like adaptor protein”. A more subtle problem is the clash of concepts as you move from one database to another. An example is the definition of a gene. The definition of ”gene” differs between researchers and databases which makes it very hard or even impossible to merge data from some sources. There are also technical challenges. The various databases use different DBMSs and none provide a standard way of accessing the data.Some databases provide large text dumps of their contents, others offer access to the underlaying DBMS and still others provide only web pages as their primary mode of access.

1.2

Biological databases

As already mentioned, there are several hundred biological databases. Well-known examples are DDBJ [5], EMBL [6], GenBank [17], PIR [9], and SWISS-PROT [23]. It is difficult to keep track of all these databases and dbcat was developed for this purpose. Most biological databases are also large, GenBank, e.g., contains more than 23 × 106 gene sequence records. To make matters even more complicated these databases are growing very rapidly. Both the actual size and the growth rate of these databases has become a serious problem and without automated methods, such as data mining algorithms, the data collected can no longer be fully exploited.

1.2.1

Data modeling and data management

Molecular databases can be classified as follows [8]: • Databases using a standard DBMS, i.e. relational,object or object-relational. • Databases using the database management system ACEDB [27].ACEDB is a DBMS which was originally developed for the biology database called A C.elegans Data Base. • Databases using the OPM (object protocol model) [11] together with a relational or object database management system. OPM is a data model combining standard object-oriented modeling constructs with specific constructs for modeling of scientific experiments. • Databases implemented as flat files. Most biology databases were first implemented as a collection of flat files. Later, many of them were reimplemented using relational or object database management 4

systems (DBMS). Unfortunately the relational model is not ideal for biological data that often has a semi-structed form, this has lead to very complex schemas that are not intuitive. The object model fits better but is less known. ACEDB is a database management system originally developed to hold data on a small worm (C. elegans).ACEDB was later extended to be able to manage other such specialized databases.ACEDB resembles an object database management system. With ACEDB, data are modeled as objects organized in classes. However, ACEDB supports neither class hierarchies not inheritance. An ACEDB object has a set of attributes that are objects or atomic values such as number or strings. ACEDB objects are represented as trees where the named nodes are objects or atomic values and arcs express the attribute relationships. The advantages of ACEDB is that it accomodates irregular data items. The schema can also be extended easily by adding attributes to objects because all objects of a class must not have all attributes. With ACEDB it is possible to extend a database schema without having to restructure the database. for existing objects need not be changed. ACEDB has its own query language AQL. The Object Protocol Model (OPM) has been developed for modeling both biological data and the event sequences in scientific experiments. OPM is similar to an object model but provides specific constructs for the modeling of scientific experiments. The SQL-like query language of OPM supports nested queries with path expressions and set predicates. OPM also offers an ontology of scientific terms. It has been argued that DBMSs are unnecessary in biology because transactions are so rare, most access is read-only, and because the cost of reimplementing the database in a relational DBMS is often very high. Another reason is that biological data is often very complex and includes deeply nested records, sets and lists. Such data types are difficult to model in a relational or object DBMS. The flatfile databases have no explicit data model in general. Their entries are structured either implicitly or explicitly by search indexes. Flat files are the de facto data exchange standard in biology. Many tools biologists use work only with flat files. Many research projects are currently investigating alternative means of data storage for bioinformatics. Different XML based strategies seem to hold a lot of promise and I will return to this subject in a later section.

1.2.2

Data retrieval

In general a biological database provides access via at least one of the following approaches: • Query interface. The ability to query the database directly using SQL is actually a rarity in this field. My only explanation for this is that many biologists do not know SQL and besides, many databases are nothing but indexed flat files. • Indirect retrieval using web browsers. This resembles the approach taken by common search engines on the web (e.g. GOOGLE). These databases allow 5

users to input boolean search strings to query the database. • Database downloading (as flat file). This is also quite common and it depends on the user having software to sift through the text file to extract interesting data.

1.2.3

Data acquisition

The information is collected in different ways: • From other databases. Many databases only summarize the data in other databases. In one way this is convenient for the biologist who may find it easier to locate interesting information. The problem is that it can be hard to find the original source of the information. • From the research community. Nowadays it is common that scientific journals demand that the scientists submit their data to relevant databases before publication. This ensures that the raw data are available to other research groups. • From the scientific literature. Some databases have large staffs of curators who are experts in the field and who regularly update database records based on new published findings.

1.2.4

Databases used in this project

In this section I will provide an overview of the databases that were used in this work. The factors involved in the choice of these particular databases will be discussed in chapter 2. GenBank The GenBank sequence database is an annotated collection of all publicly available nucleotide sequences and their protein translations. This database is produced at the National Center for Biotechnology Information (NCBI) [22] as part of an international collaboration with the European Molecular Biology Laboratory (EMBL) Data library from the European Bioinformatics Institute (EBI) [6] and the DNA Data Bank of Japan (DDBJ) [5]. In Febrary 2003, GenBank contained more than 23 million records. GenBank is built by direct submissions from individual research groups. Entries are found and retrieved using keyword searches. It is possible to search the database using a web browser or programmatically by using the fact that the queries are sent as http GET requests. Since the query string is a part of the url it is relatively easy to automate the searches. The result is returned as text, html or xml. It is also possible to do bulk downloads using FTP. The format of the query string and the result report is specified which makes the access protocol to GenBank relatively stable. GenBank is a flat file database and no access to the 6

database backend is provided. Figure 1.1 shows a shortened version of a GenBank record. The first thing to notice is that LOCUS is given with the prefix ”NM_” for this record which means that this sequence is a reference sequence (see RefSeq below). The second thing to notice is that the data is semi-structured and it can easily be stored as a tree which makes XML a perfect match for this datatype. However, it is not obvious how to design the schema for a relational database that will hold this data in third normal form. The result is a schema that is not intuitive for the biologist, a problem I will return to later.

RefSeq The goal of the reference sequence (RefSeq) database is to provide a biologically nonredundant collection of DNA, RNA and protein sequences [19]. Each RefSeq represents a single, naturally occurring molecule from a particular organism. RefSeqs are frequently based on GenBank records but are really a synthesis of information from several sources. GenBank contains records based on genomic sequences, transcribed (RNA) sequences and protein sequences. The problem is that we are interested in the functional unit, the gene product, that encompass a genomic sequence that is transcribed to RNA and translated to a protein. In GenBank this information is scattered over several records. Furthermore, several records are derived from the same biological entity but submitted by different research groups using different methods and reporting slightly different results. The RefSeq database removes redundant records and retains one reference sequence for the genomic sequence, one reference sequence for the RNA and one reference sequence for the protein. Every record in the RefSeq database also holds a link to the other members of the functional unit. This makes it easier to retrieve the functional information about the gene products. The RefSeq database is maintained by NCBI and is accessed as was described for GenBank.

LocusLink LocusLink organizes information from several public databases to provide a locuscentered view of genomic information [20]. A locus is a defined place in the genome that is transcribed to RNA. LocusLink is the most important source of functional information precisely because it is based on the concept of a locus and not on sequences (sequences have a many-to-one relationship to a locus). Figure 1.2 shows a shortened version of a LocusLink record. The data is again semi-structured but with the additional problem that the data is only available as an html-page. The GenBank records are always given in the same format (both the text format and the XML format is standardized) but since the LocusLink records are only meant to look similar to a user who browses the pages there is no guarantee that the html-code will not change. 7

Figure 1.1. An abbreviated GenBank record

8

Figure 1.2. An abbreviated LocusLink record

9

Figure 1.3. A Biocarta record.

Biocarta The BioCarta database stores pictures of different signaling pathways [16]. One uses a web interface to search for a particular gene product by name or by LocusLink ID. If the gene product is found, a web page with links to the relevant records is returned. If one follows the links the pictures are shown in the web browser. Unfortunately, a lot of information is lost because pictures and not, e.g., graphs are used to store the data. An example record is shown in figure 1.3

KEGG KEGG (Kyoto Encyclopedia of Genes and Genomes) is a Japanese database that is similar to BioCarta but contains pictures of metabolic pathways citekegg. KEGG is 10

accessed exactly like the BioCarta database and are subject to the same restrictions. Figure 1.4 shows an example record. GeneOntology In this work we are interested in going from a gene product to its function. Unfortunately the Gene Ontology database is adapted for the opposite problem of going from a defined function and finding gene products involved in that particular function [26]. The critical information contained in the Gene Ontology database can also be found by links from the LocusLink database records and we decided to use this indirect route instead of using the Gene Ontology database directly.

1.3

Current annotation strategies

To put the present work in perspective I will give an overview of the most commonly used strategies for functional annotation today.

1.3.1

Manual annotation

Today the most common approach to annotation is to do ”database surfing” where the biologist starts the search for information by querying a few databases that he or she happens to know about. The database records frequently contain hyperlinks to information in other databases on the internet so the biologist follows these links to get more data and stops when enough information is collected. This approach often starts with a keyword search which is in itself a problem due to the large number of irrelevant hits returned by the search engine. Combined with the fact that every link has to be followed and checked manually makes the whole process very laborious and prone to error. Another problem is that it depends on the hyperlinks which must be correct and kept up-to-date.

1.3.2

Analysis pipelines

Some improvements are possible by writing software that hides the problems from the biologist who is only interested in the end result. When a group of biologists have common information needs, it is common that a piece of tailor-made software is written and then used by all researchers in the group. These applications are typically rather small and their purpose is to do automatically what would otherwise be done manually. The benefit is standardization and the possibility to make necessary changes in one place instead of having every biologist relearn how to accomplish the goal when the database model change or when databases disappear from the internet (it happens, e.g. when they are bought by a company). Another benefit is that it becomes possible to handle a lot of information which can be integrated in a local database. The problem with this approach is that the applications can never grow to be truly general because this would rapidly lead to maintainance problems that cannot be handled by a small research group. 11

Figure 1.4. A KEGG record

12

1.3.3

More ambitious approaches based on link integration

An example of a more general application that is based on link integration is SRS (sequence retrieval system) which is a keyword indexing and search system for biological databases [7]. SRS is more sophisticated than general web-based search tools (like e.g. GOOGLE) because it recognizes the existence of structured fields in source databases and allows maintainers to explicitly relate a field in one database to a differently named field in another. Biologists can go to a SRS site and perform their searches there. This system depends on having maintainers that constantly work to keep the information accurate.

1.3.4

Datawarehousing

The idea in datawarehousing is to collect all relevant information in one database. The first step is to develop a data model that can accomodate all the information contained in the various source databases. It is also necessary to develop software programs that will fetch the data from the source databases, transform them to match the data model and then load them into the database. Datawarehousing is difficult because the database must be updated constantly. New information is continously added to the source databases, which means that the new data must be re-imported into the datawarehouse. To make matters worse, database designs do not stand still, their maintainers are changing the data model by adding new data types, changing fields and nomenclature, and changing the relationships among datatypes. This means that software to fetch, transform and load information that has been written for one version of a database will not necessarily work with a later version. One ambitious attempt at the warehouse approach in bioinformatics was the Integrated Genome Database (IGD) project [25]. At its peak, IGD integrated more than a dozen source databases. The IGD project survived for slightly longer than a year before collapsing. The main reason for its collapse was the rapid change of the source databases. On average, each of the source databases changed its data model twice a year. This meant that the IGD data import system broke down every two weeks and the software had to be rewritten.

13

Chapter 2

The present work 2.1

Goals and strategy

This project was initiated to explore some of the problems and possibilities in functional annotation. In the planning phase we decided to concentrate on the following subgoals.

2.1.1

Clarifying the task and the problems

The first goal of this project was to find a good strategy for functional annotation. This can in itself be useful for biologists doing manual annotation since it will help them avoid many problems. From this viewpoint the application (see 2.2.4) is only important in so far as it validates the strategy. The points to consider were: • How to get a handle on all information. About 500 biological databases are available on the internet. Many of them have overlapping and sometimes conflicting information. • How to handle the terminology problem. It is necessary to make sure that there is no doubt about what one is talking about. What genes are we annotating? • How to get and organize the information once it is found. Every database that one wants to get information from must be treated separately since they differ in how data is accessed and in how the results are delivered. As a final step it is necessary to set up a local database to store all the data. These points will be discussed beginning with section 2.2.1.

2.1.2

How should applications for functional annotation be designed?

The second goal was to clarify how an application for functional annotation should be designed. When we planned this project we came up with the following important factors. 14

• It is necessary to design the applications with change in mind. The demands of the application will change. The application will have dependencies on data formats, query interfaces and other things. These dependencies will be broken. The question is how to create a design that makes changing the application as painless as possible. • The application must be able to work as a component of a larger system. This is necessary to allow the user to adapt the application to his or hers particular needs. Therefore it is necessary to give careful thought to input and output of the application. • Even though the application will use a few selected databases it is necessary to allow other databases to be added or removed at a later date. • Quality control. How to ensure that the information is valid.

2.1.3

Providing a small proof-of-concept annotation program

The final goal of this project is to provide a proof-of-concept implementation of an application that takes a number of gene names and outputs annotated genes. The application must be usable, although somewhat limited, and incorporate the features mentioned above. The major limitation is that the only interface to the result will be SQL queries of the created database. Since most biologists do not know SQL an alternative interface must be constructed for this application to be really useful.

2.2

Results

I will put emphasis on the data selection and data integration problem since that was the major challenge of this project. The writing of the application was relatively straightforward. To make the discussion more concrete I will first give an outline of the steps in the annotation process. 1. Read the file with the names of the genes to be annotated. 2. Find the entities (in this case the gene products) to annotate. This is done by querying the GenBank or the LocusLink databases to find all genes with names found in the input file. 3. Use the information from these databases to learn how to query other databases for more information. At present the application will try to find information in the KEGG database, the Biocarta database and the Gene Ontology database apart from the information already found in GenBank and LocusLink. 4. The application uses an embedded database and will automatically create a database on the users computer to hold all information in a relational database. 15

5. Write all information to the local database. The embedded database, SQLite [4], comes with a client program that can be used for SQL queries of the created database.

2.2.1

Choosing data sources

It took a long time to decide which databases to use as sources of information. The major problem was to find the relevant databases. First the databases have to be found and then it is necessary to quickly determine if they contain relevant information. There is no easy approach to do this. Basically, information was gathered from biologists at CGB, from bioinformatics literature and web searches. It was difficult to know when to stop searching and accepting that a search like this cannot be exhaustive. Another problem was to keep the application simple by not including two databases with very similar information. Several parameters were considered in the selection of the five databases. 1. Completeness. In our case this is a question of whether the database holds information on all gene products of interest to biologists at CGB. This criterion is fulfilled by the chosen databases if the definition of completeness is restricted to mean that the databases hold all information that exists within their scope. The records in the Biocarta and KEGG databases depend on a rather deep understanding of the function of the gene products so it is not so surprising that these databases lack information about many gene products. But they are still complete in the restricted sense that there is not more information of the same kind anywhere else. 2. Quality. Even though quality is difficult to measure we did what we could. One criterion was frequency of citation in biological research articles. We reasoned that databases that are frequently cited are regarded as trustworthy by the researchers in the field. The databases we chose are often cited in the biological literature and they are well known to most biologists. 3. Accessibility. How is information retrieved and in what format is the information returned? The most troublesome databases are completely designed for access via a web browser. To retrieve data from these databases automatically, the application must first perform a direct GET or, even worse, a direct POST request via HTTP and then parse the returned HTML-code. This makes the application very brittle since even small changes in an HTML-form or in the returned HTML-code will lead to a collapse of the application. We had hoped to be able to choose databases where the database backend is available but this was not possible and accessibility was a problem with all chosen databases.

2.2.2

The terminology problem

There are many potential ways to handle the ambigous terminology in this field. I choose to implement two strategies in parallel. The first strategy is to let the user 16

handle the problem. If the user inputs an ambigous gene name then the output will be ambigous. This is not as bad as it seems because the application will first find all gene products that the gene name refers to and then create an entity for each gene product and annotate these entities. The result will be correct in the sense that the annotations are correctly associated with the relevant entities but this approach will generate a lot of irrelevant results that the user must handle. The second strategy is to let the user use LocusLink IDs instead of gene names. LocusLink IDs are unique so there is no longer any ambiguity but this approach involves more work for the user who has to find the LocusLink IDs.

2.2.3

The validation problem

Another problem when one wants to build an application that is dependent on data from several external sources is the question of validation of data. Every database uses a different strategy to check the correctness of data before it is entered in the database. GenBank makes no guarantees, the research group that submits the data is responsible for its correctness. In most cases the research group supply a reference to an article where they have published their work and this article can then be scrutinized by any sceptical database user. RefSeq is another story since the database records are assemblies of data from several sources. The RefSeq maintainers therefore accept responsibility for validation of the data in RefSeq. Of course, any user could in principle backtrack the steps that were taken in the assembly process but this quickly becomes unpractical even if the user is a software application. To validate all records would mean that the whole database would have to be recreated. LocusLink is also a curated database with data assembled from several sources. The difference to RefSeq is that it is more obvious where the information comes from and links to the original source is provided. Still, validation of all data means recreating the database so the best thing one can do is to retain the references to the original data source and let the (human) user decide what needs to be checked. Biocarta records are created by scientists who volunteer. The pictures come with literature references but this is not very useful if one wants to validate the information automatically. Once again validation must be left to the end-user. The information in KEGG is validated by the maintainers and can be taken as is. The conclusion is that most of the validation must be left to the end-user. It is theoretically possible to do some checks, e.g. using multiple sources and check for consistency but then the question becomes how to resolve conflicts between different data sources. In this project the data is taken ”as-is”.

2.2.4

The GeneAnnotator program (GEA)

A major design goal was to maximize flexibility for the user. One result of this is that the user can enter either gene names, gene symbols or LocusLink IDs in the inputfile. Another result is the use of an embedded database, SQLite, to store the data [4]. By using an embedded database the user does not have to set up the database and can run the application on his or her computer directly. At the same 17

time it is very easy for a knowledgeable user to switch to another relational database should that be necessary. It was important to store the result in a database not only to allow queries but also to allow GEA to be a part of an analysis pipeline. The analysis for the biologist does not stop with the functional annotation. The next step may be visualization of different aspects of the result or the use of different data mining algorithms. It is difficult to predict exactly what the next steps can be but by storing the data in a database most needs can be met. However, the need to write the application to fit a particular database schema is a real problem. If a user wants to add information from another biological database the database schema must be changed and with it a large part of the application. The problem is that even a small addition may necessitate a large change to keep the database schema in third normal form. This problem would be reduced if an object based database or an XML database had been used instead. A future version of GEA may switch to an XML database but right know this is not practical because the end-users know about relational databases and are reluctant to change to anything else. The application reads an entry at a time from the input file and must then first retrieve all GenBank record IDs that match the entry. This is done by building a query string that is concatenated to a certain url. This is just a way of sending a GET-request using HTTP. The GenBank server then sends back the IDs of all matching records. The next step is to send another GET-request to retrieve the GenBank records. At this stage some filtering is done and only GenBank records that are also RefSeq records are retrieved. The GenBank records are retrieved as text. The data in the parsed records are stored in classes which have attributes that match the tables in the database. To retrieve more information the LocusLink ID of each record is extracted and used to get the relevant records from LocusLink. The LocusLink ID is also used to find the relevant Biocarta and KEGG records. In all these cases the requests for records are sent as GET-requests via HTTP. The problem with the responses from LocusLink, Biocarta and KEGG is that the application must parse html-code. A decision was made to only extract the data we need from the html-code. This is in contrast with the GenBank records where all information is retained. By not trying to parse the records in their entirety we hope that the code will not be so brittle. The user are given the choice to either create a new database or use an existingone before the results are written to disc. SQLite comes with a client program that can be used to query an existing database. The resulting database is a single file which can easily be imported to a more full-featured database if desired. The whole application is written in Python, an object-oriented scripting language well suited to our purpose. Instead of burdening this thesis with a lot of code I have included pseudocode for some important parts of the application in the appendix. Figure 2.1 shows the entity-relationship diagram for the GEA database. The locuslink entity is central which is natural since a LocusLink record represents a gene product which is what we want to annotate. The RefSeq database has records for genomic sequences (DNA), mRNA sequences and protein sequences so in reality one LocusLink record can correspond to a least three RefSeq records but a decision 18

was made by the biologists to only include the mRNA RefSeq records since these records refer to the other RefSeq records (the genomic and protein sequences). This is reflected in the one-to-one relationship between the locuslink and refseq entities. The chromosome attribute of the locuslink entity gives the chromosome on which the gene is located. For many of the entities the id is actually very informative. An example is the kegg and biocarta entities where the id is the name of the process the gene product is a part of. Sometimes the name of a process is enough, in these cases the pictures are not important. Initially we planned to store the pictures but it takes a long time to download all pictures and the database becomes very large. In the present version we only store the urls of the pictures. This limitation will be discussed in section 2.3 Generif is a table of literature references. Each record in this table contains a summary of an article that describes the function of a gene product and the url of the article itself. The domain table holds the name of the protein domains of gene products. The url of a domain points to a record in the conserved domain database (cdd) that holds information of that particular domain. The lltodomain table associates a gene product with a domain and contains a literature reference that justifies the association. The go table contains the name of a gene ontology term, its aspect (one of biological process, molecular function or cellular component) and the url of the GO term. The url actually points to a record in a database maintained by the Gene Ontology Consortium. In the present version the user must use SQL to access the data. This will change in later versions.

2.3

What remains to be done

The application is functional and used by some of the biologists in CGB but more work is needed before it can be considered finished.

2.3.1

Designing a user-friendly interface

The most important improvement will be to allow automatic retrieval of the resources pointed to by the urls in the database. It will be relatively easy to add a GUI which shows the urls as real hyperlinks. Another improvement would be to add a simplified query interface. This could be modeled on web search engines like GOOGLE. Another possibility would be to show the data as one large table and let the user mark interesting columns with the mouse. Exactly how the user interface will look is not clear but since the users have an interest in the application they are willing to participate in its design.

2.3.2

Robustness

At present the error handling is very primitive. Errors are simply ignored. If a record from a database cannot be retrieved, the application will simply proceed to the next step. An obvious improvement would be to let the application create a log 19

go generif

gotoll

id name aspect url

id summary url llid

goid url

locuslink

refseq id mrna protein genome chromosome llid

biocarta

id synonym name phenotype omimid

id url bctoll bcid llid

lltodomain domainid llid ref url

keggtoll keggid llid kegg

domain

id url

id url

Figure 2.1. The entity-relationship diagram for the database schema used in GEA.

20

file where all problems would be recorded. Another improvement would be to let the application retry every failed operation at least once. A different kind of problem presents itself when the user wants to update the database. It is easy to add records to an existing database by running the application again with a new inputfile but at present it is not possible to change the information about a gene product already in the database. Either the user must create a new database or he or she must first remove the old record.

2.3.3

Designing for extensibility and flexibility

As already discussed, the use of a relational database limits the extensibility of the application. We are thinking about using an XML database, e.g. Xindice from the Apache Software Foundation [1]. Xindice is an Open Source database which can be queried with XPath. Using an XML database would make it easier to add or remove information of gene products without disrupting the whole application.

21

Chapter 3

Future directions 3.1

Ongoing research

There is a lot of research in bioinformatics that tries to address the problems I have touched upon in this thesis. I will briefly mention some ideas that will probably prove very useful in the area of functional annotation.

3.1.1

Web services

Web services can be seen as a variant of link integration. In this view, the heterogeneous collection of linked data sources on the web is turned upside-down and becomes a web of services that are linked by service names and definitions. GenBank, e.g., is no longer a database for retrieval of sequences but is transformed to a service that transforms sequence accession numbers into GenBank flat files. The difference seems minor but it allows users to establish a common framework that can encompass many data sources. One example of a web service is the Distributed Annotation System (DAS) [2]. DAS provides a web service for exchanging genomic annotations, information that can be associated with a region of the genome. The DAS protocol is simple. The user asks for a genomic region and the server returns a structured document that contains information about all annotations that overlap the specified region. The DAS service allows data providers to exchange informationabout annotations and allows a limited form of data integration. DAS is unfortunately semantically weak and a lot of ongoing work tries to overcome this limitation. The first aspect of this weakness is that the annotation-fields are without type information. Another aspect is that DAS does nothing to handle the terminology problem even though all objects exchanged through DAS have names. Two other technologies, ontologies and globally unique identifiers, go a long way towards solving these problems. 22

3.1.2

Ontologies

We have already discussed one ontology, the Gene Ontology (GO), but ontologies are a very active research area in bioinformatics and there are several other ontology projects as well [24]. Ontologies can not by themselves lead to integration of biological databases but they can be important facilitators. The existence of an ontology allows a data integrator to merge the information in different databases with some guarantee that a term means the same thing in all databases. The Sequence Ontology (SO),e.g., defines a set of terms and definitions that describe features on a genome, such as exon, pseudogene and transcription start site. An important feature of biological ontologies is that terms are organized in a hierarchical manner where more specific terms are specializations of more general ones. This makes it possible to merge specific, detailed information with more general information by first moving up in the hierarchy from the specialized terms to a common, more general term. To support the complex relationships that are common in biology, terms are allowed to have more than one parent leading to a data structure that is a DAG (directed acyclic graph). The most common type of relationship is ”is-a” but other relationships are found in certain ontologies, e.g., ”part-of” in GO.

3.1.3

Globally unique qualifiers

Since the same biological object may have several names and different biological object may have the same name, terminology is a major problem as already discussed. One solution might be to have a names commission to manage the definite list of such names, as the HUGO Gene Nomenclature Committee is attempting to do with human gene symbols [10]. The problem is that names come, go, are merged and split too rapidly for any commission to keep up with. Even if the names commission could handle this, it is unclear how the changes can be propagated to the databases that depend on them. Another solution is to create a globally unique identifier. The Life Sciences Identifier (LSID) put forward by the Interoperable Informatics Infrastructure Consortium (I3C), combines the internet domain name of the source database with the local identifier from the database [13]. An example is the C. elegans rad-3 gene which might get the LSID ”urn:lsid:www.wormbase.org:gene/rad-3”. The ”urn:” identifies the resource as a Universal Resource Name (URN) to distinguish it from a Universal Resource Location (URL). The combination of ontologies and globally unique identifiers increases the chances that web services can exchange data without manual intervention. A future version of DAS will use SO to describe sequence annotation types and LSIDs to identify biological objects.

3.2

Recommendations

There is at present no ”best practice” when it comes to functional annotation of gene products. The field is full of ad hoc solutions based on the opinions of individual 23

biologists so it is very important to design for change and to develop applications in this area in close cooperation with the end users. Any application written today will be obsolete in six months so it is no use trying to design a ”killer application”.

24

References URLs last visited the 20th of May, 2004 [1] Apache Software Foundation. http://xml.apache.org/xindice [2] Biodas. http://biodas.org. [3] The Gene Ontology Consortium. Creating the gene ontology resource: Design and implementation. Genome Research, 2001. [4] Applied Software Research D R Hipp. http://www.sqlite.org [5] http://www.ddbj.nig.ac.jp/Welcome-e.html [6] European Bioinformatics Institute. http://www.ebi.ac.uk/embl/index.html [7] European Bioinformatics Institute. http://srs.embl-heidelberg.de:8000/srs5 [8] Peer Kröger Francois Bry. A molecular biology database digest. Distributed and Parallel Databases, 2003. [9] Georgetown University Medical Center. http://pir.georgetown.edu/home.shtml [10] HUGO Gene Nomenclature Committee. http://www.gene.ucl.ac.uk/nomenclature [11] V Markowitz I M Chen. An overview of the object protocol model (opm) and the opm data management tools. Information Systems, 1995. [12] Infobiogen. http://www.infobiogen.fr/services/dbcat [13] Interoperable Informatics Infrastructure Consortium. http://www.i3c.org/wgr/ta/resources/lsid/docs/index.asp 25

[14] Seung Y Rhee Jonathan B L Bard. Ontologies in biology: Design, applications and future challenges. Nature Reviews Genetics, 2004. [15] Kyoto University Bioinformatics Center. http://www.genome.ad.jp/kegg [16] National Cancer Institute. http://cgap.nci.nih.gov/Pathways/BioCarta_Pathways [17] National Center for Biotechnology Information. http://www.ncbi.nlm.nih.gov/Genbank [18] National Center for Biotechnology Information. http://www.ncbi.nih.gov/Structure/cdd/cdd.shtml [19] National Center for Biotechnology Information. http://www.ncbi.nlm.nih.gov/RefSeq [20] National Center for Biotechnology Information. http://www.ncbi.nlm.nih.gov/LocusLink [21] National Library of Medicine. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed [22] http://www.ncbi.nlm.nih.gov [23] Swiss Institute of Bioinformatics. http://us.expasy.org/sprot [24] Open Biological Ontologies. http://obo.sourceforge.net [25] L Stein. Integrating biological databases. Nature Reviews Genetics, 2003. [26] The Gene Ontology Consortium. http://www.godatabase.org/cgi-bin/go.cgi [27] The Sanger Institute. http://www.acedb.org

26

Appendix Pseudocode for parts of GEA read commandline arguments create or open SQLite database while entry in inputFile: entry = getEntry(inputFile) queryString = genbankIdURL + entry listOfGenBankIDs = sendGETrequest(queryString) listOfGenBankRecords = null for id in listOfGenBankIDs: queryString = genbankRecordURL + id listGenBankRecords.add(sendGETrequest(queryString) for record in listOfGenBankRecords: parse record store in shadow classes # the shadowclasses have attributes that # mirrors the database table attributes queryString = locuslinkURL + locuslinkID locuslinkRecord = sendGETrequest(queryString) extract data from locuslinkRecord store in shadow classes queryString = biocartaURL + locuslinkID biocartaPathways = sendGETrequest(queryString) extract from biocartaPathways store in shadow classes queryString = keggURL + locuslinkID keggPathways = sendGETrequest(queryString) extract data from keggPathways store in shadow classes write data in shadow classes to database

27