Inferring Domain-Domain Interactions from Protein-Protein Interactions Minghua Deng 

Shipra Mehta 

ABSTRACT Protein-protein interactions are important events in cellular and biochemical processes within a cell. Several researchers have undertaken the task of analyzing protein-protein interactions covering all genes of an organism by using yeast twohybrid assays. Protein-protein interactions involve physical interactions between protein domains. Therefore, understanding protein interactions at the domain level gives a global view of the protein interaction network, and possibly extends functions of proteins. In this study, we present a Maximum Likelihood approach to infer domain-domain interactions from the 5719 yeast protein-protein interactions obtained in the high throughput two-hybrid experiments by Uetz et al., 2000 and Ito et al., 2001. The accuracies of our predictions are measured at the protein level. Our study includes the following three results: (1) using the inferred domain-domain interactions, we predict interactions between proteins and achieve 39:0% speci city and 79:7% sensitivity; (2) our predicted protein-protein interactions have a signi cant overlap with the MIPS(http://mips.gfs.de) protein-protein interactions obtained by methods other than the two-hybrid systems; and (3) the mean correlation coeÆcient of the gene expression pro les for our predicted interacting pairs is signi cantly higher than that for random pairs as well as that of interacting pairs in Uetz's and Ito's experimental data. Our method has shown robustness in analyzing incomplete data sets and dealing with various experimental errors. We nd several novel protein-protein interactions such as RPS0A interacting with APG17 and TAF40 interacting with SPT3, which are consistent with the functions of the proteins.

1. INTRODUCTION  Department of Molecular and Computational Biology, University of Southern California. Mailing address: 1042 West 36th Place, DRB 155, Los Angeles, CA 90089-1113. y To whom the correspondence should be addressed: [email protected] and [email protected].

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 2001 ACM X-XXXXX-XX-X/XX/XX ...$5.00.

Fengzhu Sun y

Ting Chen y

With the advancement of genomic technology and genome wide analysis of organisms, more and more organisms are being extensively studied for gene expression on a global scale. Expression pro ling or measuring expression levels of all genes in an organism using DNA microarrays or oligonucleotide chips, is now being increasingly used to analyze functions of genes or to functionally group genes based on their expression pro les [1]. After the completion of the genome sequence of Saccharamyces cerevisiae, a budding yeast, in 1996 [2], several researchers have undertaken the task of functionally analyzing the yeast genome, comprising approximately of around 6280 proteins (YPD), of which roughly one-third do not have known functions [3]. Transcriptome analysis can cluster genes based on similar expression patterns. This makes it possible to cluster genes that show a common pattern of expression and assign a biological function to them, depending on the functions of genes in the cluster. However, expression pro ling gives an indirect measure of a gene product's biological and cellular function. A more complete study of an organism could possibly be achieved by looking at not only the mRNA levels but also the proteins they encode. It is well known that mRNA levels alone are not suÆcient to group genes based on their expression, since not all mRNAs end up being translated. All biological functions within a cell are carried out by proteins. Most cellular processes and biochemical events are ultimately achieved by interactions of proteins with one another. Thus this makes it worthwhile to look at protein expression and their interactions simultaneously. AÆnity chromatography, two-hybrid approach, copuri cation, immunoprecipitation and cross linking are some of the tools used to verify proteins that physically associate. Of these, the two-hybrid approach has been widely used to analyze the protein-protein interactions in Saccharomyces cerevisiae [4, 5, 6]. Their protein interaction pro les have made it possible to look at complexes comprising large number of proteins and to also functionally classify proteins of unknown function. Uetz et al. [4] used two di erent approaches in their two-hybrid experiments. The rst was a protein array approach with 192 yeast proteins as bait, Gal4 DNA-binding domain fusions, and about 6000 yeast transformants as prey, Gal4 activation domain fusions. The second, interaction sequence tag (IST) approach, used highthroughput screens of an activation domain library encoding roughly 6000 yeast genes that were pooled. All yeast proteins were cloned into DNA-binding domain vectors. Of the 6144 yeast ORF (open reading frame) PCR products, precisely 5345 ORFs were successfully cloned. Their rst

approach revealed 281 interactions, with less stringent selection criteria, using HIS3. The second approach revealed 692 interactions with the more stringent URA3 selection method. Ito et al. [6] used a similar method and reported 4549 interactions among 3278 proteins. Some interactions in both data sets were repeated (bait and prey exchanged). They imposed a more rigorous selection including four reporter genes, ADE2, HIS3, URA3 and also MEL1, to minimize false positives due to promoter speci c activation. All of these genes have Gal4 responsive promoter. Some of the computational approaches in identifying protein functions and interactions include Rosetta stone/gene fusion method [7, 8], the phylogenetic pro le method [9] and the method combining multiple sources of data [10]. The rst two methods are based on protein sequence homologues to predict protein interactions and functional correlations. The last combined method used the results of the rst two methods and other sources such as protein pairs with correlated mRNA expression levels [11] and experimentally determined protein-protein interactions. Other computational methods to predict protein-protein interaction have been presented based on di erent principles, including the interaction-domain pair pro le method [12, 13] and the support vector machine learning method [14]. In our study, we use the protein-protein interaction (PPI) data sets of Uetz and Ito to predict domain-domain interactions (DDI) in yeast proteins. The protein-domain information is obtained from a protein domain family database called Pfam [18]. Since every protein can be characterized by distinct domains or combination of domains, understanding domain interactions is crucial to understanding the nature and extent of biomolecular interactions. Our study predicts probable domain-domain interactions solely based on the information of protein-protein interactions. Since proteins interact with each other through their speci c domains, predicting domain-domain interactions on a global scale from the entire protein interaction data set can make it possible to predict previously unknown protein-protein interactions from their domains. Thus, domain interactions extend the functional signi cance of proteins and present a global view of protein-protein interaction networks within a cell responsible for carrying out various biological and cellular functions. As is known, the yeast two-hybrid approach is not accurate in determining protein-protein interactions, and the interaction data used in our study certainly contain many false positive and false negative errors [6, 15, 16, 17]. Taking into account these errors, we apply the Maximum Likelihood approach to estimate the probability of domain-domain interactions. We have also taken into account multiplicity of observations in the two data sets as evidenced by exchanged bait and prey, repeated interactions and synonyms for gene names used. To assess the accuracy of our method, we predict protein-protein interactions using the inferred domain-domain interactions, and compare them with the observed interactions. Our method has shown robustness in analyzing incomplete data sets and dealing with various experimental errors. We achieve 39:0% speci city and 79:7% sensitivity using the combined Uetz's and Ito's data. By comparing our predicted protein-protein interactions with the MIPS protein-protein interactions obtained by methods other than the two-hybrid systems, we show that the prediction rate of our method is 90 times better than that of a

random assignment. We also compare the gene expression pro le correlation coeÆcients of our predictions with those of Uetz's and Ito's experimental data, and our predictions have a higher mean correlation coeÆcient. Finally, we check for biological signi cance of our novel predictions, and nd several interesting interactions such as RPS0A interacting with APG17 and TAF40 interacting with SPT3, which are consistent with the functions of the proteins. A complete description of our model and analysis results follows in the sections below.

2. MATERIALS AND METHODS 2.1 Source of Data We obtain the protein-domain relationship for yeast proteins from Pfam [18], a protein domain family database. Pfam contains multiple sequence alignments for each domain family and uses pro le hidden Markov models to nd domains in new proteins. The latest version, Pfam 6.5 (http://pfam.wustl.edu/) contains alignments for 2929 protein domain families for Pfam-A and 57891 domain families for Pfam-B. The protein sequences are derived from SWISSPROT 39 and TrEMBL 14 databases [19]. Domains in Pfam-A are well-de ned because the corresponding multiple alignments and hidden Markov models are manually checked, and most of the domains have assigned functions. Pfam-B was generated automatically by programs and includes ProDom domains [20] not covered by Pfam-A. We download both Pfam-A and Pfam-B from the Pfam ftp site, including domain names and proteins containing the domains in various organisms. We extract domains along with the S. cerevisiae gene names and obtain domain information from both Pfam-A and Pfam-B. In this process, we associate the yeast gene accession numbers with the corresponding SWISSPROT and TrEMBL accession numbers to locate all yeast genes. Proteins for which no domain information is available are classi ed as super-domains (Table 1), and co-existing domains are merged as super-domains as well.

2.2 Association method A simple measure of interaction between domain Dm and domain Dn is the fraction of interacting protein pairs among all the protein pairs containing the domain pair (Dm ; Dn ). Let Imn be the number of interacting protein pairs containing domain pair (Dm ; Dn ), and Nmn be the total number of protein pairs containing (Dm ; Dn ). The association measure is given by : A(Dm ; Dn ) = NImn mn

The method relies on the accuracy of observed data, and in our case, observed interactions are treated as real interactions. However, this method computes domain-domain interactions locally. By locally, we mean that the association measure ignores other domain-domain interaction information between the protein pairs and thus does not make full use of all the available information. For example, proteins Pi , Pj and Pk contain domains fDa ; Dx g, fDy ; Db g, and fDy ; Dc g, respectively. Domains Dx and Dy do not appear in any other proteins. If we observe Pi interacting with Pj and Pi interacting with Pk , then A(Dx ; Dy ) = 2=2 = 1. Obviously, this kind of local

method ignores other domain-domain interactions such as Da interacting with Db and Dc . In fact, it is possible that Dx and Dy do not interact with each other but Da interacts with both Db and Dc . Therefore, to infer a domain-domain interaction, other related domain-domain interactions have to be taken into account. This means that interactions of other proteins containing domains Da , Db or Dc are to be included, and thus more domains and proteins are involved. Iterating this idea, eventually all proteins and all domains are related and need to be taken into account. The association method also ignores experimental errors. In the following, we develop a global approach using a Maximum Likelihood Estimation method to incorporate all the proteins and the domains as well as experimental errors.

2.3 Maximum Likelihood Estimation

Let D1 ;    ; DM denote the M domains, and P1 ;    ; PN denote the N proteins. Let Pij denote the protein pair of Pi and Pj , and Dij denote the domain pair of Di and Dj . Assuming that protein P1 contains domains fD1 ; D2 ; D3 g and protein P2 contains domains fD1 ; D4 g, we denote P12 = fD11 ; D12 ; D13 ; D14 ; D24 ; D34 g to be the set of domain pairs for the protein pair P12 . We treat protein-protein interactions and domain-domain interactions as random variables. Let Pij = 1 if protein Pi and protein Pj interact with each other and Pij = 0 otherwise. Similarly, let Dmn = 1 if domain Dm interacts with domain Dn and Dmn = 0 otherwise. We make the following assumptions throughout the paper.

 Assumption I: Domain-domain interactions are

independent, which means that the event that two domains interact or not does not depend on other domains.  Assumption II: Two proteins interact if and only if at least one pair of domains from the two proteins interact.

Under the above assumptions, we have Pr(Pij = 1) = 1:0

Y

Dmn 2Pij

(1

mn );

(2.1)

where mn = Pr(Dmn = 1) denotes the probability that domain Dm interacts with domain Dn . We consider two types of experimental errors in the twohybrid systems: false positives where two proteins do not interact in reality but were observed to be interacting in the experiments, and false negatives where two proteins interact in reality but were not observed to be interacting in the experiments. The false positive rate is denoted as f p and the false negative rate is denoted as f n. Let Oij be the variable for the observed interaction result for proteins Pi and Pj : Oij = 1 if the interaction is observed and Oij = 0 otherwise. Then f p = Pr(Oij = 1 j Pij = 0); f n = Pr(Oij = 0 j Pij = 1): Thus, the probability for the observed protein-protein interaction is Pr(Oij = 1) = Pr(Oij = 1; Pij = 1) + Pr(Oij = 1; Pij = 0) = Pr(Oij = 1 j Pij = 1)Pr(Pij = 1) (2.2) + Pr(Oij = 1 j Pij = 0)(1 Pr(Pij = 1)) = Pr(Pij = 1)(1 f n) + (1 Pr(Pij = 1))f p

The likelihood function, i.e, the probability of the observed whole proteome interaction data is L

=

where oij

=

Y

(Pr(Oij = 1))oij (1

Pr(Oij = 1))1 oij

(2.3)

(

1 if the interaction of Pi and Pj is observed; 0 otherwise:

The likelihood L is a function of  = (mn ; f p; f n). In the following, we x f p and f n. We estimate  using maximum likelihood estimation (MLE). Because of the high dimensionality of , it's diÆcult to maximize L directly. We develop an Expectation-Maximization (EM) algorithm to solve the problem. The idea of EM algorithms for a general problem is described as follows. In order to obtain the MLE of the parameters, we supplement the observed data with data which are not observable, called the complete data. In an EM algorithm, we distinguish the observed data Y from the complete data Z . We can obtain the MLE of the unknown parameters  based on the complete data Z . We should also be able to calculate the expectation of Z given the observed data. There are two steps in an EM algorithm: the expectation (E) step and the maximization (M) step. In the E-step, we calculate the expectation of the complete data Z given the observed data Y , Z^ = E (Z jY; (t 1) ). In the M-step, we obtain the MLE of , (t) , based on Z^ . Thus we obtain a recursive formula to estimate parameters . Next, we adapt the EM algorithm to our problem. The observed data is the experimentally observed interactions O = fOij = oij ; i  j g. The complete data includes all the domain-domain interactions for each protein-protein pair. Let Am be the set of proteins containing domain Dm . Let Nmn be the total number of protein pairs between Am and An . In order to estimate mn , the probability that domain Dm interacts with domain Dn , we need information on the interaction status for protein pairs between Am and An . De ne the complete data as (O; D), where O is given above (ij ) (ij ) and D = fDmn ; Pi 2 Am ; Pj 2 An ; 8 m; ng. Dmn = 1 if domain Dm and domain Dn interact in the protein pair (ij ) Pi and Pj , and Dmn = 0 otherwise. We derive the EM algorithm as follows. The E-step is:

j

8

(ij ) E (Dmn Okl = okl ; k; l; (t 1) ) (ij ) (t 1) = E (Dmn Oij = oij ;  ) (ij ) Pr(Dmn = 1; Oij = oij (t 1) ) = Pr(Oij = oij (t 1) ) ij = 1 (t 1) )Pr(O = o D(ij ) Pr(Dmn ij ij mn = Pr(Oij = oij (t 1) ) (t 1)  (1 f n)oij f n1 oij = mn ; Pr(Oij = oij (t 1) )

j

j

j

j

j

j

= 1; (t

1)

)

j

where the denominator can be calculated using Eq 2.2. The (ij ) MLE of mn is the fraction of fDmn ; Pi 2 Am ; Pj 2 An g (ij ) such that Dmn = 1. We thus obtain a recursive formula for

Pfam Superproteins domains domains PPI Uetz 1337 1330 313 1445 Ito 3277 2776 909 4475 Combined 3729 3124 1007 5719 Overlap 855 964 215 201

Specificity and sensitivity of the prediction 100 Specificity Sensitivity 90

80

Table 1: The number of proteins, domains and protein-protein interactions (PPI) in the Uetz's, the Ito's, the Uetz's and Ito's combined, and the overlap data sets. A super-domain is either an ORF without domain information or a merge of multiple domains.

Prediction rate, %

70

60

50

40

30

20

the M-step: (t) mn

= =

1

P

10

j

8

(ij ) E (Dmn Okl = okl ; k; l; (t Nmn i2Am ;j 2An (t 1) (1 f n)oij f n1 oij mn : Nmn i2Am ;j 2An Pr(Oij = oij (t 1) )

P

j

The EM algorithm is described in the following:

1)

)

(2.4)

1. Initialization: Choose initial values for fmn ; 8 m; ng, and compute Pr(Pij = 1) by Eq 2.1 and Pr(Oij = 1) by Eq 2.2; 2. Update parameters fmn ; 8 m; ng by Eq 2.4 and compute the likelihood function by Eq 2.3; 3. Go to step 2, repeat until the value of the likelihood function is unchanged (within certain error).

3.

RESULTS

The two sources of protein-protein interactions are listed in Table 1. The domains are extracted from the Pfam database including both Pfam-A and Pfam-B. Any protein without any domain information is treated as one superdomain, and two or more domains always co-existing in proteins are merged to one super-domain. We apply both the Association method and the MLE method to estimate domain-domain interactions. However, it is diÆcult to estimate the accuracies of our prediction at the domain level because very few domain-domain interactions have been known. So we use the inferred domaindomain interactions to predict protein-protein interactions and assess the prediction accuracies at the protein level. The accuracies of the protein-protein interaction predictions are assessed by speci city and sensitivity. The speci city, denoted as SP , is de ned as the percentage of matched interactions between the predicted set and the observed set over the total number of predicted interactions. The sensitivity, denoted as SN , is de ned as the percentage of matched interactions over the total number of observed interactions.

3.1 Results of Association Method Two proteins are predicted to be interacting if there exist two domains, one from each protein, whose association value is greater than a pre-de ned threshold. The speci city and the sensitivity curves are shown in Figure 1. We achieve 47:4% speci city and 46:7% sensitivity by setting the threshold as 0:5 using the combined data sets.

3.2 Results of the MLE Method

0

0

0.1

0.2

0.3

0.4

0.5 0.6 Probability cutoff

0.7

0.8

0.9

1

Figure 1: Speci city and sensitivity of prediction based on the association method, using all 6359 proteins. We apply the EM algorithm recursively to derive domaindomain interaction probabilities from the Uetz and Ito combined data with xed f p and f n. It was estimated in [15] that each protein interacts with about t = 5 to 50 proteins which gives a total number of t  N=2 real interaction pairs. For t = 5, this number is 15898. Therefore, fn

= Pr(Oij = 0 j Pij = 1) Pr(Oij = 1; Pij = 1) = 1:0 Pr(Pij = 1) Pr(Oij = 1)  1:0 Pr(Pij = 1) of observed interaction pairs  1:0 number number of real interaction pairs 5719  1:0 15898  0:64:

Similarly, we can estimate the range of f p. There are totally N (N + 1)=2 = 20221620 protein pairs of which about tN=2 are potentially interacting pairs. Therefore, = 1 j Pij = 0) Pr(Oij = 1; Pij = 0) = Pr(Pij = 0) Oij = 1)  Pr( Pr(Pij = 0) number of observed interaction pairs = total protein pairs - number of real interaction pairs 5719 = N (N + 1)=2 tN=2  N (N + 1)5719 =2 50N=2  2:85E 4:

f p = Pr(Oij

In the following we let f p  1:0E 5. Two proteins are predicted to interact if their interaction probability is greater than a certain threshold. The results for the combined data with f p = 1:0E 5 and f n = 0:85 are shown in Figure 2.

We achieve SP = 39:0% and threshold as 0:90.

SN

= 79:7% by setting the

Comparison of the specificity and sensitivity 100 (fp=1.0E−5,fn=0.80) (fp=1.0E−5,fn=0.85) (fp=1.0E−5,fn=0.90) (fp=1.0E−5,fn=0.95) 95

Specificity and sensitivity of the prediction 100 Specificity Sensitivity 90

90

Sensitivity, %

80

Prediction rate, %

70

85

60 80 50

40

75

30 70

20

10

0

0

0.1

0.2

0.3

0.4

0.5 0.6 Probability cutoff

0.7

0.8

0.9

1

Figure 2: Speci city and sensitivity of proteinprotein interaction prediction based on the domaindomain interactions estimated by the maximum likelihood method with f p = 1:0E 5 and f n = 0:85. Figure 3 shows the relationship between sensitivity and speci city for both the association method and the MLE method. We can see that the MLE approach out-performs the association method. For a given sensitivity, the speci city of the MLE approach is always higher than that of the association method. Comparison of association method and MLE 100 Association method MLE with fixed fp=1.0E−5 and fn=0.85 90

Sensitivity, %

80

70

60

0

5

10

15

25 Specificity, %

0

10

20

30

40

50 60 Specificity, %

70

80

90

100

Figure 3: Comparison of prediction rates between the association method and the maximum likelihood method. Figure 4 shows that the speci city and the sensitivity are quite similar for various combinations of f p and f n values. It indicates that the MLE method is robust with respect to experimental errors, and is capable of predicting the core interactions in the data.

40

45

50

Comparing with MIPS We use MIPS physical interaction pairs [3] to test our predictions. There are 2570 PPIs in the MIPS physical interaction table. By excluding those interactions overlapping with Uetz and Ito's original experimental protein-protein interactions, we obtain a test data set with 1414 MIPS interactions. These interactions are not in the training set and we want to measure whether our method can predict them. Our method gives the probability of interaction for each protein pair. The larger the probability, the more likely the interaction is real. Table 2 shows the matching numbers between the 1414 interactions and our predicted interactions with probability greater than some threshold. If our approach is reasonable, a real interaction should be more likely in the high probability categories than random pairs. To measure this excess, we calculate the ratio of the fraction of the predicted protein pairs in the test data set with those in all protein pairs. We denote this quantity by Fold: k0 =n ; K=N

where N is the total number of protein pairs, n is number of protein pairs with interaction probability greater than some threshold, K = 1414, and k0 is the number of matching protein pairs between the 1414 interactions in the test data set and the n predicted interactions. Table 2 shows that the fold number increases as the threshold increases. This is consistent with our prediction. The 1414 protein pairs in the test data set are about 90 times more likely to have interacting probability greater than 0:975 than random pairs. To statistically test if the 1414 protein pairs in the test data set is more likely to have interaction probability greater than the threshold, we use the standard Z-score Z

3.3 Validations of MLE Predictions The statistical signi cance of our predictions can be measured by comparing our results with the protein-protein in-

35

teractions in the MIPS database and the gene expression pro les.

Fold =

40

30

Figure 4: Comparison of speci city and sensitivity of predictions by the maximum likelihood approach for di erent values of f p and f n.

50

30

20

where

p

=

K. Z N

=

pk

0

np(1

np p)

has an approximate standard normal dis-

Threshold # Prediction # Train # MIPS # MIPS1* All 20221620 5719 2570 1414 0.00 136463 5719 1265 109 0.20 26908 5238 1093 53 0.40 19360 5018 1035 48 0.60 14725 4775 971 47 0.80 12734 4647 932 43 0.975 10824 4461 886 40 >

Fold Z score P value 1 11.92 33.02 2.9e-291 34.97 41.82 47.85 46.92 67.53 55.51 76.02 56.42 89.88 59.29 -

Table 2: Number of protein pairs that match between our predictions (Prediction) and the training set (Train), the MIPS data (MIPS), and the MIPS excluding training data (MIPS1) respectively, and the corresponding statistics (Folds, Z score and P value). *MIPS1: MIPS PPIs excluding Uetz and Ito's PPIs (Training set).

Comparing with Gene Expression Profiles Recently, it was noted that genes with similar expression pro les are likely to encode interacting proteins [21, 22]. As in [21], we plot distributions of pairwise correlation coeÆcient in Figure 5 for all ORF pairs, our predicted protein pairs with probability  0:975, the Ito's and Uetz's original data, and the MIPS interacction data. To test whether the expression correlation coeÆcient for gene pairs in our prediction and experimental dataset is signi cantly higher than that for all the ORF pairs, we calculate the T-score and the P-value for the null hypothesis of no di erence between the sample (MIPS and our prediction) mean and the mean of all ORF pairs. The results are listed in Table 3. The T-scores are calculated as the standard two sample T-test statistics, i.e, by the following formula: T

=

q

q n1  S2 1

1

n1

+ 2 1

(

2

2 1) 1 +(n2 1)S2 n1 +n2 2

where  is the mean of samples, and

v u u S=t

n

1

n X

(x 1 i=1 i

)2

is the standard deviation. From Figure 5 and Table 3, we see that the mean correlation coeÆcient for protein pairs with interacting probability greater than a certain threshold increases with the threshold although the di erence is not big. This result is consistent with our prediction. What is interesting is that our T-score for the protein pairs with probability  0:975, 3.8, is greater than the T-score for Uetz's and Ito's data, 2.3.

3.4 Biological significance of novel prediction

Distribution of the pairwise correlation coefficients of gene expression 0.2 Random protein pairs our predicted PPIs Uetz+Ito PPIs MIPS PPIs

0.18

0.16

0.14

0.12

Frequency

tribution under the null hypothesis. Both Z-scores and Pvalues are given in Table 2. It should be noted that setting the threshold to be 0.975 gives 10824 4461 = 5363 novel protein-protein interactions. The matches between these interactions and the 1414 MIPS1 interactions are merely 40. However, the small number of overlaps is probably due to the large size of the whole yeast protein interactome. Suppose both the 5363 novel predictions and the 1414 MIPS interactions are real, and they are random subsets of the whole protein interactome. Then the size of the interactome can be estimated by 5363  1414 = 189582 40 which is equivalent to 50 interactions per protein.

0.1

0.08

0.06

0.04

0.02

0 −1

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 Expression correlation coefficient

0.6

0.8

1

Figure 5: Distribution of the pairwise correlation coeÆcients of gene expression pro les for interaction proteins in MIPS data, Uetz and Ito's data, our prediction with threshold 0:975 and for all ORF pairs. Novel protein-protein interactions are predicted from our probabilistic model. The top 17 predictions with probability greater than 0.95 are listed in Table 4. We observe four interactions where one of the interactors has unknown function. ORF YOL083W and ORF YNL078W are shown to interact with transcription factor, TFB1(TFIIH subunit) and MRK1, a serine/threonine kinase, respectively. It is possible that the two ORFs have some role to play in the transcription machinery associated with POL II or DNA repair (TFIIH is also involved in DNA repair) and the kinase pathway of MRK1 respectively. Some of our predictions such as CTT1-PEX14 and TAF40SPT3 interactions are signi cant. PEX14 facilitates docking interactions at the peroxisomal membrane receptors and catalase T is an oxidative enzyme that degrades hydrogen peroxide. Since peroxisomes release enzymes that reduce oxygen stress in the cell, there is a logical interaction between the two proteins. An interesting nding was the SPT3 interaction with TAF40. SPT3 is a component of the nucleosomal HAT (histone acetyl transferase) complex and is TBP associated. TAF40 is also TBP associated and is a transcription factor in Pol II transcription. Thus, at some point between histone acetylation, which facilitates the transcription machinery to bind to DNA and the recruitment of transcription factors to DNA, we nd an interaction between

pairs # pairs All ORFs 3036880 0.20 6392 0.40 4433 0.60 3318 0.80 2756 0.975 2266 Uetz+Ito 1307 MIPS 1106

mean 0.0428 0.0514 0.0510 0.0598 0.0626 0.0628 0.0586 0.1109

std 0.2473 0.2550 0.2538 0.2579 0.2622 0.2637 0.2587 0.2767

T-score 0.0000 2.7984 2.2232 3.9644 4.2196 3.8482 2.3213 9.1619

p-value 5.000e-01 2.575e-03 1.311e-02 3.715e-05 1.238e-05 6.002e-05 1.015e-02 2.706e-20

 05 3.84% 4.79% 4.96% 5.42% 5.88% 5.87% 5.20% 8.23%

R

>

:

Table 3: Summary statistics of the distributions of correlation coeÆcient between the expression pro les of interacting proteins from di erent dataset and our predictions with di erent probability threshold. *R: correlation coeÆcient of gene expression for given gene pairs.

Interactor I Interactor II Interactor I Interactor II Interactor I Interactor II Interactor I Interactor II Interactor I Interactor II Interactor I Interactor II Interactor I Interactor II Interactor I Interactor II Interactor I Interactor II Interactor I Interactor II Interactor I Interactor II Interactor I Interactor II Interactor I Interactor II Interactor I Interactor II Interactor I Interactor II Interactor I Interactor II Interactor I Interactor II

protein MRK1 YNL078W CTT1 PEX14 LAP4 YHR113W SPS18 YIP1 TFB1 YOL083W DPM1 SPS18 SNZ1 SNZ1 APG17 RPS0A APG17 RPS0B SNO3 SNZ3 SNO2 SNZ1 SIW14 YCR095C SIW14 SIW14 PRS5 PRS3 PRS5 PRS5 PRS5 PRS1 TAF40 SPT3

Function Ser/Thr Kinase unknown Catalase T/cytosolic Interacts with peroxisome membrane receptors (docking interactions)/PMP* Lysosomal/vacuolar aminopeptidase similar to Lap4p Transcription factor, sporulation speci c/nuclear Vesicular transport, fusion events/G*, IMP* RNA Pol II transcription, subunit of TFIIH Unknown Protein modi cation/ER*,IMP* Transcription factor, sporulation speci c/nuclear biosynthetic enzyme, role in cell stress biosynthetic enzyme, role in cell stress authophagy, Vesicular transport Ribosomal protein, RNA-binding protein/Cytoplasmic authophagy, Vesicular transport Ribosomal protein, RNA-binding protein/Cytoplasmic putative vitamin biosynthetic enzyme, role in cell stress similar function as SNZ1 similar function as SNO3 biosynthetic enzyme, role in cell stress Ser/Thr phosphatase, cell cycle control unknown Ser/Thr phosphatase, cell cycle control as above amino acid and nucleotide metabolism/cytoplasmic similar function as PRS5 as above as above as above similar function as PRS5 RNA Pol II transcription, TFIID component, TBP associated component of nucleosomal HAT complex, TBP associated

Table 4: Some novel predictions. *HAT: histone acetyl transferase, *ER: Endoplasmic reticulum, *G: Golgi, *PMP: Peripheral membrane protein, *IMP: Integral membrane protein. The functional annotations are obtained from YPD.

the two processes. Some predictions may indicate previously unknown protein functions such as SPS18, the sporulation speci c transcription factor interacting with YIP1 and DPM1. Both YIP1 and DPM1 are integral membrane proteins and localized at ER and Golgi. Their functions in vesicular transport and protein modi cation suggest that the sporulation speci c genes activated by SPS18 may be recruited to membranes to form the spore wall and therefore interact with YIP1 and DPM1. SPS18 may be involved in this process. The rest of our top novel predictions involve interactions between members of same gene family, such as PRS5, PRS3, PRS1, or between members of two separate gene families, such as SNO and SNZ gene family. From literature sources, PRS1 and PRS3 are known to strongly interact with each other and PRS5 has interactions with PRS2 or PRS4. Here we show interactions between PRS5 and PRS3, PRS5 and PRS1 and PRS5 with itself. PRS is a phosphoribosyl pyrophosphate synthetase involved in amino acid and nucleotide metabolism. Each of the SNZ genes has SNO genes upstream and members of these two gene families are highly conserved and co-regulated. Genes of both families are involved in cellular response to nutrient stress and hence, interactions between the two families is obvious from the biological point of view. We also observe interactions of two ribosomal proteins, RPS0A and RPS0B with APG17, a protein involved in vesicular transport and autophagy. However, since pairwise interactions do not give a complete functional role of a protein, we looked at all interactions of APG17 and RPS0A separately in Table 6 and Table 5. We observe RPS0A interactions with APG17, BBP1, YDL100C and ILV1 with probability grater than 0.5. BBP1 is a spindle pole body protein and is known to bind Bfr1p (from literature sources) which is involved in vesicular transport of secretory proteins and is localized on polyribosome-mRNP complex. APG17 is a component of the APG complex of proteins involved in targeting proteins to vacuoles/lysosomes under starvation conditions. We predict binding of BBP1 and BFR1 (from literature) to RPS0A and also binding of APG17 to RPS0A. Thus, binding of two di erent vesicular transport proteins to ribosomal proteins may or may not be part of one complex depending on cellular environment. We predict several interactions for APG17 listed in Table 5. We predict 9 ORFs of unknown function to have interaction with APG17. It is possible that these ORFs are involved in APG protein dependent vesicular transport system. We predict APG17 interacting with SEC9, another protein functioning in vesicular transport and SPO20, involved in sporulation, both of which are localized at plasma membrane. We also predict APG17 interacting with four other proteins, LAT1, DOG1, DOG2 and KGD2, all of which are involved in carbohydrate metabolism and energy generation. Since APG17 targets proteins to vacuoles and lysosomes during nutrient stress conditions, it is possible that it targets some other proteins involved in energy generation (under starvation conditions) to mitochondria and some proteins involved in spore wall formation to plasma membrane. Thus, experimental veri cations of some of our signi cant predictions may throw light on cellular processes and explain the roles of proteins that may be plausible links between distinct pathways.

4.

DISCUSSION

We apply a probabilistic model to derive domain-domain interactions from protein-protein interactions observed in two-hybrid systems. We predict protein-protein interactions from the derived domain-domain interactions, and assess the accuracy of our model at the protein level in three ways: (1) comparing the prediction results with the original experimental data, (2) comparing the prediction results with the MIPS protein-protein interactions derived by methods other than the two-hybrid systems, and (3) comparing the mean gene expression correlation coeÆcient for the predicted interacting protein pairs with that for random protein pairs. Our probabilistic model and the Maximum Likelihood Estimation method are robust in handling experimental errors. The structure of our probabilistic model allows us to incorporate various kinds of protein-protein interaction data, even from di erent organisms, to infer domain-domain interactions. As more and more protein-protein interactions are experimentally determined, the prediction accuracy of our method will improve substantially. Statistics show that the prediction rate of our method is 90 times better than that of a random method in predicting the protein-protein interactions in MIPS that were obtained by methods other than two-hybrid systems. Although the statistics are signi cant, the prediction ratio, 40=(10824 4461) = 0:75%, does not seem to be practically useful. However, this is likely because the whole protein interactome is huge. We have shown that if there are 50 interactions per protein and all our predictions are real, 40 is exactly what we can expect for the number of overlaps. It is also known that every experimental method is biased to certain kinds of proteins and interactions. For example, the Uetz's and the Ito's original experimental results have a very small number of overlaps with the interactions from other methods. Thus it is likely that some of our novel predictions are real, bias to particular proteins, and can not be veri ed by other methods. The correlation coeÆcient mean for our predicted protein pairs is higher than that of random pairs and also that of the Uetz's and the Ito's experimental data. This result is a surprise. A possible explanation is that our probability score precisely re ects the possibility that an interaction is real. All these three studies validate our probabilistic model, and prove that the interaction probability we have derived is a good estimation. On the other hand, the basic assumptions of our model ignore the following biological factors. Our model makes an assumption of the independence of domain-domain interactions. In fact, that two domains interact or not may depend on other domains in the same protein or other environmental conditions. Although we have identi ed domains that co-exist in proteins and merged them as one super-domain, there certainly exist many domains whose functions depend on other domains in the same protein. Secondly, the idea of using domain-domain interactions to predict protein-protein interactions assumes that some sub-units with special structure are essential to protein-protein interactions. These subunits may be di erent from Pfam domains obtained through multiple alignments. Furthermore, compared with functionally annotated Pfam-A domains, Pfam-B domains are shorter and less known, so the roles that they play in protein interactions may not be the same, but in our model, we use them in the same level as Pfam-A domains. It has been known that protein-protein interactions have

Protein APG17 RPS0A RPS0B YBR197C YPL077C YAP7 CIN5 YMR031C YKL050C YBR270C YJL058C YMR124W PLO1 PLO2 SPO20 LAT1 SEC9 DOG1 DOG2 KGD2

Localization Cytoplasmic Cytoplasmic unknown unknown Nuclear unknown unknown unknown unknown unknown unknown unknown Plasma membrane Mitochondrial Plasma membrane Mitochondrial

Function (cellular role or biochemical) authophagy, Vesicular transport Ribosomal protein, RNA-binding protein Ribosomal protein, RNA-binding protein unknown unknown Transcription factor (Pol II), leucine zipper family Transcription factor (Pol II), leucine zipper family unknown unknown unknown unknown unknown unknown unknown sporulation Carbohydrate metabolism, Energy generation Vesicular transport (vesicle docking and secretion) Carbohydrate metabolism, Hydrolase Carbohydrate metabolism, Hydrolase Carbohydrate metabolism, Energy generation , Oxidoreductase

Table 5: Novel predictions for APG17 interactions with high probability. The functional annotations are obtained from YPD. Protein RPS0A APG17 BBP1 YDL100C ILV1

Localization Cytoplasmic Nuclear Cytoplasmic Mitochondrial

Function (cellular role or biochemical) Ribosomal protein, RNA-binding protein authophagy, Vesicular transport protein in spindle pole body, mitosis similarity with E. coli ArsA ATPase in small molecule transport Amino acid metabolism, biosynthesis of amino acid, lyase

Table 6: Novel predictions for RPS0A interactions with high probability. The functional annotations are obtained from YPD. time and space constraints. Two proteins that contain potentially interacting domains may not interact with each other because they may be expressed at di erent times during the cell cycle, or may be located at di erent cell compartments. Protein-protein interactions not only depend on structures, but also depend on other environmental conditions. Even two proteins with the same domain structure, may have di erent interaction behavior with other proteins. It is believed that the experimental protein-protein interaction data is just a small fraction of the whole protein interaction network. The incompleteness of current data makes it diÆcult to derive domain interaction information. The comparison of two data sets shows very small overlaps. This may explain that the whole interactome is much bigger than these two experimental data and thus they have only a small part of overlaps. On the other hand, it is known that the experimental data contain many errors. The exact error rate has to be assessed by using other techniques.

5.

REFERENCES

[1] Lockhart, D. J., Winzeler, E. A. Nature (London) 405: 827-836, 2000. [2] Go eau, A. et al. Life with 6000 genes. Science 274: 546-567, 1996. [3] Mewes, H. W., Albermann, K., Heumann, K., Liebl, S., Pfei er, F. MIPS: a database for protein sequences, homology data and yeast genome information. Nucleic Acids Res. 25: 28-30, 1997. [4] Uetz, P. et al, A Comprehensive analysis of protein-protein interactions in Saccharomyces

cerevisiae. Nature 403: 623-627, 2000. [5] Ito, T., etal., Toward a protein-protein interaction map of the budding yeast: a comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins. Proc. Natl. Acad. Sci. USA, 97: 1143-1147, 2000. [6] Ito, T., etal, A Comprehensive two hybrid analysis to explore the yeast protein interactome, Proc. Natl. Acad. Sci. USA, 98: 4569-4574, 2001. [7] Enright,A.J., Iliopoulos, I., Kyrpides, N.C. and Ouzounis, C.A.. Protein interaction maps for complete genomes based on gene fusion events. Nature 402: 86-90, 1999. [8] Marcotte, E.M., Pellegrini, M., Ng, H.L., Rice,D.W., Yeates, T.O. and Eisenberg, D.. Detecting protein function and protein-protein interactions from genome sequences. Science, 285: 751-753, 1999. [9] Pellegrini, M., Marcotte, E.M., Thompson,M.J., Eisenberg, D. and Yeates, T.O.. Assigning protein functions by comparative genome analysis: protein phylogenetic pro les. Proc. Natl. Acad. Sci. USA, 96: 4285-4288, 1999. [10] Marcotte, E.M., Pellegrini, M., Thompson, M.J., Yeates, T.O. and Eisenberg, D.. A combined algorithm for genome-wide prediction of protein function. Nature, 402: 83-86, 1999. [11] Eisen, M.B., Spellman, P.T., Brown, P.O., Bostein, D.. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA, 95, 14863-14868, 1998.

[12] Wojcik, J. and Schachter, V.. Protein-protein interaction map inference using interacting domain pro le pairs. Bioinformatics, 17 Suppl.1: S296-S305, 2001. [13] Rain, J.C, etal, The protein-protein interaction map of Hellicobacter pylori. Nature, 409: 211-215, 2001. [14] Bock, J.R., Gough, D.A., Predicting protein-protein interactions from primary structure. Bioinformatics, 17, 455-460, 2001. [15] Hazbun, T.R. and Fields, S., Networking proteins in yeast. Proc. Natl. Acad. Sci. USA, 98, 4277-4278, 2001. [16] Legrain, P., Selig, L., Genome-wide protein interaction maps using two-hybrid systems. FEBS Letters, 480, 32-36, 2000. [17] Ito, T., Chiba, T. and Yoshida, M., Exploring the protein interactome using comprehensive two hybrid projects. TRENDS in Biotechnology, 19, S23-S27, 2001. [18] Bateman, A., Birney, E., Durbin, R., Eddy, S., Howe, K. and Sonnhammer, E., The Pfam Protein Families Database. Nucleic Acids Research 28: 263-266, 2000. [19] Bairoch, A. and Apweiler, R. The SWISSPROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 28: 45-48, 2000. [20] Corpett, F., Gouzy, J. and Kahn, D., Nucleic Acids Res., 27: 263-267, 1999. Updated article: Nucleic Acids Res., 28: 267-269, 2000. [21] Grigoriev, A., A relationship between gene expression and protein interactions on the proteome scale: analysis of the bacteriophage T7 and the yeast Saccharomyces cerevisiae. Nucleic Acid Research, 29: 3513-3519, 2001. [22] Ge, H., Liu, Z., Church, G.M. and Vidal, M., Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae. Nature genetics, 29, 482-486 2001.