Creation and analysis of a human protein expression atlas

Creation and analysis of a human protein expression atlas Maxime Menu Verhandeling ingediend tot het verkrijgen van de graad van Master in de Biomed...

Author: Willa Johns

0 downloads 0 Views 1MB Size

Report

Download PDF

Recommend Documents

A Human Protein Atlas

The Human Protein Atlas

Protein Folding and Expression. Folding, Expression and Analysis

A Genecentric Human Protein Atlas for Expression Profiles Based on Antibodies*

Quantitative analysis of protein expression levels

hpar: The Human Protein Atlas in R

A Framework for the Automated Analysis of Subcellular Patterns in Human Protein Atlas Images

PROTEIN EXPRESSION AND PURIFICATION OF HUMAN HORMONE SENSITIVE LIPASE

A Human Protein Atlas Alpbach, March 14, 2007

Protein Expression and Purification

AN ATLAS OF PROTEIN ELECTROPHORESIS

Analysis of human tissue-specific protein-protein interaction networks

Analysis of protein targeting in living cells by transfection, fluorescent fusion protein expression, and vital staining

Antibodies for profiling the human proteome The Human Protein Atlas as a resource for cancer research

Protein Expression and Purification Kit

Prediction and validation of the unexplored RNA-binding protein atlas of the human

A Human Protein Atlas for Normal and Cancer Tissues Based on Antibody Proteomics*

Differential protein profiles during aging in human plasma: Changes in protein expression and processing

PairsDB atlas of protein sequence space

Protein expression I

Cloning, Expression, Purification and Crystallization of the PR Domain of Human Retinoblastoma Protein-Binding Zinc Finger Protein 1 (RIZ1)

Creation and analysis of a human protein expression atlas

Maxime Menu

Verhandeling ingediend tot het verkrijgen van de graad van Master in de Biomedische Wetenschappen

Promotor: Prof. Dr. Lennart Martens Begeleider: Kenneth Verheggen Vakgroep: Biochemie GE07

Academiejaar 2015-2016

Creation and analysis of a human protein expression atlas

Maxime Menu

Verhandeling ingediend tot het verkrijgen van de graad van Master in de Biomedische Wetenschappen

Promotor: Prof. Dr. Lennart Martens Begeleider: Kenneth Verheggen Vakgroep: Biochemie GE07

Academiejaar 2015-2016

Preface Writing a master’s thesis and conducting the necessary research around it is supposed to be mostly the result of the student’s endeavor. In science however, research is conducted by working together with other researchers. This is why I want to thank all the people who helped me and gave me counsel while carrying out the research and writing the thesis. The first person I would like to thank is my promotor prof. Dr. Lennart Martens. I thank him for the opportunity to learn about bioinformatics at his research group and the many words of advice given to me, not only in the field of bioinformatics but also general writing tips, support for my possible future career choices and teaching me about all kinds of trivia. I would like to thank my mentor Kenneth Verheggen for the help with the writing of the thesis and his counsel. I would also like to thank the members of the compomics group for their contribution to a great working atmosphere. I particularly thank Davy Maddelein who was like a second mentor and who helped me and gave me counsel in a plethora of subjects, Niels Hulstaert for his help with databases and the virtual machine I used to conduct my experiments, Dr. Sven Degroeve and Ana Silvia C. Silva for their counsel and help with machine learning algorithms, Surya Gupta for the counseling on pathways and protein interactions and finally Adriaan Sticker for the counseling on data science in general. I would finally also like to thank the members of the jury: Dr. Pieter-Jan Volders and Dr. Maarten Dhaenens for reading this thesis and being present at my presentation.

i

Table of contents Preface ......................................................................................................................................... i Table of contents ........................................................................................................................ ii Abstract ...................................................................................................................................... 1 Samenvatting .............................................................................................................................. 2 1 Introduction ............................................................................................................................. 3 1.1 Mass spectrometry proteomics ........................................................................................ 3 1.1.1 Mass spectrometry ................................................................................................... 3 1.1.2 Types of mass spectrometry proteomics .................................................................. 4 1.1.3 Identification of peptides from mass spectrometric spectra..................................... 5 1.1.4 Quantification of peptides ........................................................................................ 5 1.1.5 Protein inference ...................................................................................................... 6 1.2 Availability of public proteomics data ............................................................................ 6 1.3 Public proteomics data reprocessing ............................................................................... 7 1.3.1 ReSpin pipeline ........................................................................................................ 7 1.4 Distributed computing ..................................................................................................... 8 1.5 Relational databases ........................................................................................................ 8 1.6 Machine learning ............................................................................................................. 9 1.6.1 Supervised learning................................................................................................ 10 1.6.2 Unsupervised learning ........................................................................................... 11 1.6.3 Model evaluation ................................................................................................... 11 1.6.4 Model validation .................................................................................................... 12 1.6.5 Dimensionality reduction ....................................................................................... 12 1.6.6 Feature selection .................................................................................................... 13 1.7 Expression atlases .......................................................................................................... 14 2 Materials and Methods .......................................................................................................... 15 2.1 Data retrieval ................................................................................................................. 15 2.1.1 UniProt ................................................................................................................... 15 2.1.2 Gene ontology ........................................................................................................ 16 2.1.3 Reactome................................................................................................................ 16

ii

2.2 Relational database ........................................................................................................ 16 2.3 Data parsing ................................................................................................................... 16 2.3.1 Human draft proteomes.......................................................................................... 17 2.3.2 Antibody protein atlas dataset ................................................................................ 18 2.3.3 UniProt data ........................................................................................................... 19 2.3.4 Gene ontology data ................................................................................................ 19 2.3.5 Reactome data ........................................................................................................ 19 2.4 Atlas creation ................................................................................................................. 20 2.4.1 Query data and normalization quantitative values ................................................. 20 2.4.2 Linking of proteins to tissues ................................................................................. 20 2.5 Analyses on the atlas ..................................................................................................... 21 2.5.1 Preliminary analyses .............................................................................................. 21 2.5.2 Clustering ............................................................................................................... 22 2.5.3 t-SNE ..................................................................................................................... 22 2.5.4 Preparation of the antibody atlas ............................................................................ 23 2.5.5 Comparison of antibody and mass spectrometry atlases........................................ 23 2.5.6 Tissue prediction models ....................................................................................... 24 3 Results .................................................................................................................................. 26 3.1 Database......................................................................................................................... 26 3.2 Proteomics expression atlas ........................................................................................... 27 3.3 Preliminary analyses ...................................................................................................... 28 3.4 Clustering....................................................................................................................... 31 3.5 t-SNE ............................................................................................................................. 33 3.6 Comparison antibody and mass spectrometry atlases .................................................... 33 3.7 Tissue prediction models ............................................................................................... 36 3.7.1 Tissue predictor...................................................................................................... 36 3.7.2 Cell type predictor.................................................................................................. 37 4 Discussion ............................................................................................................................ 39 4.1 Re-using the pipeline on additional data in the future ................................................... 39 4.2 Quality of mass spectrometry datasets ........................................................................... 39 4.3 Clustering and visualization .......................................................................................... 40 4.4 Comparison of the MS-based and antibody-based atlases............................................. 40 iii

4.5 Tissue and cell type prediction models .......................................................................... 40 4.6 Future studies ................................................................................................................. 42 4.6.1 Inclusion of PRIDE data ........................................................................................ 42 4.6.2 Optimization of the prediction models .................................................................. 42 4.6.3 Pathways and protein functional shifts .................................................................. 42 4.6.4 External datasets .................................................................................................... 43 5 Conclusions ........................................................................................................................... 44 6 References ............................................................................................................................. 45 7 Supplementary materials ..........................................................................................................a

iv

Abstract Mass spectrometry based proteomics generates increasingly large amounts of data that are collected and made available for reprocessing by public repositories such as PRIDE. One of the possible uses for such large-scale data is the creation of a protein based expression atlas. In order to create such an atlas from human data, the two draft human proteomes by Kim et al. and Wilhelm et al. were used. Metadata could be obtained for these data sets, and the results from homogenous reprocessing with stringent identification criteria were already available. Based on the available metadata and rough protein abundance estimation, a protein expression atlas was built. This human proteomics atlas was then compared to the existing antibody-based Human Protein Atlas (HPA) using correlation coefficients. Interestingly, poor correlation was obtained between the two atlases, showing that more work will need to be done before these two views of the tissue-mapped proteome can be reconciled. Moreover, due to the absence of some of the metadata in the Wilhem et al. data set, prediction models were constructed with random forests to predict tissue and cell type annotations based on the identified proteins. These predictors performed really well, and were typically capable of unambiguously identifying source tissues. In conclusion, it is clear that there is great promise in the construction of such a human proteomics expression atlas. However, future studies including much more proteomics data from across the entire PRIDE database will be important to further improve the coverage of the tissue-based proteome, possibly yielding higher correlation with the antibody atlas, and allowing for even higher resolution predictions.

1

Samenvatting Achtergrond: Proteomics gebaseerd op massaspectrometrie genereert zeer grote hoeveelheden data. Deze data worden verzameld in publieke databanken, wat herbewerking van deze data toelaat. Op deze manier kunnen mogelijke nieuwe toepassingen worden ontdekt zoals het bouwen van een eiwit expressie atlas. Methoden: De twee humane proteoom datasets door Kim et al. en Wilhelm et al. werden gebruikt om zo een atlas te bouwen. Metadata voor deze datasets kon verzameld worden en de resultaten van herprocessering met strenge identificatie criteria waren al beschikbaar. Aan de hand van beschikbare metadata en ruwe schattingen van eiwit abundanties werd een eiwit expressie atlas gebouwd . Deze humane proteoom atlas werd daarna vergeleken met de al bestaande antilichaam-gebaseerde eiwit expressie atlas door middel van correlatie coëfficiënten. Om de gedeeltelijke afwezigheid van de metadata in de Wilhelm et al. dataset op te vangen, werden er voorpellingsmodellen gebouwd met random forests om het weefsel, respectievelijk het celtype te voorspellen gebruik makend van de geïdentificeerde eiwitten. Resultaten: Er werd een zwakke correlatie waargenomen tussen de twee atlassen. Dit toont aan deze twee versies van het weefsel-ingedeelde proteoom niet eenvoudig overeengestemd kunnen worden. De predictiemodellen vertoonden goede performanties en konden ondubbelzinnig weefsels identificeren. Conclusies: Het bouwen van een eiwit expressie atlas heeft een groot potentieel. Bijkomende studies, waarbij meer proteomics data uit de PRIDE databank opgenomen worden, zullen belangrijk zijn om de omvang van het op weefsel-gebaseerde proteoom te verbeteren, mogelijks een hogere correlatie met de antilichaam atlas op te brengen en de resolutie voor de voorspellingen verbeteren.

2

1 Introduction Mass spectrometry based proteomics generates increasingly large amounts of data that are collected and made available for reprocessing by public repositories such as PRoteomics IDEntifications PRIDE1. One of the possible uses for such large-scale data is the creation of a protein based expression atlas. This introduction will provide a short primer on mass spectrometry based proteomics in Section 1.1, the availability of public proteomics data in Section 1.2, and the approaches for public proteomics data reprocessing in Section 1.3. Reprocessing vast amounts of public proteomics data is very challenging. First of all, the large amount of public proteomics data easily exceeds the storage and processing capacities of single computers. Data of this size requires more advanced, distributed computational architectures for processing. These are described in Section 1.4. Second, storage and rapid access to the relevant (meta-) data is also a key logistics challenge, which is solved through the use of relational databases, which are described in Section 1.5. Third, the analysis of these data goes beyond simple statistical modeling, especially when predictive models are desired. Machine Learning algorithms however, are perfectly suited to this task, and these will therefore be discussed in Section 1.6. Section 1.7 finally, provides a broad overview of the approaches that were used in this thesis to create a proteomics expression atlas, compare it to the existing antibody atlas2, and build predictive models from the proteomics atlas.

1.1 Mass spectrometry proteomics 1.1.1 Mass spectrometry Mass spectrometers are instruments which are used to determine a mass profile of molecules. A prerequisite for detection by these apparatuses is that the compounds have to be charged (ionized). Indeed, proteins or peptides need to be ionized by protonation or deprotonation of the side chains of the amino acids. Protonation usually occurs on basic amino acids, deprotonation usually occurs on acidic amino acids. Knowing this ionization strategy is important, as mass 3

spectrometers can work in two modes: positive ion mode in which it detects positively charged ions and negative ion mode in which it detects negatively charged ions. Mass spectrometers possess three major components: an ion source/ionizer, a mass analyzer and a detector. The ionizer converts a sample of the compounds into ions. The ions are passed into the mass analyzer which separates the ionized masses based on their mass to charge ratios. The ions are sequentially piped to the detector that measures values of quantity in order to determine the abundances (relative or absolute) of the ionized compounds. Typically, some kind of electron multiplier is used as detector. Electron multipliers are vacuum tubes that multiply charges through secondary emission. This results in larger cascades of electrons collected by an anode and thus an amplified signal. The whole process of ionizing, sorting ions and detecting the formed precursor ions can also be called the MS1 stage. These precursor ions can be selected and further fragmented and ionized in an MS2 stage. The detection of these fragment ions happens analogous to the MS1 level. Precursor ions of interest can usually be selected for MS2 through the use of quadrupoles which can filter m/z ratios. Quadrupoles apply combinations of amplitude and frequency of alternating current and strengths of direct current which only lets ions pass with m/z ratio enabling to stably oscillate in the quadrupoles. Other ions fail to pass quadrupole and are lost for the detector.3 1.1.2 Types of mass spectrometry proteomics There are three types of mass spectrometry based proteomics: bottom-up, top-down, middle-down. Top-down proteomics consists of analyzing ionized proteins directly with MS/MS. Top-down proteomics however doesn’t handle complex samples well compared to bottom-up proteomics, suffer from dynamic range challenges and the lack of intact protein fragmentation methods in mass spectrometry. Bottom-up proteomics consists of digesting proteomes (usually a tryptic digest) into peptides, then fractionating those peptides through the use of separation techniques such as liquid chromatography and finally detecting those peptides with tandem mass spectrometry (MS/MS or MS2). Middle-down proteomics can be considered as a specialized form of bottom-up proteomics, but producing larger peptides in the protein digestion phase, however the most frequently used method is bottom-up proteomics.4,5 4

1.1.3 Identification of peptides from mass spectrometric spectra The raw output of a mass spectrometer experiment does not directly provide a list of identified peptides, but rather a list of peaks of varying intensity on certain m/z values. These represent abundances of ions detected at given m/z ratios. A collection of peaks that were obtained are gathered in spectra and are signature for a detected molecule. In case of peptides, the strategy of identifying the spectra with peptides and their parent proteins is referred to as "peptide-mass fingerprinting"6. This identification process consists of identifying amino acids by examining mass differences between peaks in the spectra. This allows for the reconstruction of a possible peptide sequence one amino acid at a time. A complication arises from this procedure however as 2 pairs of amino acids share the same molecular weight and therefore display the same m/zratio. These pairs are leucine and isoleucine and lysine and glutamine3. Multiple algorithms have been released in the past which identify MS/MS spectra. This type of algorithm is generally referred to as sequence database search engines.7 1.1.4 Quantification of peptides The measured peaks in a spectrum can give a general idea of the abundance of their generating ions. These abundances however are mainly suited to display relative differences between samples. A method to determine absolute quantifications of peptides is by labeling proteins, usually with heavier isotopes of atoms incorporated in amino acids or through the use of tags with different isotopes8, such as ITRAQ9, TMT10, SILAC11. These labeled proteomics experiments however are usually carried out on enriched samples of proteomes or sub-proteomes. In circumstances where massive amounts of various proteins need to be analyzed quickly however, such as in clinical facilities, labeled experiments can prove too expensive and time-consuming. Therefore, there still are vast amounts of label-free proteomics experiments.12,13 Comparing relative abundances between different ions in a single run, or abundances of a single ion between different runs provides many challenges. The most commonly used methods for the relative quantification of unlabeled proteomics experiments are based on spectral counts and the ion-intensity-based approach. Spectral counts estimate abundances based on the amount of peptide-to-spectrum (PSM) matches13,14. The ion-intensity-based approach compares 5

chromatographic peak areas of corresponding peptides during the chromatographic separation of the peptide samples.13 1.1.5 Protein inference Protein inference refers to the process of inferring parent proteins by combining peptides that were previously identified in a mass spectrometer experiment with the protein sequences that are included in the search database. This process can however be quite challenging due to so-called degenerate peptides which map to several different proteins and one-hit wonders, proteins which have only one identified peptide. This is particularly the case in membrane bound proteins, which are inherently harder to detect because only a part of the sequence is detectable without protocols designed specifically for this purpose. Examples of these methodologies are illustrated in studies by Morgner et al.15 and Arnott et al.16 Another important term in the field of protein inference is proteotypic peptides. There are two types of proteotypic peptides. The first one refers to peptides which only map to one 1 parent protein based on the uniqueness of the sequence. The second one refers to peptides which are only detected for 1 parent protein in mass spectrometry experiments. The protein inference problem can be circumvented for proteins which at least two or three proteotypic peptides. 17

1.2 Availability of public proteomics data Technological advances in mass spectrometry instrumentation and novel algorithms have led to a substantial increase in data volumes. A lot of these data have been made publicly available in online repositories, such as the PRoteomics IDEntifications (PRIDE) database.1 However, there are deposits from the past decade that lack a complete annotation and thus require either manual curation or processing by annotating tools to be completed. This is mostly true for older, legacy projects, as the instatement of a submission system with mandatory parameters greatly improved the metadata annotation.18 PRIDE can be split into two major of types of deposited data: partial and complete projects. Complete projects contain the original mass spectrometer output as raw files, peptide/protein identification files and the spectra that were used for the sequence database searches. Partial projects on the other hand contain only raw files and generally lack any information on peptide 6

identification and are intended as a temporary submission until the publication process of the data has been completed. Co-operations between different public proteomics databases in order to gather as much data as possible have been undertaken such as ProteomeXchange19. ProteomeXchange is an integrated framework for the submission and dissemination of mass spectrometry proteomics data. This allows for better regulation and quality control of mass spectrometry experiments across different repositories. PRIDE1 and PeptideAtlas20 are the two main repositories in the proteomeXchange framework, and this network has continued to welcome new repositories over the last years, such as MassIVE (https://massive.ucsd.edu/ProteoSAFe/static/massive.jsp).

1.3 Public proteomics data reprocessing Despite the public availability of these proteomics data, their (orthogonal) reuse remains very limited. This in contrast with the field of transcriptomics, where the reuse of public data has taken a large stride forward, see for instance.21 This lack of reuse of proteomics data is peculiar, because the added value of such reuse is already proven.22,23 Because a proteomics experiment tends to be analyzed with a specific purpose in mind, adaptation of the original settings can lead to novel discoveries. The settings that can be adapted include parameters for the search engines, such as search database, modifications considered, allowed missed cleavages, fragment and/or precursor ion mass tolerance, or even the enzyme used. 1.3.1 ReSpin pipeline The ReSpin pipeline was designed to re-analyze public proteomics data. This pipeline consists of three tools in sequence: pride-asap24, SearchGUI25 and PeptideShaker26. pride-asap provides a uniform annotation of identified spectra stored in the PRIDE database and was originally designed to provide a solid basis on which to build an a posteriori quality control framework. In ReSpin however, pride-asap is used to infer search parameters and original experiment settings, such as the applied enzymatic sample processing, modifications included in the search space and the used mass tolerances. SearchGUI functions as a single point of access to launch multiple search engines that would otherwise each have their own specific settings and 7

parameters. ReSpin currently uses three out of the seven available search engines: MyriMatch27, X!Tandem28 and MSGF+29. PeptideShaker is a post-processing tool that collates the results from the individual search engines into a global set of identified peptide-to-spectrum-matches (PSMs), trying to maximally exploit the strengths of the search algorithms used. PeptideShaker also provides in-depth statistics to assess the quality of the results in detail.

1.4 Distributed computing The large amount of public proteomics data easily exceeds the storage and processing capacities of single computers. The reprocessing of these data therefore requires substantial processing power that is typically obtained from cluster or grid computing30. Setting up such GRID based solutions can be tedious and time consuming. Solutions have been posted to reduce the complexity of managing computationally intensive bio-informatics tasks, such as the Galaxy31 platform and Pladipus32. Pladipus has been used to reprocess the public proteomics data used in this thesis because it incorporates SearchGUI and PeptideShaker out of the box, with the option to set up additional pre-and post-processing steps, including parameter inference by pride-asap.32 A recent evolution for large-scale computing is provided by the MapReduce paradigm, which has become very popular through its implementations in Hadoop33 and Spark.34 This paradigm generally consists of splitting datasets in multiple smaller datasets, spreading across the available compute nodes, and then perform calculations locally (Map phase). The results are then collated into a single result set in the Reduce phase.35

1.5 Relational databases In order to provide a solid basis for the creation of an expression atlas, the underlying data and metadata have to be clearly structured and stored in an efficient and persistent way. Relational databases provide a solution to this problem. Relational databases are the most widely used type of databases and implement the relational model. This relational model approached data management through a structure and language consistent with first-order predicate logic. In a relational database, all data is represented in terms of tuples, grouped into relations. Related entries are linked together with a "key". The 8

relational model provides a declarative method for specifying data and queries. Most relational databases use the SQL data definition and query language. A table in an SQL database schema corresponds to a predicate variable; the contents of a table to a relation; key constraints, other constraints, and SQL queries correspond to predicates. There are three types of relations between tables in a relational database: one-to-one, one-to-many, many-to-many. One-to-one relations occur when one element A can only be linked to one other element B, for instance, a proteotypic peptide is linked to only one protein. One-to-many relations are used when one element A can be linked to multiple elements B, but all B elements only link to element A, for instance, a protein and the collection of its proteotypic peptides. Many-to-many relations join multiple elements A with multiple elements B, for instance, proteins and their degenerate peptides. There are multiple guidelines for the efficient design of relational databases, known as the orders of normalization. These orders are hierarchically constructed and every higher order automatically assumes the correct application of the order directly beneath it. These guidelines generally ensure that every table contains the minimal needed information and avoid duplicates where possible. Note that some infractions on those guidelines might be implemented depending on the structure of the database to ensure better practicality or optimize queries.36 The database developed in this work has been normalized according to these guidelines.

1.6 Machine learning Aside from adequate means for storage and processing power, finding proper statistical techniques to correctly process large amounts of data can be quite challenging. Simple statistical modeling techniques become more difficult to implement when dealing with high-dimensional data and hidden structures in data. In the case of classifications for example, statistical models usually provide frontiers in order to classify data, machine learning on the other hand also captures patterns beyond those frontiers. Machine learning is a field at the intersection of data analytics and computer sciences. It comprises algorithms that can learn from, and make predictions from, input data.37

9

1.6.1 Supervised learning There are two types of machine learning algorithms: supervised and unsupervised. Supervised learning algorithms learn from labeled data. The general workflow consists of training prediction models to predict labels for a test set based on features (variables/attributes of data). Examples of supervised learning methods are decision trees and their ensemble counterpart random forests.38 Decision trees assign entropy values to data labels and split data based on entropy changes. Entropy is computed as follows:

Entropy (H) is the negative of the sum of probabilities for every label (P(xi)) multiplied with the logarithm of itself (logb P(xi)). Decision trees recursively partition data into optimal partitions based on a single feature. In order to find these features, a criterion called information gain is calculated. Information gain is the difference between the current entropy and the entropy after selecting a given feature for partitioning. At each node in the tree, decision trees specifically select the features with the highest information gain at that node for partitioning. The tree stops growing when there are no more features available for partitioning, when there is only one class (data points with the same label) left in the node, or when a preset depth limit is reached. Decision trees have several advantages such as ease of interpretation, the ability to handle both continuous and discrete features, insensitivity to transformations of features, automatic feature selection, robustness and scalability. Decision trees are however not as accurate as other algorithms, tend to be unstable, and can possess high variance.39 Decision trees can however, be improved by building multiple trees and averaging these. This approach, called bootstrap aggregating, reduces variance. Such bootstrap aggregated decision trees are also called random forests.40

10

1.6.2 Unsupervised learning Unsupervised learning machine learning algorithms learn from unlabeled data which adds an additional challenge because the algorithms need to look for hidden structures in the data. One popular class of unsupervised learning algorithms is that of clustering algorithms.

Clustering algorithms try to cluster data points into groups such that data points in the same clusters are more similar to each other than to data points in other clusters. Examples of unsupervised learning algorithms are kmeans++41 and hierarchical agglomerative clustering.42 Kmeans++ associates each data point in a dataset to K clusters. These clusters are specified by centroids, which are the average locations of their clusters. These centroids are randomly initialized while dispersing the different centroids as much as possible from each other in order to avoid the initialization of multiple collocated centroids. The algorithm then updates the position of the centroids to the average of the data points which have been assigned to them. The algorithm continues to recursively reassign data points to the closest new centroids, update centroid positions, and reassign data points until a preset level of convergence is reached for the centroid positions. While very flexible, this technique does require knowledge of the initial number of clusters K.41 Hierarchical agglomerative clustering produces clustered visualizations of datasets without a prior knowledge of the number of clusters K. The algorithm represents each data point as a singleton cluster (cluster with one data point) and merges the two closest clusters iteratively until a single cluster remains. The result of this is a so-called cluster tree or dendrogram.42 1.6.3 Model evaluation Prediction models make four types of predictions: true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN). These four types of prediction can be visualized with a confusion matrix. (Table 1) A general metric for the evaluation of prediction models is accuracy. Accuracy is defined as the sum of true positives and true negatives, divided by the total number of predictions. Accuracy, however, is sensitive to the ratio between labels. If the majority of the data belongs to the same class, then making a majority of positive predictions for that class can lead to a high accuracy even though the model could possibly perform poorly on 11

all other classes. Two other metrics for the evaluation of prediction models are the true positive rate (TPR) and the false positive rate (FPR). TPR is defined as the ratio of true positives for a class over the total of data points belonging to that class, while FPR is the ratio of false positives for a class over the total of data points not belonging to that class. Table 1: example of a confusion matrix

Condition A

Not A

Predicted as A

TP

FP

Predicted as not A

FN

TN

Random forests have another metric known as the out-of-bag (oob) error. The oob error is calculated as the accuracy of every tree for the data not used in its construction, and then averaging all those accuracies.40,43 1.6.4 Model validation Models need to perform well in unseen external data. In order to estimate the performance of prediction models on external data, the available, labeled data can be split into different subsets. Data sets are split into training/validation sets and test sets. The training/validation sets are split even further in n subsets. Of the n subsets a single subset is retained as the validation set for testing a model and the remaining n-1 subsets are used to train a model. This procedure is repeated for every subset and the performance on the real test set can be estimated as the average of the accuracies of every subset. This procedure is known as cross-validation (cv). The accuracy of the cv-trained model can then be taken as a fair estimate for the performance of the model on unseen external data.43 1.6.5 Dimensionality reduction High-dimensional datasets (datasets with a lot of features) are difficult to visualize in two or three dimensions. To improve this there is a need for techniques that are able to reduce dimensions while keeping distances between data points in two or three dimensions as representative of the original high-dimensional distances as possible. Examples of dimensionality reduction techniques are principal component analysis (PCA)44 and t-Distributed Stochastic Neighbor Embedding (t-SNE).45 12

In PCA, data sets are decomposed through linear transformation in their principal components or eigenvectors. Eigenvectors are orthogonal and point to the direction of largest variance, therefore retaining the most information from the dataset as possible. Every eigenvector has an eigenvalue which represents the variance observed for the eigenvector.44 In t-SNE, conditional probabilities are calculated proportional to the similarities (defined as Euclidean distances) between data points using a Gaussian distribution. This leads to high probabilities to pick similar elements and low probabilities to pick dissimilar elements. Probabilities for the low-dimensional counterparts of the data points are then calculated using a t-test distribution. Kullback-Leibler divergences are subsequently used to minimize the divergence between the Gaussian and t-test distributions, thus leading to low-dimensional representations that maintain representative distances between data points compared to the versions in high-dimensional space.45 1.6.6 Feature selection Feature selection consists of selecting the optimal subset of features to build models. This in contrast to feature extraction techniques such as PCA that create new features from the original features. There are three types of feature selection implementations: filter, wrapper and embedded methods. Filter methods calculate the relevance of each individual feature, independently from other features therefore neglecting potential interactions between features. Rather than providing optimal subsets, filter methods usually provide feature rankings. Wrapper methods use predictive models to compute the optimal feature subsets. The data set is split into subsets with different subsets of features. The subsets are then tested in a cross-validated fashion in order to determine the optimal feature subset. Wrapper methods are usually computationally intensive but provide the best subsets. Embedded methods perform feature selection as part of the model construction process. Decision trees and random forests implicitly calculate feature importance during the building process by calculating the information gain for every feature at every node of the tree. Random forests have the additional advantage of averaging those performances in order to select subsets 13

which perform well on the data set in general, rather than only on the subset on which they were trained.46

1.7 Expression atlases This thesis aims to create an expression based atlas that maps protein expression levels to tissues by re-using results of public proteomics data. The advantages of proteomics data over transcriptomics data for this purpose are considerable. First, proteomics data provides more direct information concerning protein quantity than mRNA expression. The mRNA expression levels for a given gene do not necessarily correlate with the related protein quantity.47 And second, proteomics data provides more direct insight on protein-protein interactions than transcriptomics data. While co-expressed genes have a probability to physically interact at the protein level, gene expression levels do not necessarily correlate with protein quantities. Protein-level quantification can thus deliver better predictors for protein-protein interactions.48 Recently, two large scale data sets have provided the community with reasonably well annotated, tissue wide, human proteomics data48,49, but a complex set of tools is needed in order to interpret the results and use these to create a protein expression atlas. This includes an efficient means to gather these data, to match experimentally obtained spectra to peptides24, to optimally filter out false positives19,25 and finally to infer the correct proteins of origin from the identified peptide sequences26. Moreover, at around the same time, an antibody-based atlas has been published as well2, but no study yet has compared the proteomics atlas to the antibody atlas. This thesis will thus use reprocessed public proteomics data to construct a human tissue based protein expression atlas. This atlas will be analyzed with clustering techniques to discover possible hidden patterns in the data and will subsequently be compared to the antibody atlas as presented in Uhlén et al. with the aim of discovering possible complementarity between both views on tissue-based proteomics. In order to aid future automatic (re)annotation of projects with incomplete or missing tissue and cell type metadata, a tissue and cell type predictor will be built with random forests using the proteomics atlas. 14

2 Materials and Methods Reprocessed data from the PRIDE database1,50 is used to create the atlas in this work. However, in order to map proteins to tissues, the projects in PRIDE need to be adequately annotated. Unfortunately, this is not always the case as metadata annotation is often incomplete or even worse, erroneous. In order to remediate this, tissue information was gathered manually through curation from the original publications and supplementary materials that were linked to the data in PRIDE. The workflow that leads to a protein based tissue atlas consists of multiple steps: (i) data retrieval, (ii) data formatting, (iii) data persistence and storage, and (iv) data inference. At the end of the workflow, a compendium or atlas is constructed from the resulting database. The large size of the data calls for a relational database. The processing and handling of the data was programmed in several scripts, using Java and Python.

2.1 Data retrieval In order to comprehend the data optimally, there is a need for various types of metadata aside from MS-based protein expression and quantification data. Protein expression data is gathered by reprocessing of the draft human proteome data sets published by Wilhelm et al.51 (ProteomeXchange accession PXD000865) and Kim et al.52 (ProteomeXchange accession PXD000561) through the ReSpin pipeline. Additional information such as gene ontology, pathways, genetic disorders, and drugs can also be included in further analysis. In order to accommodate this additional knowledge in the database and atlas, specific (online) repositories were queried. By using BioMart53 where possible, large quantities of metadata could be combined for retrieval across multiple resources 2.1.1 UniProt Additional protein-level information, including description and chromosome location of corresponding genes, was gathered from the Ensembl54 Biomart with the “Homo sapiens genes” data set selected. The filter was set to limit to genes with external references to UniProt/SwissProt55 accessions. The attributes “description” and “chromosome name” were selected from the “gene-ensembl” features, and the attribute UniProt/SwissProt Accession was 15

selected from the “external-external references” features. Protein sequence lengths were gathered from the UniProt database as a text file export from the December 2015 release. The result file was saved as a csv file. 2.1.2 Gene ontology The gene ontology56 annotation for UniProtKB/Swiss-Prot proteins was gathered with the ensembl Biomart. The Homo sapiens genes dataset was selected and the filter was set to limit to genes with external references to UniProt/SwissProt55 accessions. The attributes “UniProt Gene Name”, “UniProt/TrEMBL Accession” and “UniProt/SwissProt Accession” were selected from the “external-external references” features and the attributes “GO Term Accession”, “GO Term Name” and “GO domain” were selected from the “external-GO“ features. The result file was saved as csv file. 2.1.3 Reactome Pathway information and the proteins they concern were gathered from the Reactome database57 December 2015 release as the “UniProt to all pathways” mapping file. This file was subsequently filtered on Homo sapiens using Microsoft Excel.

2.2 Relational database The output from the ReSpin pipeline was stored in a relational database prior to analysis. The database

itself

was

created

with

MySQL

Workbench

version

6.08

(https://downloads.mysql.com/archives/workbench/) and was updated iteratively to cope with the large variety of data formats added, including the antibody atlas, gene ontology, and the pathway information.

2.3 Data parsing The different data and external metadata files have different formats but all need to be stored into the database. Every format requires a custom parser to (i) read the data, (ii) format the data to be compatible with the structure of the database and (iii) store the data in the database.

16

The source code each individual parser can be found on (https://github.ugent.be/mmenu/creation_analysis_protein_expression_atlas/tree/master/Java %20code/expressionatlas/src/main/java). 2.3.1 Human draft proteomes The Kim et al.52 and Wilhelm et al.51 draft human proteome data sets were processed through the ReSpin pipeline. In the Kim et al. data set, tissue attributes for an assay can be derived from the filename, along with the developmental stage of the tissue (adult or fetal). The Wilhelm et al. data set provided this information in the supplementary materials of the publication. The csv files produced by ReSpin post reprocessing contain identification information, including detected peptides, protein inference, spectra information, level of confidence, and validation state (Table 2). A custom csv parser was written for these ReSpin output files to extract the information from the file and store it in the relational database. The parser only retains peptides that are statistically validated by PeptideShaker at 1% FDR on the peptide-to-spectrum match level in order to keep the quality of the identified peptides as high as possible. The source code is made available on the GitHub repository as PandeyParser for the Kim et al. dataset and KusterParser for the Wilhelm et al. dataset.

17

Table 2: Description of columns in ReSpin output csv files as per documentation exported through the PeptideShaker user interface.

Headers index Protein(s) sequence Variable Modifications Fixed Modifications Spectrum File Spectrum Scan Number RT m/z Measured Charge Identification Charge Theoretical Mass Isotope Number Precursor m/z Error Decoy Localization Confidence

Description Automatically generated index number Protein(s) to which the peptide matches The identified sequence of amino acids The variable modifications The fixed modifications Reference to the spectrum file The spectrum scan number Retention time as provided in the spectrum file Measured m/z The charge as given in the spectrum file Charge measured in MS1 Theoretical Mass of the assigned sequence and modifications The isotope number targeted by the instrument The precursor m/z error in ppm (experimental – theoretical) Indicates whether the peptide is a decoy (1: yes, 0: no). The confidence in variable PTM localization. The probabilistic score (e.g. A-score58 or PhosphoRS59) used for variable PTM probabilistic PTM score localization. D-score D-score60 for variable PTM localization. Confidence Confidence score for the peptide (0-100) Validation Indicates the validation level of the protein group.

2.3.2 Antibody protein atlas dataset The antibody-based protein atlas is a map of the tissue-based proteome. It combines microarray-based immunohistochemistry protein profiling and quantitative transcriptomics on tissue and organ levels. This atlas can be freely accessed as part of the Human Protein Atlas portal61. The antibody atlas data2 is stored as a very large XML file of several gigabytes in size, making manual inspection nearly impossible. At the time of retrieval of this XML file the Human Protein Atlas version was version 14. (http://v14.proteinatlas.org/about/download) In order to programmatically extract the information from this file, SAX parser, a specialist Java library

for

the

handling

of

XML

(https://docs.oracle.com/javase/7/docs/api/javax/xml/parsers/SAXParser.html),

files was

used.

Inconsistencies in the tissue annotation provided an additional challenge. For example, the tissue identifier is in some cases truncated, leading to duplicate tissue entries in the database. To remediate this, a manual inspection of the table was required to merge the relations between 18

tissues and proteins to the first actual occurrence of the tissue, collapsing the duplicate tissue entries into a single entry afterwards. The source code for this parser is made available as AntibodyAtlasParser. 2.3.3 Uniprot data The Biomart result file from section 2.1.1 was parsed with a custom csv parser in Java. Protein descriptions and chromosomes were queried and the corresponding entries in the relational database were updated with the gathered metadata. The format of the UniProt text file consists of predefined prefixes preceding specific information for every UniProt entry, such as accession number, species of origin, ID, and sequence length. The parser for this file used regular expressions to find all human proteins and gather their UniProt accession numbers and sequence lengths. The proteins in the relational databases were then updated with their sequence length. The source code for this parser is made available as UniProtParser. 2.3.4 Gene ontology data The BioMart result file from section 2.1.2 was parsed with a custom csv parser in Java. Unique gene ontology entries were stored in a separate table from the protein-to-gene ontology relations. The source code for this parser is made available as GOParser. 2.3.5 Reactome data The Reactome database file was parsed with a custom csv parser in Java. Pathways with a unique accession number were stored in a separate table from the protein-to-pathway relations. The source code for this parser is made available as ReactomeParser.

19

2.4 Atlas creation Atlas creation was based on the data gathered in the relational database, and largely consisted of two steps: data retrieval and quantitative data normalization, and the linking of identified proteins to tissues. These two steps were carried out in iPython notebooks and are described in detail below. All

source

code

can

be

found

on

(https://github.ugent.be/mmenu/creation_analysis_protein_expression_atlas/tree/master/ipytho n%20notebooks). 2.4.1 Query data and normalization of quantitative values The first step consisted of retrieving from the database the number of peptides found per assay along with their spectral counts, and the relations between peptides and their parent protein(s).This

information

is

stored

in

the

“peptide_to_assay”

table

and

the

“peptide_to_protein” table, respectively. Following database data retrieval, two reasonably stringent filtering steps were then applied on the compiled data: (i) to retain only proteotypic peptides, and (ii) retain only proteins identified by at least three proteotypic peptides. Proteotypic peptides are here defined as peptides which are only detected for 1 parent protein in mass spectrometry experiments. 17 Peptides were all quantified using the normalized spectral abundance factor (NSAF)62 to ensure that the different assays were comparable with one another. NSAF also implicitly corrects for proteins with different sequence lengths, which is important because proteins with longer sequences tend to have a higher amount of detectable peptides. The iPython notebook for this analysis is made available as ‘Atlas Creation 1.ipynb’. 2.4.2 Linking of proteins to tissues The identified proteins and their quantification values were linked with tissue annotations using the data stored in the “tissue_to_assay” table. The actual atlas was then created by making a pivot table of the proteins with their NSAF values for every tissue. For proteins with several 20

values for the same tissue, the average NSAF value was taken. Missing values were replaced by 0. A minimum to maximum normalization was performed on the pivot table for every tissue to correct for differences in absolute expression values between tissues. A heat map representation was then made of the normalized pivot table. This heat map plots normalized protein expression values for the different tissues, offering an overview of expression patterns in every tissue for every protein. The iPython notebook for this analysis is made available as ‘Atlas Creation 2.ipynb’.

2.5 Analyses on the atlas Analyses of the atlas were carried out for the purpose of (i) determine the individual contributions of the Wilhelm et al. and Kim et al. to the atlas, (ii) clustering the proteins atlas based on the expression values in tissues to find core and tissue-specific proteins and visualizing these clusters, (iii) to compare the cross-section of the proteomics atlas and the antibody-based atlas in order to reconcile two different views of tissue-mapped proteomes, (iv) build predictors for tissues and cell types to annotate data sets lacking metadata. These analyses were carried out in iPython notebooks. All

source

code

can

be

found

on

(https://github.ugent.be/mmenu/creation_analysis_protein_expression_atlas/tree/master/ipytho n%20notebooks). 2.5.1 Preliminary analyses The number of assays retained from each data set was calculated with SQL queries in order to assess the contribution of each individual data set to the atlas. In order to estimate the percentage of peptides lost by only using proteotypic peptides and proteins with at least three proteotypic peptides, the total amount of peptides, and of proteotypic peptides were queried. The amount of proteotypic peptides kept after filtering for proteins with at least three proteotypic peptides were calculated in Python. The total amount of proteins mapped to every tissue was calculated in Python to investigate the individual contributions from the Kim et al. and Wilhelm et al. data sets to the atlas. 21

The iPython notebook for this analysis is made available as ‘Atlas Analysis 1 - preliminary analysis PCA & MDS.ipynb’. 2.5.2 Clustering In order to find proteins with similar expression values in a particular tissue, a clustering analysis was performed. A dendrogram was constructed to find an approximation of the amount and the sizes of the possible clusters of proteins in the atlas. A better approximation for the number of clusters was then determined by calculating inertia and silhouette scores for different cluster sizes. Inertia is a measure for the overall density of the different clusters obtained after a clustering analysis. The lower the inertia, the higher the average density of the clusters. Silhouette scores are a general measure for the quality of a clustering analysis. These scores combine the average density of clusters with the distances between elements of the two nearest clusters. The lower the silhouette score, the higher the average density of the clusters and the higher the distance between different clusters. The best number of clusters derived from the inertia and silhouette scores was then used as input for the Kmeans++ clustering analysis. A PCA was then carried out to have a draft look on the density of the proteins. The iPython notebook for this analysis is made available as ‘Atlas Analysis 1 - preliminary analysis PCA & MDS.ipynb’. 2.5.3 t-SNE PCA doesn’t show representative distances between data points but the two features which account for the most variance in the data; therefore a dimension reduction was performed using t-SNE in order to have a better overview on the different clusters of proteins in the atlas. t-SNE provides better compression of high-dimensional data to two- or three-dimensional representations. This analysis was performed iteratively in order to determine the optimal hyperparameters such as perplexity and number of clusters. The iPython notebook for this analysis is made available as ‘Atlas Analysis 2 - TSNE & cluster analysis.ipynb.

22

2.5.4 Preparation of the antibody atlas The antibody atlas data was queried from the “protein_to_tissue” table which contains the expression levels of the proteins in each “cell_type” (ordinal values, from no expression ‘0’ to high ‘3’). To be able to compare the antibody atlas data to the MS derived data, organ labels were used instead of tissues or cell types. This implies the existence of multiple expression values for a protein in the same organ. In contrast to the continuous NSAF values for which the averages were calculated, the antibody atlas expression levels are ordinal values, therefore the median expression values were calculated (Table 3). Table 3: example of NSAF to antibody atlas expression level mappings

uniprot_id Q9NQ75 P20936 P23919 Q00535 Q00534

organ_x Placenta Placenta Placenta Placenta Placenta

NSAF 0.037242 0.052490 0.119735 0.075765 0.068618

organ_y expression_level Placenta 2.0 Placenta 3.0 Placenta 3.0 Placenta 1.0 Placenta 2.0

The uniprot_id column displays SwissProt accession numbers of proteins detected with both mass spectrometry and antibodies in the same tissues. The organ_x and organ_y columns display the names of the tissues for the mass spectrometry and antibody datasets respectively. The NSAF column displays NSAF values for the proteins in different tissues. The expression level column displays bins of expression level in the antibody atlas dataset in different tissues.

The iPython notebook for this analysis is made available as ‘Atlas Analysis 3 - Preparation HPA data.ipynb’. 2.5.5 Comparison of antibody and mass spectrometry atlases The intersection of organs was determined between the two datasets in order to have the same features for the proteins. Then the intersection of proteins for every organ was determined in order to have a single dataset with no missing values. The spearman correlations between the NSAF values and antibody expression level ordinal values were calculated across organs and across proteins. This procedure was also applied on a subset of the data where the top 10% NSAF values were not included in order to remove strong influences from highly abundant proteins. To verify whether the two separate MS based proteome studies show a bias with regards to the correlation, the dataset was split into the respective projects. Pearson correlations 23

were then calculated between the Wilhelm et al. and Kim et al. data sets, and Spearman correlation between each of the individual MS data sets and the antibody atlas data set. This could have indicated a bias in the contribution of the MS data sets to the cross-section analysis. The iPython notebooks for this analysis are made available as ‘Atlas Analysis 4 - comparison HPA data and MS data.ipynb’, ‘Kuster vs Pandey 1.ipynb’ and ‘Kuster vs Pandey 2.ipynb’. 2.5.6 Tissue prediction models The Wilhelm et al.51 lacks tissue and cell type annotations for a significant portion of its data. This is not an isolated case however as public proteomics also lack metadata sometimes and the annotation of the sampled tissue forms no exception to this18. Therefore a tissue predictor was built to automatically annotate or provide suggestions of the possible tissues used in an experiment for future reference. Initially the data used to build the classifier needed preprocessing. The NSAF normalized protein data (as described in 2.5.1) was converted into a pivot table with tissue annotated assays as rows and proteins as columns with NSAF as values (0 if missing). The dataset was then split into a training/validation set to build the classifier, and a test set to test the classifier. In order to significantly reduce the number of features (proteins), a feature selection analysis was carried out. The feature ranks were determined with a Random Forest classifier. The optimal subset of features was then determined by iteratively calculating the accuracy of a train set on a validation set for multiple subsets, keeping the subset with the highest accuracy as final subset. The classifier was then tested on the test set and metrics such as accuracy and out-of-bag error were calculated. A confusion matrix was made for the predictions on the test set in order to visualize false positive and false negative rates for every tissue. Since cell type annotation is sometimes missed as well and in order to make more specific predictions for tissues like lymphoid tissue which comprises of different cells such as B-cells, T-cells, NK-cells, etc. a more specific predictor for cell types was constructed as well. The same procedure as described above for the tissue classifier was applied.

24

The iPython notebooks for this analysis are made available as ‘Atlas Analysis 6 – Tissue predictor.ipynb’ and ‘Atlas Analysis 6 – Cell type predictor.ipynb’.

25

3 Results 3.1 Database The database currently includes 22 tables (Figure 1). For the creation of, and analyses on the atlas, the pipeline only relies on nine tables. The other tables are intended to support possible future analyses. The tables used for the study are “project”, “assay”, “peptide_to_assay”, “peptide”,

“peptide_to_protein”,

“protein”,

“tissue_to_assay”,

“tissue”

and

“tissue_to_protein”. The “project” table stores the PRIDE project ids and related assay ids. The “assay” table stores metadata for the assays such as the pride assay ids, description of the assays and the experiment types. In the case of the Wilhelm et al. and the Kim et al. datasets, which don’t have PRIDE assay ids as they are both partial projects at the time of writing, auto-incremented integers appended to the dataset name are used. The “peptide” table stores unique peptide sequences that are detected in different assays and assigns an auto-incremented integer as peptide id. The start positions in the sequences of proteins are stored as well. The “protein” table stores UniProt/SwissProt accession numbers, the sequence lengths of those proteins, the chromosome on which the corresponding gene is located, and a description of the proteins. The “tissue” table stores cell type information; “cell_type” stores the different types of cell types in experiments, if available, and stores the same name as the used tissue if this information is missing, “tissue_name” stores the different types of tissues in experiments, if available. Every unique combination of tissue and cell type is assigned an auto-incremented id. The primary purpose of the organ column is to make different datasets more comparable by mapping tissues to organs. The “peptide_to_assay” table stores the relations between assays and peptides by storing peptide ids with the assay ids of assays in which they were detected with their spectral count values. The “peptide_to_protein” table stores the mappings between proteins and peptides and serves as the base for protein inference. The “tissue_to_assay” table stores the tissue annotations of assays. And the “tissue_to_protein” table stores data from the antibody-based protein atlas. It maps detected proteins to the tissue ids stored in the tissue table and provides the expression values in bins for the proteins in every tissue.

26

Figure 1: Schema of the relational database. Tables with the same header colour refer to related information. The dotted lines between tables represent one-to-many relationships. The line between the assay and project tables represents a one-to-one relationship. The tables needed for the construction of the atlas (in the blue polygon) were: peptide, peptide_to_protein, proteins, tissue_to_protein, tissue, tissue_to_assay, assay, project and peptide_to_assay.

3.2 Proteomics expression atlas The atlas contains 12 263 proteins, spread over 28 organs. The heat map in Figure 2 displays the normalized expression values for every protein in every organ. Every organ displays some unique expression patterns, but interestingly, reoccurring patterns also occur. There are a few

27

organs which display lower overall expression levels which could indicate more specialized proteomes in those organs, or are due to an insufficient sample size and the lack of data.

Figure 2: Heatmap representation of the proteomics expression atlas. The y axis displays proteins, the x axis displays organs. The coloring displays the min to max normalized expression for a given protein in a given organ.

3.3 Preliminary analyses The assay counts (Table 4) reveal that 98 percent of the assays from the Kim et al. data set, and 25 percent of the assays from the Wilhelm et al. data set were used for the creation of the atlas.

28

Table 4: Assay counts in the relational database.

Total number of assays Assays stored in database Assays with tissue annotation Percentage of retained assays

Pandey 2093 2051 2051 0,98

Kuster 1315 998 323 0,25

Total 3408 3049 2374 0,70

The first row displays the number of assays provided in the original data sets. The second row displays how many assays contained at least one validated peptide in each dataset. The third row displays the amount of assays in the second row which contained tissue metadata and were therefore usable for the creation of the proteomics atlas. The last row displays the percentage of usable assays for the atlas.

The peptide counts (Table 5) reveal that 92.83 percent of the identified peptides are proteotypic, and that 99.17 percent of these proteotypic peptides can be mapped to proteins with at least three proteotypic peptides. Table 5: Proteotypic peptides in the relational database.

Peptide counts Number of unique peptides in database 316 539 Number of proteotypic peptides in database 293 831 Percentage of proteotypic peptides in database 92,83 Number of proteotypic peptides used for protein inference 291 403 Percentage of proteotypic peptides used for atlas 99,17 The first row displays the total number of peptides stored in the database. The second row displays the number of proteotypic peptides in the database. The third row is the percentage of proteotypic peptides in the database. The fourth row displays the amount of proteotypic peptides mapped to proteins with at least three proteotypic peptides. The last row displays the percentage of proteotypic peptides usable for the creation of the atlas.

The protein counts per tissue (Table 6) reveal that the data sets have twelve tissues in common, and that each has eight unique tissues. For the shared tissues, the majority of proteins are derived from the Kim et al. dataset. The pharynx has the lowest number of proteins mapped to it.

29

Table 6: Protein counts per tissue.

Placenta Testis Stomach Pharynx Ovary Spleen lymphatic tissue Adrenal gland Thyroid gland Salivary gland Oesophagus Liver Seminal vesicle Uterus Lung Gall bladder Prostate Oral cavity Kidney Pancreas Heart Colon Eye Urinary bladder Brain Blood Spinal cord Gut

3428 4258 4622 578 3423 4025 3945 4421 3226 4734 4863 3587 3524 4007 3061 2433 4168 2892 3330 3833 0 0 0 0 0 0 0 0

4193 8345 0 0 8155 0 8634 5604 0 0 3385 7193 0 0 4345 4982 6255 0 4238 6545 7043 6348 7193 5253 7208 3946 5232 5316

4936 8638 4622 578 8339 4025 9161 6408 3226 4734 5590 7472 3524 4007 5101 5507 6721 2892 4814 7046 7043 6348 7193 5253 7208 3946 5232 5316

Wilhelm et al. percentage 69,45 49,29 100,00 100,00 41,05 100,00 43,06 68,99 100,00 100,00 86,99 48,01 100,00 100,00 60,01 44,18 62,01 100,00 69,17 54,40 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00

Total

9480

12159

12253

77,37

Tissues

Wilhelm et al. Kim et al.

Total

Kim et al. percentage 84,95 96,61 0,00 0,00 97,79 0,00 94,25 87,45 0,00 0,00 60,55 96,27 0,00 0,00 85,18 90,47 93,07 0,00 88,03 92,89 100,00 100,00 100,00 100,00 100,00 100,00 100,00 100,00 99,23

The number of unique proteins detected per tissue. Each row represents a tissue. The Wilhelm et al. and Kim et al. columns display the number of unique proteins in these respective data sets. The total column displays the union of unique proteins in both datasets. The Wilhelm et al. and Kim et al. percentage columns display the percentage of proteins in the total data set that were found in the respective data sets.

30

3.4 Clustering The dendrogram in Figure 3 displays the number of proteins in the different clusters. As can be seen on the plot, there is one major cluster containing most of the proteins, while the other clusters are of smaller sizes, with some only containing a single protein. This indicates that the number of clusters for subsequent analyses should not exceed 30 clusters. A higher number of clusters would lead to even more singletons (clusters that contain only one protein), resulting in the overfitting of the clustering analysis.

Figure 3: Dendrogram of the protein expression atlas. This dendrogram shows the top 30 hierarchies/clusters. The numbers on the x axis between parentheses represent the sizes of the clusters. Numbers without parentheses represent singletons with the number being the nth row in the pivot table, thus referring to the protein on that row.

The silhouette scores plot in Figure 4a displays local minima at 20 and 26 clusters. These numbers are supported by the inertia plot in Figure 4b, which also displays a change in slope per added cluster at 20 and at 26 clusters. The following PCA analysis was carried out with a kmeans++ clustering with number of clusters k = 20.

31

Figure 4: 4a plots the silhouette score calculated for the number of clusters ‘k’ used as displayed on the x axis. 4b plots the inertia calculated for the number of clusters ‘k’ used as displayed on the x axis.

The PCA plot in Figure 5 reveals that proteins in the same cluster also tend to aggregate when only the two principal components are considered. The clusters are, however, densely packed, making it hard to distinguish the boundaries between areas of the clusters.

Figure 5: PCA plot of the atlas. Every colour represents a cluster label as calculated by the kmeans ++ clustering steps with k = 20.

32

3.5 t-SNE The t-SNE (figure 6) offers a better display of the different clusters in comparison to the PCA plot. Proteins of the same cluster tend to aggregate on this plot as well, although t-SNE spreads the data better making it clearer to see the different clusters.

Figure 6: t-SNE plot of the atlas. Every colour represents a cluster label as calculated in the previous kmeans ++ clustering analysis. Distances between points on the t-SNE plot are intrinsically related to the real distances in the high-dimensional space.

3.6 Comparison antibody and mass spectrometry atlases The comparison between the antibody and mass spectrometry atlases is displayed as histograms of spearman correlations (Figure 7). The distribution over organs shows an average equal to 0.03 (SD = 0.09). A two-sided t-test on the distribution reveals that the null hypothesis: the mean equals 0, cannot be rejected (p = 0.10). The distribution over proteins shows an average of 0.3 (SD = 0.56) although there are some additional peaks at the negative and positive maximum correlation values indicating a multimodal distribution. The protein-level correlation is therefore better, although still not very high.

33

a)

b)

Figure 7: a) histogram of the Spearman correlations between organs in the MS and in the antibody data set. The x axis displays bins of correlation scores while the y axis displays the number of organs in that correlation score bin. b) histogram of the Spearman correlations between proteins in the MS and in the antibody data set. The x axis displays bins of correlation scores while the y axis displays the number of proteins within a given correlation score bin.

The comparison between the Kim et al. and Wilhelm et al. data sets is displayed in histograms of the Spearman correlations (Figure 8). The correlations between organs shows an average equal to 0.45 (SD = 0.07). The correlations between proteins shows an average of 0.07 (SD = 0.57), although here too there are some additional peaks at the negative and positive maximum correlation values that indicate a multimodal distribution.

a)

b)

Figure 8: a) histogram of the Pearson correlations between organs in the Kim et al. and Wilhelm et al. data sets. The x axis displays bins of correlation scores while the y axis displays the number of organs within that correlation score bin. b) histogram of the Pearson correlations between proteins in the Kim et al. and Wilhelm et al. data sets. The x axis displays bins of correlation scores while the y axis displays the number of proteins within a given correlation score bin.

34

The comparison between the Wilhelm et al. data set and the antibody atlas, and the comparison between the Kim et al. dataset and the antibody atlas are displayed as histograms of Spearman correlations (Figure 9). The correlations between organs for the Wilhelm et al. and the antibody data sets show an average equal to -4.3e-05 (SD = 0.18). The correlations between proteins for the Wilhelm et al. and the antibody data sets show an average of 0.04 (SD = 0.58), although there are some additional peaks at the negative and positive maximum correlation values indicating a multimodal distribution. The correlations between organs for the Wilhelm et al. and antibody data sets show an average equal to 0.04 (SD = 0.04). The correlation between proteins for the Kim et al. and antibody data sets show an average of 0.06 (SD = 0.53), although there are some additional peaks at the negative and positive maximum correlation values indicating a multimodal distribution.

35

a)

b)

c)

d)

Figure 9: a) histogram of the Spearman correlations between organs in the Wilhelm et al. and antibody datasets. The x axis displays bins of correlation scores while the y axis displays the number of organs within a given correlation score bin. b) histogram of the Spearman correlations between proteins in the Wilhelm et al. and antibody datasets. The x axis displays bins of correlation scores while the y axis displays the number of proteins within a given correlation score bin. c) histogram of the Spearman correlations between organs in the Kim et al. and antibody datasets. The x axis displays bins of correlation scores while the y axis displays the number of organs within a given correlation score bin. d) histogram of the Spearman correlations between proteins in the Kim et al. and antibody datasets. The x axis displays bins of correlation scores while the y axis displays the number of proteins within a given correlation score bin.

3.7 Tissue prediction models 3.7.1 Tissue predictor Out of the original 12 253 proteins, only 410 were kept as the final subset for the classifier. The out-of-bag error for the tissue predictor on the test set was 0.937 and the accuracy was 0.933. The confusion matrix displays the efficacy of the predictor for every tissue individually. The only outlier on the matrix is the nasopharynx. (Figure 10)

36

Figure 10: confusion matrix for the tissue predictor. Squares on the diagonal line represent true positive predictions for those tissues. Horizontally shifted squares represent false negatives for the tissue label and false positives for the predicted tissue.

3.7.2 Cell type predictor Out of the original 12 253 proteins, only 710 were kept as the final subset for the classifier. The out-of-bag error for the tissue predictor on the test set was 0.908 and the accuracy was 0.924. The confusion matrix displays the efficacy of the predictor for every tissue individually. The only outlier on the matrix is the nasopharynx. (Figure 11)

37

Figure 11: confusion matrix for the cell type predictor. Squares on the diagonal line represent true positive predictions for those cell types. Horizontally shifted squares represent false negatives for the cell type label and false positives for the predicted cell type.

38

4 Discussion 4.1 Re-using the pipeline on additional data in the future The original purpose of this study was to build an atlas from the whole of PRIDE and to compare this atlas to the Kim et al. and Wilhelm et al. data sets, both of which were made for the sole purpose of creating a comprehensive human proteome. However, due to the large amount of computational power needed, and issues with load and optimizations on the tools used in the reprocessing pipeline, the complete PRIDE data could not be retrieved in time and the reanalysis is still ongoing. Nevertheless, the parsers and the database were constructed with this original goal in mind, and the addition of any future data from PRIDE is therefore semi-automatic. The scripts for the analyses are founded on the structure of the database, decoupling the analyses form the actual data origins. Rerunning these analyses with new data will therefore not require any changes to the analyses pipeline.

4.2 Quality of mass spectrometry datasets The Wilhelm et al. data set resulted in 998 result files containing at least one validated peptide. Of those 998 usable assays, only 323 came with a tissue annotation. This in contrast to the Kim et al. data set, which had all of its assays annotated and most of these had at least one validated peptide. The atlas is therefore comprised mostly from data originating from the Kim et al. data set. A major consequence is the low amount of data for tissues that were exclusively studied in the Wilhelm et al. dataset, such as the nasopharynx. The percentage of proteotypic peptides usable for the creation of the atlas from the reprocessed Wilhelm et al. and Kim et al. datasets was 99.17% thus showing that despite the stringent criteria for protein inference, which was only keeping proteins with at least 3 proteotypic peptides, most of the peptides were still available. The protein counts for every tissue in each data set revealed that each of the data sets included twenty different tissues. Only twelve of these were used in both data sets, and each data set 39

therefore contributed eight unique tissues. When examining the cross section of tissues between the two data sets, the Kim et al. data set generally mapped more tissues which can be explained by the loss of most assays in the Wilhelm et al. data set.

4.3 Clustering and visualization The clustering analysis shows that about 9 420 of the 12 253 proteins seem to cluster together. This cluster, which contains three quarters of all proteins, could be considered the core proteome, i.e., the proteins expressed in most of (if not all) of the human tissues. Although the t-SNE plot shows a grouping of proteins belonging to the same clusters, it does not separate different clusters over large distances (even with high perplexities). This suggests that although the data can be clustered, the differences between clusters are very small, making it more difficult to clearly separate these from one another. This could be a feature of proteome data in general or a result of the low quality of the data used for the atlas.

4.4 Comparison of the MS-based and antibody-based atlases The comparison between mass spectrometry based and antibody based atlases for the same proteins in the same tissues reveals a general lack of correlation. Moreover, the comparison between each of the mass spectrometry data sets and the antibody atlas show that both data sets correlate badly with the antibody atlas, thus excluding the possibility of bias in one of the MS data sets. The comparison between the Wilhelm et al. and Kim et al. data sets reveals a weak correlation between tissues, but a distribution similar to the one with the antibody atlas between proteins. This could suggest that these techniques cannot be readily compared quantitatively, or that the quality of one or more of these datasets is quite poor.

4.5 Tissue and cell type prediction models The tissue predictor performed quite well, using only a small subset of proteins. This suggests that tissues can be separated from one another purely based on differences in their protein expression profile. The small number of features used for the classifier also suggests that even small differences in expressed proteins can lead to the characterization of different tissues. The reason for the high importance of such tissue-typic proteins could be their unique expression in 40

a single tissue, or the implication in a key function of a single tissue, or their specific expression values in different tissues. Further analyses could help answer this interesting question. Although the classifier has a good accuracy and oob score, these metrics do not reveal any detailed information on the rate of false positives and false negatives for every tissue. The confusion matrix plot remedies this and in doing so reveals additional information. The main outlier is the nasopharynx, which has no true positive predictions at all. This could be explained by the lack of nasopharynx data due to the incomplete annotation of the Wilhelm et al. data set. When looking at other false positives and false negatives, it seems that misclassified assays are usually derived from quite similar tissues, such as colon and rectum (parts of the same organ with only minor histological differences), or lymphoid tissue and spleen (both components of the immune system consisting of the same cell types). This indicates that the signal found in the tissue-typic proteins is likely to be real, and of some biological significance. The cell type classifier performed comparable, although it required more proteins to differentiate cell types than the tissue classifier to distinguish tissues. This was however to be expected, due to the higher number of possible classes, which in turn leads to a higher chance of false positives or false negatives. Furthermore, differences between cell types may well be even more subtle than differences between tissues, and this may in turn require more protein identifications. It would be interesting to look at the additional proteins that are used in the classification of cells as compared to tissues, because these proteins are potentially biologically relevant. Due to lack of time, an optimization for the random forest classifier hyperparameters (e.g. number of trees) could not be performed. Selecting the optimal number of trees could further improve the performances on the test sets. More data (especially for tissues or cell types such as the nasopharynx) could also be a major enhancement for the classifier. Including additional from PRIDE to the analysis will therefore be very useful. Indeed, the classifier was trained on data from two large-scale projects only, so adding data from PRIDE could increase the number of tissues covered, as well as the amount of data per tissue. This could in turn increase the overall performance of the classifiers as well.

41

4.6 Future studies 4.6.1 Inclusion of PRIDE data A possible addition to this study would be to achieve the original goal of using all human data from the PRIDE database as well in the construction of the proteomics atlas. The database and parsers to feed the results into this database are already created, and because the analyses are based on the data formats of the database, the only missing ingredient is the ReSpin output from PRIDE itself. With the PRIDE data in the database, it would also become possible to compare the atlases derived from PRIDE data with those derived from the Wilhelm et al. and/or the Kim et al. data sets. 4.6.2 Optimization of the prediction models The prediction models could be further optimized in several ways. Adding data from PRIDE would allow for better training of the models, cross validations and more robust testing on truly external data. A database updated with this new data could potentially also improve the knowledge for tissues with few elements such as the nasopharynx. The pool of tissues would also be expanded. Increased volumes of data would also make it possible to carry out a hyperparameter optimization of the number of trees in the forest. Another option is to adapt more computationally intensive, but also more accurate Machine Learning methods. 4.6.3 Pathways and protein functional shifts Reactome data was downloaded, parsed and stored in the database but due to the lack of time no analyses could be carried out on pathways in the atlas. Aside from defining tissue-specific pathways, the pathway information could be used to study shifts in protein functions. Proteins involved in several pathways carry out several different functions depending on which pathway is active in a cell. By defining pathways specific to certain tissues, proteins involved in such pathways could be studied to find their shifts in functions which in turn could reduce the complexity of molecular biological research. 42

4.6.4 External datasets Other than Reactome, gene ontology data and OMIM data have been stored in the relational database. All these data could be used for analyses such as investigating differences in distributions of gene ontology tags in the different atlas clusters. The OMIM data could be used to carry out a protein based diseasome study with the additional tissue information. Data from DrugBank could be linked to pathways and protein-protein interactions to possibly study drug side effects.

43

5 Conclusions The Wilhelm et al. and Kim et al. data sets were successfully reused to create a proteomics expression atlas. The reprocessing of the data and filtering through peptide- and protein-level selection criteria revealed that the Wilhelm et al. data set has a lower overall quality than the Kim et al. data set. The clustering analysis revealed that even though proteins can cluster based on their different tissue expression values, these clusters still remain close to each other, making it difficult to distinguish these. The cross section of tissues and proteins in the mass spectrometry and antibody atlases were not comparable. Moreover, the mass spectrometry data sets individually also did not correlate strongly with the antibody atlas. This could imply low quality of data in one or more of the data sets, or a general lack of correlation for both techniques, hinting at currently unknown biases in the experimental methods. Tissue and cell type of origin can be predicted for an assay based on observed protein NSAF values. Typically, several hundred proteins provide sufficient resolution for this. Finally, this work is a preliminary study that clearly has potential for extensive follow-up research. By including more data, the developed predictors can be improved. Other applications will also be within grasp such as setting up an atlas that contains and links together valuable information on reaction pathways, genetic diseases and drug related interactions.

44

6 References 1.

Jones, P. et al. PRIDE: a public repository of protein and peptide identifications for the proteomics community. Nucleic Acids Res. 34, D659–63 (2006).

2.

Uhlén, M. et al. Proteomics. Tissue-based map of the human proteome. Science 347, 1260419 (2015).

3.

Domon, B. & Aebersold, R. Mass spectrometry and protein analysis. Science 312, 212–7 (2006).

4.

Aebersold, R. & Mann, M. Mass spectrometry-based proteomics. Nature 422, 198–207 (2003).

5.

Zhang, Y., Fonslow, B. R., Shan, B., Baek, M.-C. & Yates, J. R. Protein analysis by shotgun/bottom-up proteomics. Chem. Rev. 113, 2343–94 (2013).

6.

Pappin, D. J., Hojrup, P. & Bleasby, A. J. Rapid identification of proteins by peptide-mass fingerprinting. Curr. Biol. 3, 327–32 (1993).

7.

Shteynberg, D., Nesvizhskii, A. I., Moritz, R. L. & Deutsch, E. W. Combining results of multiple search engines in proteomics. Mol. Cell. Proteomics 12, 2383–93 (2013).

8.

Bantscheff, M., Schirle, M., Sweetman, G., Rick, J. & Kuster, B. Quantitative mass spectrometry in proteomics: a critical review. Anal. Bioanal. Chem. 389, 1017–31 (2007).

9.

Wiese, S., Reidegeld, K. A., Meyer, H. E. & Warscheid, B. Protein labeling by iTRAQ: A new tool for quantitative mass spectrometry in proteome research. Proteomics 7, 340–350 (2007).

10.

Thompson, A. et al. Tandem Mass Tags: A Novel Quantification Strategy for Comparative Analysis of Complex Protein Mixtures by MS/MS. Anal. Chem. 75, 1895–1904 (2003).

11.

Ong, S.-E. Stable Isotope Labeling by Amino Acids in Cell Culture, SILAC, as a Simple and Accurate Approach to Expression Proteomics. Mol. Cell. Proteomics 1, 376–386 (2002).

12.

Brönstrup, M. Absolute quantification strategies in proteomics based on mass spectrometry. Expert Rev. Proteomics 1, 503–12 (2004).

13.

Griffin, N. M. et al. Label-free, normalized quantification of complex mass spectrometry data for proteomic analysis. Nat. Biotechnol. 28, 83–89 (2010).

14.

Lundgren, D. H., Hwang, S.-I., Wu, L. & Han, D. K. Role of spectral counting in quantitative proteomics. Expert Rev. Proteomics 7, 39–53 (2010).

15.

Morgner, N., Kleinschroth, T., Barth, H.-D., Ludwig, B. & Brutschy, B. A novel approach to analyze membrane proteins by laser mass spectrometry: from protein subunits to the integral complex. J. Am. Soc. Mass Spectrom. 18, 1429–38 (2007).

16.

Arnott, D. et al. Selective detection of membrane proteins without antibodies: a mass spectrometric version of the Western blot. Mol. Cell. Proteomics 1, 148–56 (2002).

17.

Huang, T., Wang, J., Yu, W. & He, Z. Protein inference: a review. Brief. Bioinform. 13, 586–614 (2012).

18.

Verheggen, K. & Martens, L. Ten years of public proteomics data: How things have evolved, and where the next ten years should lead us. EuPA Open Proteomics 8, 28–35 (2015).

19.

Vizcaíno, J. A. et al. ProteomeXchange provides globally coordinated proteomics data submission and dissemination. Nat. Biotechnol. 32, 223–6 (2014).

45

20.

Deutsch, E. W., Lam, H. & Aebersold, R. PeptideAtlas: a resource for target selection for emerging targeted proteomics workflows. EMBO Rep. 9, 429–34 (2008).

21.

Kapushesky, M. et al. Gene expression atlas at the European bioinformatics institute. Nucleic Acids Res. 38, D690–8 (2010).

22.

Matic, I., Ahel, I. & Hay, R. T. Reanalysis of phosphoproteomics data uncovers ADP-ribosylation sites. Nat. Methods 9, 771–772 (2012).

23.

Hahne, H. & Kuster, B. Discovery of O-GlcNAc-6-phosphate modified proteins in large-scale phosphoproteomics data. Mol. Cell. Proteomics 11, 1063–9 (2012).

24.

Hulstaert, N. et al. Pride-asap: automatic fragment ion annotation of identified PRIDE spectra. J. Proteomics 95, 89–92 (2013).

25.

Vaudel, M., Barsnes, H., Berven, F. S., Sickmann, A. & Martens, L. SearchGUI: An open-source graphical user interface for simultaneous OMSSA and X!Tandem searches. Proteomics 11, 996–9 (2011).

26.

Vaudel, M. et al. PeptideShaker enables reanalysis of MS-derived proteomics data sets. Nat. Biotechnol. 33, 22–24 (2015).

27.

Tabb, D. L., Fernando, C. G. & Chambers, M. C. MyriMatch: highly accurate tandem mass spectral peptide identification by multivariate hypergeometric analysis. J. Proteome Res. 6, 654–61 (2007).

28.

Fenyo, D. & Beavis, R. C. A Method for Assessing the Statistical Significance of Mass Spectrometry-Based Protein Identifications Using General Scoring Schemes. doi:10.1021/ac0258709

29.

Risk, B. A., Edwards, N. J. & Giddings, M. C. A peptide-spectrum scoring system based on ion alignment, intensity, and pair probabilities. J. Proteome Res. 12, 4240–7 (2013).

30.

Verheggen, K., Barsnes, H. & Martens, L. Distributed computing and data storage in proteomics: many hands make light work, and a stronger memory. Proteomics 14, 367–77 (2014).

31.

Giardine, B. et al. Galaxy: a platform for interactive large-scale genome analysis. Genome Res. 15, 1451–5 (2005).

32.

Verheggen, K. et al. Pladipus Enables Universal Distributed Computing in Proteomics Bioinformatics. J. Proteome Res. 15, 707–12 (2016).

33. 34.

Shvachko, K., Kuang, H., Radia, S. & Chansler, R. The Hadoop Distributed File System. Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S. & Stoica, I. Spark: Cluster Computing with Working Sets.

35.

Dean, J. & Ghemawat, S. MapReduce: Simplified Data Processing on Large Clusters.

36.

Codd, E. F. A Relational Model of Data for Large Shared Data Banks.

37.

Wang, H., Ma, C. & Zhou, L. A Brief Review of Machine Learning and Its Application. in 2009 International Conference on Information Engineering and Computer Science 1–4 (IEEE, 2009). doi:10.1109/ICIECS.2009.5362936

38.

Kotsiantis, S. B. Supervised Machine Learning: A Review of Classification Techniques. Informatica 31, 249–268 (2007).

39.

Niuniu, X. & Yuxun, L. Review of Decision Trees.

46

40.

Denil, M., Matheson, D. & De Freitas, N. Narrowing the Gap: Random Forests In Theory and In Practice.

41.

Kanungo, T. et al. An Efficient k-Means Clustering Algorithm: Analysis and Implementation.

42.

Murtagh, F. & Contreras, P. Algorithms for hierarchical clustering: an overview. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2, 86–97 (2012).

43.

Kohavi, R. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. (1995). at

44.

Van Der Maaten, L., Postma, E. & Van Den Herik, J. Tilburg centre for Creative Computing Dimensionality Reduction: A Comparative Review Dimensionality Reduction: A Comparative Review. (2009). at

45.

Van Der Maaten, L. & Hinton, G. Visualizing Data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).

46.

Saeys, Y., Inza, I. & Larrañaga, P. A review of feature selection techniques in bioinformatics. Bioinformatics 23, 2507–17 (2007).

47.

Vogel, C. & Marcotte, E. M. Insights into the regulation of protein abundance from proteomic and transcriptomic analyses. Nat. Rev. Genet. 13, 227–32 (2012).

48.

Kim, M.-S. et al. A draft map of the human proteome. Nature 509, 575–581 (2014).

49.

Wilhelm, M. et al. Mass-spectrometry-based draft of the human proteome. Nature 509, 582–587 (2014).

50.

Gonnelli, G., Hulstaert, N., Degroeve, S. & Martens, L. Towards a human proteomics atlas. Anal. Bioanal. Chem. 404, 1069–77 (2012).

51.

Wilhelm, M. Mass-spectrometry-based draft of the human proteome. Nature 509, (2014).

52.

Kim, M.-S. A draft map of the human proteome. Nature 509, (2014).

53.

54.

Smedley, D. et al. The BioMart community portal: an innovative alternative to large, centralized data repositories. Nucleic Acids Res. 43, W589–98 (2015). Birney, E. et al. An overview of Ensembl. Genome Res. 14, 925–8 (2004).

55.

Wu, C. H. et al. The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res. 34, D187–91 (2006).

56.

Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 25–9 (2000).

57.

Joshi-Tope, G. et al. Reactome: a knowledgebase of biological pathways. Nucleic Acids Res. 33, D428–32 (2005).

58.

Beausoleil, S. A., Villén, J., Gerber, S. A., Rush, J. & Gygi, S. P. A probability-based approach for high-throughput protein phosphorylation analysis and site localization. Nat. Biotechnol. 24, 1285–92 (2006).

59.

Taus, T. et al. Universal and confident phosphorylation site localization using phosphoRS. J. Proteome Res. 10, 5354–62 (2011).

60.

Vaudel, M. et al. D-score: a search engine independent MD-score. Proteomics 13, 1036–41 (2013).

47

61.

Uhlén, M. et al. A human protein atlas for normal and cancer tissues based on antibody proteomics. Mol. Cell. Proteomics 4, 1920–32 (2005).

62.

Zybailov, B. et al. Statistical analysis of membrane proteome expression changes in Saccharomyces cerevisiae. J. Proteome Res. 5, 2339–47 (2006).

48

7 Supplementary materials The

database

schema,

Java

code

and

iPython

notebooks

can

be

found

on

https://github.ugent.be/mmenu/creation_analysis_protein_expression_atlas.git

a