Key words

1 Introduction

High-resolution mass spectrometry-based proteomics, aided by computational sciences, is continuously pushing the boundaries of systems biology. Obtaining highly accurate quantitative proteomes on a genome-wide scale is becoming feasible within realistic measurement times [1]. Similar to the clinical goals of genomics and transcriptomics to provide a deeper understanding of a certain disease that goes beyond the standard clinical parameters of cancer diagnosis, proteomics offers a comprehensive view of the molecular players in a cell at a particular moment and in a specific state [1]. The maturation of the technology together with the development of suitable methods for quantification of human tissue proteomes [2,3,4] has opened new doors for employing proteomics in medical applications and is shaping the growing field of clinical proteomics [5, 6]. Following these advances, proteomic approaches have been used to address multiple clinical questions in the context of various cancer types. The major area of application is the profiling of cancer-relevant tissues—including the proteomes of colorectal cancer [7, 8] and prostate cancer [9], as well as the subtyping of lymphoma [10] and breast cancer [11, 12] patients. Although proteomics has become an extremely powerful approach for studying biomedical questions, offering unique advantages compared to other omics techniques, the functional interpretation of the vast amounts of data of a typical proteomics experiment often poses analytical challenges to the biological domain experts.

The aim of data analysis is to translate large amounts of proteomic data that cover numerous samples, conditions and time points into structured, domain-specific knowledge that can guide clinical decisions (Fig. 1). Prior to any statistical analysis, data cleansing is usually performed which includes normalization, to ensure that different samples are comparable, and missing value handling to enable the use of methods that require all data points to be present. A plethora of imputation methods developed for microarray data [13] can be applied to proteomics as well [14]. Among these, methods with the underlying assumption that missing values result from protein expression that lies under the detection limit of modern mass spectrometers are frequently used. A typical task of clinical proteomics studies is to identify proteins that show differential expression between healthy and diseased states or between different subtypes of a disease. Although commonly established statistical methods, which achieve this task exist, distinguishing between expression differences due to technical variability, genetic heterogeneity, or even intra-sample variability and true disease-related changes require deep knowledge of statistical tools and good understanding of the underlying problems in the analysis of omics data.

Fig. 1
figure 1

Outline of a typical analysis workflow in Perseus. The workflow shows the process of converting data into information and knowledge. Statistical analysis can be used to guide the identification of biologically relevant hits and drive hypotheses generation. Various external databases, annotation sources, and multiple omics types can be loaded and matched within the software and together with powerful enrichment techniques allow for smooth data integration

For instance, testing thousands of proteins for differential expression is hampered by the multiple hypothesis-testing problem, which results in an increased probability of calling a protein a significant hit when there is no actual difference in expression. This necessitates the use of correction methods to increase the confidence of the identified hits. The choice of the appropriate correction method depends on the balance between wrongly accepted hits (error type I) and wrongly rejected hits (error type II) that an experimentalist is willing to accept. For instance, permutation-based FDR [15] has a reduced error type II rate compared to the Benjamini-Hochberg correction [16]. Once the initial list of quantified proteins is narrowed down to only the significantly changing hits the question of their functional relevance arises. Enrichment analysis of protein annotations is the preferred method for deriving functional implications of sets of proteins and is applicable to both categorical (Fisher’s exact test [17]) and expression/numerical data (1D enrichment test [18]). The outcome of such an analysis often offers a comprehensive view of the biological roles of the selected proteins through highlighting key pathways and cellular processes in which they are involved.

In this chapter, we provide a step-by-step workflow of bioinformatic analysis of proteomics data of luminal-type breast cancer progression. Commonly used analytical practices are described including data cleansing and preprocessing, exploratory analysis, statistical methods and guidelines, as well as functional enrichment techniques. All the steps are implemented as processes in Perseus [19], a comprehensive software for functional analysis of omics data.

2 Materials

2.1 Software Download and Installation

Written in C#, Perseus achieves optimal performance when run on Windows operating systems. The latest versions require 64 bit system and .NET Framework 4.5 to be installed on the same computer. To use the software on MacOS set up BootCamp and optionally in addition Parallels. Registration and acceptance of the Software License Agreement are required prior to downloading Perseus from the official website: http://www.coxdocs.org/doku.php?id=perseus:start . Once the download has finished, decompress the folder, locate the Perseus.exe file, and double-click it to start the program.

2.2 Data Files

In the subsequent analysis, we used a subset of the data measured by Pozniak et al. [20]. The authors provide a genome-wide proteomic analysis of progression of breast cancer in patients by studying major differences at the proteome level between healthy, primary tumor, and metastatic tissues. The data were measured as ratios between an optimized heavy-labeled mix of cell lines representing different breast cancer stages and the patient proteome [2]. This constitutes an accurate relative quantification approach used especially in the analysis of clinical samples. Peptide and protein identification and quantification was performed using the MaxQuant suite for the analysis of raw mass spectrometry data [21] at peptide spectrum match and protein false discovery rate of 1%. The subset used in this protocol contains proteome profiles of 22 healthy, 21 lymph node negative, and 25 lymph node metastatic tissue samples and spans over 10,000 protein groups and can be found in the proteinGroups.txt file provided as supplementary material to the Pozniak et al. study (see Note 1 ).

3 Methods

The Methods section contains several modules covering the most frequently performed steps in the analysis of proteomics data. Often, a proteomics study benefits from a global overview of the data, which usually includes the total number of identified and quantified proteins, dynamic range, coverage of specific pathways, and groups of proteins. A good practice in data analysis is to start with exploratory statistics in order to check for biases in the data, undesirable outliers, and experiments with poor quality data and to make sure that all requirements for performing the subsequent statistical tests are met. Once the data are filtered and normalized appropriately, statistical and bioinformatic analyses are performed in order to identify proteins that are likely to be functionally-important. When the list of such proteins is small enough and direct links to the question of interest can be inferred using prior knowledge, follow-up experiments can be performed after this step to confirm the results of the statistical analysis. However, one of the advantages of mass spectrometry-based proteomics is the ability to unravel new discoveries in an unbiased way, for instance, through functional analysis. This analysis is often based on enrichment tests, which can highlight guiding biological processes and mechanisms.

3.1 Loading the Data

  1. 1.

    Go to the “Load” section in Perseus and click the “Generic matrix upload” button.

  2. 2.

    In the pop-up window, navigate to the file to be loaded (see Note 2 ).

  3. 3.

    Select all the expression columns and transfer them to the Main columns window (see Note 3 ). Select all additional numerical data that may be needed in the analysis and transfer them to the Numerical columns window. Make sure that the columns containing identifiers (e.g., protein IDs) are selected as Text columns. Click ok.

3.2 Summary Statistics

Get familiar with the Software and its five main sections: Load, Processing, Analysis, Multi-processing, and Export (see Fig. 2).

  1. 1.

    In the workflow panel, change the name of the data matrix from matrix 1 to InitialData by right-clicking the node and changing the Alternative name box. Close the pop-up window. Explore the right-most panel of Perseus, which contains useful information such as number of main columns and number of rows.

  2. 2.

    Go to “Processing ➔ Filter rows ➔ Filter rows based on categorical column” to exclude proteins identified by site, matching to the reverse database or contaminants (see Note 4 ).

  3. 3.

    Transform the data to a logarithmic scale by going to “Processing ➔ Basic ➔ Transform” and specifying the transformation function (e.g., log2(x)).

  4. 4.

    In the “Processing” section, select the “Basic” menu and click on the “Summary statistics (columns)” button. Select all expression columns by transferring them to the right-hand side. Click ok and explore the new matrix.

Fig. 2
figure 2

Interfaces of Perseus and the augmented data matrix format. (a) Perseus extends over five interfaces, each of which includes various analysis and transformation functionalities and visualization possibilities. (b) Experimental design is specified as annotation (e.g., treatment vs. control groups) or numerical rows (e.g., variable concentration). Multiple annotation rows can be specified that allow biological and technical replicates to be analyzed together. (c) The data is organized in a matrix format where typically all samples are displayed as columns and all proteins as rows. (d) Additional protein information can be added in the form of Numerical, Categorical, or Text annotations

3.3 Filtering

  1. 1.

    Use the workflow window to select the InitialData matrix data by clicking on it (see Note 5 ).

  2. 2.

    In the “Processing” section, go to the “Filter rows” menu and select “Filter rows based on valid values.” Change the Min. valids parameter to Percentage and keep the default value of 70% for the Min. percentage of values parameter. Click ok. Check how many protein groups were retained after the filtering (see Note 6 ).

3.4 Exploratory Analysis

  1. 1.

    To visually inspect the data, go to “Analysis ➔ Visualization ➔ Histograms.” Select all the samples of interest by transferring them to the right-hand side. Click ok.

  2. 2.

    Explore the visualization options in the Histogram panel by testing the functionality of each of the buttons (e.g., Properties, Fit width, Fit height).

  3. 3.

    Click on the pdf button to export the plot (see Note 7 ).

  4. 4.

    Switch the view to the “Data” tab.

  5. 5.

    Go to “Analysis ➔ Visualization ➔ Multi scatter plot.” Select the desired samples by transferring them to the right-hand side. Click ok (see Fig. 3).

  6. 6.

    Adjust the plot using the Fit width and Fit height options and resizing the plot window.

  7. 7.

    In the drop-down menu “Display in plots” in the plot window, select Pearson correlation.

  8. 8.

    Select a scatter plot by clicking on it. The selected plot will be shown in an enlarged view.

  9. 9.

    Select a number of proteins from the “Point” table on the right of the multi scatter plot and examine their position in all pairwise sample comparisons.

  10. 10.

    Switch back to the “Data” tab to continue with the analysis.

  11. 11.

    “Go to Processing ➔ Basic ➔ Column correlation.” Make sure that the Type is set to Pearson correlation. The output table contains all pairwise correlations between the selected columns.

  12. 12.

    To visualize the sample correlations, go to “Analysis ➔ Clustering/PCA ➔ Hierarchical clustering.” Use the Change color gradient to set a continuous gradient similar to the one in Fig. 3a.

  13. 13.

    Export the plot by clicking on the pdf button.

  14. 14.

    Navigate back to the previous data matrix by clicking on it in the workflow panel.

  15. 15.

    Principal component analysis requires all values to be valid. To remove all protein groups with missing values, repeat Subheading 3.3, step 2 setting the percentage parameter to 100 (see Note 8 ).

  16. 16.

    Go to “Analysis ➔ Clustering/PCA ➔ Principal component analysis” and click ok. Explore the sample separation (dot plot in the upper panel) and the corresponding loadings (dot plot in the lower panel).

  17. 17.

    In the table on the right of the PCA plot, select a set of samples (e.g., all samples that belong to one experimental condition) and change their color by clicking on the Symbol color button and selecting the desired color.

  18. 18.

    Check the contribution of other components by substituting Component 1 and 2 with other components from the drop-down menu. Find the components that show sample separation according to the experimental conditions (see Fig. 3c).

  19. 19.

    Explore the proteins driving this separation. In the loadings plot beneath the PCA, change the selection Mode to rectangular selection. Hold the left mouse key down and draw a rectangle around the dots in the upper right corner and then release the mouse. The selected proteins are highlighted in the table to the right and their labels are displayed in the plot.

Fig. 3
figure 3

Exploratory analysis outputs in Perseus. (a) Hierarchical clustering of all the samples based on the correlation coefficients between them reveals higher similarity between primary and metastatic tumors versus healthy tissue samples. (b) Multi-scatter plot of averaged profiles among the three main groups clearly represents the disease progression by highlighting strong similarities between subsequent stages, e.g., healthy tissue samples are more similar to primary tumors than to metastasis (correlation coefficient 0.76 vs 0.69), whereas primary tumors are most similar to metastasis (R = 0.94). The category Cell division is highlighted in bright green in all pairwise comparison plots. (c) Principal component analysis (PCA) attributes the largest variance to the difference between healthy (blue dots) and cancer tissues (pink and red dots) (Component 1, 21.1%) and shows that primary and metastatic tumors (pink and red dots respectively) are difficult to distinguish

3.5 Normalization

  1. 1.

    Navigate back to the data matrix before filtering for 100% valid values (Subheading 3.3, step 2).

  2. 2.

    Go to “Processing ➔ Normalization ➔ Z-score.” Change the Matrix access parameter to Columns and select the Use median option. In the new data table, plot histograms for the same subset of samples as in Subheading 3.4, step 1 (see Note 9 ).

3.6 Experimental Design

  1. 1.

    Go to “Processing ➔ Annot. rows ➔ Categorical annotation rows.” Use the Create action option to manually specify the experimental condition to which a sample belongs (i.e., indicate control versus stimulus, or different stages of a disease). All the samples belonging to one condition should have the same annotation. A new row will be added under the column names in the newly generated data matrix (see Note 10 ).

3.7 Loading Annotations

  1. 1.

    Go to the drop-down menu indicated with a white arrow at the top left corner of Perseus and select “Annotation download.”

  2. 2.

    Click on the link in the pop-up window. Select the appropriate annotation file (e.g., “PerseusAnnotaion ➔ FrequentlyUsed ➔ mainAnnot.homo_sapiens.txt.gz,” if the organism of interest is homo sapiens).

  3. 3.

    Download the file to the Perseus/conf/annotations folder.

  4. 4.

    Go to “Processing ➔ Annot. columns ➔ Add annotation.” Select the file from the previous step as a Source.

  5. 5.

    Set the UniProt column parameter to the column that contains UniProt identifiers. These identifiers will be used for overlaying the annotation data with the expression matrix (e.g., Protein IDs).

  6. 6.

    Select several categories of interest to be overlaid with the main matrix and move them to the right-hand side. Click ok.

3.8 Differential Expression Analysis

  1. 1.

    Go to “Processing ➔ Tests.” From the menu select the appropriate test. For the data set used in this chapter, the Multiple-sample tests option should be chosen, as there are more than two conditions that are compared. The default parameters do not have to be changed (see Note 11 ).

  2. 2.

    Specify the categorical row that contains information about the experimental conditions of the samples that will be used in the differential analysis in the Grouping parameter.

  3. 3.

    Keep the default value of 0 for the S0 parameter, to use the standard t-test statistic. Change the parameter to use the modified test statistic approach described by Tusher et al. [15].

  4. 4.

    Select the multiple hypothesis testing correction method to be used by specifying the Use for truncation parameter (see Note 12 , Fig. 4a).

  5. 5.

    Specify if a suffix should be added to the output columns produced by Perseus. This option is relevant when multiple tests are conducted, e.g., with different parameter settings, as it helps to distinguish between them in the output table.

  6. 6.

    Inspect the output table. It contains three new columns: ANOVA significant, −Log ANOVA p-value, and ANOVA q-value (see Note 13 ).

  7. 7.

    Go to “Processing ➔ Filter rows ➔ Filter rows based on categorical column.” Set the Column parameter to ANOVA Significant and the Mode parameter to Keep matching rows to retain all differentially expressed proteins.

  8. 8.

    Go to “Processing ➔ Tests ➔ Post-hoc tests.” Set the Grouping parameter to the same grouping that was used for the ANOVA test (see Subheading 3.6, step 1) and the FDR to the desired threshold. Tukey’s honestly significant difference (THSD) is computed for all proteins and all pairwise comparisons and the significant hits within the corresponding pairs are marked (see Note 14 , Fig. 4b).

Fig. 4
figure 4

Differential expression and multiple hypothesis testing. (a) Multiple hypothesis testing correction using a permutation-based false discovery rate approach is shown. Labels are randomly swapped between the three groups (blue, yellow, and red). The Randomization is repeated r times. ANOVA p-values are computed both on the measured and the permutated data and local FDR values (q-values) are computed as the fraction of accepted hits from the permuted data over accepted hits from the measured data normalized by the total number of randomizations r. All hits with a q-value smaller than a threshold are considered significant. (b) To determine the exact pairwise differences of protein expression Tukey’s Honest Significant Difference (THSD) test is used on the ANOVA significant hits. If the mean difference between two groups is greater than or equal to the corresponding THSD, the difference is considered significant between the compared groups. q: constant depending on the number of treatments and the degrees of freedom that can be found in a Studentized range q table; MSE: mean squared error; n1, n2, number of data points in each group

3.9 Clustering and Profile Plots

  1. 1.

    Go to “Analysis ➔ Clustering/PCA ➔ Hierarchical clustering.” Keep the default parameters and click ok.

  2. 2.

    Inspect the resulting heatmap and the relationship between the groups and the proteins.

  3. 3.

    Click on the Change color gradient button in the button ribbon above the heatmap to examine the color scale usage (red means high and green low expression) and to modify them.

  4. 4.

    Click on several node junctions in the protein tree that represent potentially interesting clusters of proteins (i.e., upregulation in a certain experimental condition). The selected clusters are highlighted and appear in the “Row clusters” table displayed to the right of the heatmap (see Note 15 ).

  5. 5.

    Inspect the different profile plots as you navigate through the different clusters in the table. Change the color by modifying the Color scale and export the profile plots by clicking on the Export image button (see Fig. 5).

  6. 6.

    From the ribbon menu in the heat map view, click on the Export row clustering button to add the cluster information to a new data matrix.

Fig. 5
figure 5

Enrichment analysis highlighting important pathways and processes. (a) Hierarchical clustering of proteins found to have differential expression between pairs of disease states. High and low expression are shown in red and blue respectively. Various clusters of protein groups are highlighted in the dendrogram. (b) Profile plots of three selected clusters showing distinct behavior with respect to the three disease states are shown: 1 strongly increased expression in tumor tissues; 2 moderate increase in tumor tissue; and 3 decreased expression in tumor samples. (c) Functional analysis of protein annotation terms resulted in multiple categories that were enriched in the three selected clusters. The enriched terms and the corresponding enrichment factor and p-value are shown

3.10 Functional Analysis

  1. 1.

    Go to “Multi-proc. ➔ Matching rows by name.” Both Base and Other matrices point to the last matrix.

  2. 2.

    Click on Base matrix and then in the workflow window select the data matrix that was generated before filtering for ANOVA significant hits (Subheading 3.9, step 6).

  3. 3.

    In the pop-up window set Matching column in matrix 1 and 2 to a common identifier (e.g., Protein IDs).

  4. 4.

    In the categorical columns section, transfer the category Cluster to the right hand-side. Click ok (see Note 16 ).

  5. 5.

    Go to “Processing ➔ Annot. columns ➔ Fisher exact test.” Change the Column parameter to Cluster and click ok. The resulting table contains information about all annotation categories that were found to be significantly enriched or depleted using a Fisher’s exact test and multiple hypotheses correction (see Note 17 ).

    In summary, this chapter provides a complete protocol for fundamental analysis of proteomic data, starting from data upload and transformation and ending with identification of proteins, characteristic of the specific disease progression stage, and the underlying processes in which they are involved. The described analytical methods and visualization tools are integrated in Perseus, a freely available platform for analysis of omics data, which provides a comprehensive portfolio of analysis tools with a user-friendly interface [19]. A special emphasis is placed on employing statistically sound methods in the analysis of large data, avoiding wrong interpretation and extracting maximum information. More advanced computational techniques such as supervised learning are also supported and are often instrumental for the analysis of complex data where genetic and intra-tumor variability may pose challenges. Moreover, Perseus is being continuously developed to integrate analysis of various data types, including posttranslation modifications, sequence information, as well as to allow deeper functional interpretation through network and pathways analysis.

4 Notes

  1. 1.

    Proteins with shared peptides that cannot be distinguished based on the peptides identified in a bottom-up proteomics approach are often reported together as a protein group [21].

  2. 2.

    The input file format of Perseus is a tab-delimited txt file that contains a header row with the names of all columns. The type of data is specified during file loading. Make sure that the “Regional and Language Options” are set to English to avoid errors while reading numerical data such as decimal separators being wrongly interpreted.

  3. 3.

    Different expression and meta data can be imported in Perseus and used for subsequent analysis. Common expression data are in one of the following formats: normalized intensities (e.g., LFQ intensity as described in [4], iBAQ as described in [22]) or ratios between heavy standard and light/non-labeled sample. Other data types that can be analyzed with Perseus are shown in Fig. 2.

  4. 4.

    Reverse, identified by site and contaminant proteins have to be marked in a categorical column before these filters can be applied. These are automatically set when MaxQuant output tables are used for analysis in Perseus. Additional filtering options can be used to remove proteins based on a quantitative measure such as a minimum number of quantified peptides or a maximum q-value.

  5. 5.

    Different activities have different output results including a data matrix with the same expression values and additional columns containing the results of the analysis or a new data matrix containing only the output of an analysis activity. An activity is always performed on the data matrix and specific tab for that matrix that is active at the moment.

  6. 6.

    Depending on the nature of missing values, different filtering strategies may be employed and are supported in Perseus. For example, if large differences between groups are expected with proteins having very low expression level in one of the groups, filtering based on a minimum number of valid values in at least one group would be a more suitable approach than filtering for a minimum number of valid values in the complete matrix.

  7. 7.

    All the plots can be exported in figure-ready formats such as pdf, tiff, or png.

  8. 8.

    Very stringent filtering is usually not recommended, as a large amount of the data will be lost. Instead milder filtering combined with imputation may be more appropriate.

  9. 9.

    Data normalization is not always necessary. Different types of normalization can be applied on the data to correct for systematic shifts or skewness and to make samples comparable.

  10. 10.

    Regular expressions can be used to derive the experimental design from the sample names (“Action ➔ Create from experiment name”). Additionally, a template txt file can be written out, edited in an external editor program, and read in to indicate the experimental design.

  11. 11.

    Analysis of differentially expressed proteins depends on the number of compared conditions, the underlying distribution properties, and the availability of biological replicates. For example, data sets with one condition with replicates should be analyzed with One-sample tests, with two conditions—with Two-sample tests, and with more than two conditions—Multiple-sample tests. Paired samples test and tests abolishing the requirement for equal variance are also available.

  12. 12.

    The method with largest power Permutation-based FDR is recommended and at least 250 repetitions are suggested. In case of technical replicates, these have to be specified as a separate grouping (see Subheading 3.7, step 1) and selected with the “Preserve grouping in randomizations” option. Failure to specify technical replicates will result in wrong FDR calculation. The more conservative Benjamini-Hochberg correction can also be used when a lower number of Type I errors at the cost of lower sensitivity are desired.

  13. 13.

    The “Significant” column contains a “+” if a protein met the selected significance threshold (usually q-value). Additionally, p-values (probability of type I error) and the corresponding q-values (corrected p-value) are provided in the output table.

  14. 14.

    Tukey’s honestly significant difference (THSD) is a post-hoc test that when performed on ANOVA significant hits determines in exactly which pairwise group comparisons a given protein is differentially expressed.

  15. 15.

    Clusters can be defined by clicking on the respective nodes in the protein tree or based on the precise distance measure used to compute the tree. To use the latter option, click on the “Define row clusters” button and specify the desired number of clusters, which will then be automatically defined.

  16. 16.

    The matching step is necessary in order to define the correct background against which enrichment will be computed. Too small (only significant hits) or too large (the complete proteome, even if not detected with MS analysis) introduces bias in the enrichment results.

  17. 17.

    The enrichment output table contains information about the values used to compute the contingency table for the Fisher’s exact test (e.g., category and intersection size), the enrichment factor, and the statistical significance of the hit indicated by a p-value and the associated false discovery rate.