Implementation of a statistical pipeline in the Glycoselect database

BRIEF DESCRIPTION OF THE PROJECT

Implementation of a statistical pipeline in the Glycoselect database for the identification of glycoprotein biomarker signatures.

BACKGROUND AND AIM OF THE PROJECT?

A new high-throughput glycoproteomics technology is being developed by the client to uncover potential glycosylation changes in a complex mix of proteins present in biological fluids such as serum. The input data consists of protein identification as determined by tandem mass spectrometry, together with their binding affinities to a panel of lectins, which indicate the glycan structure. A statistical pipeline needed to be deployed in the existing Glycoselect database to identify biomarker signatures. One of the challenges was to propose an appropriate methodology to deal with data which include many zero values – many classical statistical approaches (t-test, non parametric test) do not apply in this case.

WHAT WERE THE OUTCOMES?

The statistical methodology developed by Le Cao et al., 2011* was proposed for this project and produced very satisfying results.  The pipeline was developed using this methodology and implemented in the R statistical programming language.

An outlier detection step was also proposed to process the data beforehand and remove potential outliers.  The resulting process clearly identified the outliers in the data enabling the researchers to remove these prior to selection of the biomarker panel.

The R script was handed over to the Glycoselect developer to be implemented into the Glycoselect analytical pipeline.  The outliner detection methodology including visualisation of the analyses was also provided to the client.

*Lê Cao K.-A., Boitard, S. and Besse, P. (2011). Sparse PLS Discriminant Analysis: biologically relevant feature selection and graphical displays for multiclass problems BMC Bioinformatics, 12:253