Efficient denoising algorithms for large experimental datasets and their applications in Fourier transform ion cyclotron resonance mass spectrometry.

L. Chiron* , M.A. van Agthoven , B. Kieffer* C .Rolando and M-A. Delsuc*

* Institut de Génétique et de Biologie Moléculaire et Cellulaire, INSERM, U596; CNRS, UMR7104; Université de Strasbourg, 1 rue Laurent Fries, 67404 Illkirch-Graffenstaden, France,
† Miniaturisation pour la Synthèse, l'Analyse & la Protéomique (MSAP), USR CNRS 3290, and Protéomique, Modifications Post-traductionnelles et Glycobiologie, IFR 147, Université de Lille 1 Sciences et Technologies, 59655 Villeneuve d’Ascq Cedex, France

contact : madelsuc at unistra.fr


rQRd and urQRd

rQRd and urQRd are denoising methods, intended to improve harmonic signals (i.e. signals composed of sums of (eventually damped) sinusoids). rQRd is based on simple algebra, while urQRd exploits the special structures of the underlying matrices to perform the same operation much faster, and with a much smaller memory footprint. This site presents informations and code for implementing both methods.

Short Presentation

Every measurement is corrupted due to random fluctuations in the sample and the apparatus. Current efficient denoising algorithms require large matrix analysis, and become untractable even for moderately large datasets.

Any series can be considered as an operator which modifies any input vector. By projecting this operator on a series of random vectors and thus reducing the dimension of the data, it is possible, using simple algebra, to reduce noise in a robust manner. Furthermore, the structure of the underlying matrices allows a very fast and memory efficient implementation.

Counterintuitively, randomness is used here to reduce noise. This procedure, called urQRd, allows denoising to be applied to data of virtually unlimited size.

Abstract

Modern scientific research produces datasets of increasing size and complexity that require dedicated numerical methods to be processed. In many cases, the analysis of spectroscopic data involves the denoising of raw data before any further processing. Current efficient denoising algorithms require the singular value decomposition of a matrix with a size that scales up as the square of the data length, preventing their use on very large datasets.

Taking advantage of recent progress on random projection and probabilistic algorithms, we developed a simple and efficient method for the denoising of very large datasets. Based on the QR decomposition of a matrix randomly sampled from the data, this approach allows a gain of nearly three orders of magnitude in processing time compared with classical sin- gular value decomposition denoising. This procedure, called urQRd (uncoiled random QR denoising), strongly reduces the computer memory footprint and allows the denoising algorithm to be applied to virtually unlimited data size.

The efficiency of these numerical tools is demonstrated on experimental data from high-resolution broad- band Fourier transform ion cyclotron resonance mass spectrometry, which has applications in proteomics and metabolomics. We show that robust denoising is achieved in 2D spectra whose interpretation is severely impaired by scintillation noise. These denoising procedures can be adapted to many other data analysis domains where the size and/or the processing time are crucial.

Paper

PNAS link

Preprint

Preprint

program distribution

We provide the reader with the code used to obtain all the results and figures of the paper. Testing code is given as Ipython files for reproducing our results and testing the algorithms. To use the disctributed code, two ways are possible :


Python Code

The code (python .py files), the ipython files(.ipynb) and the 1D dataset can be be found in urQRd.zip. It contains the three algorithms (rQRd, urQRd and Cadzow), the 1D FTICR dataset (FT-ICR-1D) and the four ipython notebook files for testing and exploring the code. The ipython testing code files are in the order:

  1. tests of the algorithms on simple examples (it is quite fast: around ten seconds each example)
  2. the code to reproduce the figures of the paper obtained from synthetic data
  3. application of urQRd to the 1D FTICR dataset
  4. code to read the processed 2D dataset (noisy and denoised)
urQRd.zip
urQRd.zip(4.1 Mb)

The versions for Python, Ipython and the required dependencies are Python 2.7.3, Ipython 0.12.1, Numpy 1.6.1, Scipy 0.10.1 and Matplotlib 1.1.0. In case you want to use a package containing all these dependencies but with slightly more recent versions, we recommend to download and install Canopy (EPD) (it works on all the  usual plateforms).

To experiment with a ipython notebook file, open a terminal in the urQRd directory (the same place where the ipython files are) and type in the console the following command line :

ipython notebook

This opens an interface in which to chose the experiment to perform. To execute the code in each ipython file cell, place the cursor in the cell and make the keys combination Shift-Enter.

In the case you use Canopy, open directly the ipython file using the Editor tool

Find here a tutorial on ipython

Find here static pages presenting the results you should obtain when testing :

Remarks:


Virtual environment

All the previous installation steps can be replaced by loading the provided Virtual Disk Image (Vdi) (link here under) to be used with the VirtualBox® environment. The provided Vdi file contains a complete Linux environment (Ubuntu) with all the frameworks (Python and the dependencies) needed to run the python and Ipython files found above.

To install the software Virtualbox go here (needed version is 4.2.12) for the instructions to install the Vdi file in Virtualbox go there. The installation parameters are :

VirtualBox Vdi(1.6 Gb)

vitualbox

FTICR2D Dataset

For reproducing and visualize our results concerning the 2D dataset we provide the corresponding .csv files compressed in gzip format (the processed data are in single precision). To those results are joined the acquisiton files. The link here under contains all those files gathered in a zip file (TGplasmahumainNT.zip)

FT-ICR-2D (zip, 2.36 Gb)

dataset

The unzip file has to be placed in the directory urQRd/Datasets created after unzipping the urQRd.zip file


Additional points

to contact us: madelsuc at unistra.fr or lionel.chiron at nmrtec.com

This deposit is deposited under the concept of Reproducible research

When using this work, please cite : Chiron, L., van Agthoven, M. A., Kieffer, B., Rolando, C. & Delsuc, M.-A. Efficient denoising algorithms for large experimental datasets and their applications in Fourier transform ion cyclotron resonance mass spectrometry. Proc Natl Acad Sci USA (2014). doi:10.1073/pnas.1306700111

License and warranty

Covered code is provided under this license on an "as is" basis, without warranty of any kind, either expressed or implied, including, without limitation, warranties that the covered code is free of defects. The entire risk as to the quality and performance of the covered code is with you. Should any covered code prove defective in any respect, you (not the initial developer or any other contributor) assume the cost of any necessary servicing, repair or correction.

Downloading code and datasets from this page signifies acceptance of the hereunder License Agreement. The code distributed here is covered under the CeCILL licence.

Patent

The commercial use of the urQRd and rQRd algorithms is patented.

Acknowledgments

The authors acknowledge gratefully Fabrice Bray (Université Lille 1, Sciences et Technologies) for providing the 2D FT-ICR-MS data-set, the funding of this project by the Agence Nationale de la Recherche (grants 2010 FT-ICR 2D and FRISBI). This work was partly funded by the MASTODONS project by CNRS (grant MesureHD). M.v.A. thanks the Région Nord-Pas-de-Calais for postdoctoral funding. The FT-ICR mass spectrometer and the proteomics platform used for this study are funded by the European Community (FEDER), the Région Nord-Pas-de-Calais (France), the IBISA network, the CNRS, and Université Lille 1, Sciences et Technologies, and this funding is gratefully acknowledged.

Valid XHTML 1.0 Transitional