Breath Biopsy Conference Themes Part 1: Statistical Approaches to Biomarker Discovery

Published on 06 Dec 18

Now that the dust has settled on the first ever Breath Biopsy Conference, we thought it would be a good time to focus on some of the themes in breath research that were apparent across the day’s presentations and posters. Our first in a series of blog posts will look at statistical approaches to biomarker discovery. As all of our conference attendees will attest, breath is a complex matrix, with over 1,000 volatile organic compounds present in breath. This gives us a rich data set to deal with, but it also means that if we are to pick out the subtle signals of disease from among this wealth of data, sophisticated data analytical techniques will be needed.

In the day’s opening presentation, Dr Agnieszka Smolinska guided us through how a variety of machine learning algorithms, including discriminant analysis, partial least squares discriminant analysis (PLS-DA), Random Forest, gradient boosting and analysis of variance principal component analysis (ANOVA-PCA) could be used to identify compounds of interest, which could then be proposed as biomarker candidates for different diseases. She went on to discuss some examples of this approach in action, including paediatric asthma, Crohn’s disease/IBD, and the analysis of infant milk formula. Immediately following on from that, Dr Stephen Fowler’s presentation on “Future Clinical Applications of Breath Analysis: Asthma” discussed how ANOVA-based approaches could potentially be used to differentiate asthma from COPD and from healthy controls, and how unbiased clustering approaches were used in the U-BIOPRED study to identify asthma phenotypes. In the afternoon session, we heard from Professor Paul Thomas, whose guide to “A Breathomics Workflow” included a discussion of how multivariate statistical analysis techniques would fit into the workflow, and allow us to go from “information to knowledge”.

Statistical analysis techniques for biomarker discovery also featured prominently in a number of the posters presented during the lunchtime session. Covington et al used a variety of techniques to see whether Alzheimer’s sufferers could be distinguished from those suffering mild cognitive impairment and from controls using breath analysis, with the best results coming from the use of Random Forest. Purcaro et al also used Random Forest - in this case attempting to determine whether breath could be used to identify influenza-infected ferrets. Skarysz et al presented a poster on how neural networks could be deployed to identify compounds directly from GC-MS raw data (targeted analysis), that has the potential to allow for the targeting of compounds of interest directly from raw GC-MS data. Janssens et al used a least absolute shrinkage and selection operator (LASSO) regression to identify compounds that could be used to discriminate between individuals with lung cancer, and those with COPD. Finally, Miceli et al applied machine-learning based data analysis to Hi-Res mass spec data to look for volatile signatures that could be of use in non-invasive cancer detection.

Dr Rianne Fijten raised a note of caution in her presentation, pointing out that most studies used univariate rather than multivariate statistics, and very few studies (~2%) used any kind of external validation. This is a concern that we take very seriously at Owlstone Medical, with validation forming a key part of our Biomarker Discovery Process.

We’ve been working hard to solve the problem of identifying candidate biomarkers for diseases,  and use state-of-the-art machine learning algorithms similar to the ones reported here. If you would like to find out more about how our discovery services can help you to find VOC markers of disease, then why not have a look at our 2.0 Breath Biopsy Discovery VOC Kits or download our example services report?

Stay informed and receive all the latest updates and news straight into your inbox

Sign up to our newsletter