Mass Spectrometry and Missing Data
By Daniel Kon, The University of Adelaide
My AMSI VRS project fell somewhere in the intersection of mathematics and genetics.
The study of genetics in the age of computers involves dealing with vast quantities of data. Typically, there are hundreds or even thousands of genes expressed in any cell of an organism, with many interactions between them. Genes are transcribed and translated into proteins which catalyse or inhibit the expression of other genes, and these genes influence more genes, and the chains of influence can grow very large. Making sense of what is going on inside a cell is a very complicated task, even with tools such as the lab-on-a-chip which can analyse and display the entire set of proteins that are expressed at various concentrations in a given cell. Clustering analysis is one of the techniques that can be used to see the big picture, grouping different organisms or cells by similarity based on the different expression rates of the relevant proteins in each of the subjects.
As well as analysing proteins, there is also work to be done in sequencing an organism’s actual genome itself. The genome consists of millions and sometimes billions of base pairs of DNA, and sequencing it involves collecting many shorter strings of base pairs and matching them where they overlap, while selecting and discarding those strings which contain errors. The techniques of graph theory are applicable to this problem.
The project I did over the summer was about missing data in a mass spectrometry study of gastric cancer. In the study, different groups of mice with distinct genomes were expressing differing levels of proteins as they developed gastric cancer or gut inflammation (or both or neither, depending on which group the mouse was from…). The expression was detected via mass spectrometry, but due to the vagaries of the mass spectrometry process, some expression levels were unknown. The aim of the project was to develop a mathematical model to predict which mouse blood serum samples would have missing data, bridging the wet lab work side of things with the mathematical statistics side of things. Developing the model is one of the preliminary steps to knowing how to impute (i.e., estimate) the missing values.
The work will lead into my Masters, and I look forward to taking it further with the help of my supervisors through the coming two years, making my own contributions to the methods of imputing missing data in biostatistics studies.
Daniel Kon was one of the recipients of a 2015/16 AMSI Vacation Research Scholarship.