Morgan Hunter is a 4th year student at La Trobe University. She plans to finish her undergraduate degree in 2015 with majors in Statistics, Mathematics and Japanese. She will go on to do Masters in Statistics in 2016. Outside of her studies Morgan is enthusiastic netballer and long-time member of her local brass band.
During her time at La Trobe, Morgan has been fortunate enough to have received three scholarships for her studies in mathematics and statistics as well as making it onto the Dean’s Honours list every year of her studies. At University, Morgan has developed a passion for both statistics and foreign languages, studying both Japanese and Spanish. While she does enjoy the mathematics behind theoretical statistics, she much prefers its applications, and she hopes to go on to work in this field. This interest in applied statistics is the reason behind her choice to complete her AMSI Vacation Research Scholarship in the field of bioinformatics, and more specifically, proteomics and contaminant identification and removal techniques. In the future, she hopes to combine her love of statistics and languages by working overseas in the area of bioinformatics.
Linear mixed models with Gaussian mixture to identify contaminants in proteomics data: a prelude to intra-experiment normalisation.
Proteomics is the large-scale study of proteins, particularly their structures and functions. To interrogate the relative expression levels of thousands of proteins simultaneously, researchers can choose from a variety of quantification techniques. One the most established technique is stable isotope labelling by/with amino acids in cell culture (SILAC). SILAC is a technique based on mass spectrometry that detects differences in protein abundance among samples using non-radioactive isotopes labelling (Mann, 2006). A typical SILAC experiment has heavy-light (H-L) label configuration where the control sample is labelled with the light version of the isotope (e.g., 13 Carbon) and the treated sample is labelled with the heavy version (e.g., 14 Carbon). Software such as Maxquant (Cox and Mann, 2008) can be used to identify the pairs of spectra peak belonging to the same protein and the relative expression is calculated as ratio of the intensities of the two peaks. Although SILAC is relatively more robust to experimental bias when compared to label-free quantification, bias due to labelling effect (Ting et al., 2009) and the presence of contaminant are inherent. In our lab, we observe that protein expressions from two biological replicates with the same labelling configuration (e.g., heavy-light) show high correlations (0.8-0.9) but biological replicates with different labelling configuration (one labelled heavy-light, the other labelled light-heavy) show very poor correlation, typically in 0.2-0.4 range. Performing intra-experiment normalization should make biological replicates more comparable (Gagnon-Bartsch and Speed, 2011). However, the presence of unwanted contaminants can potentially misled the normalization procedure and result in sub-optimal performance of the normalization methods. Contaminants are typically proteins from the parts of the cells that are not of interest in the current experiment, however despite the best effort of the researcher, these parts of the cells sometimes cannot be separated from the parts of the cells that are of interest. We propose to identify these contaminants using data from negative controls experiment. In negative controls experiments, the two samples labelled heavy and light are both not treated, thus we expect that proteins from the parts of the cells that are consistently present to show an average of log intensity ratio close to 0, while the contaminants are proteins from the parts of the cells that are inconsistently present and thus we expect to show a log average of log intensity ratio different from zero and furthermore, they will also show greater variability. A linear mixed model with k component Gaussian mixture in the random effects will be used to model the log intensity ratio. Let Yi be the vector of the log intensity ratio for the ith protein from all negative control experiment, the model is given by, The EM algorithm can be used to estimate the parameters of this model (Verbeke and Lesaffre, 1996) and given the estimated parameters proteins can be assigned to the most-likely component by computing its posterior distribution. Proteins that belong to components whose mean is closest to zero will be retained from downstream analysis, including normalization while proteins belonging to other components will be considered contaminants. To measure the performance of this procedure in identifying common contaminants in proteomics data, we will compare the list of proteins we identified to the database of common contaminants cRAP database (in http://www.thegpm.org/crap/). To measure the improvement in performance of the normalization methods after exclusion of these contaminants, we will perform NSAF (Zybailov et al., 2006) and dNSAF normalization (Zhang et al., 2010) before and after the removal of the contaminants and compare the performance of the normalization methods using the average intra-experiment correlations as the key measure. Better performance in normalization method will result in higher intra-experiment correlations.