Proteogenomic Analysis of Mycobacterium tuberculosis By High Resolution Mass Spectrometry

2011129 citationsJournal Articlehybrid Open Access

Authors

Dhanashree Kelkar · Institute of Bioinformatics

Dhirendra Kumar · Institute of Genomics and Integrative Biology

Praveen Kumar · Institute of Bioinformatics

Lavanya Balakrishnan · Institute of Bioinformatics

Babylakshmi Muthusamy · Pondicherry University

Amit Kumar Yadav · Institute of Genomics and Integrative Biology

Priyanka Shrivastava

Abstract

The genome sequencing of H37Rv strain of Mycobacterium tuberculosis was completed in 1998 followed by the whole genome sequencing of a clinical isolate, CDC1551 in 2002. Since then, the genomic sequences of a number of other strains have become available making it one of the better studied pathogenic bacterial species at the genomic level. However, annotation of its genome remains challenging because of high GC content and dissimilarity to other model prokaryotes. To this end, we carried out an in-depth proteogenomic analysis of the M. tuberculosis H37Rv strain using Fourier transform mass spectrometry with high resolution at both MS and tandem MS levels. In all, we identified 3176 proteins from Mycobacterium tuberculosis representing ∼80% of its total predicted gene count. In addition to protein database search, we carried out a genome database search, which led to identification of ∼250 novel peptides. Based on these novel genome search-specific peptides, we discovered 41 novel protein coding genes in the H37Rv genome. Using peptide evidence and alternative gene prediction tools, we also corrected 79 gene models. Finally, mass spectrometric data from N terminus-derived peptides confirmed 727 existing annotations for translational start sites while correcting those for 33 proteins. We report creation of a high confidence set of protein coding regions in Mycobacterium tuberculosis genome obtained by high resolution tandem mass-spectrometry at both precursor and fragment detection steps for the first time. This proteogenomic approach should be generally applicable to other organisms whose genomes have already been sequenced for obtaining a more accurate catalogue of protein-coding genes. The genome sequencing of H37Rv strain of Mycobacterium tuberculosis was completed in 1998 followed by the whole genome sequencing of a clinical isolate, CDC1551 in 2002. Since then, the genomic sequences of a number of other strains have become available making it one of the better studied pathogenic bacterial species at the genomic level. However, annotation of its genome remains challenging because of high GC content and dissimilarity to other model prokaryotes. To this end, we carried out an in-depth proteogenomic analysis of the M. tuberculosis H37Rv strain using Fourier transform mass spectrometry with high resolution at both MS and tandem MS levels. In all, we identified 3176 proteins from Mycobacterium tuberculosis representing ∼80% of its total predicted gene count. In addition to protein database search, we carried out a genome database search, which led to identification of ∼250 novel peptides. Based on these novel genome search-specific peptides, we discovered 41 novel protein coding genes in the H37Rv genome. Using peptide evidence and alternative gene prediction tools, we also corrected 79 gene models. Finally, mass spectrometric data from N terminus-derived peptides confirmed 727 existing annotations for translational start sites while correcting those for 33 proteins. We report creation of a high confidence set of protein coding regions in Mycobacterium tuberculosis genome obtained by high resolution tandem mass-spectrometry at both precursor and fragment detection steps for the first time. This proteogenomic approach should be generally applicable to other organisms whose genomes have already been sequenced for obtaining a more accurate catalogue of protein-coding genes. Mycobacterium tuberculosis continues to be a significant health burden, especially in the developing countries. Emergence of drug-resistant strains and a higher incidence of tuberculosis in people with HIV/AIDS have further worsened the situation. In the past, researchers have used proteomics for investigating the biology of this pathogen (1Jungblut P.R. Schaible U.E. Mollenkopf H.J. Zimny-Arndt U. Raupach B. Mattow J. Halada P. Lamer S. Hagens K. Kaufmann S.H. Comparative proteome analysis of Mycobacterium tuberculosis and Mycobacterium bovis BCG strains: towards functional genomics of microbial pathogens.Mol. Microbiol. 1999; 33: 1103-1117Crossref PubMed Scopus (308) Google Scholar, 2Jungblut P.R. Müller E.C. Mattow J. Kaufmann S.H. Proteomics reveals open reading frames in Mycobacterium tuberculosis H37Rv not predicted by genomics.Infect. Immun. 2001; 69: 5905-5907Crossref PubMed Scopus (97) Google Scholar, 3Gu S. Chen J. Dobos K.M. Bradbury E.M. Belisle J.T. Chen X. Comprehensive proteomic profiling of the membrane constituents of Mycobacterium tuberculosis strain.Mol. Cell. Proteomics. 2003; 2: 1284-1296Abstract Full Text Full Text PDF PubMed Scopus (180) Google Scholar, 4Mawuenyega K.G. Forst C.V. Dobos K.M. Belisle J.T. Chen J. Bradbury E.M. Bradbury A.R. Chen X. Mycobacterium tuberculosis functional network analysis by global subcellular protein profiling.Mol. Biol. Cell. 2005; 16: 396-404Crossref PubMed Scopus (176) Google Scholar, 5Mattow J. Siejak F. Hagens K. Becher D. Albrecht D. Krah A. Schmidt F. Jungblut P.R. Kaufmann S.H. Schaible U.E. Proteins unique to intraphagosomally grown Mycobacterium tuberculosis.Proteomics. 2006; 6: 2485-2494Crossref PubMed Scopus (73) Google Scholar, 6Målen H. Berven F.S. Fladmark K.E. Wiker H.G. Comprehensive analysis of exported proteins from Mycobacterium tuberculosis H37Rv.Proteomics. 2007; 7: 1702-1718Crossref PubMed Scopus (261) Google Scholar, 7Målen H. Pathak S. Softeland T. de Souza G.A. Wiker H.G. Definition of novel cell envelope associated proteins in Triton X-114 extracts of Mycobacterium tuberculosis H37Rv.BMC Microbiol. 2010; 10: 132Crossref PubMed Scopus (118) Google Scholar). There are a number of published studies pertaining to annotation of the Mycobacterium tuberculosis genome. The whole genome sequence of M. tuberculosis first became available for H37Rv strain in 1998, which was followed by that of CDC1551 and several other strains (8Cole S.T. Brosch R. Parkhill J. Garnier T. Churcher C. Harris D. Gordon S.V. Eiglmeier K. Gas S. Barry 3rd, C.E. Tekaia F. Badcock K. Basham D. Brown D. Chillingworth T. Connor R. Davies R. Devlin K. Feltwell T. Gentles S. Hamlin N. Holroyd S. Hornsby T. Jagels K. Krogh A. McLean J. Moule S. Murphy L. Oliver K. Osborne J. Quail M.A. Rajandream M.A. Rogers J. Rutter S. Seeger K. Skelton J. Squares R. Squares S. Sulston J.E. Taylor K. Whitehead S. Barrell B.G. Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence.Nature. 1998; 393: 537-544Crossref PubMed Scopus (6522) Google Scholar, 9Fleischmann R.D. Alland D. Eisen J.A. Carpenter L. White O. Peterson J. DeBoy R. Dodson R. Gwinn M. Haft D. Hickey E. Kolonay J.F. Nelson W.C. Umayam L.A. Ermolaeva M. Salzberg S.L. Delcher A. Utterback T. Weidman J. Khouri H. Gill J. Mikula A. Bishai W. Jacobs Jr., W.R. Venter J.C. Fraser C.M. Whole-genome comparison of Mycobacterium tuberculosis clinical and laboratory strains.J. Bacteriol. 2002; 184: 5479-5490Crossref PubMed Scopus (580) Google Scholar). Accurate annotation of protein coding genes from any genome is a continuously evolving process. This is highly evident in the case of M. tuberculosis. Cole and colleagues reported the presence of 3924 open reading frames (ORFs) in H37Rv genome (8Cole S.T. Brosch R. Parkhill J. Garnier T. Churcher C. Harris D. Gordon S.V. Eiglmeier K. Gas S. Barry 3rd, C.E. Tekaia F. Badcock K. Basham D. Brown D. Chillingworth T. Connor R. Davies R. Devlin K. Feltwell T. Gentles S. Hamlin N. Holroyd S. Hornsby T. Jagels K. Krogh A. McLean J. Moule S. Murphy L. Oliver K. Osborne J. Quail M.A. Rajandream M.A. Rogers J. Rutter S. Seeger K. Skelton J. Squares R. Squares S. Sulston J.E. Taylor K. Whitehead S. Barrell B.G. Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence.Nature. 1998; 393: 537-544Crossref PubMed Scopus (6522) Google Scholar). In a re-annotation effort by the same authors, the gene number was revised to 3995 (10Camus J.C. Pryor M.J. Médigue C. Cole S.T. Re-annotation of the genome sequence of Mycobacterium tuberculosis H37Rv.Microbiology. 2002; 148: 2967-2973Crossref PubMed Scopus (447) Google Scholar). As of March 2011, the TubercuList database contains 4012 annotated protein coding genes in the M. tuberculosis genome (11Tuberculist Database [http://tuberculist.epfl.ch]Google Scholar). de Souza et al. have carried out a comparison of two different gene annotation sets for M. tuberculosis H37Rv strain (Sanger and TIGR annotations) and reported that ∼50% of the genes have different translation start sites (12de Souza G.A. Målen H. Softeland T. Saelensminde G. Prasad S. Jonassen I. Wiker H.G. High accuracy mass spectrometry analysis as a tool to verify and improve gene annotation using Mycobacterium tuberculosis as an example.BMC Genomics. 2008; 9: 316Crossref PubMed Scopus (61) Google Scholar). In the same study, using proteomic data for 449 culture filtrate proteins, the authors were able to correct annotations of 24 genes. Finally, the possibility of existence of many CDSs, which are not yet annotated in H37Rv genome, has also been suggested (13Lew J.M. Kapopoulou A. Jones L.M. Cole S.T. TubercuList - 10 years after.Tuberculosis. 2011; 91: 1-7Crossref PubMed Scopus (313) Google Scholar). A direct evidence of translational potential of a genomic region can be obtained from peptide data from mass spectrometry-based proteomics (14Pandey A. Lewitter F. Nucleotide sequence databases: a gold mine for biologists.Trends Biochem. Sci. 1999; 24: 276-280Abstract Full Text Full Text PDF PubMed Scopus (59) Google Scholar, 15Mann M. Pandey A. Use of mass spectrometry-derived data to annotate nucleotide and protein sequence databases.Trends Biochem. Sci. 2001; 26: 54-61Abstract Full Text Full Text PDF PubMed Scopus (96) Google Scholar, 16Castellana N. Bafna V. Proteogenomics to discover the full coding content of genomes: a computational perspective.J. Proteomics. 2010; 73: 2124-2135Crossref PubMed Scopus (132) Google Scholar). Other information such as N-terminal acetylation of peptides can be used for translational start site assignment. That the annotations in H37Rv genome are still not final is indicated by a recent analysis by de Souza et al., where they used clustered database of annotated CDSs and flanking regions from five M. tuberculosis strains and three M. bovis strains to search mass spectrometric data to identify missing proteins from H37Rv genome. These investigators found peptide evidence for 24 genes incorrectly annotated in H37Rv genome (17de Souza G.A. Arntzen M.Ø. Fortuin S. Schürch A.C. Målen H. McEvoy C.R. van Soolingen D. Thiede B. Warren R.M. Wiker H.G. Proteogenomic analysis of polymorphisms and gene annotation divergences in prokaryotes using a clustered mass spectrometry-friendly database.Mol. Cell. Proteomics. 2011; 10 (M110.002527)Abstract Full Text Full Text PDF PubMed Google Scholar). In the present study, we have carried out an in-depth proteomic analysis of M. tuberculosis using high resolution Fourier transform mass spectrometry. Cell lysates and culture filtrates were fractionated using various methodologies followed by tandem MS (MS/MS) 1The abbreviations used are:MS/MStandem MSPSMPeptide spectrum matchFDRFalse discovery rateTSStranslational start siteGSSPsGenome search specific peptidesORFopen reading frameSCXstrong cation exchangeCDSCoding sequence. analysis on an LTQ-Orbitrap Velos ETD mass spectrometer. The mass spectrometry-derived data were analyzed using a six-frame translation of genome sequences in addition to searches of a protein database of M. tuberculosis H37Rv. We used two gene prediction programs (FgeneSB and GeneMark) to obtain alternative gene models (18Besemer J. Lomsadze A. Borodovsky M. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions.Nucleic Acids Res. 2001; 29: PubMed Scopus Google Scholar, and Scholar). In addition to gene we also used a proteomic approach to alternative gene this 3176 proteins were identified representing ∼80% of the total proteome of M. tuberculosis. A total of ∼250 peptides that not existing annotations were the of these high confidence peptide we were able to 41 novel genes in the H37Rv genome and correct 79 gene models. We were also able to identify alternative translational start sites for 33 proteins in addition to translational start sites of 727 proteins using N terminus-derived peptides. tandem MS spectrum discovery translational start site search specific peptides open reading cation sequence. M. tuberculosis H37Rv strain was grown in with from were used to of were grown at in a for the of the were by using Cell was carried out by in presence of was used as for and was used as for of culture filtrate proteins, the M. tuberculosis H37Rv strain was grown in at for was as it not have protein The were by membrane was using To obtain proteome cell lysates were fractionated using three different was carried out by of protein was a and using the was and to as R. Pandey A. with in cell culture for of protein and 2005; Google Scholar). was carried out using 10 followed by using was carried out using at for were from the and using as H. Pandey A. proteomics using with in cell 2008; PubMed Scopus Google Scholar). was carried out using cation and was carried out for of cell protein and was carried out at for using at the of were using the of the peptides was used for on A and the other was used for by using were by method the of to of culture filtrate protein was fractionated by and were and was as for cell We carried out total of and culture filtrate of the mass spectrometry were carried out on an LTQ-Orbitrap Velos ETD mass with an The peptides from were analyzed using to tandem mass spectrometry. The of a and an 10 with an at The mass spectrometry analysis on the LTQ-Orbitrap Velos was carried out in a with resolution at of to to precursor were for by was used for analysis of and was used for were in with resolution at In the case of culture filtrate were in the mass mass was which high mass accuracy by using from for were for for was set to and the was at data were to using used precursor mass was to was from to number of in a was to be to was set as with of and were The protein database used for searches was from for M. tuberculosis H37Rv strain The genome sequence for H37Rv strain was from site Using a six-frame database was sequences from to As Mycobacterium is to and as and was it was as in addition to and (8Cole S.T. Brosch R. Parkhill J. Garnier T. Churcher C. Harris D. Gordon S.V. Eiglmeier K. Gas S. Barry 3rd, C.E. Tekaia F. Badcock K. Basham D. Brown D. Chillingworth T. Connor R. Davies R. Devlin K. Feltwell T. Gentles S. Hamlin N. Holroyd S. Hornsby T. Jagels K. Krogh A. McLean J. Moule S. Murphy L. Oliver K. Osborne J. Quail M.A. Rajandream M.A. Rogers J. Rutter S. Seeger K. Skelton J. Squares R. Squares S. Sulston J.E. Taylor K. Whitehead S. Barrell B.G. Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence.Nature. 1998; 393: 537-544Crossref PubMed Scopus (6522) Google Scholar). The peptides obtained were to the genome translation as and were to the that were used for the number of sequences in protein was in genome translation database sequences were different search and were used to the used were as as a to one for search genome was mass of mass of culture filtrate data mass was as it was in mass was of were of acetylation of peptide N and of were on for discovery using database A sequence database was in addition to database and was at spectrum as of in database at number of in and at the A was to search from data different of searches were with the same data set and protein database searches using and peptide sequences to a spectrum in search were that were different sequences in different searches were from further The protein identification was by proteins on peptides. Proteins that have at one unique peptide were from the a in which protein be that proteins it was by one in the final with an and protein were reported in Proteins with unique peptide evidence were reported in a identified with which were used for protein N analysis were for is a for peptide and protein identification from mass spectrometry data D. D. A with Based for Res. 2011; 10: PubMed Scopus Google Scholar). to different and to different These the peptide spectrum from an that high from mass from precursor This to database This with a approach to correct peptide while is and is for high resolution mass spectrometry data because the the mass of the fragment from mass spectrometric data obtained in high resolution at both MS and were used for proteogenomic This was because we to novel on peptide obtained the were for genome annotation of the peptides were found out using the to in the H37Rv genome were not for further proteogenomic search specific peptides were identified by those peptides which to proteins from genome database search search specific peptides were as to annotated and to annotated genes. gene models were using two different gene prediction for prokaryotes and for peptides that not with the gene model that they were to and and those that to region genes and gene model obtained using peptide evidence and gene prediction were for Mycobacterium using protein used for novel in gene were from for the sequence identification was carried out from both cell lysates and culture filtrates of M. tuberculosis H37Rv. Cell lysates were fractionated by three different followed by cation and the two at the peptide level. filtrate proteins were fractionated by In all, were carried were out of which were to peptide sequences using three different search were for first that a The total number of unique peptide sequences obtained was The number of peptides identified from different from from cation and from and of peptides and proteins identified from the various for analysis of cell proteins. The complete of peptides identified in with and is in The complete set of mass spectrometry data and from this has been available the for the data are at the of the The data has also been to where the data with peptide can be using the number H. R. a for for proteomics 2008; 9: PubMed Scopus Google Scholar). The peptide and protein identification data have also been to database has protein sequences for H37Rv TubercuList database 4012 protein coding genes in the H37Rv genome March (11Tuberculist Database [http://tuberculist.epfl.ch]Google Scholar). We have identified a total of 3176 proteins with at one unique peptide ∼80% of the total proteome of the M. tuberculosis. Proteins identified on peptide evidence are As an of the high of many of the proteins, the identification of peptides to of the gene where we identified unique peptides on the of Other proteins identified in this analysis are with peptides with peptides with unique peptides with unique peptides and with unique peptides we identified a peptide to gene which is annotated as a (11Tuberculist Database [http://tuberculist.epfl.ch]Google Scholar). Mycobacterium an of of the genes in are because of of sequence to any of the genes from model prokaryotes (8Cole S.T. Brosch R. Parkhill J. Garnier T. Churcher C. Harris D. Gordon S.V. Eiglmeier K. Gas S. Barry 3rd, C.E. Tekaia F. Badcock K. Basham D. Brown D. Chillingworth T. Connor R. Davies R. Devlin K. Feltwell T. Gentles S. Hamlin N. Holroyd S. Hornsby T. Jagels K. Krogh A. McLean J. Moule S. Murphy L. Oliver K. Osborne J. Quail M.A. Rajandream M.A. Rogers J. Rutter S. Seeger K. Skelton J. Squares R. Squares S. Sulston J.E. Taylor K. Whitehead S. Barrell B.G. Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence.Nature. 1998; 393: 537-544Crossref PubMed Scopus (6522) Google Scholar). In the TubercuList such proteins are which are as proteins as these are organisms M. M. (13Lew J.M. Kapopoulou A. Jones L.M. Cole S.T. TubercuList - 10 years after.Tuberculosis. 2011; 91: 1-7Crossref PubMed Scopus (313) Google Scholar). We identified proteins that are annotated as proteins of which have not been to be by any proteomic peptides which to annotated proteins the from the genome database search a of the of these novel genes and gene were by proteogenomic a of proteogenomic annotation which is in in the of a total unique peptides that were identified in this study, peptides to regions of genome where gene was present not with the gene model they were gene prediction and as as the tool from were used to in the region to which these were (18Besemer J. Lomsadze A. Borodovsky M. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions.Nucleic Acids Res. 2001; 29: PubMed Scopus Google Scholar, and Scholar). Based on we were able to the presence of 41 novel protein coding genes. we also the of these predicted genes species using protein of the novel were already annotated in other strains of M. tuberculosis. is to that the of these novel proteins was of and it is that of these were from genome annotation of the strain to As TubercuList contains an 24 genes as with the data We identified of these proteins, which are in and identification of a novel in the H37Rv genome with the the novel genes found in this with of the peptide evidence and genome of the novel of a novel gene on peptides to an peptides to an region genes and prediction and the presence of this In a protein to this novel gene has been annotated in M. tuberculosis CDC1551 genome. sequence of a novel region is indicated by A spectrum for identification of genome search specific peptide is of using peptide indicated is the obtained for the peptide sequence by search in other in of novel in a from identification of novel we corrected the gene using genome search specific peptide Using this we corrected 79 gene models of were N-terminal and one was of two genes TubercuList has annotated corrected with for of these genes. data the of genes where in the is suggested in also contains information and peptide an of of a gene model by of the N The of the gene which for are to We found peptides and to the and predicted a gene which the peptide identified by In case of gene a peptide was found to the N of the This finding suggested that start site of the protein was further of the annotated start Using a genomic a region in strains was by one protein in H37Rv two CDSs, and were reported in the genome as a However, a nucleotide has been reported at in the H37Rv genome, which the to a the presence of a protein of two coding regions (11Tuberculist Database [http://tuberculist.epfl.ch]Google Scholar). In of N-terminal peptides and were found to of a gene to However, these peptides are in and is in We also identified unique peptides to in A that proteins in other strains and CDC1551 are by which both of the novel peptides that were et al. have reported an at in the H37Rv genome, which is a sequencing in the H37Rv genome K. Chen X. Dobos K.M. S. Jacobs Jr., W.R. V. T. E. C. J.C. genome sequences of H37Rv strains of Mycobacterium tuberculosis from Bacteriol. 2010; PubMed Scopus Google Scholar). that is a sequencing and that the of the gene should be corrected to be to of gene was also carried out using protein N-terminal peptides as in the of the genes in H37Rv genome as start as start and as a start As in the while database from genome we peptides in which and were of N-terminal peptides can be indicated by acetylation at the N of the We used these to correct translational start We peptides in search of the genome database for the of such peptides. We for peptides, protein N-terminal peptides, peptides with and peptides with N was to identify protein N-terminal peptides. N-terminal acetylation of is to in at in et al. have that of proteins that are at the N-terminal in M. tuberculosis is more that found in E. Mattow J. Jungblut P.R. of translational starts using peptide mass and tandem mass spectrometry the proteome of Mycobacterium 2007; PubMed Scopus Google Scholar). We found protein N-terminal peptides out of peptides were at the N peptides and peptides we were able to the of 727 proteins and of 33 other proteins We also identified N-terminal peptides which confirmed the translational start site of We found of N-terminal peptides to the annotated translational start site of the the other we corrected of genes on a protein N-terminal peptide of the annotated such is in where a peptide with N was of the annotated of gene which can be used as evidence for the of the gene in the case of proteins and we found peptide evidence translation at two different M. tuberculosis H37Rv genome sequence has been available for more 10 years and has been by a number of it was to identify many novel regions with protein coding more was the that of these novel were already annotated in other strains of M. tuberculosis were missing from the genome sequence of M. tuberculosis. We have the of the proteogenomic approach to annotate a genome and the annotation of a studied genome. this approach of using mass spectrometry-based proteomic data to identify protein coding regions in the genome can to be an method with computational for sequenced genomes in the data from of M. data from of M. data from of M. data from M. culture We the of for to the of We for with

Topics & Keywords

Genomics and Phylogenetic Studies RNA and protein synthesis mechanisms Tuberculosis Research and Epidemiology

UN Sustainable Development Goals

Good health and well-being

Publication Details

Published in: Molecular & Cellular Proteomics

Volume 10, Issue 12, pp. M111.011445-M111.011445

DOI: 10.1074/mcp.m111.011627

Field-Weighted Citation Impact: 5.45