Search for a command to run...
Over the past decade the mitochondrial (mt) genome has become the most widely used genomic resource available for systematic entomology. While the availability of other types of ‘–omics’ data – in particular transcriptomes – is increasing rapidly, mt genomes are still vastly cheaper to sequence and are far less demanding of high quality templates. Furthermore, almost all other ‘–omics’ approaches also sequence the mt genome, and so it can form a bridge between legacy and contemporary datasets. Mitochondrial genomes have now been sequenced for all insect orders, and in many instances representatives of each major lineage within orders (suborders, series or superfamilies depending on the group). They have also been applied to systematic questions at all taxonomic scales from resolving interordinal relationships (e.g. Cameron et al., 2009; Wan et al., 2012; Wang et al., 2012), through many intraordinal (e.g. Dowton et al., 2009; Timmermans et al., 2010; Zhao et al. 2013a) and family-level studies (e.g. Nelson et al., 2012; Zhao et al., 2013b) to population/biogeographic studies (e.g. Ma et al., 2012). Methodological issues around the use of mt genomes in insect phylogenetic analyses and the empirical results found to date have recently been reviewed by Cameron (2014), yet the technical aspects of sequencing and annotating mt genomes were not covered. Most papers which generate new mt genome report their methods in a simplified form which can be difficult to replicate without specific knowledge of the field. Published studies utilize a sufficiently wide range of approaches, usually without justification for the one chosen, that confusion about commonly used jargon such as ‘long PCR’ and ‘primer walking’ could be a serious barrier to entry. Furthermore, sequenced mt genomes have been annotated (gene locations defined) to wildly varying standards and improving data quality through consistent annotation procedures will benefit all downstream users of these datasets. The mt genome of most animals is an extremely conserved and constrained molecule. It is descended from the genome of the alpha-proteobacterial symbiont that became the mitochondrion in the ancestor of all eukaryotes, and retains many bacterial-type features. Like most bacterial genomes it is usually a circular molecule, the only exceptions being noninsects such as cnidarians (Burger et al., 2003). It has undergone massive reductive evolution with many genes either moved to the nuclear genome or their function replaced by nuclear encoded orthologues. The gene set of bilaterian animals (i.e. all metazoans excluding cnidarians, ctenophores, poriferans and placozoans) is fixed at just 37 genes: 13 protein-coding genes (PCGs) which form part of the electron transport chain, plus 2 ribosomal RNA (rRNAs) and 22 transfer RNA (rRNA) genes which are responsible for translating the mt PCGs (Osigus et al., 2013). Very few bilaterian animals have fewer than 37 genes, and the small number with more than 37 have duplicate copies of one or more of the core gene set. In addition to its genic content, the mt genome also includes one or more noncoding regions that function as binding sites for proteins involved in genome replication such as the control-region (CR) and transcription. In most animals mt genes are transcribed on both strands; the strand with the most genes is termed the ‘majority’ strand and the other the ‘minority’ stand. Other terms used include the H (heavy) and L (light) strands, a reference to differences in G + T content between the two stands that arises due to their asymmetric replication (Reyes et al., 1998). In most insects the majority strand corresponds to the H strand and the minority to the L; however, as each naming convention has an independent basis, one cannot assume that they are interchangeable. The arrangement of genes (both gene order and transcription direction) within the mt genome varies widely across bilaterians, but sufficient conservation between different groups has allowed the recognition of conserved gene blocks (Bernt et al., 2013a), as well as ancestral genome arrangements for the Ecdysozoa (Braband et al., 2010), Pancrustacea and Insecta (Boore et al., 1998). While there are many insects that have mt genome arrangements derived relative to this ancestral insect genome (Fig. 1), the majority of insect species share this arrangement (see Cameron, 2014, for a full discussion of genome rearrangements found in insects). Naming conventions for mt genomes were established by Boore (2006), yet a variety of alternative names are used; for example, nad1, nd1, nad1 and NADH1 all describe the same gene. Methods for sequencing mt genomes have improved vastly over the last decade and these improvements are largely responsible for the rapid increase in the numbers of available genomes over this time (Boore et al., 2005). The first mt genomes were sequenced using the direct isolation of mtDNA either by differential centrifugation to separate mtDNA from nuclear DNA using caesium chloride or of tissue lysate to separate whole mitochondria from other cell components using sucrose (Clary & Wolstenholme, 1985; Crozier & Crozier, 1993). Purified mtDNA was then digested using restriction enzymes, cloned and the clone library sequenced. Mt genomes for only eight insect species were sequenced using these methods between 1985 (Drosophila yakuba Burla: Diptera: Drosophilidae) and 2000 (Cochliomyia hominovorax Coquerel: Diptera: Calliphoridae), highlighting the technical demands of this approach. The remaining 98% of insect mt genomes have been sequenced by one of the following four methods outlined below: long PCR plus primer walking; long PCR plus next-generation sequencing (NGS); RNA sequencing (RNAseq) plus gap filling; and direct shotgun sequencing (Figs 2, 3). The introduction of PCR revolutionised mt genomics as it has virtually every other area of molecular biology. Of most relevance to mt genomics is the application of long PCR (sometimes termed long-range PCR), the targeting of amplicons that span multiple genes. It was first applied to insect mt genomes by Roehrdanz (1995) to assess population-level variability in mtDNA via restriction fragment length polymoprhisms (RFLP) and Triatoma dimidiata Latreille (Hemiptera: Reduviidae) was the first mt genome to be sequenced using this method (Dotson & Beard, 2001). Long PCR has been used in virtually every insect mt genome sequenced since. From a technical perspective, long PCR doesn't differ greatly from regular PCR. Primers are used to delimit the target amplicon, and the same unmodified oligonucleotide primers can be used as in other PCRs. While it is common to design species-specific primers for long PCR, it is not necessary and primer sets conserved at various taxonomic scales – for example, all animals (Simon et al., 2006), arthropods (Yamauchi et al., 2004), Dictyoptera (Cameron et al., 2012), Coleoptera (H. Song, personal communication) – have been identified. Long PCRs can also be run on standard PCR machines. Amplification conditions should be changed to reflect the longer amplicons, typically by increasing the extension and run-out steps; most commercial enzyme mixes include formulae for calculating required extension times for a range of expected amplicon lengths. Annealing temperatures are defined by primer base composition, but it is useful to reduce the extension temperature by 4°C from manufacturer recommendations due to the high A + T nucleotide bias of insect mt genomes. Many commercial polymerases are suitable for long PCR, but formulations which include error-checking enzymes such as Pfu or have ultra-low error rates are preferred due to the possibility of errors accumulating over long target regions. The advantages of long PCR over direct isolation are enormous: far less tissue is insects can be and the to the mt genome in as as two PCR is many times than mtDNA to the circular of mt genomes long PCRs in gene can be used to the genome, it is with to one a gene regions that to by PCRs can be and through by long PCRs. The of the include a for high quality to in genome and While long for DNA the target that high quality is in tissue can still DNA in is almost sufficient and mt genomes can be from in or mtDNA isolation as is usually in most studies target mtDNA such as and such as the or which have high of PCR not be for extremely small insects in DNA templates. of long PCR is usually to sequence at the primer sites or to genome due to rearrangements or (e.g. in Cameron et al., DNA or more DNA sequence types in a can to PCR the differ in the will be and Long PCR also by which are nuclear copies of mitochondrial genes et al., 2001). are and they are from mt genome copies by the of and rates across all however, are also of of mtDNA the nuclear genome and the of an should not be as that a particular amplicon is PCRs of mt genes are also to or of et al., While long PCR has been as a there are of in multiple insect the by this is almost and genes, in a Cameron, of DNA to for mtDNA – either via & or et al., – to long PCR has been used to but the of these methods across a range of insect has not been of amplicons has most been via sequencing with primer methods are the In primer the of each amplicon are sequenced using the the sequence is then used to design primers downstream of the set of primers is used to sequence a the of – design new primers – sequence is the amplicon has been primers are required for a insect mt with other of sequencing of the genome in both is necessary to sequencing include sequencing one species by primer and then the primer set on species (e.g. Cameron & Nelson et al., 2012). The of primer is to the target species that due to sequence variability at primer The are that it is and Mitochondrial genomes can only be sequenced as as the number of amplicons, and the of each on times for sequencing and primer The of primer design are also typically at the of the for is a use sequencing primer sets have been for taxonomic groups (e.g. et al., but have yet to be the sequencing of the by primer is due to sequence (i.e. and to design useful (e.g. A or and (e.g. Cameron et al., 2012). this a number of the insect mt genomes available on have not been these mt genomes have been sequenced through the regions but the is The to the of primer has to application of methods to mt used by et al. for the amplicons for the for primer with sequence has that the method is of nucleotide than yet more to errors sequencing regions et al., of most however, more than primer and so has on approaches to such that multiple mt genomes can be sequenced from a from can be with – termed et al., – which from a to be to of a sequence Timmermans et al. however, have that mt genomes can be without the for using of mt genes to to species The taxonomic of this are Timmermans et al. sequenced of to species that were from different studies have on a series Timmermans & or et al., that multiple representatives at the and at taxonomic scales run the of but the of has yet to be in this of most of to mt genomics is their on long PCR. by typically include all of the mt and genes at high et al., are typically not well and the mt genome typically the of the and for regions (e.g. et al., Wang et al., 2013). the between the mt and the which are by the of by (see are by are usually and are greatly by to date has a mt genome from this be a of sequencing with transcriptomes being sequenced the of RNA species is to While the mt gene typically used in phylogenetic analyses of mt it is to use these as to sequencing of the genome et al., Wang et al., 2013). primers on each mt fragment the between to be by PCRs and sequenced by While this still PCR and as such is to PCR it of DNA and usually less than the number of species-specific primers as a full primer approach. the involved in a high it is not more than primer but is a of from datasets. direct shotgun sequencing of genomic DNA the of mt genomes without or at The first insect mt genome to be sequenced from shotgun sequencing was the which was from as part of the nuclear genome sequencing et al., 2009; et al., The genome of species – each with genes (Cameron et al., et al., – at sequencing (e.g. et al., target amplicons to protein-coding genes that in were on different genome sequencing however, use (e.g. genome 2010), just for DNA and largely mt genomic such as et al., 2012), target genome to be at and with are as or to their number within the mt genomes can in this be from the The methods used are to the of in mt genomes as a of nuclear genome studies have on mt genomes from nuclear as is used to target mt whole genomic DNA are and sequenced using of the standard has been to mt genomes from using either a relative as a reference genome, or using mt genes from the target species as for et al., 2013). of this from insects (e.g. et al., in et al., in are on but studies from other have the (e.g. et al., 2012; et al., The use of by to application on (e.g. or for which long PCR is et al. were to sequence the mt genome of the on and a tissue just 2 2 in – than many insects – that a of mt genomic data could be within of the studies to however, have sequenced mt genomes from multiple species a run et al., this more than either the primer or long PCR plus are four approaches to sequencing insect mt genomes at the have their advantages and in terms of and to difficult (see which should be to the design of mt genome sequencing however, these methods are sufficient to sequence virtually insect mt of sequencing of mt genomes are then necessary for all downstream to the of genes and plus their transcription strand or the of and of other such as the of transcription and mt genome annotation have been which use to protein-coding genes, analyses to and annotated for et al., was the first however, its of mt genomes is now extremely of date – new mt genomes have been and just insect species are et al., used methods and a however, the is longer at the time of (Bernt et al., 2013b) is the most annotation yet but its of protein-coding genes are wildly the of not the annotation methods have not been widely and the majority of insect mt genomes sequenced to date have been The to by with will for these and to annotation issues specific to an of the mt genome annotation is and in Mitochondrial genes are transcribed genes on a then by an at the sites of this is to as the et al., the first in mt genome annotation genes, usually via such as & and & the of by with the to form the by between base is by the sequence at of the on however, that from the for in almost all animals and multiple in groups such as & and et al., & 2012). have recently been (e.g. et al., in which for but are typically annotated by genomes of sequence at locations with the mt genomes of is usually sufficient to not by regions not to other genes can be using RNA such as to that can be with the of other species to regions. a small number of insect such as have one or more genes from the mt The of a particular from an annotation is usually due to either annotation error or to sequence a of the genome, for genes the the most of mt genomes. it is common to copies the expected 22 genes. of the methods which well a particular of DNA the for a in there are multiple copies of a the one with the is to be the of the gene. with the gene from species also usually will which of is the gene. copies of a that are to within 2 they are encoded on the are almost copies that are found in the however, the high of sequence between these the and from species that they are (Cameron et al., of protein-coding genes can be by between can be by most using such as or et al., that and is relative to the of and both the and should be for the PCGs regions are the first downstream of its is typically to form the of each gene. however, variability in In addition to the and and are also used across a range of insect & The also the annotation of a T or a are a common of mt protein-coding genes. are to by et al., & The annotation of is a in that it either a or other and its annotation across insects has been wildly In the first insect mt genome to be separate the from the first which a 13 than orthologues. & a for in yakuba that as an due to either ribosomal or a which could as a It should be that for this was it was a to a than was found in other Furthermore, the is not well conserved across across for example, and are all found within different species the gene is the most conserved mt gene at the and across orders to the of conserved sites as for different in et al., in (Cameron & and or at a conserved in Coleoptera et al., only a number of species (e.g. & 2009; et al., et al., have the the same and that the are from studies also that not with the as has been for insect species et al., for within of can be on the of to conserved sites downstream of the is justification for about for that with or are longer or than Most of the remaining in protein-coding gene not by In the ancestral insect mt genome there are four gene in genes for which the is not defined by and of and usually not by a with the A of the first the first base of the and almost by with a instances however, been of gene which due to base within the of the first gene et al., et al., RNA at the of each gene have been to function as sites between gene & Wolstenholme, in such instances the et al., et al., The of the however, between different insect groups (see et al., the The enzymes responsible for are to be to base et al., et al., 2004), that at is due to and as yet The extension of the to include at is by studies which that at of these gene are and in & in et al., studies are required from a range of insect so that can reflect In the the at the and of these genes are conserved at taxonomic scales (e.g. within and as with consistent of gene in instances are high of length the ribosomal RNA genes are the most difficult mt genes to 3). In the ancestral insect mt genome, is between two and and this gene has been annotated to every base between these two genes. While sequencing has this for & other insects have been variability in this from in the et al., to in the et al., variability can be for by regions within the gene – for example, genes of the and Latreille differ in by high at both the and (Cameron et al., are due to either within the gene (e.g. et al., or between and (e.g. et al., of have been (e.g. et al., et al., Cameron & the of the molecule, is conserved across the has conserved but includes a conserved and a length have not improved annotation of regions for this gene. has been as the of the gene is not by gene but by the In to however, the of the of has a high of conservation part of two that are between and of this conserved (e.g. et al., has in more consistent annotation of yet for mt genome still reflect approaches to this gene. for of has recently been (e.g. et al., in which could in more consistent of not just gene but also such as and within each The of the mt genome have also not been annotated The of replication is typically in the noncoding and is between and in the insect than specific within this is typically annotated as the or the + T & a series of conserved within the insect on the mt genomes available at the While & has to be of groups such as few of the are conserved across is in to the mt genomes of other groups such as with conserved et al., The of replication has been to a long that is found in most its within the varies et al., 2005). The of replication has not been for insect other than it also in the and is with a et al., 2005). The only other that has been is the binding of a transcription which is in a noncoding between nad1 and in the insect mt has a conserved that is conserved across insects (Cameron & in species such as a results in a longer nad1 which the binding et al., to of the genes relative to the protein-coding genes et al., and the binding is in mt genomes nad1 is longer downstream of the for et al., and (Cameron et al., The of transcription of which four are typically et al., 2009; have yet to be for a annotation as then PCGs and noncoding there is a for quality by the have in a the quality questions are outlined in these are all about the mt genome annotation to – the expected number and of genes, their transcription and While it is to a in the of mt genome annotation it is due to the high of on this within from the expected number of genes to be to the possibility of or sequencing outlined are only by annotation and their to be not in extension or of PCGs are far more to be due to sequencing than and are by the sequencing by of their The sequencing of both genome or with not is to in the on sequence errors are virtually to is and there are insect species mt genome from one or more of the quality however, these quality to on mt genome which have the of being than it also to the of mt genomes using in phylogenetic or doesn't consistent between of the or or more genic regions mt genomes and analyses to is being (e.g. genes Furthermore, the includes only error gene in are can be by use of the other such as and are not and errors example, in a of mt genomes Cameron, across species – genome – were that in of the gene was in While many of these – for example, annotated to be long or – they still in genes for phylogenetic Other however, are and gene with other species – for example, the gene of was annotated to be due to an in the of the gene et al., 2012). errors are due to errors in being mt genome The first mt genome, et al., many errors that have been in the annotation of other due to between and as well as and were annotated in the first mt genome et al., 2006), and these have been in other mt for et al., and et al., 2012). mt genome such as et al., have many such errors in however, these are not the for mt genome for – users of mt genome data should the of data in their It is also that each new genome of is in insect mt genomes and is an to Of the were in species mt genomes were by the Cameron & et al., 2010; Zhao et al., and with data from other species the most annotation has is about the gene which can be at a time should form a part of all analyses that use mt genome with differences from as part of mt genomes are a useful data for a wide variety of phylogenetic and genomic Methods for whole mt genome data have over the last decade and depending the time and of different sequencing methods be most Mt genomes can be sequenced and for almost all insect groups and approaches can be applied to groups that or to Mt genome annotation and in it is still that in this be in only to methods are and the A of mt genomes are transcribed and the is to A to mt genome conserved across insects or across orders are most to gene in the of has been by is for the of in mt genomes and there is for to such for sequenced mt genomes that studies have their legacy there has been a wide variety in annotation between different but of has also over studies that use mt genomes on should be as part of or analyses to gene are being to the and with have on insect mt genomes over the past in particular Dowton and has been by the and the