Search for a command to run...
Article Figures and data Abstract eLife digest Introduction Results Discussion Materials and methods References Decision letter Author response Article and author information Metrics Abstract Results from genome-wide association studies (GWAS) can be used to infer causal relationships between phenotypes, using a strategy known as 2-sample Mendelian randomization (2SMR) and bypassing the need for individual-level data. However, 2SMR methods are evolving rapidly and GWAS results are often insufficiently curated, undermining efficient implementation of the approach. We therefore developed MR-Base (http://www.mrbase.org): a platform that integrates a curated database of complete GWAS results (no restrictions according to statistical significance) with an application programming interface, web app and R packages that automate 2SMR. The software includes several sensitivity analyses for assessing the impact of horizontal pleiotropy and other violations of assumptions. The database currently comprises 11 billion single nucleotide polymorphism-trait associations from 1673 GWAS and is updated on a regular basis. Integrating data with software ensures more rigorous application of hypothesis-driven analyses and allows millions of potential causal relationships to be efficiently evaluated in phenome-wide association studies. https://doi.org/10.7554/eLife.34408.001 eLife digest Our health is affected by many exposures and risk factors, including aspects of our lifestyles, our environments, and our biology. It can, however, be hard to work out the causes of health outcomes because ill-health can influence risk factors and risk factors tend to influence each other. To work out whether particular interventions influence health outcomes, scientists will ideally conduct a so-called randomized controlled trial, where some randomly-chosen participants are given an intervention that modifies the risk factor and others are not. But this type of experiment can be expensive or impractical to conduct. Alternatively, scientists can also use genetics to mimic a randomized controlled trial. This technique – known as Mendelian randomization – is possible for two reasons. First, because it is essentially random whether a person has one version of a gene or another. Second, because our genes influence different risk factors. For example, people with one version of a gene might be more likely to drink alcohol than people with another version. Researchers can compare people with different versions of the gene to infer what effect alcohol drinking has on their health. Every day, new studies investigate the role of genetic variants in human health, which scientists can draw on for research using Mendelian randomization. But until now, complete results from these studies have not been organized in one place. At the same time, statistical methods for Mendelian randomization are continually being developed and improved. To take advantage of these advances, Hemani, Zheng, Elsworth et al. produced a computer programme and online platform called "MR-Base", combining up-to-date genetic data with the latest statistical methods. MR-Base automates the process of Mendelian randomization, making research much faster: analyses that previously could have taken months can now be done in minutes. It also makes studies more reliable, reducing the risk of human error and ensuring scientists use the latest methods. MR-Base contains over 11 billion associations between people's genes and health-related outcomes. This will allow researchers to investigate many potential causes of poor health. As new statistical methods and new findings from genetic studies are added to MR-Base, its value to researchers will grow. https://doi.org/10.7554/eLife.34408.002 Introduction Inferring causal relationships between phenotypes is a major challenge and has important implications for understanding the aetiology of disease processes. The potential for phenome-wide causal inference has increased markedly over the past 10 years due to two major advances. The first is the continuing success of large scale genome-wide association studies (GWAS) in identifying robust genetic associations (Visscher et al., 2017). The second is the development of statistical methods for causal inference that exploit the principles of Mendelian randomization (MR) using GWAS summary data (Davey Smith and Ebrahim, 2003; Davey Smith and Hemani, 2014; Zhu et al., 2016; Pierce and Burgess, 2013). Genetic data for MR can, however, be difficult to access, while MR methods are evolving rapidly and can be difficult to implement for non-specialists. To address the need for more systematic curation and application of complete GWAS summary data and MR methods, we have developed MR-Base (http://www.mrbase.org): a platform that integrates a database of thousands of GWAS summary datasets with a web interface and R packages for automated causal inference through MR. Following an extended introduction on the uses and sources of GWAS summary data, and the principles and assumptions behind MR, we describe how to implement MR analyses using MR-Base, how to interpret results and provide a thorough overview of potential limitations. In an applied example, we demonstrate the functionality of MR-Base through an MR study of low density lipoprotein (LDL) cholesterol and coronary heart disease (CHD). We also demonstrate how the integration achieved by MR-Base supports a wide range of applications, including phenome-wide association studies (PheWAS) to identify potential sources of horizontal pleiotropy, and for performing hypothesis-free MR to gain insight into impacts of interventions. These applications demonstrate how integrating data and analytical tools enable novel insights that would previously have been technically and practically challenging to achieve. GWAS summary data GWAS summary data, the non-disclosive results from testing the association of hundreds of thousands to millions of genetic variants with a phenotype, have been routinely collected and curated for several years (Welter et al., 2014; Li et al., 2016; Beck et al., 2014) and are a valuable resource for dissecting the causal architecture of complex traits (Pasaniuc and Price, 2017). Accessible GWAS summary data are, however, often restricted to 'top hits', that is, statistically significant results, or tend to be hosted informally in different locations under a wide variety of formats. For other studies, summary data may only be available 'on request' from authors. Complete summary data are currently publicly accessible for thousands of phenotypes but to ensure reliability and efficiency for systematic downstream applications they must be harvested, checked for errors, harmonised and curated into standardised formats. GWAS summary data are useful for a wide variety of applications, including MR, PheWAS (Millard et al., 2015; Denny et al., 2010), summary-based transcriptome-wide (Gusev et al., 2016) and methylome-wide (Richardson et al., 2017; Hannon et al., 2017a) association studies and linkage disequilibrium (LD) score regression (Bulik-Sullivan et al., 2015; Zheng et al., 2017b). Mendelian randomization MR (Davey Smith and Ebrahim, 2003; Davey Smith and Hemani, 2014) uses genetic variation to mimic the design of randomised controlled trials (RCT) (although for interpretive caveats see Holmes et al., 2017). Let us suppose we have a single nucleotide polymorphism (SNP) that is known to influence some phenotype (the exposure). Due to Mendel's laws of inheritance and the fixed nature of germline genotypes, the alleles an individual receives at this SNP are expected to be random with respect to potential confounders and causally upstream of the exposure. In this 'natural experiment', the SNP is considered to be an instrumental variable (IV), and observing an individual's genotype at this SNP is akin to randomly assigning an individual to a treatment or control group in a RCT (Figure 1a). To infer the causal influence of the exposure, one calculates the ratio between the SNP effect on the outcome over the SNP effect on the exposure. If there are many independent IVs available for a particular exposure, as is often the case, causal inference can be strengthened (Johnson, 2012). Here, we consider each SNP to mimic an independent RCT and we can adapt tools developed for meta-analysis (Bowden et al., 2017a) to combine the results obtained from each of the SNPs, giving an overall causal estimate that is better powered (Bowden et al., 2017a). Figure 1 Download asset Open asset Principles and assumptions behind Mendelian randomization. (A) Diagram illustrating the analogy between Mendelian randomization (MR) and a randomised controlled trial. (B) A directed acyclic graph representing the MR framework. Instrumental variable (IV) assumption 1: the instruments must be associated with the exposure; IV assumption 2: the instruments must influence the outcome only through the exposure; IV assumption 3: the instruments must not associate with measured or unmeasured confounders. (C-F) Scatter plots demonstrating the relationship between the instrumental single nucleotide polymorphism (SNP) effects on the exposure against their corresponding effects on the outcome. The slope of the regression is the estimate of the causal effect of the exposure on the outcome. (C) If there is no violation of the IV2 assumption (no horizontal pleiotropy), or the horizontal pleiotropy is balanced, an unbiased causal estimate can be obtained by inverse-variance weighted (IVW) linear regression, where the contribution of each instrumental SNP to the overall effect is weighted by the inverse of the variance of the SNP-outcome effect. Fixed and random effects IVW approaches are available (the slopes from both approaches are identical but the variance of the slope is inflated in the random effects model in the presence of heterogeneity between SNPs). (D) If there is a tendency for the horizontal pleiotropic effect to be in a particular direction, then constraining the slope to go through zero will incur bias (grey line). Egger regression relaxes this constraint by allowing the intercept to pass through a value other than zero, returning an unbiased effect estimate if the instrument-exposure and pleiotropic effects are uncorrelated, also known as the InSIDE (Instrument Strength Independent of Direct Effect) assumption (Bowden et al., 2015). Pleiotropic effect here refers to the effect of the instrument on the outcome that is not mediated by the exposure. (E) If the majority of the instruments are valid (black points), with some invalid instruments (red points), the median based approach will provide an unbiased estimate in the presence of unbalanced horizontal pleiotropy (black line), whereas IVW linear regression will provide a biased estimate (grey line). In addition, the median-based estimator does not require the InSIDE assumption of the Egger approach. (F) If a group of SNPs influences the outcome through a particular pathway other than the exposure (i.e. the SNPs are horizontally pleiotropic) then that group of SNPs will return consistently biased estimates. Clustering SNPs based on their estimates (grey lines) is possible with the mode-based estimator. The cluster with the largest weight (black line) is selected as the final causal estimate. The causal estimate from the mode-based estimator is unbiased if the SNPs contributing to the largest cluster are valid instruments. https://doi.org/10.7554/eLife.34408.003 Crucially, MR can be performed using results from GWAS, in a strategy known as 2-sample MR ( 2SMR) (Pierce and Burgess, 2013). Here, the SNP-exposure effects and the SNP-outcome effects are obtained from separate studies. With these summary data alone, it is possible to estimate the causal influence of the exposure on the outcome. This has the tremendous advantage that causal inference can be made between two traits even if they aren't measured in the same set of samples, enabling us to harness the statistical power of pre-existing large GWAS analyses. Due to the flexibility afforded by the 2SMR strategy, MR can be applied to 1000s of potential exposure-outcome associations, where 'exposure' can be very broadly defined, from gene expression and proteins to more complex traits, such as body mass index and smoking. While MR avoids certain problems of conventional observational studies (Davey Smith and Ebrahim, 2001), it introduces its own set of new problems. MR is predicated on exploiting 'vertical' pleiotropy, where a SNP influences two traits because one trait causes the other (Davey Smith and Hemani, 2014). It is crucial to be aware of the assumptions and limitations that arise due to this model (Haycock et al., 2016). The main assumptions (Figure 1b) are: the instrument associates with the exposure (IV assumption 1); the instrument does not influence the outcome through some pathway other than the exposure (IV assumption 2); and the instrument does not associate with confounders (IV assumption 3). The IV1 assumption is easily satisfied in MR by restricting the instruments to genetic variants that are discovered using genome-wide levels of statistical significance and replicated in independent studies. The other two assumptions are impossible to prove, and, when violated, can lead to bias in MR analyses. Violations of the IV2 assumption can be introduced by 'horizontal' pleiotropy where the SNP influences the outcome through some pathway other than the exposure. Such effects can manifest in various different patterns (Figure 1c–f). When multiple independent instruments are available it is possible to perform sensitivity analyses that attempt to distinguish between horizontal and vertical pleiotropy and return causal estimates adjusted for the former (Bowden et al., 2016a; Bowden et al., 2015; Hartwig et al., 2017b). To improve reliability of causal inference, MR results should be presented alongside sensitivity analyses that make allowance for various potential patterns of horizontal pleiotropy. Further details on the design and interpretation of Mendelian randomization studies can be found in several existing reviews (Davey Smith and Hemani, 2014; Haycock et al., 2016; Swerdlow et al., 2016; Holmes et al., 2017; Zheng et al., 2017a). A glossary of terms can be found in Supplementary file 1F. Model In this section we describe how to use MR-Base to conduct MR analyses (Figure 2). The data required to perform the analysis can be described as a 'summary set' (Hemani et al., 2017a), where the genetic effects for a set of instruments are available for both the exposure and the outcome. To create a summary set we select appropriate instruments, obtain the effect estimates for those instruments for the exposure and the outcome, and harmonise the effects so that they reflect the same allele. We can then perform MR analyses using the summary set. These steps are supported by the database of GWAS results and R packages ('TwoSampleMR' and 'MRInstruments') curated by MR-Base and the following R packages curated by other researchers: 'MendelianRandomization' (Yavorska and Burgess, 2017), 'RadialMR' (Bowden et al., 2017b), 'MR-PRESSO' (Verbanck et al., 2018) and 'mr.raps' (Zhao et al., 2018). The statistical methods and R packages accessible through MR-Base are updated on a regular basis. Figure 2 Download asset Open asset The practical steps for performing 2-sample Mendelian randomization (2SMR), as described in the Model section of the paper. The database of genome-wide association study results and R packages ('TwoSampleMR' and 'MRInstruments') curated by MR-Base support the data extraction, harmonisation and analysis steps required for 2SMR. Additional R packages for MR from other researchers are also accessible, including MendelianRandomization (Yavorska and Burgess, 2017), RadialMR ( Bowden et al., 2017b), MR-PRESSO (Verbanck et al., 2018) and mr.raps (Zhao et al., 2018). The available methods are updated on a regular basis. https://doi.org/10.7554/eLife.34408.004 Obtaining instruments Instruments are characterised as SNPs that reliably associate with the exposure, meaning they should be obtained from well-conducted GWAS, typically involving their detection in a discovery sample at a GWAS threshold of statistical significance (e.g. p<5x10−8) followed by replication in an independent sample. The minimum data requirements for each SNP are effect sizes (βx), standard errors (σx) and effect alleles. Also useful are sample size, non-effect allele and effect allele frequency. Sources There are several data sources that can be used in MR-Base (Figure 3) to define exposure and outcome traits (the number of traits is updated on a regular basis): Figure 3 Download asset Open asset The data available through MR-Base and the possible exposure-outcome analyses that can be performed. Exposure traits can very broadly defined and may include molecular traits like gene expression, DNA-methylation, metabolites and proteins, as well as more complex traits, including cholesterol, body mass index, smoking and education. Further details on the traits with complete summary data can be found in Supplementary file 1A. The numbers reflect MR-Base in December 2017 and are updated on a regular basis. https://doi.org/10.7554/eLife.34408.005 The MR-Base database comprises complete GWAS summary data for hundreds of traits (Figure 3 and Supplementary file 1A). By 'complete' we mean all SNPs reported in a GWAS analysis, with no exclusions on the basis of a p-value threshold for association with the target trait of interest. It is possible for the user to extract the top-hits from this data source using their own criteria (e.g. strength of p-value). Alternatively, potential instruments can be obtained from the MRInstruments package, which includes independent SNP-trait associations from the database with p-value < 5e-8. Quantitative trait loci (QTL) studies performed on DNA methylation (Gaunt et al., 2016), gene expression (GTEx Consortium, 2015), protein (Deming et al., 2016) and metabolite (Shin et al., 2014; Kettunen et al., 2016) levels generate hundreds to thousands of independent associations for thousands of traits. The MRInstruments R package contains hundreds of thousands of 'omic QTLs for ease of use within MR-Base. The NHGRI-EBI GWAS catalog (Welter et al., 2014) comprises 21,324 SNPs associated with 1628 complex traits and diseases. This list of potential instruments has been harmonised and formatted for ease of use within MR-Base within the MRInstruments R package. User provided data can also be used for analysis. Independence It is important to ensure that instruments selected for an exposure are independent, unless measures are taken in the MR analysis to account for any correlation structures that arise through linkage disequilibrium. efficient to ensure that instruments are independent is to use against a of to the in which the GWAS A has been in MR-Base to automate the of independent instruments. Obtaining SNP effects on the outcome In to generate the summary the effects of each of the instruments on the outcome need to be This typically to the set of GWAS results because it is that the instrumental SNPs for the exposure will be the of the outcome As with the exposure data, the outcome data must at a minimum the SNP effects their standard errors and effect alleles. If a particular SNP is not in the outcome then it is possible to use SNPs that are Here, it is important to ensure that for any the effect allele is the one in with the effect allele of the target are provided by MR-Base. Sources There are two main sources that can be used (Figure The MR-Base database comprises complete GWAS summary data for hundreds of traits file 1A). for SNPs against traits can be performed. If a SNP is then MR-Base for using data from the et al., 2015), and the corresponding data for the (Figure 2). User provided complete GWAS summary data can be used with the R package. exposure and outcome SNP effects To generate a summary for each SNP we need its effect and standard error on the exposure and the outcome corresponding to the same effect alleles et al., 2016). This is impossible to generate if the effect alleles for the SNP effects in the exposure and outcome datasets are MR-Base uses of the effect and where the effect allele to harmonise the exposure and outcome The following are effect alleles A SNP with alleles for the exposure and for the outcome are harmonised by the of the SNP-outcome effect. SNPs that are reported as for the exposure summary and for the outcome a for example, one study has reported the effect on the and the other on the In this case, the outcome alleles are to those of the exposure and effect alleles are then SNPs SNPs with or alleles are known as SNPs, because their alleles are by the same of on the and which can into the of the effect allele in the exposure and outcome If are effect allele can be used to the For example, consider a SNP with alleles A and with a of for allele A in the exposure study and in the outcome In addition, both studies have allele A as the effect allele and both are of The that allele A is the allele in the exposure study and the major allele in the outcome study that the two studies have used different To ensure that the effect sizes for the SNP reflect the same allele it is therefore to the of the effect in the exposure or outcome study (the in MR-Base is to the of effect in the outcome allele may however, be a of when it is to This process has been described in more previously et al., 2016). alleles If a SNP has alleles for the exposure and alleles for the outcome, there is no of that can these and there are or there is an error in the data. In this the SNP is from the analysis. MR analysis The summary set can now be using a range of methods in Supplementary file but new methods are added on a regular The to combine these data is to use a ratio where the causal effect is and the standard error of the estimate is If there are multiple independent instruments for the exposure is typically the for complex traits with then our analysis can improve in two major the variance in the exposure, and therefore statistical power will we can the sensitivity of the estimate to bias from violations of the IV2 assumption by different patterns of horizontal pleiotropy. analyses are performed by MR-Base. variance weighted MR The to obtain an MR estimate using multiple SNPs is to perform an inverse variance weighted (IVW) analysis of each ratio (Johnson, each SNP as a valid Fixed effects IVW that each SNP the same estimate in other of the SNPs horizontal pleiotropy other violations of effects IVW relaxes this allowing each SNP to have different mean due to horizontal pleiotropy (Bowden et al., 2017a). This will return an unbiased estimate if the horizontal pleiotropy is balanced, the from the mean estimate is independent from all other to this is as a weighted regression of the SNP-exposure effects against the SNP-outcome with the regression to pass through the and with from the inverse of the variance of the outcome MR-Base a random effects IVW model by unless there is in the causal estimates between SNPs, in which a fixed effects model is The estimates from the random and fixed effects IVW are the same but the variance for the random effects model is inflated to take into account heterogeneity between strategy to the IVW approach is to estimate the causal effect by of the given the SNP-exposure and SNP-outcome effects and a linear relationship between the exposure and outcome (Pierce and Burgess, 2013). to the fixed effects IVW the that the effect of the exposure on the outcome due to each SNP is the there is no heterogeneity or horizontal pleiotropy. unbiased estimate will be in the of horizontal pleiotropy or when horizontal pleiotropy is the variance of the effect estimate will be in the advantage of the is that it may provide more results in the presence of error in the SNP-exposure MR Egger analysis the IV2 assumption of horizontal (Bowden et al., 2015; Bowden et al., the IVW analysis by allowing a allowing the pleiotropic effect all SNPs to be or The an unbiased causal effect even if the IV2 assumption is for all SNPs but that the horizontal pleiotropic effects are not with the SNP-exposure effects is known as the InSIDE pleiotropy refers to the effects of the SNPs on the outcome not mediated by the exposure. estimator approach is to take the median effect of all available SNPs (Bowden et al., 2016a; et al., 2014). This has the advantage that only the SNPs need to be valid instruments (i.e. no horizontal pleiotropy, no association with robust association with the for the causal effect estimate to be The weighted median estimate allows SNPs to more the and can be obtained by the contribution of each SNP by the inverse variance of its association with the outcome. methods The mode-based estimator the SNPs into based on of causal and the causal effect estimate based on the cluster that has the largest number of SNPs et al., 2017b). The mode-based an unbiased causal effect if the SNPs within the largest cluster are valid instruments. Clustering is performed using a density that a The weighted introduces an to IVW and the weighted each contribution to the by the inverse variance of its outcome effect. and sensitivity analyses It is that the methods described are applied to all MR analyses and presented in to demonstrate sensitivity to different patterns of assumption MR-Base also the following sensitivity analyses and in causal effects instruments is an of potential violations of IV assumptions (Bowden et al., 2017a). can be for the IVW and Egger and this can be used to between of horizontal pleiotropy (Bowden et al., 2017a). analysis To if the MR estimate is or biased by a single SNP that might have a large horizontal pleiotropic we can the effect by one SNP at a SNPs when lead to a in the estimate can be the sensitivity of the estimate to plots A used in meta-analysis is the in which the estimate for a particular SNP is against its et al., in the may be of violations of the IV2 assumption through horizontal pleiotropy. MR analysis methods In to the MR-Base also supports to the following statistical methods