Impact of influential data on screening epigenome-wide data

20260 citationsJournal Articlegold Open Access

Authors

Samia Sultana · University of Memphis

Hongmei Zhang · University of Memphis

Yu Jiang · University of Memphis

Mohammad Nahian Ferdous Abrar · University of Memphis

Hasan Arshad · David Hide Asthma and Allergy Research Centre

Lu Xie · University of Memphis

Meredith Ray ·

Abstract

ttScreening (TT) is an effective high-dimensional screening algorithm to identify important cytosine-phosphate-guanine dinucleotide (CpG) sites associated with DNA methylation. Via simulations, we aimed to examine the impact of influential outliers on TT’s performance. We simulated K = 2,000 and 10,000 CpG sites across n = 100 and 200 subjects, linearly associated with a continuous outcome, $$x_1$$, and other latent variables with the error term following a normal or Cauchy distribution. Among the K CpGs, 10 were associated with $$x_1$$ (informative CpGs) while the remaining sites were not associated with $$x_1$$ (non-informative CpGs). We artificially created 1 to 5 influential points in one informative and one non-informative CpG site and compared TT’s accuracy to Bonferroni and false discovery rate (FDR)-based approaches. TT performed as well as or better than the FDR and Bonferroni-based approaches, across all degrees of influentiality. When focusing on non-informative CpG detection, regardless of sample size, all approaches had high accuracy (above 85%, overall) at their optimal cutoff for a single influential point. Among the CpG sites with a higher number of influential points (five points) and a normal error term, TT required a minimum cutoff of 70% for accuracy $$>0\%$$ compared to FDR and Bonferroni, both of which had an accuracy of 0% for n = 100 and 200. However, increasing the TT cutoff to 80% increased accuracy to 20% and 24%, respectively, and further increased to 97% and 99% for a 90% cutoff, respectively (among K = 2000). We observed the same patterns for 10,000 CpGs and informative CpG detection. When Cauchy error terms were applied, the same patterns held, but with a higher magnitude of accuracy for all approaches, and thus TT required lower cutoffs to achieve 100% accuracy. In summary, in the presence of influential data, we recommend a more conservative cutoff of 70–90% compared to the default cutoff of 50–70% suggested by Ray et al. (in Biomed Res Int 2016(1):2615348, 2016). TT, Bonferroni, and FDR are capable approaches for type 1 protection when screening high-dimensional data. However, in the presence of influential data, TT is likely to be the most robust approach.

Topics & Keywords

Epigenetics and DNA Methylation Genomics and Rare Diseases Machine Learning in Bioinformatics

Publication Details

Published in: BMC Bioinformatics

DOI: 10.1186/s12859-026-06369-4

Field-Weighted Citation Impact: 0.00

Command Palette

Impact of influential data on screening epigenome-wide data

Authors

Abstract

Topics & Keywords

Publication Details