Empirical Bayes single nucleotide variant-calling for next-generation sequencing data

Sci Rep. 2024 Jan 18;14(1):1550. doi: 10.1038/s41598-024-51958-z.

Abstract

One of the fundamental computational problems in cancer genomics is the identification of single nucleotide variants (SNVs) from DNA sequencing data. Many statistical models and software implementations for SNV calling have been developed in the literature, yet, they still disagree widely on real datasets. Based on an empirical Bayesian approach, we introduce a local false discovery rate (LFDR) estimator for germline SNV calling. Our approach learns model parameters without prior information, and simultaneously accounts for information across all sites in the genomic regions of interest. We also propose another LFDR-based algorithm that reliably prioritizes a given list of mutations called by any other variant-calling algorithm. We use a suite of gold-standard cell line data to compare our LFDR approach against a collection of widely used, state of the art programs. We find that our LFDR approach approximately matches or exceeds the performance of all of these programs, despite some very large differences among them. Furthermore, when prioritizing other algorithms' calls by our LFDR score, we find that by manipulating the type I-type II tradeoff we can select subsets of variant calls with minimal loss of sensitivity but dramatic increases in precision.

MeSH terms

  • Algorithms
  • Bayes Theorem
  • High-Throughput Nucleotide Sequencing
  • Nucleotides* / genetics
  • Polymorphism, Single Nucleotide*
  • Software

Substances

  • Nucleotides