3 minute read

ddqc is a method proposed in the paper published in Genome Biology for performing quality control (QC) on scRNA-seq data in a way that accounts for biological variability between cell types. Please refer to the paper link and two github repositories:

In this post, I’m going to use R and ddqc_R.

Introduction

Single-cell RNA sequencing (scRNA-seq) has revolutionized our understanding of biology by enabling the transcriptomic profiling of individual cells. However, the high-throughput nature of scRNA-seq experiments also introduces technical noise and low-quality cells that can confound downstream analysis. Effective quality control (QC) is therefore a critical first step in any scRNA-seq data analysis pipeline.

Limitations with conventional QC step

Conventional QC approaches rely on fixed thresholds applied to QC metrics such as the percentage of mitochondrial reads or the number of genes detected per cell. Cells exceeding these thresholds are considered low-quality and filtered out. While this approach is simple and widely used, it fails to account for the biological heterogeneity between cell types.

For example, this might be a typical QC step in R Seurat.

# Assuming 'seuar.obj' is our Seurat object
seurat.obj <- seurat.obj %>% subset(nFeature_RNA > 200 & 
									nFeature_RNA < 2500 & 
									percent.mt < 5)

Here, we’re filtering out cells with less than 200 or more than 2500 genes detected, and those with more than 5% mitochondrial reads. While these thresholds might work well for some datasets, they could inadvertently remove biologically relevant cell types in others.

Recent research has shown that QC metrics can vary substantially between cell types in biologically meaningful ways. For example, certain cell types like neurons or hepatocytes naturally have higher mitochondrial content, while others like sperm have inherently low complexity. Applying fixed QC thresholds risks removing these biologically important cell populations.

Another apporach is to determine thresholds statiscally. For example, the [pipeComp](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02136-7](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02136-7) method published in Genome Biology in 2020 removes cells that fall outside 3 to 5 MADs (Median Absolute Deviations) from the mdeian.

However, given the vastly different biological contexts across species and sample types, using any single fixed threshold universally can have significant drawbacks from a data perspective.

ddqc: : Data-Driven Quality Control

To address this challenge, researchers at the Klarman Cell Observatory, Broad Institute have developed ddqc (data-driven quality control) that accounts for biological variability between cell types. The key idea behind ddqc is to perform QC adaptively within each cell type, rather than applying global thresholds.

ddqc performs the following steps:

  1. Clustering: ddqc first clusters the cells to identify transcriptionally similar groups approximating cell types or states.
  2. Adaptive QC: Within each cluster, ddqc identifies outlier cells based on the median absolute deviation (MAD) of the QC metrics for that cluster. Cells exceeding a threshold defined in terms of the MAD are considered low-quality and filtered out.
  3. Iteration: This process is repeated for each QC metric of interest, such as the percentage of mitochondrial reads, the number of genes detected, and the number of UMIs.

By adapting QC to the characteristics of each cell type, ddqc is able to retain more cells overall while specifically preserving biologically important populations that may be lost by fixed thresholds.

Why utilize ddqc?

The benefits of ddqc are substantial:

  • Improved Data Quality: By accounting for biological heterogeneity, ddqc provides a more principled and effective approach to removing low-quality cells.
  • Increased Statistical Power: Retaining more cells increases the statistical power for downstream analysis.
  • Enhanced Biological Discovery: Preserving rare or unusual cell types enables the discovery of new biology that might otherwise be missed.

Using ddqc in one’s analysis

This is my QC step for scRNA-seq:

# Doublets removal

p_load(scDblFinder) 
sce <- as.SingleCellExperiment(seurat.obj) 
sce <- scDblFinder(sce) 
seurat.obj$multiplet_class <- colData(sce)$scDblFinder.class 

seurat.obj <- subset(seurat.obj, multiplet_class == 'singlet')

# Defining thresholds using ddqc 

devtools::install_github("ayshwaryas/ddqc_R"); p_load(ddqcR) 
seurat.obj <- initialQC(seurat.obj) 
df.qc <- ddqc.metrics(seurat.obj) 
df.qc 

seurat.obj <- filterData(seurat.obj, df.qc)

Before utilizing ddqc, I used scDblFinder for doublet detection and removal.

How simple and clean, isn’t it!

I love hos ddqc covers everything from problem defining to development and user-demand centered deployment.

Leave a comment