The dnaEPICO User's Guide • dnaEPICO

Abstract

A comprehensive guide to use the dnaEPICO package. The package supports preprocessing and analysis of Illumina DNA methylation array data (EPICv2, EPIC and 450K), including QC, filtering, SVA, cell-type deconvolution, phenotype integration, modelling, and export-ready files for the DNAmAge Clock Foundation platform.

Introduction

The dnaEPICO package provides a structured workflow for preprocessing and analysing Illumina DNA methylation array data. It works with EPICv2, EPIC, and 450K arrays. The package supports data import, quality control, filtering, normalisation, surrogate variable analysis, cell-type deconvolution, phenotype integration, statistical modelling, and export files for DNAmAge Clock Foundation.

The package is designed to integrate with established Bioconductor tools such as minfi, ENmix, and wateRmelon.

dnaEPICO also prepares standardised DNA methylation files for direct export to the DNAmAge Clock Foundation platform.

Citing `dnaEPICO`

The dnaEPICO package uses methods from several external packages. Because no single manuscript describes all components, the guidance below explains how to cite dnaEPICO depending on the functions you use.

This citation guidance is adapted from the vignettes and user guides of minfi, ENmix, and wateRmelon.

If you use preprocessingMinfiEwasWater(), please cite (Aryee et al. 2014; Fortin, Triche Jr., and Hansen 2017; Xu, Niu, and Taylor 2021; Pidsley et al. 2013; Murat et al. 2020; Maksimovic, Gordon, and Oshlack 2012; Fortin et al. 2014; Triche et al. 2013; Touleimat and Tost 2012).
If you use svaEnmix(), please cite (Aryee et al. 2014; Xu, Niu, and Taylor 2021).
If you use make f3 MODEL=model1, please cite (Aryee et al. 2014; Fortin, Triche Jr., and Hansen 2017; Xu, Niu, and Taylor 2021; Pidsley et al. 2013; Maksimovic, Gordon, and Oshlack 2012; Fortin et al. 2014; Triche et al. 2013; Touleimat and Tost 2012; Murat et al. 2020).
If you use make f4 MODEL=model1, please cite (Aryee et al. 2014; Fortin, Triche Jr., and Hansen 2017; Xu, Niu, and Taylor 2021; Pidsley et al. 2013; Maksimovic, Gordon, and Oshlack 2012; Fortin et al. 2014; Triche et al. 2013; Touleimat and Tost 2012; Murat et al. 2020; Marschner 2011).
If you use make f3lme MODEL=model1, please cite (Aryee et al. 2014; Fortin, Triche Jr., and Hansen 2017; Xu, Niu, and Taylor 2021; Pidsley et al. 2013; Maksimovic, Gordon, and Oshlack 2012; Fortin et al. 2014; Triche et al. 2013; Touleimat and Tost 2012; Murat et al. 2020; Bates et al. 2015).
If you use make all MODEL=model1, please cite (Aryee et al. 2014; Fortin, Triche Jr., and Hansen 2017; Xu, Niu, and Taylor 2021; Pidsley et al. 2013; Maksimovic, Gordon, and Oshlack 2012; Fortin et al. 2014; Triche et al. 2013; Touleimat and Tost 2012; Murat et al. 2020; Marschner 2011; Bates et al. 2015).

Installation

The dnaEPICO package depends on minfi, ENmix, and wateRmelon, among others. Install dependencies with BiocManager:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
      install.packages("BiocManager")
}

BiocManager::install("dnaEPICO")

BiocManager::valid()

library("dnaEPICO")

Alternatively, install directly from GitHub:

install.packages('devtools')
devtools::install_github('paulYRP/dnaEPICO')

Overview

dnaEPICO is a DNA methylation preprocessing and modelling pipeline for Illumina arrays. It wraps standard analysis steps into a consistent structure that:

reads IDATs and phenotype metadata,
performs quality control, filtering, and normalisation,
computes surrogate variables (SVA),
merges phenotype data with methylation matrices,
fits GLM (cross-sectional) and GLMM/LME (longitudinal) models,
writes logs, figures, RData objects, and a PDF report.

Each step writes its outputs to a dedicated folder. This design lets you run steps independently or connect them through the Makefile pipeline.

Local/Cluster use

Local use is recommended for:

testing the pipeline on small data,
preparing clean merged data objects,
generating QC plots and logs,
validating input formats and parameters.

Cluster/HPC use is recommended for:

genome-wide modelling across many CpGs,
parallel chunk processing,
large memory and multi-core work,
reproducible batch runs with Make targets (f3, f4, f3lme, all).

Data expectations and folder conventions

dnaEPICO expects a project layout like this:

├── data/                                   # Processed data outputs (per model)
│   ├── preprocessingMinfiEwasWater/
│   │   ├── idats/                          # Folder containing raw IDAT files
│   │   ├── pheno.csv                       # Phenotype file with sample metadata
│   │   └── 12864_2024_10027_MOESM8_ESM.csv # Cross-reactivity comparison reference
│   ├── Makefile                            # Arguments

Phenotype file: a CSV with sample identifiers and covariates.
IDAT folder: contains Illumina *_Red.idat and *_Grn.idat pairs.

Cross-hybridisation probe data for EPICv2 can be downloaded from 12864_2024_10027_MOESM8_ESM.csv.

You can extract the example Makefile with:

extractMake(
  destDir = getwd(),
  overwrite = FALSE
  )

dnaEPICO produces:

├── data/                                   # Processed data outputs (per model)
│   ├── preprocessingMinfiEwasWater/
│   │   ├── idats/                          # Folder containing raw IDAT files
│   │   ├── pheno.csv                       # Phenotype file with sample metadata
│   │   └── 12864_2024_10027_MOESM8_ESM.csv # Cross-reactivity comparison reference
│   ├── model1/                             # Outputs specific to the first model
│   │   ├── preprocessingMinfiEwasWater/    # Preprocessed phenotype + QC from Minfi
│   │   ├── svaEnmix/                       # Surrogate variable analysis 
│   │   ├── preprocessingPheno/             # Cleaned and merged phenotype files 
│   │   ├── methylationGLM_T1/              # CpG-level GLM results (single timepoint)
│   │   ├── methylationGLMM_T1T2/           # Longitudinal LME results (T1 vs T2)
│   │   └── ...                             # Additional subfolders by model
│   ├── model2/
│   └── model3/
├── rData/                                  # Processed R objects (MSet, Beta, CN matrices, etc.)
├── logs/                                   # Logging output from script runs
├── figures/                                # QC plots and visualization output
├── results/                                # Output tables and summary statistics
├── reports/                                # Rendered reports (PDF)

Each step writes:

logs/: logs
rData/: intermediate and final R objects
figures/: plots organised by step
data/: exported tables and annotated results
reports/: final report files

This approach improves reproducibility, debugging, and HPC resuming.

Local use

Local use focuses on the first three steps that are usually run interactively:

preprocessingMinfiEwasWater(): import, QC, normalise, filter, and estimate cell composition
svaEnmix(): surrogate variable analysis using ENmix control probes
preprocessingPheno(): merge phenotype metadata with methylation matrices

Step 1: `preprocessingMinfiEwasWater()`

preprocessingMinfiEwasWater(
    phenoFile = "data/preprocessingMinfiEwasWater/pheno.csv",
    idatFolder = "data/preprocessingMinfiEwasWater/idats",
    outputLogs = "logs",
    nSamples = NA,
    SampleID = "Sample_Name",
    arrayType = "IlluminaHumanMethylationEPICv2",
    annotationVersion = "20a1.hg38",
    scriptLabel = "preprocessingMinfiEwasWater",
    baseDataFolder = "rData",
    sepType = "",
    tiffWidth = 2000,
    tiffHeight = 1000,
    tiffRes = 150,
    qcCutoff = 10.5,
    detPtype = "m+u",
    detPThreshold = 0.05,
    normMethods = "adjustedfunnorm",
    sexColumn = "Sex",
    pvalThreshold = 0.01,
    chrToRemove = "chrX,chrY",
    snpsToRemove = "SBE,CpG",
    mafThreshold = 0.1,
    crossReactivePath =
      "data/preprocessingMinfiEwasWater/12864_2024_10027_MOESM8_ESM.csv",
    plotGroupVar = "Sex",
    lcRef = "salivaEPIC",
    phenoOrder = "Sample_Name;Timepoint;Sex;PredSex;Basename;Sentrix_ID;Sentrix_Position",
    lcPhenoDir = "data/preprocessingMinfiEwasWater"
)

What it does:

This wrapper runs inst/scripts/preprocessingMinfiEwasWater.R. It:

Reads the phenotype CSV (phenoFile) and IDAT files (idatFolder).
Loads an RGChannelSet and performs initial QC.
Builds MSet, RatioSet, and GenomicRatioSet objects.
Computes detection p-values and filters failed probes and samples.
Applies normalisation using normMethods.
Filters probes based on detection p-values, sex chromosomes, SNPs, and cross-reactive probes.
Exports final Beta, M, and CN matrices to rData/.
Writes QC plots to figures/ and logs to logs/.
Estimates cell composition with lcRef and writes phenoLC.csv to lcPhenoDir.

Parameter guide:

phenoFile: path to sample metadata CSV.
idatFolder: folder with raw IDAT pairs.
nSamples: subset size for local testing (use NA to process all).
SampleID: sample identifier column in the phenotype file.
arrayType, annotationVersion: ensure the correct annotation and manifest.
qcCutoff, detPtype, detPThreshold: QC thresholds for probe/sample filtering.
normMethods: chosen normalisation method.
pvalThreshold, chrToRemove, snpsToRemove, mafThreshold: probe filtering rules.
crossReactivePath: path to cross-reactive probe list.
phenoOrder: order of columns in the final phenotype table.
lcRef, lcPhenoDir: cell type reference and output directory for cell fraction results.

Step 2: `svaEnmix()`

svaEnmix(
    phenoFile = "data/preprocessingMinfiEwasWater/phenoLC.csv",
    rgsetData = "rData/preprocessingMinfiEwasWater/objects/RGSet.RData",
    sepType = "",
    outputLogs = "logs",
    nSamples = NA,
    SampleID = "Sample_Name",
    arrayType = "IlluminaHumanMethylationEPICv2",
    annotationVersion = "20a1.hg38",
    SentrixIDColumn = "Sentrix_ID",
    SentrixPositionColumn = "Sentrix_Position",
    ctrlSvaPercVar = 0.90,
    ctrlSvaFlag = 1,
    scriptLabel = "svaEnmix",
    tiffWidth = 2000,
    tiffHeight = 1000,
    tiffRes = 150
)

What it does:

This wrapper runs inst/scripts/svaEnmix.R. It:

Loads the RGSet object from rgsetData and phenotype data from phenoFile.
Uses ENmix control probes to derive control-based surrogate variables.
Selects the number of surrogate components needed to explain ctrlSvaPercVar of variance.
Writes summary outputs, plots, and updated objects.

Parameter guide:

rgsetData: path to the saved RGSet object from Step 1.
phenoFile: phenotype file that includes cell fractions.
ctrlSvaPercVar: proportion of variance explained by control-derived SVA components.
SentrixIDColumn, SentrixPositionColumn: columns that identify array positions.

Step 3: `preprocessingPheno()`

preprocessingPheno(
    phenoFile = "data/preprocessingMinfiEwasWater/phenoLC.csv",
    sepType = "",
    betaPath =
      "rData/preprocessingMinfiEwasWater/metrics/beta_NomFilt_MSetF_Flt_Rxy_Ds_Rc.RData",
    mPath =
      "rData/preprocessingMinfiEwasWater/metrics/m_NomFilt_MSetF_Flt_Rxy_Ds_Rc.RData",
    cnPath =
      "rData/preprocessingMinfiEwasWater/metrics/cn_NomFilt_MSetF_Flt_Rxy_Ds_Rc.RData",
    SampleID = "Sample_Name",
    timeVar = "Timepoint",
    timepoints = "1,2",
    combineTimepoints = "1,2",
    outputPheno = "data/preprocessingPheno",
    outputRData = "rData/preprocessingPheno/metrics",
    outputRDataMerge = "rData/preprocessingPheno/mergeData",
    sexColumn = "Sex",
    outputLogs = "logs",
    outputDir = "data/preprocessingPheno"
)

What it does:

This wrapper runs inst/scripts/preprocessingPheno.R. It:

Loads phenotype data and methylation matrices (Beta, M, CN).
Aligns samples using SampleID.
Creates timepoint-specific subsets based on timepoints.
Combines timepoints for longitudinal analysis using combineTimepoints.
Writes cleaned phenotype tables to outputPheno, metric objects to outputRData, and merged objects to outputRDataMerge.

Parameter guide:

phenoFile: the phenotype file generated in Step 1.
betaPath, mPath, cnPath: paths to saved methylation matrices.
SampleID: sample identifier column used to match data.
timeVar: column used to define timepoints.
timepoints: comma-separated list of timepoints to export separately.
combineTimepoints: list of timepoints to combine for longitudinal analysis.
outputPheno: directory for cleaned phenotype tables.
outputRData: directory for timepoint-specific methylation objects.
outputRDataMerge: directory for merged phenotype–methylation objects.
sexColumn: column used for biological sex.
outputLogs: directory for logs.
outputDir: base directory for exported files.

This step produces analysis-ready phenotype–methylation objects for GLM and LME modelling and creates files compatible with the DNAmAge Clock Foundation platform.

Cluster / HPC use

For full datasets, model fitting across hundreds of thousands of CpGs usually requires:

large memory,
parallel chunking,
long runtimes.

dnaEPICO supports this through a Makefile-driven pipeline that can run the whole workflow or selected targets.

Step 4: `methylationGLM_T1()`

This step is typically executed on HPC systems because it is computationally intensive.

methylationGLM_T1(
    inputPheno = "rData/preprocessingPheno/mergeData/phenoBetaT1.RData",
    outputLogs = "logs",
    outputRData = "rData/methylationGLM_T1/models",
    outputPlots = "figures/methylationGLM_T1",
    phenotypes = "Pheno1,Pheno2,Pheno3,Pheno4",
    covariates = "Sex,Age,Ethnicity,TraumaDefinition,Leukocytes,Epithelial.cells",
    factorVars = "Sex,Ethnicity,TraumaDefinition",
    cpgPrefix = "cg",
    cpgLimit = NA,
    nCores = 32,
    plotWidth = 2000,
    plotHeight = 1000,
    plotDPI = 150,
    interactionTerm = NULL,
    libPath = NULL,
    glmLibs = "glm2",
    prsMap = "Pheno1:PRS_1,Pheno2:PRS_2,Pheno3:PRS_3,Pheno4:PRS_4",
    summaryPval = NA,
    summaryResidualSD = TRUE,
    saveSignificantCpGs = FALSE,
    significantCpGDir = "preliminaryResults/cpgs/methylationGLM_T1",
    significantCpGPval = 0.05,
    saveTxtSummaries = TRUE,
    chunkSize = 10000,
    summaryTxtDir = "preliminaryResults/summary/methylationGLM_T1/glm",
    fdrThreshold = 0.05,
    padjmethod = "fdr",
    annotationPackage = "IlluminaHumanMethylationEPICv2anno.20a1.hg38",
    annotationCols = "Name,chr,pos,UCSC_RefGene_Group,UCSC_RefGene_Name,
                      Relation_to_Island,GencodeV41_Group",
    annotatedGLMOut = "data/methylationGLM_T1"
)

What it does:

This wrapper runs inst/scripts/methylationGLM_T1.R. It:

Loads a merged phenotype–beta object from inputPheno.
Processes CpGs in chunks and fits a generalised linear model for each CpG.
Fits one model per phenotype and adjusts for covariates.
Handles interaction terms and PRS mappings when supplied.
Applies multiple-testing correction.
Saves fitted model objects, plots, logs, summaries, and annotated results.

Parameter guide:

inputPheno: merged phenotype–beta .RData input.
outputLogs: directory for log files.
outputRData: directory for model objects.
outputPlots: directory for diagnostic and summary plots.
phenotypes: comma-separated phenotype variables.
covariates: covariates included in each model.
factorVars: categorical covariates.
cpgPrefix: CpG identifier prefix.
cpgLimit: number of CpGs to process; NA processes all.
nCores: CPU cores for parallel processing.
plotWidth, plotHeight, plotDPI: plot size and resolution.
interactionTerm: optional interaction term.
libPath: optional HPC library path.
glmLibs: GLM package to use.
prsMap: phenotype-to-PRS mapping.
summaryPval: threshold for summary outputs.
summaryResidualSD: include residual SD summaries.
saveSignificantCpGs: save CpGs passing the significance threshold.
significantCpGDir: directory for significant CpG tables.
significantCpGPval: significance threshold for CpGs.
saveTxtSummaries: save text summaries.
chunkSize: CpGs per chunk.
summaryTxtDir: directory for text-based summaries.
fdrThreshold: FDR threshold.
padjmethod: adjustment method.
annotationPackage: annotation package for CpGs.
annotationCols: annotation fields to include.
annotatedGLMOut: directory for annotated results.

GLM structure used in dnaEPICO (`glm2`):

For each CpG site, dnaEPICO fits a model of the form:

$\text{Methylation}_{CpG_i} = \beta_0 + \beta_1 \cdot \text{Phenotype} + \sum_{k=1}^{K} \beta_{k+1} \cdot \text{Covariate}_k + \varepsilon$

where $\text{Methylation}_{CpG_i}$ is the methylation value at CpG site $i$ , $\beta_0$ is the intercept, $\text{Phenotype}$ is the variable of interest, $\text{Covariate}_k$ represents the $k$ -th covariate in the model, $\beta_1$ and $\beta_{k+1}$ are the corresponding regression coefficients, and $\varepsilon$ is the residual error term.

For example, a GLM without PRS is:

$\begin{aligned} \text{Methylation}_{CpG_i} = \;& \beta_0 + \beta_1 \cdot \text{Phenotype} + \beta_2 \cdot \text{Age} + \beta_3 \cdot \text{Sex} + \beta_4 \cdot \text{Ethnicity} \\ &+ \beta_5 \cdot \text{TraumaDefinition} + \beta_6 \cdot \text{Leukocytes} + \beta_7 \cdot \text{Epithelial.cells} + \varepsilon \end{aligned}$

GLM with PRS:

When prsMap is supplied, the model is extended only for the matching phenotype.

From the example above:

prsMap = “Pheno1:PRS_1,Pheno2:PRS_2,Pheno3:PRS_3,Pheno4:PRS_4”

This defines one-to-one mappings between phenotype and PRS. For example:

Pheno1 uses PRS_1

$\text{Methylation}_{CpG_i} = \beta_0 + \beta_1 \cdot \text{Pheno1} + \sum_{k=1}^{K} \beta_{k+1} \cdot \text{Covariate}_k + \beta_{\text{PRS}} \cdot \text{PRS_1} + \varepsilon$

Pheno2 uses PRS_2

$\text{Methylation}_{CpG_i} = \beta_0 + \beta_1 \cdot \text{Pheno2} + \sum_{k=1}^{K} \beta_{k+1} \cdot \text{Covariate}_k + \beta_{\text{PRS}} \cdot \text{PRS_2} + \varepsilon$

Interaction term

If an interaction is specified (for example phenotype × timepoint):

$\text{Methylation}_{CpG_i} = \beta_0 + \beta_1 \cdot (\text{Phenotype} * \text{Timepoint}) + \sum_{k=1}^{K} \beta_{k+1} \cdot \text{Covariate}_k + \varepsilon$

Step 5: `methylationGLMM_T1T2()`

This step is typically executed on HPC systems because it is also computationally intensive.

methylationGLMM_T1T2(
    inputPheno = "rData/preprocessingPheno/mergeData/phenoBetaT1T2.RData",
    outputLogs = "logs/",
    outputRData = "rData/methylationGLMM_T1T2/models",
    outputPlots = "figures/methylationGLMM_T1T2",
    personVar = "person",
    timeVar = "Timepoint",
    phenotypes = "Pheno1,Pheno2,Pheno3,Pheno4",
    covariates = "Sex,Age,Ethnicity,TraumaDefinition,Leukocytes,Epithelial.cells",
    factorVars = "Sex,Ethnicity,TraumaDefinition,Timepoint",
    lmeLibs = "lme4,lmerTest",
    prsMap = "Pheno1:PRS_1,Pheno2:PRS_2,Pheno3:PRS_3,Pheno4:PRS_4",
    libPath = NULL,
    cpgPrefix = "cg",
    cpgLimit = NA,
    nCores = 32,
    summaryPval = NA,
    plotWidth = 2000,
    plotHeight = 1000,
    plotDPI = 150,
    interactionTerm = NULL,
    saveSignificantInteractions = TRUE,
    significantInteractionDir = "preliminaryResults/cpgs/methylationGLMM_T1T2",
    significantInteractionPval = 0.05,
    saveTxtSummaries = TRUE,
    chunkSize = 10000,
    summaryTxtDir = "preliminaryResults/summary/methylationGLMM_T1T2/lmer",
    fdrThreshold = 0.05,
    padjmethod = "fdr",
    annotationPackage = "IlluminaHumanMethylationEPICv2anno.20a1.hg38",
    annotationCols = "Name,chr,pos,UCSC_RefGene_Group,UCSC_RefGene_Name,
                      Relation_to_Island,GencodeV41_Group",
    annotatedLMEOut = "data/methylationGLMM_T1T2"
)

What it does:

This wrapper runs inst/scripts/methylationGLMM_T1T2.R. It:

Loads a merged longitudinal phenotype–methylation object from inputPheno.
Processes CpGs in chunks and fits a linear mixed-effects model.
Fits one model per phenotype and uses a subject-level random effect.
Includes fixed effects for phenotype, time, covariates, and optional interactions.
Adds phenotype-specific PRS terms based on prsMap.
Applies multiple-testing correction.
Saves model objects, figures, logs, summaries, significant interaction results, and annotated tables.

Parameter guide:

inputPheno: merged longitudinal phenotype–methylation .RData input.
outputLogs: directory for logs.
outputRData: directory for mixed-effects model objects.
outputPlots: directory for plots.
personVar: participant identifier for the random effect.
timeVar: time variable for longitudinal analysis.
phenotypes: comma-separated phenotype variables.
covariates: fixed-effect covariates.
factorVars: categorical variables including timeVar.
lmeLibs: mixed-model libraries.
prsMap: phenotype-to-PRS mappings.
libPath: optional library path.
cpgPrefix: CpG identifier prefix.
cpgLimit: maximum CpGs; NA uses all.
nCores: parallel cores.
summaryPval: summary threshold; NA disables filtering.
plotWidth, plotHeight, plotDPI: plot dimensions and resolution.
interactionTerm: optional interaction term.
saveSignificantInteractions: save significant interactions.
significantInteractionDir: directory for interaction results.
significantInteractionPval: significance threshold for interactions.
saveTxtSummaries: save text summaries.
chunkSize: CpGs per chunk.
summaryTxtDir: directory for text summaries.
fdrThreshold: FDR threshold.
padjmethod: adjustment method.
annotationPackage: annotation package.
annotationCols: annotation fields.
annotatedLMEOut: directory for annotated results.

LMEM structure used in dnaEPICO (`lme4`):

For longitudinal analyses, the model includes random effects and optional interactions:

$\begin{aligned} \text{Methylation}_{CpG_i} = \;& \beta_0 + \beta_1 \cdot (\text{Phenotype} * \text{Timepoint}) + \beta_2 \cdot \text{covariate} + \beta_3 \cdot \text{PRS} + (1 | person)+ \varepsilon \end{aligned}$

where PRS is included only if it is defined for the phenotype.

The model follows the same general form as methylationGLM_T1().

$\begin{aligned} \text{Methylation}_{CpG_i} =\;& \beta_0 + \beta_1 \cdot \text{Phenotype} + \beta_2 \cdot \text{Timepoint} + \beta_3 \cdot (\text{Phenotype} \times \text{Timepoint}) \\ &+ \beta_4 \cdot \text{Age} + \beta_5 \cdot \text{PRS} + b_{\text{person}} + \varepsilon \end{aligned}$

where $b_{person}$ is the subject-specific random intercept.

Makefile pipeline concepts

Makefile.model.pipeline: user configuration (models, variables, paths)
Makefile.rules.pipeline: pipeline targets and rules

Extract the example Makefile into the current directory:

## Copies the example Makefile into the current directory
extractMake(destDir = getwd(), overwrite = FALSE)

The example template Makefile.model.pipeline is copied and renamed to Makefile.

Make targets: f3, f4, f3lme and all

Common grouped targets:

f3: Steps 1–3 (preprocessing + SVA + merge)
f4: Steps 1–4 (adds GLM)
f3lme: Steps 1–3 and 5 (adds LME4)
all: Steps 1–5

Commands

Show status of pipeline outputs
- make status MODEL=model1
Run steps 1–3 for one model
- make f3 MODEL=model1
Run steps 1–4 for one model
- make f4 MODEL=model1
Run steps 1–3 and 5 for one model
- make f3lme MODEL=model1
Run the full pipeline
- make all MODEL=model1
Run multiple models
- make models (run the entire pipeline for all models in parallel.)
- make f3_models (run Steps 1–3 in parallel for all models.)
- make f4_models (run Steps 1–4 in parallel for all models.)
- make f3lme_models (run Steps 1–3 + LME in parallel for all models.)

Makefile Modelling steps (HPC)

The Makefile controls how dnaEPICO runs models at scale. It defines:

which phenotype files to use,
which statistical models to run,
GLM (glm2) and GLMM (lme4) parameter settings,
how results are organised and reproduced across models.

Instead of hard-coding parameters in R scripts, dnaEPICO uses a declarative Makefile layer. This makes it easy to rerun the same workflow with different configurations.

# ===============================================
# USER CONFIGURATION
# ===============================================
MAKEFLAGS += --output-sync

# ===============================================
# MODELS SELECTION PARALLEL RUN
# ===============================================
MODEL ?= model1 # Single model
MODELS = modelA modelB modelC # Multiple models for parallel run

# ===============================================
# PER-MODEL OVERRIDES
# ===============================================
ifeq ($(MODEL), modelA)
  PHENO_FILE = $(DATA_DIR)/$(STEP1)/phenoA.csv


else ifeq ($(MODEL), modelB)
  PHENO_FILE = $(DATA_DIR)/$(STEP1)/phenoB.csv


else ifeq ($(MODEL), modelC)
  PHENO_FILE = $(DATA_DIR)/$(STEP1)/phenoC.csv

else # default (model1)
  PHENO_FILE = $(DATA_DIR)/$(STEP1)/pheno.csv

endif

# ===============================================
# DIRECTORIES
# ===============================================
LOGS_DIR = logs
DATA_DIR = data
RDATA_DIR = rData
RESULTS_DIR = results
FIGURES_DIR = figures
STEP1 = preprocessingMinfiEwasWater
STEP2 = svaEnmix
STEP3 = preprocessingPheno
STEP4 = methylationGLM_T1
STEP5 = methylationGLMM_T1T2
STEP6 = reports
METRICS_DIR = metrics
IDAT_DIR = idats
OBJ_DIR = objects
MERGE_DIR = mergeData
MODEL_DIR = models
CPG_DIR = cpgs
SUMMARY_DIR = summary
ENMIX_DIR = enmix
SVA_DIR = sva
QC_DIR = qc
R_DIR = ~/R/x86_64-pc-linux-gnu-library/4.4 # NULL is default R library path

# ===============================================
# GLOBAL PARAMETERS
# ===============================================
# STEP 1-5 PARAMETERS
PHENO_FILE = $(DATA_DIR)/$(STEP1)/phenoEMD.csv
SEP_TYPE = ""
SAMPLE_ID = Sample_Name
N_SAMPLES = 30
ARRAY_TYPE = IlluminaHumanMethylationEPICv2
ANNOTATION_VERSION = 20a1.hg38
TIFF_WIDTH = 2000
TIFF_HEIGHT = 1000
TIFF_RES = 150
SEX_COLUMN = Sex
SENTRIX_ID_COLUMN = Sentrix_ID
SENTRIX_POSITION_COLUMN = Sentrix_Position
BASENAME_COLUMN = Basename
TIME_VAR = Timepoint
PHENO_ORDER = $(SAMPLE_ID);$(TIME_VAR);$(SEX_COLUMN);PredSex;$(BASENAME_COLUMN);$(SENTRIX_ID_COLUMN);$(SENTRIX_POSITION_COLUMN)

# STEP 4-5 PARAMETERS
PHENOTYPES = TreatmentGroup
COVARIATES = Sex
FACTOR_VARS = Sex,TreatmentGroup
CPG_PREFIX = cg
CPG_LIMIT = NA
PRS_MAP = NULL
SUMMARY_PVAL = NA
N_CORES = 64
SAVE_TXT_SUMMARIES = TRUE
CHUNK_SIZE = 10000
FDR_THRESHOLD = 0.05

Model selection

MODEL ?= model1
MODELS = modelA modelB modelC
MODEL defines a single pipeline configuration
MODELS defines multiple independent configurations

The same pipeline can therefore be run in two modes:

Single-model execution
- make f4 MODEL=modelA
Parallel multi-model execution
- make f4_models

In parallel mode, each model runs as a fully independent pipeline, with its own inputs, logs, figures, and outputs. No files are shared between models except the raw inputs (IDATs), unless explicitly specified.

Per-model overrides

ifeq ($(MODEL), modelA)
  PHENO_FILE = data/preprocessingMinfiEwasWater/phenoA.csv
...
else
  PHENO_FILE = data/preprocessingMinfiEwasWater/pheno.csv
endif

This block dynamically replaces variables based on the selected model.

The Makefile is evaluated at runtime
When MODEL=modelA, PHENO_FILE is replaced before any step runs
The rest of the pipeline remains unchanged

This allows users to:

compare cohorts or subgroup definitions,
test alternative phenotype encodings,
run sensitivity analyses, without duplicating code.

Directory and step conventions

STEP1 = preprocessingMinfiEwasWater
STEP2 = svaEnmix
STEP3 = preprocessingPheno
STEP4 = methylationGLM_T1
STEP5 = methylationGLMM_T1T2

Each step writes outputs into step-specific directories. When running in parallel, directory paths are automatically namespaced by model, ensuring isolation between runs.

This design allows:

restarting individual steps,
debugging failed models independently,
mixing local and HPC execution safely.

Global parameters

PHENO_FILE = …
SAMPLE_ID = …
TIME_VAR = …
…

These parameters are shared across preprocessing and modelling.

Specific parameters

PHENOTYPES = TreatmentGroup
COVARIATES = Sex
FACTOR_VARS = Sex,TreatmentGroup
N_CORES = 64
CHUNK_SIZE = 10000
…

These parameters control how each model is fit. In parallel execution:

N_CORES controls intra-model parallelism (within R)
MODELS controls inter-model parallelism (within Make)

Argument normalisation

ifeq ($(PRS_MAP),NULL)
  PRS_MAP_ARG = NULL
else
  PRS_MAP_ARG = '$(PRS_MAP)'
endif
if ...

This section ensures that optional arguments are passed correctly to R scripts. It prevents accidental interpretation of the string "NULL" as a real value and guarantees consistent behaviour across local and Scluster environments.

Reporting

dnamReport() generates a PDF report by rendering DNAm.Rmd through a runner script (DNAm.R) shipped in inst/scripts.

Note: PDF report generation requires a working LaTeX installation.TinyTeX can be installed in R using: install.packages("tinytex") and tinytex::install_tinytex(). For troubleshooting LaTeX-related issues, see the TinyTeX documentation.

dnamReport(
  output = "DNAm_Report.pdf",
  outputDir = "reports",
  qcDir = "figures/preprocessingMinfiEwasWater/enMix",
  preprocessingDir = "figures/preprocessingMinfiEwasWater/qc",
  postprocessingDir = "figures/preprocessingMinfiEwasWater/metrics",
  svaDir = "figures/svaEnmix/sva",
  glmDir = "figures/methylationGLM_T1",
  glmmDir = "figures/methylationGLMM_T1T2",
  reportTitle = "DNA methylation",
  author = "School of Biomedical Sciences",
  date = format(Sys.Date(), "%B %d, %Y")
)

Troubleshooting

If a step “fails”, check logs in logs/
If Make targets are missing, confirm:
- your working directory contains a valid Makefile

Example Analysis

The following examples are brief demonstrations on how to perform DNA methylation analysis using dnaEPICO functions and arguments.

This example has the following dependencies:

library(dnaEPICO)

library(minfiData)
library(minfi)

Example 1: Local Test

The aim of this example is to show how to run the complete dnaEPICO pipeline using the example dataset provided by the minfiData package.

This example is designed to:

help new users understand the pipeline structure,
demonstrate how data and results are organised on disk,
and provide a safe, reproducible way to test the pipeline

To avoid creating permanent files on the user’s computer, all steps are run inside a temporary directory.

When analysing real data, users can simply replace the temporary directory with their own project folder.

Overview of example 1

In this example, the pipeline performs the following steps:

Stage raw input data (IDAT files and phenotype data)
Perform DNA methylation preprocessing and quality control
Estimate surrogate variables (technical variation)
Prepare phenotype-aligned methylation matrices
Generate figures and a summary PDF report

Step 1: Create a temporary working directory

We first create a temporary directory that will hold all input files and outputs for this example.

# Temporary directory 
tmpDir  <- file.path(tempdir(), "dnaEPICO_Test")
dir.create(tmpDir, showWarnings = FALSE, recursive = TRUE)

preprocDir <- file.path(tmpDir, "data", "preprocessingMinfiEwasWater")
dir.create(preprocDir, showWarnings = FALSE, recursive = TRUE)

# Register automatic cleanup when the vignette exits
on.exit(unlink(tmpDir, recursive = TRUE, force = TRUE), add = TRUE)

tmpDir
#> [1] "/tmp/Rtmpl8RIvi/dnaEPICO_Test"

Important:

This directory is created only for this example
It will be deleted automatically at the end
Users analysing real data should use a permanent directory instead

Step 2: Load example IDAT files from minfiData

The minfiData package contains a small set of Illumina 450K IDAT files from blood samples.

# Locate minfiData extdata directory
baseDir_minfi <- system.file("extdata", package = "minfiData")

# Find all IDAT files
idatFiles <- list.files(
  baseDir_minfi,
  pattern = "\\.idat$",
  recursive = TRUE,
  full.names = TRUE
)

length(idatFiles)
#> [1] 12

We now copy these IDAT files into the directory structure expected by SdnaEPICO.

# Create IDAT directory
idatDir <- file.path(preprocDir, "idats")
dir.create(idatDir, showWarnings = FALSE, recursive = TRUE)

# Copy IDATs into idatDir
file.copy(idatFiles, idatDir, overwrite = TRUE)
#>  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

length(list.files(idatDir, pattern = "\\.idat$"))
#> [1] 12

Step 3: Copy the cross-reactive probe annotation file

The pipeline requires a file listing cross-reactive probes. This file is distributed with dnaEPICO and is copied into the preprocessing folder. It is a reduced version of the original file, which can be accessed from the following link:

12864_2024_10027_MOESM8_ESM.csv

# Locate dnaEPICO extdata directory
baseDir_dnaEPICO <- system.file("extdata", package = "dnaEPICO")

# Source cross-reactive probe file
crossReactiveSrc <- file.path(baseDir_dnaEPICO, "12864_2024_10027_MOESM8_ESM.csv")

# Destination path inside tmpDir/data/preprocessingMinfiEwasWater/
crossReactivePath <- file.path(preprocDir, "12864_2024_10027_MOESM8_ESM.csv")

# Copy into the target folder
file.copy(crossReactiveSrc, crossReactivePath, overwrite = TRUE)
#> [1] TRUE
file.exists(crossReactivePath)
#> [1] TRUE

Step 4: Create a phenotype file

Next, we create a phenotype file using the sample sheet provided by minfiData.

# Read SampleSheet from minfiData
targets <- read.csv(
  file.path(baseDir_minfi, "SampleSheet.csv"),
  stringsAsFactors = FALSE,
  skip = 7
)

# Create Basename column 
targets$Basename <- paste0(
  targets$Sentrix_ID,
  "_",
  targets$Sentrix_Position
)

# Create Timepoint column
targets$Timepoint <- 1

# Write phenotype file
samplePath <- file.path(preprocDir, "sample_minfi.csv")
write.csv(targets, samplePath, row.names = FALSE)

file.exists(samplePath)
#> [1] TRUE

This phenotype file will be used throughout the pipeline.

Step 5: Preprocessing and quality control

We now run the first pipeline stage, which performs:

reading of IDAT files,
quality control,
normalisation,
probe filtering,
and generation of QC figures

set.seed(123)
preprocessingMinfiEwasWater(
  phenoFile = samplePath,
  idatFolder = file.path(preprocDir, "idats"),
  nSamples = NA,
  SampleID = "Sample_Name",
  phenoOrder = "Sample_Name;sex;Basename;Sentrix_ID;Sentrix_Position",
  arrayType = "IlluminaHumanMethylation450k",
  annotationVersion = "ilmn12.hg19",
  sexColumn = "sex",
  plotGroupVar = "sex",
  tiffWidth = 2000,
  tiffHeight = 1000,
  tiffRes = 150,
  qcCutoff = 10.5,
  detPtype = "m+u",
  detPThreshold = 0.05,
  normMethods = "adjustedfunnorm",
  pvalThreshold = 0.01,
  chrToRemove = "chrX,chrY",
  snpsToRemove = "SBE,CpG",
  mafThreshold = 0.1,
  outputLogs = file.path(tmpDir, "logs"),
  baseDataFolder = file.path(tmpDir, "rData"),
  figureBaseDir = file.path(tmpDir, "figures"),
  lcRef = "FlowSorted.Blood.450k",
  crossReactivePath = file.path(preprocDir, "12864_2024_10027_MOESM8_ESM.csv"),
  lcPhenoDir = preprocDir
)
#> ==== Starting preprocessingMinfiEwasWater ====
#> Start Time:                2026-04-02 05:15:57 
#> Log file path:             /tmp/Rtmpl8RIvi/dnaEPICO_Test/logs/log_preprocessingMinfiEwasWater.txt 
#> 
#> Phenotype file:            /tmp/Rtmpl8RIvi/dnaEPICO_Test/data/preprocessingMinfiEwasWater/sample_minfi.csv 
#> Separator type:   
#> IDAT folder:               /tmp/Rtmpl8RIvi/dnaEPICO_Test/data/preprocessingMinfiEwasWater/idats 
#> nSamples limit:            all 
#> SampleID column:           Sample_Name 
#> Array type:                IlluminaHumanMethylation450k 
#> Annotation version:        ilmn12.hg19 
#> Base RData folder:         /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData 
#> Base Figure folder:         /tmp/Rtmpl8RIvi/dnaEPICO_Test/figures 
#> TIFF size (w x h @ dpi):   2000  x  1000  @  150 
#> QC cutoff (median):        10.5 
#> Detection P-value type:    m+u 
#> 
#> Detection p-value threshold: 0.05 
#> Normalization methods:     adjustedfunnorm 
#> Sex column:                sex 
#> Plot grouping variable:    sex 
#> 
#> Probe filtering:
#>   P-value threshold:       0.01 
#>   Chromosomes to remove:   chrX,chrY 
#>   SNP positions filter:    SBE,CpG 
#>   MAF threshold:           0.1 
#>   Cross-reactive file:     /tmp/Rtmpl8RIvi/dnaEPICO_Test/data/preprocessingMinfiEwasWater/12864_2024_10027_MOESM8_ESM.csv 
#> 
#> Cell composition (estimateLC):
#>   Reference:               FlowSorted.Blood.450k 
#>   Leading pheno order:     Sample_Name;sex;Basename;Sentrix_ID;Sentrix_Position 
#> =======================================================================
#> Using all 6 samples.
#> Phenotype file loaded with 6 samples and 13 columns.
#> Preview of targets:
#>   Sample_Name Sample_Well Sample_Plate Sample_Group Pool_ID
#> 1    GroupA_3          H5           NA       GroupA      NA
#> 2    GroupA_2          D5           NA       GroupA      NA
#> 3    GroupB_3          C6           NA       GroupB      NA
#> 4    GroupB_1          F7           NA       GroupB      NA
#> 5    GroupA_1          G7           NA       GroupA      NA
#> 6    GroupB_2          H7           NA       GroupB      NA
#> =======================================================================
#> RGSet loaded with 6 samples.
#> =======================================================================
#> Plotting  STAINING .jpg
#> Plotting  EXTENSION .jpg
#> Plotting  HYBRIDIZATION .jpg
#> Plotting  TARGET_REMOVAL .jpg
#> Plotting  BISULFITE_CONVERSION_I .jpg
#> Plotting  BISULFITE_CONVERSION_II .jpg
#> Plotting  SPECIFICITY_I .jpg
#> Plotting  SPECIFICITY_II .jpg
#> Plotting  NON-POLYMORPHIC .jpg
#> Plotting  NEGATIVE .jpg
#> Plotting  NORM_A .jpg
#> Plotting  NORM_C .jpg
#> Plotting  NORM_G .jpg
#> Plotting  NORM_T .jpg
#> Plotting  NORM_ACGT .jpg
#> Generated ENmix control JPGs in: /tmp/Rtmpl8RIvi/dnaEPICO_Test/figures/preprocessingMinfiEwasWater/enMix 
#> =======================================================================
#> Applied annotation:  IlluminaHumanMethylation450k, ilmn12.hg19 
#> Manifest used:
#> IlluminaMethylationManifest object
#> Annotation
#>   array: IlluminaHumanMethylation450k
#> Number of type I probes: 135476 
#> Number of type II probes: 350036 
#> Number of control probes: 850 
#> Number of SNP type I probes: 25 
#> Number of SNP type II probes: 40 
#> =======================================================================
#> RGSet saved to:  /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData/preprocessingMinfiEwasWater/objects/RGSet.RData 
#> =======================================================================
#> Running preprocessRaw() to generate MSet...
#> MSet created with 6 samples and 485512 probes.
#> =======================================================================
#> MSet saved to: /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData/preprocessingMinfiEwasWater/objects/MSet.RData 
#> =======================================================================
#> Preview of methylated intensities:
#>            GroupA_3 GroupA_2 GroupB_3
#> cg00050873    22041      588    20505
#> cg00212031      679      569      439
#> cg00213748     1620      421      707
#> cg00214611      449      614      343
#> cg00455876     5921      398     3257
#> cg01707559     1238      646      637
#> =======================================================================
#> Preview of unmethylated intensities:
#>            GroupA_3 GroupA_2 GroupB_3
#> cg00050873     1945      433     1012
#> cg00212031     6567      300     2689
#> cg00213748      384      461      295
#> cg00214611     4869      183     1655
#> cg00455876     1655      792     1060
#> cg01707559    12227     1009     7414
#> =======================================================================
#> Converting MSet to RatioSet and GSet...
#> RatioSet created.
#> class: RatioSet 
#> dim: 485512 6 
#> metadata(0):
#> assays(3): Beta M CN
#> rownames(485512): cg00050873 cg00212031 ... ch.22.47579720R
#>   ch.22.48274842R
#> rowData names(0):
#> colnames(6): GroupA_3 GroupA_2 ... GroupA_1 GroupB_2
#> colData names(14): Sample_Name Sample_Well ... Timepoint filenames
#> Annotation
#>   array: IlluminaHumanMethylation450k
#>   annotation: ilmn12.hg19
#> Preprocessing
#>   Method: Raw (no normalization or bg correction)
#>   minfi version: 1.52.1
#>   Manifest version: 0.4.0
#> =======================================================================
#> GSet created.
#> class: GenomicRatioSet 
#> dim: 485512 6 
#> metadata(0):
#> assays(3): Beta M CN
#> rownames(485512): cg13869341 cg14008030 ... cg08265308 cg14273923
#> rowData names(0):
#> colnames(6): GroupA_3 GroupA_2 ... GroupA_1 GroupB_2
#> colData names(14): Sample_Name Sample_Well ... Timepoint filenames
#> Annotation
#>   array: IlluminaHumanMethylation450k
#>   annotation: ilmn12.hg19
#> Preprocessing
#>   Method: Raw (no normalization or bg correction)
#>   minfi version: 1.52.1
#>   Manifest version: 0.4.0
#> =======================================================================
#> =======================================================================
#> Extracting methylation raw level metrics from GSet, these metrics are not saved...
#>     
#> Preview of beta values:
#>              GroupA_3   GroupA_2   GroupB_3   GroupB_1   GroupA_1
#> cg13869341 0.83863092 0.93969915 0.90540879 0.83618626 0.82767608
#> cg14008030 0.64813991 0.62519431 0.71063615 0.70437389 0.68204321
#> cg12045430 0.06774677 0.07835538 0.06291283 0.17674976 0.07216013
#> cg20826792 0.13240759 0.13743747 0.10998991 0.18505709 0.13602546
#> cg00381604 0.03526373 0.03036437 0.03364362 0.05298938 0.03786708
#> cg20253340 0.56370048 0.63295793 0.47331874 0.51451954 0.58907301
#> =======================================================================
#> Preview of M-values:
#>              GroupA_3   GroupA_2   GroupB_3    GroupB_1   GroupA_1
#> cg13869341  2.3776718  3.9619486  3.2587913  2.35176802  2.2639433
#> cg14008030  0.8813034  0.7381618  1.2962264  1.25256776  1.1010324
#> cg12045430 -3.7824979 -3.5561063 -3.8967571 -2.21962279 -3.6846021
#> cg20826792 -2.7120316 -2.6498537 -3.0164505 -2.13872859 -2.6671121
#> cg00381604 -4.7738777 -4.9969913 -4.8441506 -4.15960551 -4.6672202
#> cg20253340  0.3696099  0.7861642 -0.1541181  0.08381261  0.5195643
#> =======================================================================
#> Preview of copy number values:
#>            GroupA_3 GroupA_2 GroupB_3 GroupB_1 GroupA_1
#> cg13869341 15.45930 15.49151 15.66336 15.42105 15.59497
#> cg14008030 14.30499 15.08423 14.85866 14.75692 15.09280
#> cg12045430 14.13755 14.45436 14.17290 14.40561 14.38418
#> cg20826792 14.31995 14.39794 14.27467 14.36543 14.44553
#> cg00381604 14.02289 14.15782 13.87652 14.00799 14.14498
#> cg20253340 13.70217 14.06483 13.69360 13.88722 14.20526
#> =======================================================================
#> Running QC plotting from MSet object...
#> QC plot saved to:  /tmp/Rtmpl8RIvi/dnaEPICO_Test/figures/preprocessingMinfiEwasWater/qc/quality_control(MSet).tiff 
#> =======================================================================
#> Calculating detection p-values...
#> Detection p-values calculated using type:  m+u 
#> Preview of detection p-values:
#>                 GroupA_3     GroupA_2     GroupB_3     GroupB_1    GroupA_1
#> cg00050873  0.000000e+00 0.4489762614 0.000000e+00 9.463873e-01 0.854614396
#> cg00212031  0.000000e+00 0.7236576115 3.963479e-36 8.595495e-01 0.632034035
#> cg00213748  1.474819e-24 0.7026344783 3.658137e-01 3.018175e-01 0.915179880
#> cg00214611 1.727636e-260 0.8252957388 7.893160e-10 5.017703e-01 0.781146377
#> cg00455876  0.000000e+00 0.1759300809 3.160650e-83 4.191290e-01 0.509724924
#> cg01707559  0.000000e+00 0.0008455845 0.000000e+00 4.362571e-11 0.001268238
#> Detection RData p-values saved to:  /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData/preprocessingMinfiEwasWater/qc/detP_RGSet.RData
#> Detection plot p-values saved to:  /tmp/Rtmpl8RIvi/dnaEPICO_Test/figures/preprocessingMinfiEwasWater/qc/detection_pvalues(RGSet).tiff 
#> =======================================================================
#> Calculate the mean detection p-values across all samples...
#> Number of failed samples: 0 
#> Samples before filtering: 6 
#> Samples after filtering: 6 
#> RGSet saved after removing the failed samples to:  /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData/preprocessingMinfiEwasWater/objects/RGSet.RData 
#> =======================================================================
#> Generating density plot of Beta values...
#> Density plot saved to:  /tmp/Rtmpl8RIvi/dnaEPICO_Test/figures/preprocessingMinfiEwasWater/qc/densityBeta(MSet).tiff 
#> =======================================================================
#> Predicting sex based on Beta values...
#> Predicted Sex plot saved to:  /tmp/Rtmpl8RIvi/dnaEPICO_Test/figures/preprocessingMinfiEwasWater/qc/sexPrediction(GSet).tiff 
#> =======================================================================
#> Clinical sex values...
#> Sex column integrity check:
#>   Sex column used:  sex 
#>   NA values:        0 
#>   Unknown labels:   None
#> Clinical Sex plot saved to:  /tmp/Rtmpl8RIvi/dnaEPICO_Test/figures/preprocessingMinfiEwasWater/qc/sexClinical(GSet).tiff 
#> =======================================================================
#> Mistmaches found[1] Sample_Name  Sample_Well  Sample_Plate
#> <0 rows> (or 0-length row.names)
#> =======================================================================
#> Running normalization methods using Minfi and WateRmelon:  adjustedfunnorm 
#> Applying normalization: adjustedfunnorm
#> No methods found in package 'RSQLite' for request: 'dbListFields' when loading 'lumi'
#> [adjustedFunnorm] Background and dye bias correction with noob
#> [adjustedFunnorm] Mapping to genome
#> [adjustedFunnorm] Quantile extraction
#> [adjustedFunnorm] Normalization
#> Saved normalized object:  /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData/preprocessingMinfiEwasWater/normObjects/norm_adjustedfunnorm_RGSet.RData
#> Plot Raw vs Normalisation data saved to:  /tmp/Rtmpl8RIvi/dnaEPICO_Test/figures/preprocessingMinfiEwasWater/qc/sexComparison_RawNorm(MSetF).tiff 
#> =======================================================================
#> Filtering probes with detection p-values:  0.01 ...
#> Probes retained:  466668 / 485512 
#> Filtered object saved to:  /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData/preprocessingMinfiEwasWater/filterObjects/removProbes_MSetF_Flt.RData 
#> =======================================================================
#> Removing probes on chromosomes:  chrX, chrY 
#> Remaining probes after removing selected chromosomes:
#> 
#>  chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19  chr2 chr20 
#> 45338 23371 27833 23541 11955 14417 14661 21474 26554  5538 24983 33341 10158 
#> chr21 chr22  chr3  chr4  chr5  chr6  chr7  chr8  chr9 
#>  4136  8368 23930 19029 23181 35088 29275 20091  9575 
#> Sex chromosome-filtered object saved to:  /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData/preprocessingMinfiEwasWater/filterObjects/removChrXY_MSetF_Flt_Rxy.RData 
#> =======================================================================
#> Removing probes with SNPs at:  SBE, CpG  with MAF >= 0.1 
#> Remaining probes after SNP filtering:  451304 
#> SNP-filtered object saved to:  /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData/preprocessingMinfiEwasWater/filterObjects/removSNPs_MAF0.1_MSetF_Flt_Rxy_Ds.RData 
#> =======================================================================
#> Loading cross-reactive probe list from:
#>  /tmp/Rtmpl8RIvi/dnaEPICO_Test/data/preprocessingMinfiEwasWater/12864_2024_10027_MOESM8_ESM.csv 
#> Probes retained after cross-reactive filter:  451304 
#> Cross-reactive-filtered object saved to:  /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData/preprocessingMinfiEwasWater/filterObjects/removCrossReactive_MSetF_Flt_Rxy_Ds_Rc.RData 
#> =======================================================================
#> Extracting final DNAm matrices (M, Beta, CN)...
#> M-values saved to:  /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData/preprocessingMinfiEwasWater/metrics/m_NomFilt_MSetF_Flt_Rxy_Ds_Rc.RData 
#>              GroupA_3   GroupA_2   GroupB_3   GroupB_1   GroupA_1
#> cg13869341  2.6393989  4.5098963  3.5921764  2.4425984  2.3233487
#> cg14008030  1.1068115  0.7804484  1.3156115  1.0308431  0.9513732
#> cg12045430 -4.9216557 -4.4422917 -4.7704070 -2.7293860 -4.8731142
#> cg20826792 -3.2152921 -3.0354748 -3.3089689 -2.6377700 -3.1260062
#> cg00381604 -5.7736577 -5.9343344 -5.7260241 -5.5323099 -5.9329137
#> cg20253340  0.4886808  0.9086010 -0.2618946 -0.1448504  0.4240629
#> Beta-values saved to:  /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData/preprocessingMinfiEwasWater/metrics/beta_NomFilt_MSetF_Flt_Rxy_Ds_Rc.RData 
#>              GroupA_3   GroupA_2   GroupB_3   GroupB_1   GroupA_1
#> cg13869341 0.86170119 0.95795345 0.92343122 0.84462880 0.83347005
#> cg14008030 0.68291127 0.63203702 0.71339133 0.67140047 0.65913495
#> cg12045430 0.03194010 0.04397503 0.03534566 0.13103183 0.03299699
#> cg20826792 0.09720524 0.10870567 0.09165419 0.13843342 0.10277336
#> cg00381604 0.01795101 0.01608951 0.01854242 0.02115069 0.01610511
#> cg20253340 0.58388143 0.65244281 0.45474134 0.47492040 0.57295995
#> CN-values saved to:  /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData/preprocessingMinfiEwasWater/metrics/cn_NomFilt_MSetF_Flt_Rxy_Ds_Rc.RData 
#>            GroupA_3 GroupA_2 GroupB_3 GroupB_1 GroupA_1
#> cg13869341 15.30437 15.49182 15.82079 15.60943 15.75944
#> cg14008030 14.18634 15.04115 14.84927 14.73982 15.09002
#> cg12045430 13.91558 14.39012 14.25258 14.51851 14.47114
#> cg20826792 14.10642 14.32523 14.35196 14.47400 14.52957
#> cg00381604 13.82183 14.11709 13.96986 14.12212 14.25033
#> cg20253340 13.52945 13.94953 13.64490 13.83988 14.15564
#> =======================================================================
#> Plot examineMDS_PostFilteringCrossRect saved to:  /tmp/Rtmpl8RIvi/dnaEPICO_Test/figures/preprocessingMinfiEwasWater/metrics/examineMDS_PostFilteringCrossRect(MSetF_Flt_Rxy_Ds_Rc).tiff 
#> =======================================================================
#> Plotting final density plots for grouping variable:  sex
#> Density plots saved to:  /tmp/Rtmpl8RIvi/dnaEPICO_Test/figures/preprocessingMinfiEwasWater/metrics/densityBeta&M(MSetF_Flt_Rxy_Ds_Rc).tiff 
#> =======================================================================
#> Cell composition reference selected: FlowSorted.Blood.450k 
#> Using ENmix Houseman-based cell composition
#> Saved phenoLC: /tmp/Rtmpl8RIvi/dnaEPICO_Test/data/preprocessingMinfiEwasWater/phenoLC.csv 
#> =======================================================================
#> Session info:
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] FlowSorted.Blood.450k_1.44.0                      
#>  [2] minfiData_0.52.0                                  
#>  [3] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.1
#>  [4] IlluminaHumanMethylation450kmanifest_0.4.0        
#>  [5] minfi_1.52.1                                      
#>  [6] bumphunter_1.48.0                                 
#>  [7] locfit_1.5-9.12                                   
#>  [8] iterators_1.0.14                                  
#>  [9] foreach_1.5.2                                     
#> [10] Biostrings_2.74.1                                 
#> [11] XVector_0.46.0                                    
#> [12] SummarizedExperiment_1.36.0                       
#> [13] Biobase_2.66.0                                    
#> [14] MatrixGenerics_1.18.1                             
#> [15] matrixStats_1.5.0                                 
#> [16] GenomicRanges_1.58.0                              
#> [17] GenomeInfoDb_1.42.3                               
#> [18] IRanges_2.40.1                                    
#> [19] S4Vectors_0.44.0                                  
#> [20] BiocGenerics_0.52.0                               
#> [21] dnaEPICO_0.99.10                                  
#> [22] BiocStyle_2.34.0                                  
#> 
#> loaded via a namespace (and not attached):
#>   [1] splines_4.4.2             BiocIO_1.16.0            
#>   [3] bitops_1.0-9              filelock_1.0.3           
#>   [5] tibble_3.3.1              methylumi_2.52.0         
#>   [7] preprocessCore_1.68.0     XML_3.99-0.23            
#>   [9] lifecycle_1.0.5           doParallel_1.0.17        
#>  [11] lattice_0.22-9            MASS_7.3-65              
#>  [13] base64_2.0.2              scrime_1.3.7             
#>  [15] magrittr_2.0.4            limma_3.62.2             
#>  [17] sass_0.4.10               rmarkdown_2.31           
#>  [19] jquerylib_0.1.4           yaml_2.3.12              
#>  [21] otel_0.2.0                doRNG_1.8.6.3            
#>  [23] askpass_1.2.1             DBI_1.3.0                
#>  [25] RColorBrewer_1.1-3        lumi_2.58.0              
#>  [27] abind_1.4-8               zlibbioc_1.52.0          
#>  [29] quadprog_1.5-8            purrr_1.2.1              
#>  [31] RCurl_1.98-1.18           rappdirs_0.3.4           
#>  [33] GenomeInfoDbData_1.2.13   irlba_2.3.7              
#>  [35] rentrez_1.2.4             genefilter_1.88.0        
#>  [37] annotate_1.84.0           pkgdown_2.2.0            
#>  [39] DelayedMatrixStats_1.28.1 codetools_0.2-20         
#>  [41] DelayedArray_0.32.0       xml2_1.5.2               
#>  [43] tidyselect_1.2.1          UCSC.utils_1.2.0         
#>  [45] beanplot_1.3.1            BiocFileCache_2.14.0     
#>  [47] dynamicTreeCut_1.63-1     illuminaio_0.48.0        
#>  [49] GenomicAlignments_1.42.0  jsonlite_2.0.0           
#>  [51] multtest_2.62.0           survival_3.8-6           
#>  [53] systemfonts_1.3.2         tools_4.4.2              
#>  [55] ragg_1.5.2                Rcpp_1.1.1               
#>  [57] glue_1.8.0                SparseArray_1.6.2        
#>  [59] mgcv_1.9-4                xfun_0.57                
#>  [61] dplyr_1.2.0               HDF5Array_1.34.0         
#>  [63] withr_3.0.2               BiocManager_1.30.27      
#>  [65] fastmap_1.2.0             rhdf5filters_1.18.1      
#>  [67] openssl_2.3.5             caTools_1.18.3           
#>  [69] digest_0.6.39             R6_2.6.1                 
#>  [71] colorspace_2.1-2          textshaping_1.0.5        
#>  [73] RPMM_1.25                 gtools_3.9.5             
#>  [75] RSQLite_2.4.6             tidyr_1.3.2              
#>  [77] generics_0.1.4            wateRmelon_2.12.0        
#>  [79] data.table_1.18.2.1       rtracklayer_1.66.0       
#>  [81] httr_1.4.8                htmlwidgets_1.6.4        
#>  [83] S4Arrays_1.6.0            pkgconfig_2.0.3          
#>  [85] blob_1.3.0                siggenes_1.80.0          
#>  [87] impute_1.80.0             htmltools_0.5.9          
#>  [89] bookdown_0.46             geneplotter_1.84.0       
#>  [91] ROC_1.82.0                png_0.1-9                
#>  [93] knitr_1.51                tzdb_0.5.0               
#>  [95] rjson_0.2.23              nlme_3.1-169             
#>  [97] curl_7.0.0                cachem_1.1.0             
#>  [99] rhdf5_2.50.2              BiocVersion_3.20.0       
#> [101] KernSmooth_2.23-26        AnnotationDbi_1.68.0     
#> [103] restfulr_0.0.16           desc_1.4.3               
#> [105] GEOquery_2.74.0           pillar_1.11.1            
#> [107] grid_4.4.2                reshape_0.8.10           
#> [109] vctrs_0.7.2               gplots_3.3.0             
#> [111] ENmix_1.42.2              dbplyr_2.5.2             
#> [113] xtable_1.8-8              cluster_2.1.8.2          
#> [115] evaluate_1.0.5            readr_2.2.0              
#> [117] GenomicFeatures_1.58.0    cli_3.6.5                
#> [119] compiler_4.4.2            Rsamtools_2.22.0         
#> [121] rlang_1.1.7               crayon_1.5.3             
#> [123] rngtools_1.5.2            nor1mix_1.3-3            
#> [125] mclust_6.1.2              affy_1.84.0              
#> [127] plyr_1.8.9                fs_2.0.1                 
#> [129] BiocParallel_1.40.2       nleqslv_3.3.6            
#> [131] Matrix_1.7-5              ExperimentHub_2.14.0     
#> [133] hms_1.1.4                 sparseMatrixStats_1.18.0 
#> [135] bit64_4.6.0-1             Rhdf5lib_1.28.0          
#> [137] KEGGREST_1.46.0           statmod_1.5.1            
#> [139] AnnotationHub_3.14.0      memoise_2.0.1            
#> [141] affyio_1.76.0             bslib_0.10.0             
#> [143] bit_4.6.0

Step 6: Surrogate variable analysis (SVA)

Next, we estimate surrogate variables to capture technical variation such as chip and position effects.

rgsetPath <- file.path(
  tmpDir,
  "rData",
  "preprocessingMinfiEwasWater",
  "objects",
  "RGSet.RData"
)

file.exists(rgsetPath)
#> [1] TRUE

svaEnmix(
  phenoFile = samplePath,
  rgsetData = rgsetPath,
  sepType = "",
  outputLogs = file.path(tmpDir, "logs"),
  nSamples = NA,
  SampleID = "Sample_Name",
  arrayType = "IlluminaHumanMethylation450k",
  annotationVersion = "ilmn12.hg19",
  SentrixIDColumn = "Sentrix_ID",
  SentrixPositionColumn = "Sentrix_Position",
  ctrlSvaPercVar = 0.90,
  ctrlSvaFlag = 1,
  scriptLabel = "svaEnmix",
  tiffWidth = 2000,
  tiffHeight = 1000,
  tiffRes = 150,
  dataBaseDir = file.path(tmpDir, "data") ,
  figureBaseDir = file.path(tmpDir, "figures"),
  rBaseDir = file.path(tmpDir, "rData")
  
)
#> ==== Starting SVA Estimation with Enmix ====
#> Start time:  2026-04-02 05:17:41 
#> 
#> Log file path:  /tmp/Rtmpl8RIvi/dnaEPICO_Test/logs/log_svaEnmix.txt 
#> 
#> Pheno file:  /tmp/Rtmpl8RIvi/dnaEPICO_Test/data/preprocessingMinfiEwasWater/sample_minfi.csv 
#> Separator type:   
#> Log directory:  /tmp/Rtmpl8RIvi/dnaEPICO_Test/logs 
#> Sample limit:  All 
#> SampleID column:  Sample_Name 
#> Sentrix ID column:  Sentrix_ID 
#> Sentrix Position column:  Sentrix_Position 
#> Script label:  svaEnmix 
#> ctrlSva percvar:  0.9 
#> ctrlSva flag:  1 
#> TIFF dimensions (WxH):  2000 x 1000  at 150 dpi
#> =======================================================================
#> Using all 6 samples.
#> Phenotype file loaded with 6 samples and 13 columns.
#> Preview of targets:
#>   Sample_Name Sample_Well Sample_Plate Sample_Group Pool_ID Sentrix_ID
#> 1    GroupA_3          H5           NA       GroupA      NA 5723646052
#> 2    GroupA_2          D5           NA       GroupA      NA 5723646052
#> 3    GroupB_3          C6           NA       GroupB      NA 5723646052
#> 4    GroupB_1          F7           NA       GroupB      NA 5723646053
#> 5    GroupA_1          G7           NA       GroupA      NA 5723646053
#> 6    GroupB_2          H7           NA       GroupB      NA 5723646053
#> =======================================================================
#> RGSet loaded with 6 samples.
#> =======================================================================
#> 3  surrogate variables explain  91.17398 % of 
#>     data variation
#> Surrogate variables matrix (first few rows):
#>                 PC1       PC2         PC3
#> GroupA_3 -25330.022 -3189.849   3715.1929
#> GroupA_2   6350.671 -6443.776  -1051.9582
#> GroupB_3   8827.110 -4444.123  15604.2358
#> GroupB_1  -3840.017 -1215.872 -10610.1476
#> GroupA_1  10355.156 -9071.276  -7419.7718
#> GroupB_2   3637.102 24364.897   -237.5511
#> SVA Matrix RData saved to:  /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData/svaEnmix/svaMatrix.RData 
#> SVA Matrix CSV saved to:  /tmp/Rtmpl8RIvi/dnaEPICO_Test/data/svaEnmix/svaMatrix.csv 
#> Saved phenoLC + SVA: /tmp/Rtmpl8RIvi/dnaEPICO_Test/data/preprocessingMinfiEwasWater/sample_minfi.csv
#> SVA Sentrix plot saved to:  /tmp/Rtmpl8RIvi/dnaEPICO_Test/figures/svaEnmix/sva_SentrixID.tiff
#> SVA Position plot saved to:  /tmp/Rtmpl8RIvi/dnaEPICO_Test/figures/svaEnmix/sva_SentrixPosition.tiff 
#> =======================================================================
#> Number of surrogate variables (K): 3 
#> SentrixID class: factor 
#> SentrixID unique levels: 2 
#> sentrixID
#> 5723646052 5723646053 
#>          3          3 
#> SentrixPosition class: factor 
#> SentrixPosition unique levels: 5 
#> sentrixPos
#> R02C02 R04C01 R04C02 R05C02 R06C02 
#>      1      1      1      2      1 
#> First row of SVA matrix:
#>        PC1        PC2        PC3 
#> -25330.022  -3189.849   3715.193 
#> Sample names in SVA matrix: GroupA_3, GroupA_2, GroupB_3, GroupB_1, GroupA_1 
#> Sample names in pData(RGSet): GroupA_3, GroupA_2, GroupB_3, GroupB_1, GroupA_1
#> Warning: attempting model selection on an essentially perfect fit is nonsense
#> Warning in max(dttmp$`Pr(F)`, na.rm = TRUE): no non-missing arguments to max;
#> returning -Inf
#> Warning: attempting model selection on an essentially perfect fit is nonsense
#> Warning in max(dttmp$`Pr(F)`, na.rm = TRUE): no non-missing arguments to max;
#> returning -Inf
#> Warning: attempting model selection on an essentially perfect fit is nonsense
#> Warning in max(dttmp$`Pr(F)`, na.rm = TRUE): no non-missing arguments to max;
#> returning -Inf
#> Warning in anova.lm(lmsvaFull[[i]]): ANOVA F-tests on an essentially perfect
#> fit are unreliable
#> Warning in anova.lm(lmsvaRed[[i]]): ANOVA F-tests on an essentially perfect fit
#> are unreliable
#> Warning in anova.lm(lmsvaFull[[i]]): ANOVA F-tests on an essentially perfect
#> fit are unreliable
#> Warning in anova.lm(lmsvaRed[[i]]): ANOVA F-tests on an essentially perfect fit
#> are unreliable
#> Warning in anova.lm(lmsvaFull[[i]]): ANOVA F-tests on an essentially perfect
#> fit are unreliable
#> Warning in anova.lm(lmsvaRed[[i]]): ANOVA F-tests on an essentially perfect fit
#> are unreliable
#> =======================================================================
#> SVA Sentrix/Position plot saved to:  /tmp/Rtmpl8RIvi/dnaEPICO_Test/figures/svaEnmix/sva_SentrixIDPosition.tiff 
#> =======================================================================
#> Session info:
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] FlowSorted.Blood.450k_1.44.0                      
#>  [2] minfiData_0.52.0                                  
#>  [3] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.1
#>  [4] IlluminaHumanMethylation450kmanifest_0.4.0        
#>  [5] minfi_1.52.1                                      
#>  [6] bumphunter_1.48.0                                 
#>  [7] locfit_1.5-9.12                                   
#>  [8] iterators_1.0.14                                  
#>  [9] foreach_1.5.2                                     
#> [10] Biostrings_2.74.1                                 
#> [11] XVector_0.46.0                                    
#> [12] SummarizedExperiment_1.36.0                       
#> [13] Biobase_2.66.0                                    
#> [14] MatrixGenerics_1.18.1                             
#> [15] matrixStats_1.5.0                                 
#> [16] GenomicRanges_1.58.0                              
#> [17] GenomeInfoDb_1.42.3                               
#> [18] IRanges_2.40.1                                    
#> [19] S4Vectors_0.44.0                                  
#> [20] BiocGenerics_0.52.0                               
#> [21] dnaEPICO_0.99.10                                  
#> [22] BiocStyle_2.34.0                                  
#> 
#> loaded via a namespace (and not attached):
#>   [1] splines_4.4.2             BiocIO_1.16.0            
#>   [3] bitops_1.0-9              filelock_1.0.3           
#>   [5] tibble_3.3.1              methylumi_2.52.0         
#>   [7] preprocessCore_1.68.0     XML_3.99-0.23            
#>   [9] lifecycle_1.0.5           doParallel_1.0.17        
#>  [11] lattice_0.22-9            MASS_7.3-65              
#>  [13] base64_2.0.2              scrime_1.3.7             
#>  [15] magrittr_2.0.4            limma_3.62.2             
#>  [17] sass_0.4.10               rmarkdown_2.31           
#>  [19] jquerylib_0.1.4           yaml_2.3.12              
#>  [21] otel_0.2.0                doRNG_1.8.6.3            
#>  [23] askpass_1.2.1             DBI_1.3.0                
#>  [25] RColorBrewer_1.1-3        lumi_2.58.0              
#>  [27] abind_1.4-8               zlibbioc_1.52.0          
#>  [29] quadprog_1.5-8            purrr_1.2.1              
#>  [31] RCurl_1.98-1.18           rappdirs_0.3.4           
#>  [33] GenomeInfoDbData_1.2.13   irlba_2.3.7              
#>  [35] rentrez_1.2.4             genefilter_1.88.0        
#>  [37] annotate_1.84.0           pkgdown_2.2.0            
#>  [39] DelayedMatrixStats_1.28.1 codetools_0.2-20         
#>  [41] DelayedArray_0.32.0       xml2_1.5.2               
#>  [43] tidyselect_1.2.1          UCSC.utils_1.2.0         
#>  [45] beanplot_1.3.1            BiocFileCache_2.14.0     
#>  [47] dynamicTreeCut_1.63-1     illuminaio_0.48.0        
#>  [49] GenomicAlignments_1.42.0  jsonlite_2.0.0           
#>  [51] multtest_2.62.0           survival_3.8-6           
#>  [53] systemfonts_1.3.2         tools_4.4.2              
#>  [55] ragg_1.5.2                Rcpp_1.1.1               
#>  [57] glue_1.8.0                SparseArray_1.6.2        
#>  [59] mgcv_1.9-4                xfun_0.57                
#>  [61] dplyr_1.2.0               HDF5Array_1.34.0         
#>  [63] withr_3.0.2               BiocManager_1.30.27      
#>  [65] fastmap_1.2.0             rhdf5filters_1.18.1      
#>  [67] openssl_2.3.5             caTools_1.18.3           
#>  [69] digest_0.6.39             R6_2.6.1                 
#>  [71] colorspace_2.1-2          textshaping_1.0.5        
#>  [73] RPMM_1.25                 gtools_3.9.5             
#>  [75] RSQLite_2.4.6             tidyr_1.3.2              
#>  [77] generics_0.1.4            wateRmelon_2.12.0        
#>  [79] data.table_1.18.2.1       rtracklayer_1.66.0       
#>  [81] httr_1.4.8                htmlwidgets_1.6.4        
#>  [83] S4Arrays_1.6.0            pkgconfig_2.0.3          
#>  [85] blob_1.3.0                siggenes_1.80.0          
#>  [87] impute_1.80.0             htmltools_0.5.9          
#>  [89] bookdown_0.46             geneplotter_1.84.0       
#>  [91] ROC_1.82.0                png_0.1-9                
#>  [93] knitr_1.51                tzdb_0.5.0               
#>  [95] rjson_0.2.23              nlme_3.1-169             
#>  [97] curl_7.0.0                cachem_1.1.0             
#>  [99] rhdf5_2.50.2              BiocVersion_3.20.0       
#> [101] KernSmooth_2.23-26        AnnotationDbi_1.68.0     
#> [103] restfulr_0.0.16           desc_1.4.3               
#> [105] GEOquery_2.74.0           pillar_1.11.1            
#> [107] grid_4.4.2                reshape_0.8.10           
#> [109] vctrs_0.7.2               gplots_3.3.0             
#> [111] ENmix_1.42.2              dbplyr_2.5.2             
#> [113] xtable_1.8-8              cluster_2.1.8.2          
#> [115] evaluate_1.0.5            readr_2.2.0              
#> [117] GenomicFeatures_1.58.0    cli_3.6.5                
#> [119] compiler_4.4.2            Rsamtools_2.22.0         
#> [121] rlang_1.1.7               crayon_1.5.3             
#> [123] rngtools_1.5.2            nor1mix_1.3-3            
#> [125] mclust_6.1.2              affy_1.84.0              
#> [127] plyr_1.8.9                fs_2.0.1                 
#> [129] BiocParallel_1.40.2       nleqslv_3.3.6            
#> [131] Matrix_1.7-5              ExperimentHub_2.14.0     
#> [133] hms_1.1.4                 sparseMatrixStats_1.18.0 
#> [135] bit64_4.6.0-1             Rhdf5lib_1.28.0          
#> [137] KEGGREST_1.46.0           statmod_1.5.1            
#> [139] AnnotationHub_3.14.0      memoise_2.0.1            
#> [141] affyio_1.76.0             bslib_0.10.0             
#> [143] bit_4.6.0

Step 7: Phenotype preprocessing

This step aligns phenotype data with the processed methylation matrices.

# Metric paths produced by preprocessingMinfiEwasWater
betaPath <- file.path(
  tmpDir,
  "rData",
  "preprocessingMinfiEwasWater",
  "metrics",
  "beta_NomFilt_MSetF_Flt_Rxy_Ds_Rc.RData"
)

mPath <- file.path(
  tmpDir,
  "rData",
  "preprocessingMinfiEwasWater",
  "metrics",
  "m_NomFilt_MSetF_Flt_Rxy_Ds_Rc.RData"
)

cnPath <- file.path(
  tmpDir,
  "rData",
  "preprocessingMinfiEwasWater",
  "metrics",
  "cn_NomFilt_MSetF_Flt_Rxy_Ds_Rc.RData"
)

# Quick sanity check
file.exists(betaPath)
#> [1] TRUE
file.exists(mPath)
#> [1] TRUE
file.exists(cnPath)
#> [1] TRUE

preprocessingPheno(
  phenoFile = samplePath,
  sepType = "",
  betaPath = betaPath,
  mPath = mPath,
  cnPath = cnPath,
  SampleID = "Sample_Name",
  timeVar = "Timepoint",
  timepoints = "1",
  combineTimepoints = "1",
  sexColumn = "sex",
  outputPheno = file.path(tmpDir, "data", "preprocessingPheno"),
  outputRData = file.path(tmpDir, "rData", "preprocessingPheno", "metrics"),
  outputRDataMerge = file.path(tmpDir, "rData", "preprocessingPheno", "mergeData"),
  outputLogs = file.path(tmpDir, "logs"),
  outputDir = file.path(tmpDir, "data", "preprocessingPheno")
)
#> ==== Starting Phenotype Preprocessing ====
#> Start Time:                2026-04-02 05:17:42 
#> Log file path:             /tmp/Rtmpl8RIvi/dnaEPICO_Test/logs/log_preprocessingPheno.txt 
#> 
#> Phenotype file:            /tmp/Rtmpl8RIvi/dnaEPICO_Test/data/preprocessingMinfiEwasWater/sample_minfi.csv 
#> Beta path:                 /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData/preprocessingMinfiEwasWater/metrics/beta_NomFilt_MSetF_Flt_Rxy_Ds_Rc.RData 
#> M-values path:             /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData/preprocessingMinfiEwasWater/metrics/m_NomFilt_MSetF_Flt_Rxy_Ds_Rc.RData 
#> CN path:                   /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData/preprocessingMinfiEwasWater/metrics/cn_NomFilt_MSetF_Flt_Rxy_Ds_Rc.RData 
#> 
#> Identifier column:         Sample_Name 
#> Timepoint column:         Timepoint 
#> Timepoints (if present):   1 
#> Combine timepoints:        1 
#> 
#> Sex column:                sex 
#> Output phenotype dir:      /tmp/Rtmpl8RIvi/dnaEPICO_Test/data/preprocessingPheno 
#> RData metrics dir:         /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData/preprocessingPheno/metrics 
#> RData merge dir:           /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData/preprocessingPheno/mergeData 
#> 
#> =======================================================================
#> Beta dimensions:  451304 6 
#> M dimensions:  451304 6 
#> CN dimensions:  451304 6 
#> Phenotype file loaded with 6 samples and 16 columns.
#> Preview of phenoLC:
#>   Sample_Name Sample_Well Sample_Plate Sample_Group Pool_ID
#> 1    GroupA_1          G7           NA       GroupA      NA
#> 2    GroupA_2          D5           NA       GroupA      NA
#> 3    GroupA_3          H5           NA       GroupA      NA
#> 4    GroupB_1          F7           NA       GroupB      NA
#> 5    GroupB_2          H7           NA       GroupB      NA
#> 6    GroupB_3          C6           NA       GroupB      NA
#> =======================================================================
#> Subsetting to timepoints: 1 
#> Available values in Timepoint column:
#> 
#> 1 
#> 6 
#> Combining timepoints: 1 
#> Saved combined phenotype file for T1T2 at: /tmp/Rtmpl8RIvi/dnaEPICO_Test/data/preprocessingPheno 
#> =======================================================================
#> Processing merge for timepoint: 1 
#>   - pheno rows: 6 
#>   - beta cols: 6 
#> Saved merged object for T1
#> Combined data saved for timepoints: 1 
#> Merged data saved to: /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData/preprocessingPheno/mergeData 
#> =======================================================================
#>      ProbeID   GroupA_3   GroupA_2   GroupB_3   GroupB_1
#> 1 cg13869341 0.86170119 0.95795345 0.92343122 0.84462880
#> 2 cg14008030 0.68291127 0.63203702 0.71339133 0.67140047
#> 3 cg12045430 0.03194010 0.04397503 0.03534566 0.13103183
#> 4 cg20826792 0.09720524 0.10870567 0.09165419 0.13843342
#> 5 cg00381604 0.01795101 0.01608951 0.01854242 0.02115069
#> Beta CSV file for ClockFundation saved to: /tmp/Rtmpl8RIvi/dnaEPICO_Test/data/preprocessingPheno/beta.csv 
#> Beta ZIP file for ClockFundation saved to: /tmp/Rtmpl8RIvi/dnaEPICO_Test/data/preprocessingPheno/beta.zip 
#> Re-encoding Sex: 0 = Female, 1 = Male
#> Sample file for ClockFundation saved to: /tmp/Rtmpl8RIvi/dnaEPICO_Test/data/preprocessingPheno/phenoCF.csv 
#> =======================================================================
#> 
#> Session Info:
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] FlowSorted.Blood.450k_1.44.0                      
#>  [2] minfiData_0.52.0                                  
#>  [3] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.1
#>  [4] IlluminaHumanMethylation450kmanifest_0.4.0        
#>  [5] minfi_1.52.1                                      
#>  [6] bumphunter_1.48.0                                 
#>  [7] locfit_1.5-9.12                                   
#>  [8] iterators_1.0.14                                  
#>  [9] foreach_1.5.2                                     
#> [10] Biostrings_2.74.1                                 
#> [11] XVector_0.46.0                                    
#> [12] SummarizedExperiment_1.36.0                       
#> [13] Biobase_2.66.0                                    
#> [14] MatrixGenerics_1.18.1                             
#> [15] matrixStats_1.5.0                                 
#> [16] GenomicRanges_1.58.0                              
#> [17] GenomeInfoDb_1.42.3                               
#> [18] IRanges_2.40.1                                    
#> [19] S4Vectors_0.44.0                                  
#> [20] BiocGenerics_0.52.0                               
#> [21] dnaEPICO_0.99.10                                  
#> [22] BiocStyle_2.34.0                                  
#> 
#> loaded via a namespace (and not attached):
#>   [1] splines_4.4.2             BiocIO_1.16.0            
#>   [3] bitops_1.0-9              filelock_1.0.3           
#>   [5] tibble_3.3.1              methylumi_2.52.0         
#>   [7] preprocessCore_1.68.0     XML_3.99-0.23            
#>   [9] lifecycle_1.0.5           doParallel_1.0.17        
#>  [11] lattice_0.22-9            MASS_7.3-65              
#>  [13] base64_2.0.2              scrime_1.3.7             
#>  [15] magrittr_2.0.4            limma_3.62.2             
#>  [17] sass_0.4.10               rmarkdown_2.31           
#>  [19] jquerylib_0.1.4           yaml_2.3.12              
#>  [21] otel_0.2.0                doRNG_1.8.6.3            
#>  [23] askpass_1.2.1             DBI_1.3.0                
#>  [25] RColorBrewer_1.1-3        lumi_2.58.0              
#>  [27] abind_1.4-8               zlibbioc_1.52.0          
#>  [29] quadprog_1.5-8            purrr_1.2.1              
#>  [31] RCurl_1.98-1.18           rappdirs_0.3.4           
#>  [33] GenomeInfoDbData_1.2.13   irlba_2.3.7              
#>  [35] rentrez_1.2.4             genefilter_1.88.0        
#>  [37] annotate_1.84.0           pkgdown_2.2.0            
#>  [39] DelayedMatrixStats_1.28.1 codetools_0.2-20         
#>  [41] DelayedArray_0.32.0       xml2_1.5.2               
#>  [43] tidyselect_1.2.1          UCSC.utils_1.2.0         
#>  [45] beanplot_1.3.1            BiocFileCache_2.14.0     
#>  [47] dynamicTreeCut_1.63-1     illuminaio_0.48.0        
#>  [49] GenomicAlignments_1.42.0  jsonlite_2.0.0           
#>  [51] multtest_2.62.0           survival_3.8-6           
#>  [53] systemfonts_1.3.2         tools_4.4.2              
#>  [55] ragg_1.5.2                Rcpp_1.1.1               
#>  [57] glue_1.8.0                SparseArray_1.6.2        
#>  [59] mgcv_1.9-4                xfun_0.57                
#>  [61] dplyr_1.2.0               HDF5Array_1.34.0         
#>  [63] withr_3.0.2               BiocManager_1.30.27      
#>  [65] fastmap_1.2.0             rhdf5filters_1.18.1      
#>  [67] openssl_2.3.5             caTools_1.18.3           
#>  [69] digest_0.6.39             R6_2.6.1                 
#>  [71] colorspace_2.1-2          textshaping_1.0.5        
#>  [73] RPMM_1.25                 gtools_3.9.5             
#>  [75] RSQLite_2.4.6             tidyr_1.3.2              
#>  [77] generics_0.1.4            wateRmelon_2.12.0        
#>  [79] data.table_1.18.2.1       rtracklayer_1.66.0       
#>  [81] httr_1.4.8                htmlwidgets_1.6.4        
#>  [83] S4Arrays_1.6.0            pkgconfig_2.0.3          
#>  [85] blob_1.3.0                siggenes_1.80.0          
#>  [87] impute_1.80.0             htmltools_0.5.9          
#>  [89] bookdown_0.46             geneplotter_1.84.0       
#>  [91] ROC_1.82.0                png_0.1-9                
#>  [93] knitr_1.51                tzdb_0.5.0               
#>  [95] rjson_0.2.23              nlme_3.1-169             
#>  [97] curl_7.0.0                cachem_1.1.0             
#>  [99] rhdf5_2.50.2              BiocVersion_3.20.0       
#> [101] KernSmooth_2.23-26        AnnotationDbi_1.68.0     
#> [103] restfulr_0.0.16           desc_1.4.3               
#> [105] GEOquery_2.74.0           pillar_1.11.1            
#> [107] grid_4.4.2                reshape_0.8.10           
#> [109] vctrs_0.7.2               gplots_3.3.0             
#> [111] ENmix_1.42.2              dbplyr_2.5.2             
#> [113] xtable_1.8-8              cluster_2.1.8.2          
#> [115] evaluate_1.0.5            readr_2.2.0              
#> [117] GenomicFeatures_1.58.0    cli_3.6.5                
#> [119] compiler_4.4.2            Rsamtools_2.22.0         
#> [121] rlang_1.1.7               crayon_1.5.3             
#> [123] rngtools_1.5.2            nor1mix_1.3-3            
#> [125] mclust_6.1.2              affy_1.84.0              
#> [127] plyr_1.8.9                fs_2.0.1                 
#> [129] BiocParallel_1.40.2       nleqslv_3.3.6            
#> [131] Matrix_1.7-5              ExperimentHub_2.14.0     
#> [133] hms_1.1.4                 sparseMatrixStats_1.18.0 
#> [135] bit64_4.6.0-1             Rhdf5lib_1.28.0          
#> [137] KEGGREST_1.46.0           statmod_1.5.1            
#> [139] AnnotationHub_3.14.0      memoise_2.0.1            
#> [141] affyio_1.76.0             bslib_0.10.0             
#> [143] bit_4.6.0

Step 8: Generate a PDF report

Finally, we generate a summary report that compiles figures and results from all previous steps.

dnamReport(
  output = "DNAm_Report.pdf",
  outputDir = file.path(tmpDir, "reports"),

  qcDir = file.path(
    tmpDir, "figures", "preprocessingMinfiEwasWater", "enMix"
  ),
  preprocessingDir = file.path(
    tmpDir, "figures", "preprocessingMinfiEwasWater", "qc"
  ),
  postprocessingDir = file.path(
    tmpDir, "figures", "preprocessingMinfiEwasWater", "metrics"
  ),
  svaDir = file.path(
    tmpDir, "figures", "svaEnmix"
  ),
  glmDir = file.path(
    tmpDir, "figures", "methylationGLM_T1"
  ),
  glmmDir = file.path(
    tmpDir, "figures", "methylationGLMM_T1T2"
  ),
  figDir = file.path(
          tmpDir, "reports", "figures"
        ),
  reportTitle = "DNA methylation",
  author = "School of Biomedical Sciences",
  date = format(Sys.Date(), "%B %d, %Y")
)
#> processing file: DNAm.Rmd
#> output file: DNAm.knit.md
#> /usr/bin/pandoc +RTS -K512m -RTS DNAm.knit.md --to latex --from markdown+autolink_bare_uris+tex_math_single_backslash --output /tmp/Rtmpl8RIvi/dnaEPICO_Test/reports/DNAm_Report.tex --lua-filter /usr/local/lib/R/site-library/rmarkdown/rmarkdown/lua/pagebreak.lua --lua-filter /usr/local/lib/R/site-library/rmarkdown/rmarkdown/lua/latex-div.lua --embed-resources --standalone --table-of-contents --toc-depth 2 --number-sections --highlight-style tango --pdf-engine pdflatex --variable graphics --extract-media /tmp/Rtmpl8RIvi/dnaEPICO_Test/reports/DNAm_Report_files --include-in-header /tmp/Rtmpl8RIvi/rmarkdown-straaec555e518c.html
#> 
#> Output created: /tmp/Rtmpl8RIvi/dnaEPICO_Test/reports/DNAm_Report.pdf

Example 2: Cluster Test

For large studies or cluster environments, the same workflow can be executed via a Makefile.

# Extract Makefile
extractMake(destDir = tmpDir)

oldWd <- getwd()
setwd(tmpDir)
on.exit(setwd(oldWd), add = TRUE)

system(paste(
  "make f3",
  "MODEL=model1",
  "PHENO_FILE=data/preprocessingMinfiEwasWater/sample_minfi.csv",
  "ARRAY_TYPE=IlluminaHumanMethylation450k",
  "ANNOTATION_VERSION=ilmn12.hg19",
  "SEX_COLUMN=sex",
  "PLOT_GROUP_VAR=sex",
  "LC_REF=FlowSorted.Blood.450k",
  "TIMEPOINTS=1",
  "COMBINE_TIMEPOINTS=1",
  "R_DIR=NULL",
  "N_SAMPLES=NA",        
  "N_CORES=1"         
))

Step 1: Selecting a single test model

MODEL ?= model1

For this example, the pipeline is run using one explicit model (model1) rather than multiple parallel models. This mirrors the single-dataset workflow demonstrated earlier in R.

Step 2: Using the example phenotype file

PHENO_FILE = (DATA_DIR)/(STEP1)/sample_minfi.csv

The phenotype file generated from minfiData (sample_minfi.csv) is used for consistency between:

interactive R execution
Makefile-driven execution

Step 3: Removing custom R library paths

R_DIR = NULL

Setting R_DIR = NULL tells the pipeline to use the default R library paths, which is appropriate for local testing. You need to use a directory if you are running this in HPC environments.

Step 4: Matching columns

GLOBAL PARAMETERS:

SAMPLE_ID = Sample_Name
ARRAY_TYPE = IlluminaHumanMethylation450k
ANNOTATION_VERSION = ilmn12.hg19
SEX_COLUMN = sex
SENTRIX_ID_COLUMN = Sentrix_ID
SENTRIX_POSITION_COLUMN = Sentrix_Position

These parameters ensure the Makefile uses the:

correct sample identifier,
correct array platform and annotation,
and correct sex column and chip metadata

Step 5: Plot grouping and cell composition reference

PLOT_GROUP_VAR = sex
LC_REF = FlowSorted.Blood.450k

Plots are grouped by sex, matching earlier QC figures. FlowSorted.Blood.450k is selected as the reference panel because minfiData represents blood samples.

Step 6: Timepoint

TIMEPOINTS = 1
COMBINE_TIMEPOINTS = 1

The example dataset contains a single timepoint, so longitudinal logic is disabled.

Basics

Date the vignette was generated.

#> [1] "2026-04-02 05:18:05 UTC"

Wallclock time spent generating the vignette.

#> Time difference of 2.299 mins

R session information.

#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] FlowSorted.Blood.450k_1.44.0                      
#>  [2] minfiData_0.52.0                                  
#>  [3] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.1
#>  [4] IlluminaHumanMethylation450kmanifest_0.4.0        
#>  [5] minfi_1.52.1                                      
#>  [6] bumphunter_1.48.0                                 
#>  [7] locfit_1.5-9.12                                   
#>  [8] iterators_1.0.14                                  
#>  [9] foreach_1.5.2                                     
#> [10] Biostrings_2.74.1                                 
#> [11] XVector_0.46.0                                    
#> [12] SummarizedExperiment_1.36.0                       
#> [13] Biobase_2.66.0                                    
#> [14] MatrixGenerics_1.18.1                             
#> [15] matrixStats_1.5.0                                 
#> [16] GenomicRanges_1.58.0                              
#> [17] GenomeInfoDb_1.42.3                               
#> [18] IRanges_2.40.1                                    
#> [19] S4Vectors_0.44.0                                  
#> [20] BiocGenerics_0.52.0                               
#> [21] dnaEPICO_0.99.10                                  
#> [22] BiocStyle_2.34.0                                  
#> 
#> loaded via a namespace (and not attached):
#>   [1] splines_4.4.2             BiocIO_1.16.0            
#>   [3] bitops_1.0-9              filelock_1.0.3           
#>   [5] tibble_3.3.1              methylumi_2.52.0         
#>   [7] preprocessCore_1.68.0     XML_3.99-0.23            
#>   [9] lifecycle_1.0.5           doParallel_1.0.17        
#>  [11] lattice_0.22-9            MASS_7.3-65              
#>  [13] base64_2.0.2              scrime_1.3.7             
#>  [15] magrittr_2.0.4            limma_3.62.2             
#>  [17] sass_0.4.10               rmarkdown_2.31           
#>  [19] jquerylib_0.1.4           yaml_2.3.12              
#>  [21] otel_0.2.0                doRNG_1.8.6.3            
#>  [23] askpass_1.2.1             DBI_1.3.0                
#>  [25] RColorBrewer_1.1-3        lumi_2.58.0              
#>  [27] abind_1.4-8               zlibbioc_1.52.0          
#>  [29] quadprog_1.5-8            purrr_1.2.1              
#>  [31] RCurl_1.98-1.18           rappdirs_0.3.4           
#>  [33] GenomeInfoDbData_1.2.13   irlba_2.3.7              
#>  [35] rentrez_1.2.4             genefilter_1.88.0        
#>  [37] annotate_1.84.0           pkgdown_2.2.0            
#>  [39] DelayedMatrixStats_1.28.1 codetools_0.2-20         
#>  [41] DelayedArray_0.32.0       xml2_1.5.2               
#>  [43] tidyselect_1.2.1          UCSC.utils_1.2.0         
#>  [45] beanplot_1.3.1            BiocFileCache_2.14.0     
#>  [47] dynamicTreeCut_1.63-1     illuminaio_0.48.0        
#>  [49] GenomicAlignments_1.42.0  jsonlite_2.0.0           
#>  [51] multtest_2.62.0           survival_3.8-6           
#>  [53] systemfonts_1.3.2         tools_4.4.2              
#>  [55] ragg_1.5.2                Rcpp_1.1.1               
#>  [57] glue_1.8.0                SparseArray_1.6.2        
#>  [59] mgcv_1.9-4                xfun_0.57                
#>  [61] dplyr_1.2.0               HDF5Array_1.34.0         
#>  [63] withr_3.0.2               BiocManager_1.30.27      
#>  [65] fastmap_1.2.0             rhdf5filters_1.18.1      
#>  [67] openssl_2.3.5             caTools_1.18.3           
#>  [69] digest_0.6.39             R6_2.6.1                 
#>  [71] colorspace_2.1-2          textshaping_1.0.5        
#>  [73] RPMM_1.25                 gtools_3.9.5             
#>  [75] RSQLite_2.4.6             tidyr_1.3.2              
#>  [77] generics_0.1.4            wateRmelon_2.12.0        
#>  [79] data.table_1.18.2.1       rtracklayer_1.66.0       
#>  [81] httr_1.4.8                htmlwidgets_1.6.4        
#>  [83] S4Arrays_1.6.0            pkgconfig_2.0.3          
#>  [85] blob_1.3.0                siggenes_1.80.0          
#>  [87] impute_1.80.0             htmltools_0.5.9          
#>  [89] bookdown_0.46             geneplotter_1.84.0       
#>  [91] ROC_1.82.0                png_0.1-9                
#>  [93] knitr_1.51                tzdb_0.5.0               
#>  [95] rjson_0.2.23              nlme_3.1-169             
#>  [97] curl_7.0.0                cachem_1.1.0             
#>  [99] rhdf5_2.50.2              BiocVersion_3.20.0       
#> [101] KernSmooth_2.23-26        AnnotationDbi_1.68.0     
#> [103] restfulr_0.0.16           desc_1.4.3               
#> [105] GEOquery_2.74.0           pillar_1.11.1            
#> [107] grid_4.4.2                reshape_0.8.10           
#> [109] vctrs_0.7.2               gplots_3.3.0             
#> [111] ENmix_1.42.2              dbplyr_2.5.2             
#> [113] xtable_1.8-8              cluster_2.1.8.2          
#> [115] evaluate_1.0.5            tinytex_0.59             
#> [117] readr_2.2.0               GenomicFeatures_1.58.0   
#> [119] cli_3.6.5                 compiler_4.4.2           
#> [121] Rsamtools_2.22.0          rlang_1.1.7              
#> [123] crayon_1.5.3              rngtools_1.5.2           
#> [125] nor1mix_1.3-3             mclust_6.1.2             
#> [127] affy_1.84.0               plyr_1.8.9               
#> [129] fs_2.0.1                  BiocParallel_1.40.2      
#> [131] nleqslv_3.3.6             tiff_0.1-12              
#> [133] Matrix_1.7-5              ExperimentHub_2.14.0     
#> [135] hms_1.1.4                 sparseMatrixStats_1.18.0 
#> [137] bit64_4.6.0-1             Rhdf5lib_1.28.0          
#> [139] KEGGREST_1.46.0           statmod_1.5.1            
#> [141] AnnotationHub_3.14.0      memoise_2.0.1            
#> [143] affyio_1.76.0             bslib_0.10.0             
#> [145] bit_4.6.0

Acknowledgement

This package was developed using biocthis.

This vignette was generated using BiocStyle, with knitr and rmarkdown running behind the scenes.

Required knowledge

dnaEPICO is built on core Bioconductor infrastructure for high-dimensional genomic data, with a focus on Illumina DNA methylation arrays.

Preprocessing and quality control are performed using established Bioconductor tools, including minfi, ENmix, and wateRmelon. Downstream statistical modelling relies on base R and CRAN frameworks, including generalised linear models and linear mixed-effects models. Users are expected to have basic familiarity with R, command-line execution, and Illumina IDAT file structures. Downstream statistical modelling relies on base R and CRAN frameworks, including generalised linear models and linear mixed-effects models. Users are expected to have basic familiarity with R, command-line execution, and Illumina IDAT file structures.

If you are asking yourself the question “Where do I start using Bioconductor?” you might be interested in this blog post.

Asking for help

As package developers, we try to explain clearly how to use our packages and in which order to use the functions. But R and Bioconductor have a steep learning curve so it is critical to learn where to ask for help. The blog post quoted above mentions some but we would like to highlight the Bioconductor support site as the main resource for getting help: remember to use the dnaEPICO tag and check the older posts. Other alternatives are available such as creating GitHub issues and tweeting. However, please note that if you want to receive help you should adhere to the posting guidelines. It is particularly critical that you provide a small reproducible example and your session information so package developers can track down the source of the error.

References

Aryee, Martin J, Andrew E Jaffe, Hector Corrada Bravo, Christine Ladd-Acosta, Andrew P Feinberg, Kasper D Hansen, and Rafael A Irizarry. 2014. “Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays.” Bioinformatics 30 (10): 1363–69. https://doi.org/10.1093/bioinformatics/btu049.

Bates, Douglas, Martin Mächler, Ben Bolker, and Steve Walker. 2015. “Fitting Linear Mixed-Effects Models Using lme4.” Journal of Statistical Software 67 (1): 1–48. https://doi.org/10.18637/jss.v067.i01.

Fortin, Jean-Philippe, Aurélie Labbe, Mathieu Lemire, Brent W Zanke, Thomas J Hudson, Elana J Frtig, Celia MT Greenwood, and Kasper D Hansen. 2014. “Functional normalization of 450k methylation array data improves replication in large cancer studies.” Genome Biology 15 (11): 503. https://doi.org/10.1186/s13059-014-0503-2.

Fortin, Jean-Philippe, Timothy Triche Jr., and Kasper D Hansen. 2017. “Preprocessing, Normalization and Integration of the Illumina HumanMethylationEPIC Array with Minfi.” Bioinformatics 33 (4): 558–60. https://doi.org/10.1093/bioinformatics/btw691.

Maksimovic, Jovana, Lavinia Gordon, and Alicia Oshlack. 2012. “SWAN: Subset quantile Within-Array Normalization for Illumina Infinium HumanMethylation450 BeadChips.” Genome Biology 13 (6): R44. https://doi.org/10.1186/gb-2012-13-6-r44.

Marschner, Ian. 2011. “Glm2: Fitting Generalized Linear Models with Convergence Problems.” The R Journal 3: 12–15. https://doi.org/10.32614/RJ-2011-012.

Murat, Kubra, Björn Grüning, Pawel W Poterlowicz, Gareth Westgate, Desmond J Tobin, and Krzysztof Poterlowicz. 2020. “Ewastools: Infinium Human Methylation BeadChip pipeline for population epigenetics integrated into Galaxy.” GigaScience 9 (5): giaa049. https://doi.org/10.1093/gigascience/giaa049.

Pidsley, Ruth, Ching Ching Y Wong, Matteo Volta, Katie Lunnon, Jonathan Mill, and Leonard C Schalkwyk. 2013. “A data-driven approach to preprocessing Illumina 450K methylation array data.” BMC Genomics 14: 293. https://doi.org/10.1186/1471-2164-14-293.

Touleimat, Nizar, and Jörg Tost. 2012. “Complete Pipeline for Infinium() Human Methylation 450K BeadChip Data Processing Using Subset Quantile Normalization for Accurate DNA Methylation Estimation.” Epigenomics 4 (3): 325–41. https://doi.org/10.2217/epi.12.21.

Triche, Timothy J, Daniel J Weisenberger, David Van Den Berg, Peter W Laird, and Kimberly D Siegmund. 2013. “Low-level processing of Illumina Infinium DNA Methylation BeadArrays.” Nucleic Acids Research 41 (7): e90. https://doi.org/10.1093/nar/gkt090.

Xu, Zongli, Li Niu, and Jack A Taylor. 2021. “The ENmix DNA methylation analysis pipeline for Illumina BeadChip and comparisons with seven other preprocessing pipelines.” Clinical Epigenetics 13 (1): 216. https://doi.org/10.1186/s13148-021-01207-1.

The dnaEPICO User’s Guide

Paul Ruiz

Divya Mehta

2026-04-02

Introduction

Citing dnaEPICO

Installation

Overview

Local/Cluster use

Data expectations and folder conventions

Local use

Step 1: preprocessingMinfiEwasWater()

What it does:

Parameter guide:

Step 2: svaEnmix()

What it does:

Parameter guide:

Step 3: preprocessingPheno()

What it does:

Parameter guide:

Cluster / HPC use

Step 4: methylationGLM_T1()

What it does:

Parameter guide:

GLM structure used in dnaEPICO (glm2):

GLM with PRS:

Interaction term

Step 5: methylationGLMM_T1T2()

What it does:

Parameter guide:

LMEM structure used in dnaEPICO (lme4):

Makefile pipeline concepts

Make targets: f3, f4, f3lme and all

Commands

Makefile Modelling steps (HPC)

Model selection

Per-model overrides

Directory and step conventions

Global parameters

Specific parameters

Argument normalisation

Reporting

Troubleshooting

Example Analysis

Example 1: Local Test

Overview of example 1

Step 1: Create a temporary working directory

Step 2: Load example IDAT files from minfiData

Step 3: Copy the cross-reactive probe annotation file

Step 4: Create a phenotype file

Step 5: Preprocessing and quality control

Step 6: Surrogate variable analysis (SVA)

Step 7: Phenotype preprocessing

Step 8: Generate a PDF report

Example 2: Cluster Test

Step 1: Selecting a single test model

Step 2: Using the example phenotype file

Step 3: Removing custom R library paths

Step 4: Matching columns

Step 5: Plot grouping and cell composition reference

Step 6: Timepoint

Basics

Acknowledgement

Required knowledge

Asking for help

References

Citing `dnaEPICO`

Step 1: `preprocessingMinfiEwasWater()`

Step 2: `svaEnmix()`

Step 3: `preprocessingPheno()`

Step 4: `methylationGLM_T1()`

GLM structure used in dnaEPICO (`glm2`):

Step 5: `methylationGLMM_T1T2()`

LMEM structure used in dnaEPICO (`lme4`):