vignettes/dnaEPICO.Rmd
dnaEPICO.RmdAbstract
A comprehensive guide to use the dnaEPICO package. The package supports preprocessing and analysis of Illumina DNA methylation array data (EPICv2, EPIC and 450K), including QC, filtering, SVA, cell-type deconvolution, phenotype integration, modelling, and export-ready files for the DNAmAge Clock Foundation platform.
The dnaEPICO package provides a structured workflow for preprocessing and analysing Illumina DNA methylation array data. It works with EPICv2, EPIC, and 450K arrays. The package supports data import, quality control, filtering, normalisation, surrogate variable analysis, cell-type deconvolution, phenotype integration, statistical modelling, and export files for DNAmAge Clock Foundation.
The package is designed to integrate with established Bioconductor
tools such as minfi, ENmix, and
wateRmelon.
dnaEPICO also prepares standardised DNA methylation files for direct export to the DNAmAge Clock Foundation platform.
dnaEPICO
The dnaEPICO package uses methods from several external packages. Because no single manuscript describes all components, the guidance below explains how to cite dnaEPICO depending on the functions you use.
This citation guidance is adapted from the vignettes and user guides of minfi, ENmix, and wateRmelon.
preprocessingMinfiEwasWater(), please cite
(Aryee et al. 2014; Fortin, Triche Jr., and
Hansen 2017; Xu, Niu, and Taylor 2021; Pidsley et al. 2013; Murat et al.
2020; Maksimovic, Gordon, and Oshlack 2012; Fortin et al. 2014; Triche
et al. 2013; Touleimat and Tost 2012).svaEnmix(), please cite (Aryee et al. 2014; Xu, Niu, and Taylor
2021).make f3 MODEL=model1, please cite (Aryee et al. 2014; Fortin, Triche Jr., and Hansen
2017; Xu, Niu, and Taylor 2021; Pidsley et al. 2013; Maksimovic, Gordon,
and Oshlack 2012; Fortin et al. 2014; Triche et al. 2013; Touleimat and
Tost 2012; Murat et al. 2020).make f4 MODEL=model1, please cite (Aryee et al. 2014; Fortin, Triche Jr., and Hansen
2017; Xu, Niu, and Taylor 2021; Pidsley et al. 2013; Maksimovic, Gordon,
and Oshlack 2012; Fortin et al. 2014; Triche et al. 2013; Touleimat and
Tost 2012; Murat et al. 2020; Marschner 2011).make f3lme MODEL=model1, please cite (Aryee et al. 2014; Fortin, Triche Jr., and Hansen
2017; Xu, Niu, and Taylor 2021; Pidsley et al. 2013; Maksimovic, Gordon,
and Oshlack 2012; Fortin et al. 2014; Triche et al. 2013; Touleimat and
Tost 2012; Murat et al. 2020; Bates et al. 2015).make all MODEL=model1, please cite (Aryee et al. 2014; Fortin, Triche Jr., and Hansen
2017; Xu, Niu, and Taylor 2021; Pidsley et al. 2013; Maksimovic, Gordon,
and Oshlack 2012; Fortin et al. 2014; Triche et al. 2013; Touleimat and
Tost 2012; Murat et al. 2020; Marschner 2011; Bates et al.
2015).The dnaEPICO package depends on minfi, ENmix, and wateRmelon, among others. Install dependencies with BiocManager:
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("dnaEPICO")
BiocManager::valid()Alternatively, install directly from GitHub:
install.packages('devtools')
devtools::install_github('paulYRP/dnaEPICO')dnaEPICO is a DNA methylation preprocessing and modelling pipeline for Illumina arrays. It wraps standard analysis steps into a consistent structure that:
Each step writes its outputs to a dedicated folder. This design lets you run steps independently or connect them through the Makefile pipeline.
Local use is recommended for:
Cluster/HPC use is recommended for:
f3,
f4, f3lme, all).dnaEPICO expects a project layout like this:
├── data/ # Processed data outputs (per model)
│ ├── preprocessingMinfiEwasWater/
│ │ ├── idats/ # Folder containing raw IDAT files
│ │ ├── pheno.csv # Phenotype file with sample metadata
│ │ └── 12864_2024_10027_MOESM8_ESM.csv # Cross-reactivity comparison reference
│ ├── Makefile # Arguments
CSV with sample
identifiers and covariates.*_Red.idat and *_Grn.idat pairs.Cross-hybridisation probe data for EPICv2 can be downloaded from 12864_2024_10027_MOESM8_ESM.csv.
You can extract the example Makefile with:
extractMake(
destDir = getwd(),
overwrite = FALSE
)dnaEPICO produces:
├── data/ # Processed data outputs (per model)
│ ├── preprocessingMinfiEwasWater/
│ │ ├── idats/ # Folder containing raw IDAT files
│ │ ├── pheno.csv # Phenotype file with sample metadata
│ │ └── 12864_2024_10027_MOESM8_ESM.csv # Cross-reactivity comparison reference
│ ├── model1/ # Outputs specific to the first model
│ │ ├── preprocessingMinfiEwasWater/ # Preprocessed phenotype + QC from Minfi
│ │ ├── svaEnmix/ # Surrogate variable analysis
│ │ ├── preprocessingPheno/ # Cleaned and merged phenotype files
│ │ ├── methylationGLM_T1/ # CpG-level GLM results (single timepoint)
│ │ ├── methylationGLMM_T1T2/ # Longitudinal LME results (T1 vs T2)
│ │ └── ... # Additional subfolders by model
│ ├── model2/
│ └── model3/
├── rData/ # Processed R objects (MSet, Beta, CN matrices, etc.)
├── logs/ # Logging output from script runs
├── figures/ # QC plots and visualization output
├── results/ # Output tables and summary statistics
├── reports/ # Rendered reports (PDF)
Each step writes:
logs/: logsrData/: intermediate and final R objectsfigures/: plots organised by stepdata/: exported tables and annotated resultsreports/: final report filesThis approach improves reproducibility, debugging, and HPC resuming.
Local use focuses on the first three steps that are usually run interactively:
preprocessingMinfiEwasWater(): import, QC, normalise,
filter, and estimate cell compositionsvaEnmix(): surrogate variable analysis using ENmix
control probespreprocessingPheno(): merge phenotype metadata with
methylation matricespreprocessingMinfiEwasWater()
preprocessingMinfiEwasWater(
phenoFile = "data/preprocessingMinfiEwasWater/pheno.csv",
idatFolder = "data/preprocessingMinfiEwasWater/idats",
outputLogs = "logs",
nSamples = NA,
SampleID = "Sample_Name",
arrayType = "IlluminaHumanMethylationEPICv2",
annotationVersion = "20a1.hg38",
scriptLabel = "preprocessingMinfiEwasWater",
baseDataFolder = "rData",
sepType = "",
tiffWidth = 2000,
tiffHeight = 1000,
tiffRes = 150,
qcCutoff = 10.5,
detPtype = "m+u",
detPThreshold = 0.05,
normMethods = "adjustedfunnorm",
sexColumn = "Sex",
pvalThreshold = 0.01,
chrToRemove = "chrX,chrY",
snpsToRemove = "SBE,CpG",
mafThreshold = 0.1,
crossReactivePath =
"data/preprocessingMinfiEwasWater/12864_2024_10027_MOESM8_ESM.csv",
plotGroupVar = "Sex",
lcRef = "salivaEPIC",
phenoOrder = "Sample_Name;Timepoint;Sex;PredSex;Basename;Sentrix_ID;Sentrix_Position",
lcPhenoDir = "data/preprocessingMinfiEwasWater"
)This wrapper runs
inst/scripts/preprocessingMinfiEwasWater.R. It:
phenoFile) and IDAT files
(idatFolder).RGChannelSet and performs initial QC.MSet, RatioSet, and
GenomicRatioSet objects.normMethods.rData/.figures/ and logs to
logs/.lcRef and writes
phenoLC.csv to lcPhenoDir.phenoFile: path to sample metadata CSV.idatFolder: folder with raw IDAT pairs.nSamples: subset size for local testing (use
NA to process all).SampleID: sample identifier column in the phenotype
file.arrayType, annotationVersion: ensure the
correct annotation and manifest.qcCutoff, detPtype,
detPThreshold: QC thresholds for probe/sample
filtering.normMethods: chosen normalisation method.pvalThreshold, chrToRemove,
snpsToRemove, mafThreshold: probe filtering
rules.crossReactivePath: path to cross-reactive probe
list.phenoOrder: order of columns in the final phenotype
table.lcRef, lcPhenoDir: cell type reference and
output directory for cell fraction results.svaEnmix()
svaEnmix(
phenoFile = "data/preprocessingMinfiEwasWater/phenoLC.csv",
rgsetData = "rData/preprocessingMinfiEwasWater/objects/RGSet.RData",
sepType = "",
outputLogs = "logs",
nSamples = NA,
SampleID = "Sample_Name",
arrayType = "IlluminaHumanMethylationEPICv2",
annotationVersion = "20a1.hg38",
SentrixIDColumn = "Sentrix_ID",
SentrixPositionColumn = "Sentrix_Position",
ctrlSvaPercVar = 0.90,
ctrlSvaFlag = 1,
scriptLabel = "svaEnmix",
tiffWidth = 2000,
tiffHeight = 1000,
tiffRes = 150
)This wrapper runs inst/scripts/svaEnmix.R. It:
RGSet object from rgsetData and
phenotype data from phenoFile.ctrlSvaPercVar of variance.preprocessingPheno()
preprocessingPheno(
phenoFile = "data/preprocessingMinfiEwasWater/phenoLC.csv",
sepType = "",
betaPath =
"rData/preprocessingMinfiEwasWater/metrics/beta_NomFilt_MSetF_Flt_Rxy_Ds_Rc.RData",
mPath =
"rData/preprocessingMinfiEwasWater/metrics/m_NomFilt_MSetF_Flt_Rxy_Ds_Rc.RData",
cnPath =
"rData/preprocessingMinfiEwasWater/metrics/cn_NomFilt_MSetF_Flt_Rxy_Ds_Rc.RData",
SampleID = "Sample_Name",
timeVar = "Timepoint",
timepoints = "1,2",
combineTimepoints = "1,2",
outputPheno = "data/preprocessingPheno",
outputRData = "rData/preprocessingPheno/metrics",
outputRDataMerge = "rData/preprocessingPheno/mergeData",
sexColumn = "Sex",
outputLogs = "logs",
outputDir = "data/preprocessingPheno"
)This wrapper runs inst/scripts/preprocessingPheno.R.
It:
SampleID.timepoints.combineTimepoints.outputPheno, metric
objects to outputRData, and merged objects to
outputRDataMerge.phenoFile: the phenotype file generated in Step 1.betaPath, mPath, cnPath:
paths to saved methylation matrices.SampleID: sample identifier column used to match
data.timeVar: column used to define timepoints.timepoints: comma-separated list of timepoints to
export separately.combineTimepoints: list of timepoints to combine for
longitudinal analysis.outputPheno: directory for cleaned phenotype
tables.outputRData: directory for timepoint-specific
methylation objects.outputRDataMerge: directory for merged
phenotype–methylation objects.sexColumn: column used for biological sex.outputLogs: directory for logs.outputDir: base directory for exported files.This step produces analysis-ready phenotype–methylation objects for GLM and LME modelling and creates files compatible with the DNAmAge Clock Foundation platform.
For full datasets, model fitting across hundreds of thousands of CpGs usually requires:
dnaEPICO supports this through a Makefile-driven
pipeline that can run the whole workflow or selected targets.
methylationGLM_T1()
This step is typically executed on HPC systems because it is computationally intensive.
methylationGLM_T1(
inputPheno = "rData/preprocessingPheno/mergeData/phenoBetaT1.RData",
outputLogs = "logs",
outputRData = "rData/methylationGLM_T1/models",
outputPlots = "figures/methylationGLM_T1",
phenotypes = "Pheno1,Pheno2,Pheno3,Pheno4",
covariates = "Sex,Age,Ethnicity,TraumaDefinition,Leukocytes,Epithelial.cells",
factorVars = "Sex,Ethnicity,TraumaDefinition",
cpgPrefix = "cg",
cpgLimit = NA,
nCores = 32,
plotWidth = 2000,
plotHeight = 1000,
plotDPI = 150,
interactionTerm = NULL,
libPath = NULL,
glmLibs = "glm2",
prsMap = "Pheno1:PRS_1,Pheno2:PRS_2,Pheno3:PRS_3,Pheno4:PRS_4",
summaryPval = NA,
summaryResidualSD = TRUE,
saveSignificantCpGs = FALSE,
significantCpGDir = "preliminaryResults/cpgs/methylationGLM_T1",
significantCpGPval = 0.05,
saveTxtSummaries = TRUE,
chunkSize = 10000,
summaryTxtDir = "preliminaryResults/summary/methylationGLM_T1/glm",
fdrThreshold = 0.05,
padjmethod = "fdr",
annotationPackage = "IlluminaHumanMethylationEPICv2anno.20a1.hg38",
annotationCols = "Name,chr,pos,UCSC_RefGene_Group,UCSC_RefGene_Name,
Relation_to_Island,GencodeV41_Group",
annotatedGLMOut = "data/methylationGLM_T1"
) This wrapper runs inst/scripts/methylationGLM_T1.R.
It:
inputPheno.inputPheno: merged phenotype–beta .RData
input.outputLogs: directory for log files.outputRData: directory for model objects.outputPlots: directory for diagnostic and summary
plots.phenotypes: comma-separated phenotype variables.covariates: covariates included in each model.factorVars: categorical covariates.cpgPrefix: CpG identifier prefix.cpgLimit: number of CpGs to process; NA
processes all.nCores: CPU cores for parallel processing.plotWidth, plotHeight,
plotDPI: plot size and resolution.interactionTerm: optional interaction term.libPath: optional HPC library path.glmLibs: GLM package to use.prsMap: phenotype-to-PRS mapping.summaryPval: threshold for summary outputs.summaryResidualSD: include residual SD summaries.saveSignificantCpGs: save CpGs passing the significance
threshold.significantCpGDir: directory for significant CpG
tables.significantCpGPval: significance threshold for
CpGs.saveTxtSummaries: save text summaries.chunkSize: CpGs per chunk.summaryTxtDir: directory for text-based summaries.fdrThreshold: FDR threshold.padjmethod: adjustment method.annotationPackage: annotation package for CpGs.annotationCols: annotation fields to include.annotatedGLMOut: directory for annotated results.glm2):
For each CpG site, dnaEPICO fits a model of the form:
where is the methylation value at CpG site , is the intercept, is the variable of interest, represents the -th covariate in the model, and are the corresponding regression coefficients, and is the residual error term.
For example, a GLM without PRS is:
methylationGLMM_T1T2()
This step is typically executed on HPC systems because it is also computationally intensive.
methylationGLMM_T1T2(
inputPheno = "rData/preprocessingPheno/mergeData/phenoBetaT1T2.RData",
outputLogs = "logs/",
outputRData = "rData/methylationGLMM_T1T2/models",
outputPlots = "figures/methylationGLMM_T1T2",
personVar = "person",
timeVar = "Timepoint",
phenotypes = "Pheno1,Pheno2,Pheno3,Pheno4",
covariates = "Sex,Age,Ethnicity,TraumaDefinition,Leukocytes,Epithelial.cells",
factorVars = "Sex,Ethnicity,TraumaDefinition,Timepoint",
lmeLibs = "lme4,lmerTest",
prsMap = "Pheno1:PRS_1,Pheno2:PRS_2,Pheno3:PRS_3,Pheno4:PRS_4",
libPath = NULL,
cpgPrefix = "cg",
cpgLimit = NA,
nCores = 32,
summaryPval = NA,
plotWidth = 2000,
plotHeight = 1000,
plotDPI = 150,
interactionTerm = NULL,
saveSignificantInteractions = TRUE,
significantInteractionDir = "preliminaryResults/cpgs/methylationGLMM_T1T2",
significantInteractionPval = 0.05,
saveTxtSummaries = TRUE,
chunkSize = 10000,
summaryTxtDir = "preliminaryResults/summary/methylationGLMM_T1T2/lmer",
fdrThreshold = 0.05,
padjmethod = "fdr",
annotationPackage = "IlluminaHumanMethylationEPICv2anno.20a1.hg38",
annotationCols = "Name,chr,pos,UCSC_RefGene_Group,UCSC_RefGene_Name,
Relation_to_Island,GencodeV41_Group",
annotatedLMEOut = "data/methylationGLMM_T1T2"
)This wrapper runs inst/scripts/methylationGLMM_T1T2.R.
It:
inputPheno.prsMap.inputPheno: merged longitudinal phenotype–methylation
.RData input.outputLogs: directory for logs.outputRData: directory for mixed-effects model
objects.outputPlots: directory for plots.personVar: participant identifier for the random
effect.timeVar: time variable for longitudinal analysis.phenotypes: comma-separated phenotype variables.covariates: fixed-effect covariates.factorVars: categorical variables including
timeVar.lmeLibs: mixed-model libraries.prsMap: phenotype-to-PRS mappings.libPath: optional library path.cpgPrefix: CpG identifier prefix.cpgLimit: maximum CpGs; NA uses all.nCores: parallel cores.summaryPval: summary threshold; NA
disables filtering.plotWidth, plotHeight,
plotDPI: plot dimensions and resolution.interactionTerm: optional interaction term.saveSignificantInteractions: save significant
interactions.significantInteractionDir: directory for interaction
results.significantInteractionPval: significance threshold for
interactions.saveTxtSummaries: save text summaries.chunkSize: CpGs per chunk.summaryTxtDir: directory for text summaries.fdrThreshold: FDR threshold.padjmethod: adjustment method.annotationPackage: annotation package.annotationCols: annotation fields.annotatedLMEOut: directory for annotated results.lme4):
For longitudinal analyses, the model includes random effects and optional interactions:
where PRS is included only if it is defined for the
phenotype.
The model follows the same general form as
methylationGLM_T1().
where is the subject-specific random intercept.
Makefile.model.pipeline: user configuration (models,
variables, paths)Makefile.rules.pipeline: pipeline targets and
rulesExtract the example Makefile into the current directory:
## Copies the example Makefile into the current directory
extractMake(destDir = getwd(), overwrite = FALSE)The example template Makefile.model.pipeline is copied
and renamed to Makefile.
Common grouped targets:
f3: Steps 1–3 (preprocessing + SVA + merge)f4: Steps 1–4 (adds GLM)f3lme: Steps 1–3 and 5 (adds LME4)all: Steps 1–5The Makefile controls how dnaEPICO runs models at scale. It defines:
glm2) and GLMM (lme4) parameter
settings,Instead of hard-coding parameters in R scripts, dnaEPICO uses a declarative Makefile layer. This makes it easy to rerun the same workflow with different configurations.
# ===============================================
# USER CONFIGURATION
# ===============================================
MAKEFLAGS += --output-sync
# ===============================================
# MODELS SELECTION PARALLEL RUN
# ===============================================
MODEL ?= model1 # Single model
MODELS = modelA modelB modelC # Multiple models for parallel run
# ===============================================
# PER-MODEL OVERRIDES
# ===============================================
ifeq ($(MODEL), modelA)
PHENO_FILE = $(DATA_DIR)/$(STEP1)/phenoA.csv
else ifeq ($(MODEL), modelB)
PHENO_FILE = $(DATA_DIR)/$(STEP1)/phenoB.csv
else ifeq ($(MODEL), modelC)
PHENO_FILE = $(DATA_DIR)/$(STEP1)/phenoC.csv
else # default (model1)
PHENO_FILE = $(DATA_DIR)/$(STEP1)/pheno.csv
endif
# ===============================================
# DIRECTORIES
# ===============================================
LOGS_DIR = logs
DATA_DIR = data
RDATA_DIR = rData
RESULTS_DIR = results
FIGURES_DIR = figures
STEP1 = preprocessingMinfiEwasWater
STEP2 = svaEnmix
STEP3 = preprocessingPheno
STEP4 = methylationGLM_T1
STEP5 = methylationGLMM_T1T2
STEP6 = reports
METRICS_DIR = metrics
IDAT_DIR = idats
OBJ_DIR = objects
MERGE_DIR = mergeData
MODEL_DIR = models
CPG_DIR = cpgs
SUMMARY_DIR = summary
ENMIX_DIR = enmix
SVA_DIR = sva
QC_DIR = qc
R_DIR = ~/R/x86_64-pc-linux-gnu-library/4.4 # NULL is default R library path
# ===============================================
# GLOBAL PARAMETERS
# ===============================================
# STEP 1-5 PARAMETERS
PHENO_FILE = $(DATA_DIR)/$(STEP1)/phenoEMD.csv
SEP_TYPE = ""
SAMPLE_ID = Sample_Name
N_SAMPLES = 30
ARRAY_TYPE = IlluminaHumanMethylationEPICv2
ANNOTATION_VERSION = 20a1.hg38
TIFF_WIDTH = 2000
TIFF_HEIGHT = 1000
TIFF_RES = 150
SEX_COLUMN = Sex
SENTRIX_ID_COLUMN = Sentrix_ID
SENTRIX_POSITION_COLUMN = Sentrix_Position
BASENAME_COLUMN = Basename
TIME_VAR = Timepoint
PHENO_ORDER = $(SAMPLE_ID);$(TIME_VAR);$(SEX_COLUMN);PredSex;$(BASENAME_COLUMN);$(SENTRIX_ID_COLUMN);$(SENTRIX_POSITION_COLUMN)
# STEP 4-5 PARAMETERS
PHENOTYPES = TreatmentGroup
COVARIATES = Sex
FACTOR_VARS = Sex,TreatmentGroup
CPG_PREFIX = cg
CPG_LIMIT = NA
PRS_MAP = NULL
SUMMARY_PVAL = NA
N_CORES = 64
SAVE_TXT_SUMMARIES = TRUE
CHUNK_SIZE = 10000
FDR_THRESHOLD = 0.05
MODEL defines a single pipeline
configuration
MODELS defines multiple independent
configurations
The same pipeline can therefore be run in two modes:
In parallel mode, each model runs as a fully independent pipeline, with its own inputs, logs, figures, and outputs. No files are shared between models except the raw inputs (IDATs), unless explicitly specified.
ifeq ($(MODEL), modelA)
PHENO_FILE = data/preprocessingMinfiEwasWater/phenoA.csv
...
else
PHENO_FILE = data/preprocessingMinfiEwasWater/pheno.csv
endif
This block dynamically replaces variables based on the selected model.
MODEL=modelA, PHENO_FILE is replaced
before any step runsThis allows users to:
Each step writes outputs into step-specific directories. When running in parallel, directory paths are automatically namespaced by model, ensuring isolation between runs.
This design allows:
These parameters are shared across preprocessing and modelling.
These parameters control how each model is fit. In parallel execution:
N_CORES controls intra-model
parallelism (within R)MODELS controls inter-model
parallelism (within Make)ifeq ($(PRS_MAP),NULL)
PRS_MAP_ARG = NULL
else
PRS_MAP_ARG = '$(PRS_MAP)'
endif
if ...
This section ensures that optional arguments are passed correctly to
R scripts. It prevents accidental interpretation of the string
"NULL" as a real value and guarantees consistent behaviour
across local and Scluster environments.
dnamReport() generates a PDF report by rendering
DNAm.Rmd through a runner script (DNAm.R)
shipped in inst/scripts.
Note: PDF report generation requires a working LaTeX installation.TinyTeX can be installed in R using: install.packages("tinytex") and tinytex::install_tinytex().
For troubleshooting LaTeX-related issues, see the TinyTeX documentation.
dnamReport(
output = "DNAm_Report.pdf",
outputDir = "reports",
qcDir = "figures/preprocessingMinfiEwasWater/enMix",
preprocessingDir = "figures/preprocessingMinfiEwasWater/qc",
postprocessingDir = "figures/preprocessingMinfiEwasWater/metrics",
svaDir = "figures/svaEnmix/sva",
glmDir = "figures/methylationGLM_T1",
glmmDir = "figures/methylationGLMM_T1T2",
reportTitle = "DNA methylation",
author = "School of Biomedical Sciences",
date = format(Sys.Date(), "%B %d, %Y")
)logs/
Makefile
The following examples are brief demonstrations on how to perform DNA
methylation analysis using dnaEPICO functions and
arguments.
This example has the following dependencies:
The aim of this example is to show how to run the complete
dnaEPICO pipeline using the example dataset
provided by the minfiData package.
This example is designed to:
To avoid creating permanent files on the user’s computer, all steps are run inside a temporary directory.
When analysing real data, users can simply replace the temporary directory with their own project folder.
In this example, the pipeline performs the following steps:
We first create a temporary directory that will hold all input files and outputs for this example.
# Temporary directory
tmpDir <- file.path(tempdir(), "dnaEPICO_Test")
dir.create(tmpDir, showWarnings = FALSE, recursive = TRUE)
preprocDir <- file.path(tmpDir, "data", "preprocessingMinfiEwasWater")
dir.create(preprocDir, showWarnings = FALSE, recursive = TRUE)
# Register automatic cleanup when the vignette exits
on.exit(unlink(tmpDir, recursive = TRUE, force = TRUE), add = TRUE)
tmpDir
#> [1] "/tmp/Rtmpl8RIvi/dnaEPICO_Test"Important:
The minfiData package contains a small set of Illumina
450K IDAT files from blood samples.
# Locate minfiData extdata directory
baseDir_minfi <- system.file("extdata", package = "minfiData")
# Find all IDAT files
idatFiles <- list.files(
baseDir_minfi,
pattern = "\\.idat$",
recursive = TRUE,
full.names = TRUE
)
length(idatFiles)
#> [1] 12We now copy these IDAT files into the directory structure expected by
SdnaEPICO.
# Create IDAT directory
idatDir <- file.path(preprocDir, "idats")
dir.create(idatDir, showWarnings = FALSE, recursive = TRUE)
# Copy IDATs into idatDir
file.copy(idatFiles, idatDir, overwrite = TRUE)
#> [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
length(list.files(idatDir, pattern = "\\.idat$"))
#> [1] 12The pipeline requires a file listing cross-reactive probes. This file
is distributed with dnaEPICO and is copied into the
preprocessing folder. It is a reduced version of the original file,
which can be accessed from the following link:
# Locate dnaEPICO extdata directory
baseDir_dnaEPICO <- system.file("extdata", package = "dnaEPICO")
# Source cross-reactive probe file
crossReactiveSrc <- file.path(baseDir_dnaEPICO, "12864_2024_10027_MOESM8_ESM.csv")
# Destination path inside tmpDir/data/preprocessingMinfiEwasWater/
crossReactivePath <- file.path(preprocDir, "12864_2024_10027_MOESM8_ESM.csv")
# Copy into the target folder
file.copy(crossReactiveSrc, crossReactivePath, overwrite = TRUE)
#> [1] TRUE
file.exists(crossReactivePath)
#> [1] TRUENext, we create a phenotype file using the sample sheet provided by
minfiData.
# Read SampleSheet from minfiData
targets <- read.csv(
file.path(baseDir_minfi, "SampleSheet.csv"),
stringsAsFactors = FALSE,
skip = 7
)
# Create Basename column
targets$Basename <- paste0(
targets$Sentrix_ID,
"_",
targets$Sentrix_Position
)
# Create Timepoint column
targets$Timepoint <- 1
# Write phenotype file
samplePath <- file.path(preprocDir, "sample_minfi.csv")
write.csv(targets, samplePath, row.names = FALSE)
file.exists(samplePath)
#> [1] TRUEThis phenotype file will be used throughout the pipeline.
We now run the first pipeline stage, which performs:
set.seed(123)
preprocessingMinfiEwasWater(
phenoFile = samplePath,
idatFolder = file.path(preprocDir, "idats"),
nSamples = NA,
SampleID = "Sample_Name",
phenoOrder = "Sample_Name;sex;Basename;Sentrix_ID;Sentrix_Position",
arrayType = "IlluminaHumanMethylation450k",
annotationVersion = "ilmn12.hg19",
sexColumn = "sex",
plotGroupVar = "sex",
tiffWidth = 2000,
tiffHeight = 1000,
tiffRes = 150,
qcCutoff = 10.5,
detPtype = "m+u",
detPThreshold = 0.05,
normMethods = "adjustedfunnorm",
pvalThreshold = 0.01,
chrToRemove = "chrX,chrY",
snpsToRemove = "SBE,CpG",
mafThreshold = 0.1,
outputLogs = file.path(tmpDir, "logs"),
baseDataFolder = file.path(tmpDir, "rData"),
figureBaseDir = file.path(tmpDir, "figures"),
lcRef = "FlowSorted.Blood.450k",
crossReactivePath = file.path(preprocDir, "12864_2024_10027_MOESM8_ESM.csv"),
lcPhenoDir = preprocDir
)
#> ==== Starting preprocessingMinfiEwasWater ====
#> Start Time: 2026-04-02 05:15:57
#> Log file path: /tmp/Rtmpl8RIvi/dnaEPICO_Test/logs/log_preprocessingMinfiEwasWater.txt
#>
#> Phenotype file: /tmp/Rtmpl8RIvi/dnaEPICO_Test/data/preprocessingMinfiEwasWater/sample_minfi.csv
#> Separator type:
#> IDAT folder: /tmp/Rtmpl8RIvi/dnaEPICO_Test/data/preprocessingMinfiEwasWater/idats
#> nSamples limit: all
#> SampleID column: Sample_Name
#> Array type: IlluminaHumanMethylation450k
#> Annotation version: ilmn12.hg19
#> Base RData folder: /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData
#> Base Figure folder: /tmp/Rtmpl8RIvi/dnaEPICO_Test/figures
#> TIFF size (w x h @ dpi): 2000 x 1000 @ 150
#> QC cutoff (median): 10.5
#> Detection P-value type: m+u
#>
#> Detection p-value threshold: 0.05
#> Normalization methods: adjustedfunnorm
#> Sex column: sex
#> Plot grouping variable: sex
#>
#> Probe filtering:
#> P-value threshold: 0.01
#> Chromosomes to remove: chrX,chrY
#> SNP positions filter: SBE,CpG
#> MAF threshold: 0.1
#> Cross-reactive file: /tmp/Rtmpl8RIvi/dnaEPICO_Test/data/preprocessingMinfiEwasWater/12864_2024_10027_MOESM8_ESM.csv
#>
#> Cell composition (estimateLC):
#> Reference: FlowSorted.Blood.450k
#> Leading pheno order: Sample_Name;sex;Basename;Sentrix_ID;Sentrix_Position
#> =======================================================================
#> Using all 6 samples.
#> Phenotype file loaded with 6 samples and 13 columns.
#> Preview of targets:
#> Sample_Name Sample_Well Sample_Plate Sample_Group Pool_ID
#> 1 GroupA_3 H5 NA GroupA NA
#> 2 GroupA_2 D5 NA GroupA NA
#> 3 GroupB_3 C6 NA GroupB NA
#> 4 GroupB_1 F7 NA GroupB NA
#> 5 GroupA_1 G7 NA GroupA NA
#> 6 GroupB_2 H7 NA GroupB NA
#> =======================================================================
#> RGSet loaded with 6 samples.
#> =======================================================================
#> Plotting STAINING .jpg
#> Plotting EXTENSION .jpg
#> Plotting HYBRIDIZATION .jpg
#> Plotting TARGET_REMOVAL .jpg
#> Plotting BISULFITE_CONVERSION_I .jpg
#> Plotting BISULFITE_CONVERSION_II .jpg
#> Plotting SPECIFICITY_I .jpg
#> Plotting SPECIFICITY_II .jpg
#> Plotting NON-POLYMORPHIC .jpg
#> Plotting NEGATIVE .jpg
#> Plotting NORM_A .jpg
#> Plotting NORM_C .jpg
#> Plotting NORM_G .jpg
#> Plotting NORM_T .jpg
#> Plotting NORM_ACGT .jpg
#> Generated ENmix control JPGs in: /tmp/Rtmpl8RIvi/dnaEPICO_Test/figures/preprocessingMinfiEwasWater/enMix
#> =======================================================================
#> Applied annotation: IlluminaHumanMethylation450k, ilmn12.hg19
#> Manifest used:
#> IlluminaMethylationManifest object
#> Annotation
#> array: IlluminaHumanMethylation450k
#> Number of type I probes: 135476
#> Number of type II probes: 350036
#> Number of control probes: 850
#> Number of SNP type I probes: 25
#> Number of SNP type II probes: 40
#> =======================================================================
#> RGSet saved to: /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData/preprocessingMinfiEwasWater/objects/RGSet.RData
#> =======================================================================
#> Running preprocessRaw() to generate MSet...
#> MSet created with 6 samples and 485512 probes.
#> =======================================================================
#> MSet saved to: /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData/preprocessingMinfiEwasWater/objects/MSet.RData
#> =======================================================================
#> Preview of methylated intensities:
#> GroupA_3 GroupA_2 GroupB_3
#> cg00050873 22041 588 20505
#> cg00212031 679 569 439
#> cg00213748 1620 421 707
#> cg00214611 449 614 343
#> cg00455876 5921 398 3257
#> cg01707559 1238 646 637
#> =======================================================================
#> Preview of unmethylated intensities:
#> GroupA_3 GroupA_2 GroupB_3
#> cg00050873 1945 433 1012
#> cg00212031 6567 300 2689
#> cg00213748 384 461 295
#> cg00214611 4869 183 1655
#> cg00455876 1655 792 1060
#> cg01707559 12227 1009 7414
#> =======================================================================
#> Converting MSet to RatioSet and GSet...
#> RatioSet created.
#> class: RatioSet
#> dim: 485512 6
#> metadata(0):
#> assays(3): Beta M CN
#> rownames(485512): cg00050873 cg00212031 ... ch.22.47579720R
#> ch.22.48274842R
#> rowData names(0):
#> colnames(6): GroupA_3 GroupA_2 ... GroupA_1 GroupB_2
#> colData names(14): Sample_Name Sample_Well ... Timepoint filenames
#> Annotation
#> array: IlluminaHumanMethylation450k
#> annotation: ilmn12.hg19
#> Preprocessing
#> Method: Raw (no normalization or bg correction)
#> minfi version: 1.52.1
#> Manifest version: 0.4.0
#> =======================================================================
#> GSet created.
#> class: GenomicRatioSet
#> dim: 485512 6
#> metadata(0):
#> assays(3): Beta M CN
#> rownames(485512): cg13869341 cg14008030 ... cg08265308 cg14273923
#> rowData names(0):
#> colnames(6): GroupA_3 GroupA_2 ... GroupA_1 GroupB_2
#> colData names(14): Sample_Name Sample_Well ... Timepoint filenames
#> Annotation
#> array: IlluminaHumanMethylation450k
#> annotation: ilmn12.hg19
#> Preprocessing
#> Method: Raw (no normalization or bg correction)
#> minfi version: 1.52.1
#> Manifest version: 0.4.0
#> =======================================================================
#> =======================================================================
#> Extracting methylation raw level metrics from GSet, these metrics are not saved...
#>
#> Preview of beta values:
#> GroupA_3 GroupA_2 GroupB_3 GroupB_1 GroupA_1
#> cg13869341 0.83863092 0.93969915 0.90540879 0.83618626 0.82767608
#> cg14008030 0.64813991 0.62519431 0.71063615 0.70437389 0.68204321
#> cg12045430 0.06774677 0.07835538 0.06291283 0.17674976 0.07216013
#> cg20826792 0.13240759 0.13743747 0.10998991 0.18505709 0.13602546
#> cg00381604 0.03526373 0.03036437 0.03364362 0.05298938 0.03786708
#> cg20253340 0.56370048 0.63295793 0.47331874 0.51451954 0.58907301
#> =======================================================================
#> Preview of M-values:
#> GroupA_3 GroupA_2 GroupB_3 GroupB_1 GroupA_1
#> cg13869341 2.3776718 3.9619486 3.2587913 2.35176802 2.2639433
#> cg14008030 0.8813034 0.7381618 1.2962264 1.25256776 1.1010324
#> cg12045430 -3.7824979 -3.5561063 -3.8967571 -2.21962279 -3.6846021
#> cg20826792 -2.7120316 -2.6498537 -3.0164505 -2.13872859 -2.6671121
#> cg00381604 -4.7738777 -4.9969913 -4.8441506 -4.15960551 -4.6672202
#> cg20253340 0.3696099 0.7861642 -0.1541181 0.08381261 0.5195643
#> =======================================================================
#> Preview of copy number values:
#> GroupA_3 GroupA_2 GroupB_3 GroupB_1 GroupA_1
#> cg13869341 15.45930 15.49151 15.66336 15.42105 15.59497
#> cg14008030 14.30499 15.08423 14.85866 14.75692 15.09280
#> cg12045430 14.13755 14.45436 14.17290 14.40561 14.38418
#> cg20826792 14.31995 14.39794 14.27467 14.36543 14.44553
#> cg00381604 14.02289 14.15782 13.87652 14.00799 14.14498
#> cg20253340 13.70217 14.06483 13.69360 13.88722 14.20526
#> =======================================================================
#> Running QC plotting from MSet object...
#> QC plot saved to: /tmp/Rtmpl8RIvi/dnaEPICO_Test/figures/preprocessingMinfiEwasWater/qc/quality_control(MSet).tiff
#> =======================================================================
#> Calculating detection p-values...
#> Detection p-values calculated using type: m+u
#> Preview of detection p-values:
#> GroupA_3 GroupA_2 GroupB_3 GroupB_1 GroupA_1
#> cg00050873 0.000000e+00 0.4489762614 0.000000e+00 9.463873e-01 0.854614396
#> cg00212031 0.000000e+00 0.7236576115 3.963479e-36 8.595495e-01 0.632034035
#> cg00213748 1.474819e-24 0.7026344783 3.658137e-01 3.018175e-01 0.915179880
#> cg00214611 1.727636e-260 0.8252957388 7.893160e-10 5.017703e-01 0.781146377
#> cg00455876 0.000000e+00 0.1759300809 3.160650e-83 4.191290e-01 0.509724924
#> cg01707559 0.000000e+00 0.0008455845 0.000000e+00 4.362571e-11 0.001268238
#> Detection RData p-values saved to: /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData/preprocessingMinfiEwasWater/qc/detP_RGSet.RData
#> Detection plot p-values saved to: /tmp/Rtmpl8RIvi/dnaEPICO_Test/figures/preprocessingMinfiEwasWater/qc/detection_pvalues(RGSet).tiff
#> =======================================================================
#> Calculate the mean detection p-values across all samples...
#> Number of failed samples: 0
#> Samples before filtering: 6
#> Samples after filtering: 6
#> RGSet saved after removing the failed samples to: /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData/preprocessingMinfiEwasWater/objects/RGSet.RData
#> =======================================================================
#> Generating density plot of Beta values...
#> Density plot saved to: /tmp/Rtmpl8RIvi/dnaEPICO_Test/figures/preprocessingMinfiEwasWater/qc/densityBeta(MSet).tiff
#> =======================================================================
#> Predicting sex based on Beta values...
#> Predicted Sex plot saved to: /tmp/Rtmpl8RIvi/dnaEPICO_Test/figures/preprocessingMinfiEwasWater/qc/sexPrediction(GSet).tiff
#> =======================================================================
#> Clinical sex values...
#> Sex column integrity check:
#> Sex column used: sex
#> NA values: 0
#> Unknown labels: None
#> Clinical Sex plot saved to: /tmp/Rtmpl8RIvi/dnaEPICO_Test/figures/preprocessingMinfiEwasWater/qc/sexClinical(GSet).tiff
#> =======================================================================
#> Mistmaches found[1] Sample_Name Sample_Well Sample_Plate
#> <0 rows> (or 0-length row.names)
#> =======================================================================
#> Running normalization methods using Minfi and WateRmelon: adjustedfunnorm
#> Applying normalization: adjustedfunnorm
#> No methods found in package 'RSQLite' for request: 'dbListFields' when loading 'lumi'
#> [adjustedFunnorm] Background and dye bias correction with noob
#> [adjustedFunnorm] Mapping to genome
#> [adjustedFunnorm] Quantile extraction
#> [adjustedFunnorm] Normalization
#> Saved normalized object: /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData/preprocessingMinfiEwasWater/normObjects/norm_adjustedfunnorm_RGSet.RData
#> Plot Raw vs Normalisation data saved to: /tmp/Rtmpl8RIvi/dnaEPICO_Test/figures/preprocessingMinfiEwasWater/qc/sexComparison_RawNorm(MSetF).tiff
#> =======================================================================
#> Filtering probes with detection p-values: 0.01 ...
#> Probes retained: 466668 / 485512
#> Filtered object saved to: /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData/preprocessingMinfiEwasWater/filterObjects/removProbes_MSetF_Flt.RData
#> =======================================================================
#> Removing probes on chromosomes: chrX, chrY
#> Remaining probes after removing selected chromosomes:
#>
#> chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr2 chr20
#> 45338 23371 27833 23541 11955 14417 14661 21474 26554 5538 24983 33341 10158
#> chr21 chr22 chr3 chr4 chr5 chr6 chr7 chr8 chr9
#> 4136 8368 23930 19029 23181 35088 29275 20091 9575
#> Sex chromosome-filtered object saved to: /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData/preprocessingMinfiEwasWater/filterObjects/removChrXY_MSetF_Flt_Rxy.RData
#> =======================================================================
#> Removing probes with SNPs at: SBE, CpG with MAF >= 0.1
#> Remaining probes after SNP filtering: 451304
#> SNP-filtered object saved to: /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData/preprocessingMinfiEwasWater/filterObjects/removSNPs_MAF0.1_MSetF_Flt_Rxy_Ds.RData
#> =======================================================================
#> Loading cross-reactive probe list from:
#> /tmp/Rtmpl8RIvi/dnaEPICO_Test/data/preprocessingMinfiEwasWater/12864_2024_10027_MOESM8_ESM.csv
#> Probes retained after cross-reactive filter: 451304
#> Cross-reactive-filtered object saved to: /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData/preprocessingMinfiEwasWater/filterObjects/removCrossReactive_MSetF_Flt_Rxy_Ds_Rc.RData
#> =======================================================================
#> Extracting final DNAm matrices (M, Beta, CN)...
#> M-values saved to: /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData/preprocessingMinfiEwasWater/metrics/m_NomFilt_MSetF_Flt_Rxy_Ds_Rc.RData
#> GroupA_3 GroupA_2 GroupB_3 GroupB_1 GroupA_1
#> cg13869341 2.6393989 4.5098963 3.5921764 2.4425984 2.3233487
#> cg14008030 1.1068115 0.7804484 1.3156115 1.0308431 0.9513732
#> cg12045430 -4.9216557 -4.4422917 -4.7704070 -2.7293860 -4.8731142
#> cg20826792 -3.2152921 -3.0354748 -3.3089689 -2.6377700 -3.1260062
#> cg00381604 -5.7736577 -5.9343344 -5.7260241 -5.5323099 -5.9329137
#> cg20253340 0.4886808 0.9086010 -0.2618946 -0.1448504 0.4240629
#> Beta-values saved to: /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData/preprocessingMinfiEwasWater/metrics/beta_NomFilt_MSetF_Flt_Rxy_Ds_Rc.RData
#> GroupA_3 GroupA_2 GroupB_3 GroupB_1 GroupA_1
#> cg13869341 0.86170119 0.95795345 0.92343122 0.84462880 0.83347005
#> cg14008030 0.68291127 0.63203702 0.71339133 0.67140047 0.65913495
#> cg12045430 0.03194010 0.04397503 0.03534566 0.13103183 0.03299699
#> cg20826792 0.09720524 0.10870567 0.09165419 0.13843342 0.10277336
#> cg00381604 0.01795101 0.01608951 0.01854242 0.02115069 0.01610511
#> cg20253340 0.58388143 0.65244281 0.45474134 0.47492040 0.57295995
#> CN-values saved to: /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData/preprocessingMinfiEwasWater/metrics/cn_NomFilt_MSetF_Flt_Rxy_Ds_Rc.RData
#> GroupA_3 GroupA_2 GroupB_3 GroupB_1 GroupA_1
#> cg13869341 15.30437 15.49182 15.82079 15.60943 15.75944
#> cg14008030 14.18634 15.04115 14.84927 14.73982 15.09002
#> cg12045430 13.91558 14.39012 14.25258 14.51851 14.47114
#> cg20826792 14.10642 14.32523 14.35196 14.47400 14.52957
#> cg00381604 13.82183 14.11709 13.96986 14.12212 14.25033
#> cg20253340 13.52945 13.94953 13.64490 13.83988 14.15564
#> =======================================================================
#> Plot examineMDS_PostFilteringCrossRect saved to: /tmp/Rtmpl8RIvi/dnaEPICO_Test/figures/preprocessingMinfiEwasWater/metrics/examineMDS_PostFilteringCrossRect(MSetF_Flt_Rxy_Ds_Rc).tiff
#> =======================================================================
#> Plotting final density plots for grouping variable: sex
#> Density plots saved to: /tmp/Rtmpl8RIvi/dnaEPICO_Test/figures/preprocessingMinfiEwasWater/metrics/densityBeta&M(MSetF_Flt_Rxy_Ds_Rc).tiff
#> =======================================================================
#> Cell composition reference selected: FlowSorted.Blood.450k
#> Using ENmix Houseman-based cell composition
#> Saved phenoLC: /tmp/Rtmpl8RIvi/dnaEPICO_Test/data/preprocessingMinfiEwasWater/phenoLC.csv
#> =======================================================================
#> Session info:
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] parallel stats4 stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] FlowSorted.Blood.450k_1.44.0
#> [2] minfiData_0.52.0
#> [3] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.1
#> [4] IlluminaHumanMethylation450kmanifest_0.4.0
#> [5] minfi_1.52.1
#> [6] bumphunter_1.48.0
#> [7] locfit_1.5-9.12
#> [8] iterators_1.0.14
#> [9] foreach_1.5.2
#> [10] Biostrings_2.74.1
#> [11] XVector_0.46.0
#> [12] SummarizedExperiment_1.36.0
#> [13] Biobase_2.66.0
#> [14] MatrixGenerics_1.18.1
#> [15] matrixStats_1.5.0
#> [16] GenomicRanges_1.58.0
#> [17] GenomeInfoDb_1.42.3
#> [18] IRanges_2.40.1
#> [19] S4Vectors_0.44.0
#> [20] BiocGenerics_0.52.0
#> [21] dnaEPICO_0.99.10
#> [22] BiocStyle_2.34.0
#>
#> loaded via a namespace (and not attached):
#> [1] splines_4.4.2 BiocIO_1.16.0
#> [3] bitops_1.0-9 filelock_1.0.3
#> [5] tibble_3.3.1 methylumi_2.52.0
#> [7] preprocessCore_1.68.0 XML_3.99-0.23
#> [9] lifecycle_1.0.5 doParallel_1.0.17
#> [11] lattice_0.22-9 MASS_7.3-65
#> [13] base64_2.0.2 scrime_1.3.7
#> [15] magrittr_2.0.4 limma_3.62.2
#> [17] sass_0.4.10 rmarkdown_2.31
#> [19] jquerylib_0.1.4 yaml_2.3.12
#> [21] otel_0.2.0 doRNG_1.8.6.3
#> [23] askpass_1.2.1 DBI_1.3.0
#> [25] RColorBrewer_1.1-3 lumi_2.58.0
#> [27] abind_1.4-8 zlibbioc_1.52.0
#> [29] quadprog_1.5-8 purrr_1.2.1
#> [31] RCurl_1.98-1.18 rappdirs_0.3.4
#> [33] GenomeInfoDbData_1.2.13 irlba_2.3.7
#> [35] rentrez_1.2.4 genefilter_1.88.0
#> [37] annotate_1.84.0 pkgdown_2.2.0
#> [39] DelayedMatrixStats_1.28.1 codetools_0.2-20
#> [41] DelayedArray_0.32.0 xml2_1.5.2
#> [43] tidyselect_1.2.1 UCSC.utils_1.2.0
#> [45] beanplot_1.3.1 BiocFileCache_2.14.0
#> [47] dynamicTreeCut_1.63-1 illuminaio_0.48.0
#> [49] GenomicAlignments_1.42.0 jsonlite_2.0.0
#> [51] multtest_2.62.0 survival_3.8-6
#> [53] systemfonts_1.3.2 tools_4.4.2
#> [55] ragg_1.5.2 Rcpp_1.1.1
#> [57] glue_1.8.0 SparseArray_1.6.2
#> [59] mgcv_1.9-4 xfun_0.57
#> [61] dplyr_1.2.0 HDF5Array_1.34.0
#> [63] withr_3.0.2 BiocManager_1.30.27
#> [65] fastmap_1.2.0 rhdf5filters_1.18.1
#> [67] openssl_2.3.5 caTools_1.18.3
#> [69] digest_0.6.39 R6_2.6.1
#> [71] colorspace_2.1-2 textshaping_1.0.5
#> [73] RPMM_1.25 gtools_3.9.5
#> [75] RSQLite_2.4.6 tidyr_1.3.2
#> [77] generics_0.1.4 wateRmelon_2.12.0
#> [79] data.table_1.18.2.1 rtracklayer_1.66.0
#> [81] httr_1.4.8 htmlwidgets_1.6.4
#> [83] S4Arrays_1.6.0 pkgconfig_2.0.3
#> [85] blob_1.3.0 siggenes_1.80.0
#> [87] impute_1.80.0 htmltools_0.5.9
#> [89] bookdown_0.46 geneplotter_1.84.0
#> [91] ROC_1.82.0 png_0.1-9
#> [93] knitr_1.51 tzdb_0.5.0
#> [95] rjson_0.2.23 nlme_3.1-169
#> [97] curl_7.0.0 cachem_1.1.0
#> [99] rhdf5_2.50.2 BiocVersion_3.20.0
#> [101] KernSmooth_2.23-26 AnnotationDbi_1.68.0
#> [103] restfulr_0.0.16 desc_1.4.3
#> [105] GEOquery_2.74.0 pillar_1.11.1
#> [107] grid_4.4.2 reshape_0.8.10
#> [109] vctrs_0.7.2 gplots_3.3.0
#> [111] ENmix_1.42.2 dbplyr_2.5.2
#> [113] xtable_1.8-8 cluster_2.1.8.2
#> [115] evaluate_1.0.5 readr_2.2.0
#> [117] GenomicFeatures_1.58.0 cli_3.6.5
#> [119] compiler_4.4.2 Rsamtools_2.22.0
#> [121] rlang_1.1.7 crayon_1.5.3
#> [123] rngtools_1.5.2 nor1mix_1.3-3
#> [125] mclust_6.1.2 affy_1.84.0
#> [127] plyr_1.8.9 fs_2.0.1
#> [129] BiocParallel_1.40.2 nleqslv_3.3.6
#> [131] Matrix_1.7-5 ExperimentHub_2.14.0
#> [133] hms_1.1.4 sparseMatrixStats_1.18.0
#> [135] bit64_4.6.0-1 Rhdf5lib_1.28.0
#> [137] KEGGREST_1.46.0 statmod_1.5.1
#> [139] AnnotationHub_3.14.0 memoise_2.0.1
#> [141] affyio_1.76.0 bslib_0.10.0
#> [143] bit_4.6.0Next, we estimate surrogate variables to capture technical variation such as chip and position effects.
rgsetPath <- file.path(
tmpDir,
"rData",
"preprocessingMinfiEwasWater",
"objects",
"RGSet.RData"
)
file.exists(rgsetPath)
#> [1] TRUE
svaEnmix(
phenoFile = samplePath,
rgsetData = rgsetPath,
sepType = "",
outputLogs = file.path(tmpDir, "logs"),
nSamples = NA,
SampleID = "Sample_Name",
arrayType = "IlluminaHumanMethylation450k",
annotationVersion = "ilmn12.hg19",
SentrixIDColumn = "Sentrix_ID",
SentrixPositionColumn = "Sentrix_Position",
ctrlSvaPercVar = 0.90,
ctrlSvaFlag = 1,
scriptLabel = "svaEnmix",
tiffWidth = 2000,
tiffHeight = 1000,
tiffRes = 150,
dataBaseDir = file.path(tmpDir, "data") ,
figureBaseDir = file.path(tmpDir, "figures"),
rBaseDir = file.path(tmpDir, "rData")
)
#> ==== Starting SVA Estimation with Enmix ====
#> Start time: 2026-04-02 05:17:41
#>
#> Log file path: /tmp/Rtmpl8RIvi/dnaEPICO_Test/logs/log_svaEnmix.txt
#>
#> Pheno file: /tmp/Rtmpl8RIvi/dnaEPICO_Test/data/preprocessingMinfiEwasWater/sample_minfi.csv
#> Separator type:
#> Log directory: /tmp/Rtmpl8RIvi/dnaEPICO_Test/logs
#> Sample limit: All
#> SampleID column: Sample_Name
#> Sentrix ID column: Sentrix_ID
#> Sentrix Position column: Sentrix_Position
#> Script label: svaEnmix
#> ctrlSva percvar: 0.9
#> ctrlSva flag: 1
#> TIFF dimensions (WxH): 2000 x 1000 at 150 dpi
#> =======================================================================
#> Using all 6 samples.
#> Phenotype file loaded with 6 samples and 13 columns.
#> Preview of targets:
#> Sample_Name Sample_Well Sample_Plate Sample_Group Pool_ID Sentrix_ID
#> 1 GroupA_3 H5 NA GroupA NA 5723646052
#> 2 GroupA_2 D5 NA GroupA NA 5723646052
#> 3 GroupB_3 C6 NA GroupB NA 5723646052
#> 4 GroupB_1 F7 NA GroupB NA 5723646053
#> 5 GroupA_1 G7 NA GroupA NA 5723646053
#> 6 GroupB_2 H7 NA GroupB NA 5723646053
#> =======================================================================
#> RGSet loaded with 6 samples.
#> =======================================================================
#> 3 surrogate variables explain 91.17398 % of
#> data variation
#> Surrogate variables matrix (first few rows):
#> PC1 PC2 PC3
#> GroupA_3 -25330.022 -3189.849 3715.1929
#> GroupA_2 6350.671 -6443.776 -1051.9582
#> GroupB_3 8827.110 -4444.123 15604.2358
#> GroupB_1 -3840.017 -1215.872 -10610.1476
#> GroupA_1 10355.156 -9071.276 -7419.7718
#> GroupB_2 3637.102 24364.897 -237.5511
#> SVA Matrix RData saved to: /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData/svaEnmix/svaMatrix.RData
#> SVA Matrix CSV saved to: /tmp/Rtmpl8RIvi/dnaEPICO_Test/data/svaEnmix/svaMatrix.csv
#> Saved phenoLC + SVA: /tmp/Rtmpl8RIvi/dnaEPICO_Test/data/preprocessingMinfiEwasWater/sample_minfi.csv
#> SVA Sentrix plot saved to: /tmp/Rtmpl8RIvi/dnaEPICO_Test/figures/svaEnmix/sva_SentrixID.tiff
#> SVA Position plot saved to: /tmp/Rtmpl8RIvi/dnaEPICO_Test/figures/svaEnmix/sva_SentrixPosition.tiff
#> =======================================================================
#> Number of surrogate variables (K): 3
#> SentrixID class: factor
#> SentrixID unique levels: 2
#> sentrixID
#> 5723646052 5723646053
#> 3 3
#> SentrixPosition class: factor
#> SentrixPosition unique levels: 5
#> sentrixPos
#> R02C02 R04C01 R04C02 R05C02 R06C02
#> 1 1 1 2 1
#> First row of SVA matrix:
#> PC1 PC2 PC3
#> -25330.022 -3189.849 3715.193
#> Sample names in SVA matrix: GroupA_3, GroupA_2, GroupB_3, GroupB_1, GroupA_1
#> Sample names in pData(RGSet): GroupA_3, GroupA_2, GroupB_3, GroupB_1, GroupA_1
#> Warning: attempting model selection on an essentially perfect fit is nonsense
#> Warning in max(dttmp$`Pr(F)`, na.rm = TRUE): no non-missing arguments to max;
#> returning -Inf
#> Warning: attempting model selection on an essentially perfect fit is nonsense
#> Warning in max(dttmp$`Pr(F)`, na.rm = TRUE): no non-missing arguments to max;
#> returning -Inf
#> Warning: attempting model selection on an essentially perfect fit is nonsense
#> Warning in max(dttmp$`Pr(F)`, na.rm = TRUE): no non-missing arguments to max;
#> returning -Inf
#> Warning in anova.lm(lmsvaFull[[i]]): ANOVA F-tests on an essentially perfect
#> fit are unreliable
#> Warning in anova.lm(lmsvaRed[[i]]): ANOVA F-tests on an essentially perfect fit
#> are unreliable
#> Warning in anova.lm(lmsvaFull[[i]]): ANOVA F-tests on an essentially perfect
#> fit are unreliable
#> Warning in anova.lm(lmsvaRed[[i]]): ANOVA F-tests on an essentially perfect fit
#> are unreliable
#> Warning in anova.lm(lmsvaFull[[i]]): ANOVA F-tests on an essentially perfect
#> fit are unreliable
#> Warning in anova.lm(lmsvaRed[[i]]): ANOVA F-tests on an essentially perfect fit
#> are unreliable
#> =======================================================================
#> SVA Sentrix/Position plot saved to: /tmp/Rtmpl8RIvi/dnaEPICO_Test/figures/svaEnmix/sva_SentrixIDPosition.tiff
#> =======================================================================
#> Session info:
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] parallel stats4 stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] FlowSorted.Blood.450k_1.44.0
#> [2] minfiData_0.52.0
#> [3] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.1
#> [4] IlluminaHumanMethylation450kmanifest_0.4.0
#> [5] minfi_1.52.1
#> [6] bumphunter_1.48.0
#> [7] locfit_1.5-9.12
#> [8] iterators_1.0.14
#> [9] foreach_1.5.2
#> [10] Biostrings_2.74.1
#> [11] XVector_0.46.0
#> [12] SummarizedExperiment_1.36.0
#> [13] Biobase_2.66.0
#> [14] MatrixGenerics_1.18.1
#> [15] matrixStats_1.5.0
#> [16] GenomicRanges_1.58.0
#> [17] GenomeInfoDb_1.42.3
#> [18] IRanges_2.40.1
#> [19] S4Vectors_0.44.0
#> [20] BiocGenerics_0.52.0
#> [21] dnaEPICO_0.99.10
#> [22] BiocStyle_2.34.0
#>
#> loaded via a namespace (and not attached):
#> [1] splines_4.4.2 BiocIO_1.16.0
#> [3] bitops_1.0-9 filelock_1.0.3
#> [5] tibble_3.3.1 methylumi_2.52.0
#> [7] preprocessCore_1.68.0 XML_3.99-0.23
#> [9] lifecycle_1.0.5 doParallel_1.0.17
#> [11] lattice_0.22-9 MASS_7.3-65
#> [13] base64_2.0.2 scrime_1.3.7
#> [15] magrittr_2.0.4 limma_3.62.2
#> [17] sass_0.4.10 rmarkdown_2.31
#> [19] jquerylib_0.1.4 yaml_2.3.12
#> [21] otel_0.2.0 doRNG_1.8.6.3
#> [23] askpass_1.2.1 DBI_1.3.0
#> [25] RColorBrewer_1.1-3 lumi_2.58.0
#> [27] abind_1.4-8 zlibbioc_1.52.0
#> [29] quadprog_1.5-8 purrr_1.2.1
#> [31] RCurl_1.98-1.18 rappdirs_0.3.4
#> [33] GenomeInfoDbData_1.2.13 irlba_2.3.7
#> [35] rentrez_1.2.4 genefilter_1.88.0
#> [37] annotate_1.84.0 pkgdown_2.2.0
#> [39] DelayedMatrixStats_1.28.1 codetools_0.2-20
#> [41] DelayedArray_0.32.0 xml2_1.5.2
#> [43] tidyselect_1.2.1 UCSC.utils_1.2.0
#> [45] beanplot_1.3.1 BiocFileCache_2.14.0
#> [47] dynamicTreeCut_1.63-1 illuminaio_0.48.0
#> [49] GenomicAlignments_1.42.0 jsonlite_2.0.0
#> [51] multtest_2.62.0 survival_3.8-6
#> [53] systemfonts_1.3.2 tools_4.4.2
#> [55] ragg_1.5.2 Rcpp_1.1.1
#> [57] glue_1.8.0 SparseArray_1.6.2
#> [59] mgcv_1.9-4 xfun_0.57
#> [61] dplyr_1.2.0 HDF5Array_1.34.0
#> [63] withr_3.0.2 BiocManager_1.30.27
#> [65] fastmap_1.2.0 rhdf5filters_1.18.1
#> [67] openssl_2.3.5 caTools_1.18.3
#> [69] digest_0.6.39 R6_2.6.1
#> [71] colorspace_2.1-2 textshaping_1.0.5
#> [73] RPMM_1.25 gtools_3.9.5
#> [75] RSQLite_2.4.6 tidyr_1.3.2
#> [77] generics_0.1.4 wateRmelon_2.12.0
#> [79] data.table_1.18.2.1 rtracklayer_1.66.0
#> [81] httr_1.4.8 htmlwidgets_1.6.4
#> [83] S4Arrays_1.6.0 pkgconfig_2.0.3
#> [85] blob_1.3.0 siggenes_1.80.0
#> [87] impute_1.80.0 htmltools_0.5.9
#> [89] bookdown_0.46 geneplotter_1.84.0
#> [91] ROC_1.82.0 png_0.1-9
#> [93] knitr_1.51 tzdb_0.5.0
#> [95] rjson_0.2.23 nlme_3.1-169
#> [97] curl_7.0.0 cachem_1.1.0
#> [99] rhdf5_2.50.2 BiocVersion_3.20.0
#> [101] KernSmooth_2.23-26 AnnotationDbi_1.68.0
#> [103] restfulr_0.0.16 desc_1.4.3
#> [105] GEOquery_2.74.0 pillar_1.11.1
#> [107] grid_4.4.2 reshape_0.8.10
#> [109] vctrs_0.7.2 gplots_3.3.0
#> [111] ENmix_1.42.2 dbplyr_2.5.2
#> [113] xtable_1.8-8 cluster_2.1.8.2
#> [115] evaluate_1.0.5 readr_2.2.0
#> [117] GenomicFeatures_1.58.0 cli_3.6.5
#> [119] compiler_4.4.2 Rsamtools_2.22.0
#> [121] rlang_1.1.7 crayon_1.5.3
#> [123] rngtools_1.5.2 nor1mix_1.3-3
#> [125] mclust_6.1.2 affy_1.84.0
#> [127] plyr_1.8.9 fs_2.0.1
#> [129] BiocParallel_1.40.2 nleqslv_3.3.6
#> [131] Matrix_1.7-5 ExperimentHub_2.14.0
#> [133] hms_1.1.4 sparseMatrixStats_1.18.0
#> [135] bit64_4.6.0-1 Rhdf5lib_1.28.0
#> [137] KEGGREST_1.46.0 statmod_1.5.1
#> [139] AnnotationHub_3.14.0 memoise_2.0.1
#> [141] affyio_1.76.0 bslib_0.10.0
#> [143] bit_4.6.0This step aligns phenotype data with the processed methylation matrices.
# Metric paths produced by preprocessingMinfiEwasWater
betaPath <- file.path(
tmpDir,
"rData",
"preprocessingMinfiEwasWater",
"metrics",
"beta_NomFilt_MSetF_Flt_Rxy_Ds_Rc.RData"
)
mPath <- file.path(
tmpDir,
"rData",
"preprocessingMinfiEwasWater",
"metrics",
"m_NomFilt_MSetF_Flt_Rxy_Ds_Rc.RData"
)
cnPath <- file.path(
tmpDir,
"rData",
"preprocessingMinfiEwasWater",
"metrics",
"cn_NomFilt_MSetF_Flt_Rxy_Ds_Rc.RData"
)
# Quick sanity check
file.exists(betaPath)
#> [1] TRUE
file.exists(mPath)
#> [1] TRUE
file.exists(cnPath)
#> [1] TRUE
preprocessingPheno(
phenoFile = samplePath,
sepType = "",
betaPath = betaPath,
mPath = mPath,
cnPath = cnPath,
SampleID = "Sample_Name",
timeVar = "Timepoint",
timepoints = "1",
combineTimepoints = "1",
sexColumn = "sex",
outputPheno = file.path(tmpDir, "data", "preprocessingPheno"),
outputRData = file.path(tmpDir, "rData", "preprocessingPheno", "metrics"),
outputRDataMerge = file.path(tmpDir, "rData", "preprocessingPheno", "mergeData"),
outputLogs = file.path(tmpDir, "logs"),
outputDir = file.path(tmpDir, "data", "preprocessingPheno")
)
#> ==== Starting Phenotype Preprocessing ====
#> Start Time: 2026-04-02 05:17:42
#> Log file path: /tmp/Rtmpl8RIvi/dnaEPICO_Test/logs/log_preprocessingPheno.txt
#>
#> Phenotype file: /tmp/Rtmpl8RIvi/dnaEPICO_Test/data/preprocessingMinfiEwasWater/sample_minfi.csv
#> Beta path: /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData/preprocessingMinfiEwasWater/metrics/beta_NomFilt_MSetF_Flt_Rxy_Ds_Rc.RData
#> M-values path: /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData/preprocessingMinfiEwasWater/metrics/m_NomFilt_MSetF_Flt_Rxy_Ds_Rc.RData
#> CN path: /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData/preprocessingMinfiEwasWater/metrics/cn_NomFilt_MSetF_Flt_Rxy_Ds_Rc.RData
#>
#> Identifier column: Sample_Name
#> Timepoint column: Timepoint
#> Timepoints (if present): 1
#> Combine timepoints: 1
#>
#> Sex column: sex
#> Output phenotype dir: /tmp/Rtmpl8RIvi/dnaEPICO_Test/data/preprocessingPheno
#> RData metrics dir: /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData/preprocessingPheno/metrics
#> RData merge dir: /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData/preprocessingPheno/mergeData
#>
#> =======================================================================
#> Beta dimensions: 451304 6
#> M dimensions: 451304 6
#> CN dimensions: 451304 6
#> Phenotype file loaded with 6 samples and 16 columns.
#> Preview of phenoLC:
#> Sample_Name Sample_Well Sample_Plate Sample_Group Pool_ID
#> 1 GroupA_1 G7 NA GroupA NA
#> 2 GroupA_2 D5 NA GroupA NA
#> 3 GroupA_3 H5 NA GroupA NA
#> 4 GroupB_1 F7 NA GroupB NA
#> 5 GroupB_2 H7 NA GroupB NA
#> 6 GroupB_3 C6 NA GroupB NA
#> =======================================================================
#> Subsetting to timepoints: 1
#> Available values in Timepoint column:
#>
#> 1
#> 6
#> Combining timepoints: 1
#> Saved combined phenotype file for T1T2 at: /tmp/Rtmpl8RIvi/dnaEPICO_Test/data/preprocessingPheno
#> =======================================================================
#> Processing merge for timepoint: 1
#> - pheno rows: 6
#> - beta cols: 6
#> Saved merged object for T1
#> Combined data saved for timepoints: 1
#> Merged data saved to: /tmp/Rtmpl8RIvi/dnaEPICO_Test/rData/preprocessingPheno/mergeData
#> =======================================================================
#> ProbeID GroupA_3 GroupA_2 GroupB_3 GroupB_1
#> 1 cg13869341 0.86170119 0.95795345 0.92343122 0.84462880
#> 2 cg14008030 0.68291127 0.63203702 0.71339133 0.67140047
#> 3 cg12045430 0.03194010 0.04397503 0.03534566 0.13103183
#> 4 cg20826792 0.09720524 0.10870567 0.09165419 0.13843342
#> 5 cg00381604 0.01795101 0.01608951 0.01854242 0.02115069
#> Beta CSV file for ClockFundation saved to: /tmp/Rtmpl8RIvi/dnaEPICO_Test/data/preprocessingPheno/beta.csv
#> Beta ZIP file for ClockFundation saved to: /tmp/Rtmpl8RIvi/dnaEPICO_Test/data/preprocessingPheno/beta.zip
#> Re-encoding Sex: 0 = Female, 1 = Male
#> Sample file for ClockFundation saved to: /tmp/Rtmpl8RIvi/dnaEPICO_Test/data/preprocessingPheno/phenoCF.csv
#> =======================================================================
#>
#> Session Info:
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] parallel stats4 stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] FlowSorted.Blood.450k_1.44.0
#> [2] minfiData_0.52.0
#> [3] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.1
#> [4] IlluminaHumanMethylation450kmanifest_0.4.0
#> [5] minfi_1.52.1
#> [6] bumphunter_1.48.0
#> [7] locfit_1.5-9.12
#> [8] iterators_1.0.14
#> [9] foreach_1.5.2
#> [10] Biostrings_2.74.1
#> [11] XVector_0.46.0
#> [12] SummarizedExperiment_1.36.0
#> [13] Biobase_2.66.0
#> [14] MatrixGenerics_1.18.1
#> [15] matrixStats_1.5.0
#> [16] GenomicRanges_1.58.0
#> [17] GenomeInfoDb_1.42.3
#> [18] IRanges_2.40.1
#> [19] S4Vectors_0.44.0
#> [20] BiocGenerics_0.52.0
#> [21] dnaEPICO_0.99.10
#> [22] BiocStyle_2.34.0
#>
#> loaded via a namespace (and not attached):
#> [1] splines_4.4.2 BiocIO_1.16.0
#> [3] bitops_1.0-9 filelock_1.0.3
#> [5] tibble_3.3.1 methylumi_2.52.0
#> [7] preprocessCore_1.68.0 XML_3.99-0.23
#> [9] lifecycle_1.0.5 doParallel_1.0.17
#> [11] lattice_0.22-9 MASS_7.3-65
#> [13] base64_2.0.2 scrime_1.3.7
#> [15] magrittr_2.0.4 limma_3.62.2
#> [17] sass_0.4.10 rmarkdown_2.31
#> [19] jquerylib_0.1.4 yaml_2.3.12
#> [21] otel_0.2.0 doRNG_1.8.6.3
#> [23] askpass_1.2.1 DBI_1.3.0
#> [25] RColorBrewer_1.1-3 lumi_2.58.0
#> [27] abind_1.4-8 zlibbioc_1.52.0
#> [29] quadprog_1.5-8 purrr_1.2.1
#> [31] RCurl_1.98-1.18 rappdirs_0.3.4
#> [33] GenomeInfoDbData_1.2.13 irlba_2.3.7
#> [35] rentrez_1.2.4 genefilter_1.88.0
#> [37] annotate_1.84.0 pkgdown_2.2.0
#> [39] DelayedMatrixStats_1.28.1 codetools_0.2-20
#> [41] DelayedArray_0.32.0 xml2_1.5.2
#> [43] tidyselect_1.2.1 UCSC.utils_1.2.0
#> [45] beanplot_1.3.1 BiocFileCache_2.14.0
#> [47] dynamicTreeCut_1.63-1 illuminaio_0.48.0
#> [49] GenomicAlignments_1.42.0 jsonlite_2.0.0
#> [51] multtest_2.62.0 survival_3.8-6
#> [53] systemfonts_1.3.2 tools_4.4.2
#> [55] ragg_1.5.2 Rcpp_1.1.1
#> [57] glue_1.8.0 SparseArray_1.6.2
#> [59] mgcv_1.9-4 xfun_0.57
#> [61] dplyr_1.2.0 HDF5Array_1.34.0
#> [63] withr_3.0.2 BiocManager_1.30.27
#> [65] fastmap_1.2.0 rhdf5filters_1.18.1
#> [67] openssl_2.3.5 caTools_1.18.3
#> [69] digest_0.6.39 R6_2.6.1
#> [71] colorspace_2.1-2 textshaping_1.0.5
#> [73] RPMM_1.25 gtools_3.9.5
#> [75] RSQLite_2.4.6 tidyr_1.3.2
#> [77] generics_0.1.4 wateRmelon_2.12.0
#> [79] data.table_1.18.2.1 rtracklayer_1.66.0
#> [81] httr_1.4.8 htmlwidgets_1.6.4
#> [83] S4Arrays_1.6.0 pkgconfig_2.0.3
#> [85] blob_1.3.0 siggenes_1.80.0
#> [87] impute_1.80.0 htmltools_0.5.9
#> [89] bookdown_0.46 geneplotter_1.84.0
#> [91] ROC_1.82.0 png_0.1-9
#> [93] knitr_1.51 tzdb_0.5.0
#> [95] rjson_0.2.23 nlme_3.1-169
#> [97] curl_7.0.0 cachem_1.1.0
#> [99] rhdf5_2.50.2 BiocVersion_3.20.0
#> [101] KernSmooth_2.23-26 AnnotationDbi_1.68.0
#> [103] restfulr_0.0.16 desc_1.4.3
#> [105] GEOquery_2.74.0 pillar_1.11.1
#> [107] grid_4.4.2 reshape_0.8.10
#> [109] vctrs_0.7.2 gplots_3.3.0
#> [111] ENmix_1.42.2 dbplyr_2.5.2
#> [113] xtable_1.8-8 cluster_2.1.8.2
#> [115] evaluate_1.0.5 readr_2.2.0
#> [117] GenomicFeatures_1.58.0 cli_3.6.5
#> [119] compiler_4.4.2 Rsamtools_2.22.0
#> [121] rlang_1.1.7 crayon_1.5.3
#> [123] rngtools_1.5.2 nor1mix_1.3-3
#> [125] mclust_6.1.2 affy_1.84.0
#> [127] plyr_1.8.9 fs_2.0.1
#> [129] BiocParallel_1.40.2 nleqslv_3.3.6
#> [131] Matrix_1.7-5 ExperimentHub_2.14.0
#> [133] hms_1.1.4 sparseMatrixStats_1.18.0
#> [135] bit64_4.6.0-1 Rhdf5lib_1.28.0
#> [137] KEGGREST_1.46.0 statmod_1.5.1
#> [139] AnnotationHub_3.14.0 memoise_2.0.1
#> [141] affyio_1.76.0 bslib_0.10.0
#> [143] bit_4.6.0Finally, we generate a summary report that compiles figures and results from all previous steps.
dnamReport(
output = "DNAm_Report.pdf",
outputDir = file.path(tmpDir, "reports"),
qcDir = file.path(
tmpDir, "figures", "preprocessingMinfiEwasWater", "enMix"
),
preprocessingDir = file.path(
tmpDir, "figures", "preprocessingMinfiEwasWater", "qc"
),
postprocessingDir = file.path(
tmpDir, "figures", "preprocessingMinfiEwasWater", "metrics"
),
svaDir = file.path(
tmpDir, "figures", "svaEnmix"
),
glmDir = file.path(
tmpDir, "figures", "methylationGLM_T1"
),
glmmDir = file.path(
tmpDir, "figures", "methylationGLMM_T1T2"
),
figDir = file.path(
tmpDir, "reports", "figures"
),
reportTitle = "DNA methylation",
author = "School of Biomedical Sciences",
date = format(Sys.Date(), "%B %d, %Y")
)
#> processing file: DNAm.Rmd
#> output file: DNAm.knit.md
#> /usr/bin/pandoc +RTS -K512m -RTS DNAm.knit.md --to latex --from markdown+autolink_bare_uris+tex_math_single_backslash --output /tmp/Rtmpl8RIvi/dnaEPICO_Test/reports/DNAm_Report.tex --lua-filter /usr/local/lib/R/site-library/rmarkdown/rmarkdown/lua/pagebreak.lua --lua-filter /usr/local/lib/R/site-library/rmarkdown/rmarkdown/lua/latex-div.lua --embed-resources --standalone --table-of-contents --toc-depth 2 --number-sections --highlight-style tango --pdf-engine pdflatex --variable graphics --extract-media /tmp/Rtmpl8RIvi/dnaEPICO_Test/reports/DNAm_Report_files --include-in-header /tmp/Rtmpl8RIvi/rmarkdown-straaec555e518c.html
#>
#> Output created: /tmp/Rtmpl8RIvi/dnaEPICO_Test/reports/DNAm_Report.pdfFor large studies or cluster environments, the same workflow can be executed via a Makefile.
# Extract Makefile
extractMake(destDir = tmpDir)
oldWd <- getwd()
setwd(tmpDir)
on.exit(setwd(oldWd), add = TRUE)
system(paste(
"make f3",
"MODEL=model1",
"PHENO_FILE=data/preprocessingMinfiEwasWater/sample_minfi.csv",
"ARRAY_TYPE=IlluminaHumanMethylation450k",
"ANNOTATION_VERSION=ilmn12.hg19",
"SEX_COLUMN=sex",
"PLOT_GROUP_VAR=sex",
"LC_REF=FlowSorted.Blood.450k",
"TIMEPOINTS=1",
"COMBINE_TIMEPOINTS=1",
"R_DIR=NULL",
"N_SAMPLES=NA",
"N_CORES=1"
))For this example, the pipeline is run using one explicit
model (model1) rather than multiple parallel
models. This mirrors the single-dataset workflow demonstrated earlier in
R.
The phenotype file generated from minfiData
(sample_minfi.csv) is used for consistency between:
Setting R_DIR = NULL tells the pipeline to use the
default R library paths, which is appropriate for local
testing. You need to use a directory if you are running this in HPC
environments.
GLOBAL PARAMETERS:
These parameters ensure the Makefile uses the:
Date the vignette was generated.
#> [1] "2026-04-02 05:18:05 UTC"
Wallclock time spent generating the vignette.
#> Time difference of 2.299 mins
R session information.
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] parallel stats4 stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] FlowSorted.Blood.450k_1.44.0
#> [2] minfiData_0.52.0
#> [3] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.1
#> [4] IlluminaHumanMethylation450kmanifest_0.4.0
#> [5] minfi_1.52.1
#> [6] bumphunter_1.48.0
#> [7] locfit_1.5-9.12
#> [8] iterators_1.0.14
#> [9] foreach_1.5.2
#> [10] Biostrings_2.74.1
#> [11] XVector_0.46.0
#> [12] SummarizedExperiment_1.36.0
#> [13] Biobase_2.66.0
#> [14] MatrixGenerics_1.18.1
#> [15] matrixStats_1.5.0
#> [16] GenomicRanges_1.58.0
#> [17] GenomeInfoDb_1.42.3
#> [18] IRanges_2.40.1
#> [19] S4Vectors_0.44.0
#> [20] BiocGenerics_0.52.0
#> [21] dnaEPICO_0.99.10
#> [22] BiocStyle_2.34.0
#>
#> loaded via a namespace (and not attached):
#> [1] splines_4.4.2 BiocIO_1.16.0
#> [3] bitops_1.0-9 filelock_1.0.3
#> [5] tibble_3.3.1 methylumi_2.52.0
#> [7] preprocessCore_1.68.0 XML_3.99-0.23
#> [9] lifecycle_1.0.5 doParallel_1.0.17
#> [11] lattice_0.22-9 MASS_7.3-65
#> [13] base64_2.0.2 scrime_1.3.7
#> [15] magrittr_2.0.4 limma_3.62.2
#> [17] sass_0.4.10 rmarkdown_2.31
#> [19] jquerylib_0.1.4 yaml_2.3.12
#> [21] otel_0.2.0 doRNG_1.8.6.3
#> [23] askpass_1.2.1 DBI_1.3.0
#> [25] RColorBrewer_1.1-3 lumi_2.58.0
#> [27] abind_1.4-8 zlibbioc_1.52.0
#> [29] quadprog_1.5-8 purrr_1.2.1
#> [31] RCurl_1.98-1.18 rappdirs_0.3.4
#> [33] GenomeInfoDbData_1.2.13 irlba_2.3.7
#> [35] rentrez_1.2.4 genefilter_1.88.0
#> [37] annotate_1.84.0 pkgdown_2.2.0
#> [39] DelayedMatrixStats_1.28.1 codetools_0.2-20
#> [41] DelayedArray_0.32.0 xml2_1.5.2
#> [43] tidyselect_1.2.1 UCSC.utils_1.2.0
#> [45] beanplot_1.3.1 BiocFileCache_2.14.0
#> [47] dynamicTreeCut_1.63-1 illuminaio_0.48.0
#> [49] GenomicAlignments_1.42.0 jsonlite_2.0.0
#> [51] multtest_2.62.0 survival_3.8-6
#> [53] systemfonts_1.3.2 tools_4.4.2
#> [55] ragg_1.5.2 Rcpp_1.1.1
#> [57] glue_1.8.0 SparseArray_1.6.2
#> [59] mgcv_1.9-4 xfun_0.57
#> [61] dplyr_1.2.0 HDF5Array_1.34.0
#> [63] withr_3.0.2 BiocManager_1.30.27
#> [65] fastmap_1.2.0 rhdf5filters_1.18.1
#> [67] openssl_2.3.5 caTools_1.18.3
#> [69] digest_0.6.39 R6_2.6.1
#> [71] colorspace_2.1-2 textshaping_1.0.5
#> [73] RPMM_1.25 gtools_3.9.5
#> [75] RSQLite_2.4.6 tidyr_1.3.2
#> [77] generics_0.1.4 wateRmelon_2.12.0
#> [79] data.table_1.18.2.1 rtracklayer_1.66.0
#> [81] httr_1.4.8 htmlwidgets_1.6.4
#> [83] S4Arrays_1.6.0 pkgconfig_2.0.3
#> [85] blob_1.3.0 siggenes_1.80.0
#> [87] impute_1.80.0 htmltools_0.5.9
#> [89] bookdown_0.46 geneplotter_1.84.0
#> [91] ROC_1.82.0 png_0.1-9
#> [93] knitr_1.51 tzdb_0.5.0
#> [95] rjson_0.2.23 nlme_3.1-169
#> [97] curl_7.0.0 cachem_1.1.0
#> [99] rhdf5_2.50.2 BiocVersion_3.20.0
#> [101] KernSmooth_2.23-26 AnnotationDbi_1.68.0
#> [103] restfulr_0.0.16 desc_1.4.3
#> [105] GEOquery_2.74.0 pillar_1.11.1
#> [107] grid_4.4.2 reshape_0.8.10
#> [109] vctrs_0.7.2 gplots_3.3.0
#> [111] ENmix_1.42.2 dbplyr_2.5.2
#> [113] xtable_1.8-8 cluster_2.1.8.2
#> [115] evaluate_1.0.5 tinytex_0.59
#> [117] readr_2.2.0 GenomicFeatures_1.58.0
#> [119] cli_3.6.5 compiler_4.4.2
#> [121] Rsamtools_2.22.0 rlang_1.1.7
#> [123] crayon_1.5.3 rngtools_1.5.2
#> [125] nor1mix_1.3-3 mclust_6.1.2
#> [127] affy_1.84.0 plyr_1.8.9
#> [129] fs_2.0.1 BiocParallel_1.40.2
#> [131] nleqslv_3.3.6 tiff_0.1-12
#> [133] Matrix_1.7-5 ExperimentHub_2.14.0
#> [135] hms_1.1.4 sparseMatrixStats_1.18.0
#> [137] bit64_4.6.0-1 Rhdf5lib_1.28.0
#> [139] KEGGREST_1.46.0 statmod_1.5.1
#> [141] AnnotationHub_3.14.0 memoise_2.0.1
#> [143] affyio_1.76.0 bslib_0.10.0
#> [145] bit_4.6.0
This package was developed using biocthis.
This vignette was generated using BiocStyle, with knitr and rmarkdown running behind the scenes.
dnaEPICO is built on core Bioconductor infrastructure for high-dimensional genomic data, with a focus on Illumina DNA methylation arrays.
Preprocessing and quality control are performed using established Bioconductor tools, including minfi, ENmix, and wateRmelon. Downstream statistical modelling relies on base R and CRAN frameworks, including generalised linear models and linear mixed-effects models. Users are expected to have basic familiarity with R, command-line execution, and Illumina IDAT file structures. Downstream statistical modelling relies on base R and CRAN frameworks, including generalised linear models and linear mixed-effects models. Users are expected to have basic familiarity with R, command-line execution, and Illumina IDAT file structures.
If you are asking yourself the question “Where do I start using Bioconductor?” you might be interested in this blog post.
As package developers, we try to explain clearly how to use our
packages and in which order to use the functions. But R and
Bioconductor have a steep learning curve so it is critical
to learn where to ask for help. The blog post quoted above mentions some
but we would like to highlight the Bioconductor support site
as the main resource for getting help: remember to use the
dnaEPICO tag and check the older
posts. Other alternatives are available such as creating GitHub
issues and tweeting. However, please note that if you want to receive
help you should adhere to the posting
guidelines. It is particularly critical that you provide a small
reproducible example and your session information so package developers
can track down the source of the error.