Pipeline Use of dnaEPICO

Introduction

This vignette describes the file-based route in dnaEPICO. In this mode, the same wrapper functions still return structured result objects, but they also write files to disk when saveOutputs = TRUE. This route is appropriate when you want a reproducible folder structure, Makefile-driven execution, or a pipeline that can be resumed on a local workstation or an HPC system.

Citation

The dnaEPICO package uses methods from several external packages. Because no single manuscript describes all components, the guidance below explains how to cite dnaEPICO depending on the functions you use.

This citation guidance is adapted from the vignettes and user guides of minfi, ENmix, and wateRmelon.

If you use make f3 MODEL=model1, please cite Aryee et al. (2014); Fortin, Triche Jr., and Hansen (2017); Xu, Niu, and Taylor (2021); Pidsley et al. (2013); Maksimovic, Gordon, and Oshlack (2012); Fortin et al. (2014); Triche et al. (2013); Touleimat and Tost (2012); Murat et al. (2020).
If you use make f4 MODEL=model1, please cite Aryee et al. (2014); Fortin, Triche Jr., and Hansen (2017); Xu, Niu, and Taylor (2021); Pidsley et al. (2013); Maksimovic, Gordon, and Oshlack (2012); Fortin et al. (2014); Triche et al. (2013); Touleimat and Tost (2012); Murat et al. (2020); Marschner (2011).
If you use make f3lme MODEL=model1, please cite Aryee et al. (2014); Fortin, Triche Jr., and Hansen (2017); Xu, Niu, and Taylor (2021); Pidsley et al. (2013); Maksimovic, Gordon, and Oshlack (2012); Fortin et al. (2014); Triche et al. (2013); Touleimat and Tost (2012); Murat et al. (2020); Bates et al. (2015).
If you use make all MODEL=model1, please cite Aryee et al. (2014); Fortin, Triche Jr., and Hansen (2017); Xu, Niu, and Taylor (2021); Pidsley et al. (2013); Maksimovic, Gordon, and Oshlack (2012); Fortin et al. (2014); Triche et al. (2013); Touleimat and Tost (2012); Murat et al. (2020); Marschner (2011); Bates et al. (2015).

Required knowledge

dnaEPICO is built on core Bioconductor infrastructure for high-dimensional genomic data, with a focus on Illumina DNA methylation arrays. To read this vignette, we assume that you are already familiar with the general DNA methylation workflow. If not, we recommend first reading this tutorial, which provides a practical introduction to the main concepts and analysis steps.

Preprocessing and quality control are performed using established Bioconductor tools, including minfi, ENmix, and wateRmelon. Downstream statistical modelling relies on base R and CRAN frameworks, including generalised linear models and linear mixed-effects models. Users are expected to have basic familiarity with R, command-line execution, and Illumina IDAT file structures.

If you are asking yourself the question “Where do I start using Bioconductor?”, you might be interested in this blog post.

Installation

The dnaEPICO package depends on minfi, ENmix, and wateRmelon, among others. Install dependencies with BiocManager:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}

BiocManager::install("dnaEPICO")

BiocManager::valid()

library("dnaEPICO")

Extract the example Makefile

The package includes a template Makefile that can be copied into a project directory and adapted to local paths and model definitions.

makefilePath <- extractMake(destDir = tempdir(), overwrite = TRUE)
basename(makefilePath)
#> [1] "Makefile"

After extraction, the file should be edited so that the project-specific paths, model labels, and cluster settings match the target analysis environment.

Arguments:

destDir = tempdir() writes the template to a temporary directory so the example does not modify the working tree.
overwrite = TRUE allows the template to be replaced if it already exists in that temporary location.

Write preprocessingMinfiEwasWater outputs to disk

The file-based route usually starts with preprocessingMinfiEwasWater(). This step writes the filtered RGSet, normalized metric matrices, quality-control figures, and the phenoLC.csv file that will be consumed by svaEnmix() and preprocessingPheno().

Create example input files

The next chunk shows how to recreate the temporary files used in this example. It writes a small phenotype table to a temporary directory, copies a small set of IDAT files, and records the cross-reactive probe reference used during probe filtering.

preprocessing_inputs <- dnaEPICO:::exampleMinfiIdatInputsDnaEpico(n = 6L)

names(preprocessing_inputs)
#> [1] "tempDir"           "idatFolder"        "phenoFile"        
#> [4] "targets"           "arrayType"         "annotationVersion"
#> [7] "crossReactivePath"
preprocessing_inputs$tempDir
#> [1] "/tmp/Rtmp9wkOQD/dnaEPICO-idat-example-32e672b243fd"
basename(preprocessing_inputs$phenoFile)
#> [1] "pheno.csv"
head(basename(list.files(preprocessing_inputs$idatFolder, full.names = TRUE)), 4)
#> [1] "5723646052_R02C02_Grn.idat" "5723646052_R02C02_Red.idat"
#> [3] "5723646052_R04C01_Grn.idat" "5723646052_R04C01_Red.idat"
basename(preprocessing_inputs$crossReactivePath)
#> [1] "12864_2024_10027_MOESM8_ESM.csv"

preprocessing_inputs is a small list returned by the internal example helper. names(preprocessing_inputs) shows its main elements: tempDir is the temporary working directory, idatFolder contains the copied example IDAT files, phenoFile is the phenotype table used by the wrapper, targets is the same phenotype information already loaded into memory, arrayType and annotationVersion describe the array platform, and crossReactivePath points to the cross-reactive probe reference used during probe filtering.

preprocessPipelineResult <- preprocessingMinfiEwasWater(
  phenoFile = preprocessing_inputs$phenoFile,
  idatFolder = preprocessing_inputs$idatFolder,
  outputLogs = file.path(preprocessing_inputs$tempDir, "logs"),
  nSamples = 6,
  SampleID = "Sample_Name",
  arrayType = preprocessing_inputs$arrayType,
  annotationVersion = preprocessing_inputs$annotationVersion,
  scriptLabel = "preprocessingMinfiEwasWater",
  baseDataFolder = file.path(preprocessing_inputs$tempDir, "rData"),
  figureBaseDir = file.path(preprocessing_inputs$tempDir, "figures"),
  detPThreshold = 1,
  normMethods = "quantile",
  sexColumn = "Sex",
  pvalThreshold = 1,
  chrToRemove = "",
  snpsToRemove = "SBE",
  mafThreshold = 1,
  crossReactivePath = preprocessing_inputs$crossReactivePath,
  plotGroupVar = "Sex",
  lcRef = "saliva",
  phenoOrder = "Sample_Name;Sex;Basename;Sentrix_ID;Sentrix_Position",
  lcPhenoDir = file.path(
    preprocessing_inputs$tempDir,
    "data",
    "preprocessingMinfiEwasWater"
  ),
  display = FALSE,
  verbose = FALSE,
  logs = TRUE,
  saveOutputs = TRUE
)
#> Plotting  STAINING .jpg
#> Plotting  EXTENSION .jpg
#> Plotting  HYBRIDIZATION .jpg
#> Plotting  TARGET_REMOVAL .jpg
#> Plotting  BISULFITE_CONVERSION_I .jpg
#> Plotting  BISULFITE_CONVERSION_II .jpg
#> Plotting  SPECIFICITY_I .jpg
#> Plotting  SPECIFICITY_II .jpg
#> Plotting  NON-POLYMORPHIC .jpg
#> Plotting  NEGATIVE .jpg
#> Plotting  NORM_A .jpg
#> Plotting  NORM_C .jpg
#> Plotting  NORM_G .jpg
#> Plotting  NORM_T .jpg
#> Plotting  NORM_ACGT .jpg

preprocess_paths <- c(
  phenoLC = file.path(
    preprocessing_inputs$tempDir,
    "data",
    "preprocessingMinfiEwasWater",
    "phenoLC.csv"
  ),
  rgset = file.path(
    preprocessing_inputs$tempDir,
    "rData",
    "preprocessingMinfiEwasWater",
    "objects",
    "RGSet.RData"
  ),
  beta = file.path(
    preprocessing_inputs$tempDir,
    "rData",
    "preprocessingMinfiEwasWater",
    "metrics",
    "beta_NomFilt_MSetF_Flt_Rxy_Ds_Rc.RData"
  ),
  qc = file.path(
    preprocessing_inputs$tempDir,
    "figures",
    "preprocessingMinfiEwasWater",
    "qc",
    "quality_control(MSet).tiff"
  )
)

class(preprocessPipelineResult)
#> [1] "dnaEPICO_preprocessingMinfiEwasWater"
basename(preprocess_paths)
#> [1] "phenoLC.csv"                           
#> [2] "RGSet.RData"                           
#> [3] "beta_NomFilt_MSetF_Flt_Rxy_Ds_Rc.RData"
#> [4] "quality_control(MSet).tiff"
file.exists(preprocess_paths)
#> [1] TRUE TRUE TRUE TRUE

The returned object still has class dnaEPICO_preprocessingMinfiEwasWater, but the printed vectors in this chunk focus on the written outputs. In preprocess_paths, phenoLC is the phenotype table with estimated cell proportions, rgset is the filtered RGSet, beta is the saved beta matrix used later by phenotype preparation, and qc is an example quality-control figure. basename(preprocess_paths) shows the file names, while file.exists(preprocess_paths) confirms that each one was written successfully.

Arguments:

phenoFile and idatFolder define the phenotype table and the IDAT folder used to start the pipeline.
SampleID = "Sample_Name" is the key used to align phenotype rows with the methylation objects.
arrayType and annotationVersion define the array manifest and annotation used by minfi.
baseDataFolder, figureBaseDir, and lcPhenoDir define where serialized objects, figures, and phenoLC.csv are written.
detPThreshold, normMethods, pvalThreshold, chrToRemove, snpsToRemove, mafThreshold, and crossReactivePath define the main preprocessing and filtering choices.
plotGroupVar, lcRef, and phenoOrder control QC grouping, cell-type estimation, and the structure of the exported phenotype table.
saveOutputs = TRUE is what turns this wrapper into the first file-producing step of the pipeline.

Write SVA outputs to disk

The next example shows svaEnmix() in file-writing mode. The function still returns a structured result object, but the savedFiles element now records the main output files written to disk.

Reuse the temporary files written in Step 1

In the file-based route, svaEnmix() consumes the phenoLC.csv and RGSet.RData files written by preprocessingMinfiEwasWater(). The next chunk shows how those temporary file paths are reconstructed locally.

sva_targets_file <- file.path(
  preprocessing_inputs$tempDir,
  "data",
  "preprocessingMinfiEwasWater",
  "phenoLC.csv"
)
sva_rgset_file <- file.path(
  preprocessing_inputs$tempDir,
  "rData",
  "preprocessingMinfiEwasWater",
  "objects",
  "RGSet.RData"
)

basename(c(sva_targets_file, sva_rgset_file))
#> [1] "phenoLC.csv" "RGSet.RData"
file.exists(c(sva_targets_file, sva_rgset_file))
#> [1] TRUE TRUE

Unlike the previous helper objects, sva_targets_file and sva_rgset_file are plain file paths reconstructed from preprocessing_inputs$tempDir. They point to the concrete outputs written by Step 1: phenoLC.csv, which is the phenotype table augmented with cell-composition estimates, and RGSet.RData, which is the filtered methylation object saved to disk.

svaPipelineResult <- svaEnmix(
  phenoFile = sva_targets_file,
  rgsetData = sva_rgset_file,
  outputLogs = file.path(preprocessing_inputs$tempDir, "logs"),
  SampleID = "Sample_Name",
  arrayType = "IlluminaHumanMethylation450k",
  annotationVersion = "ilmn12.hg19",
  SentrixIDColumn = "Sentrix_ID",
  SentrixPositionColumn = "Sentrix_Position",
  ctrlSvaPercVar = 0.90,
  ctrlSvaFlag = 1,
  scriptLabel = "svaEnmix",
  dataBaseDir = file.path(preprocessing_inputs$tempDir, "data"),
  rBaseDir = file.path(preprocessing_inputs$tempDir, "rData"),
  figureBaseDir = file.path(preprocessing_inputs$tempDir, "figures"),
  display = FALSE,
  verbose = FALSE,
  logs = TRUE,
  saveOutputs = TRUE
)
#> 3  surrogate variables explain  91.17398 % of 
#>     data variation

class(svaPipelineResult$savedFiles)
#> [1] "dnaEPICO_svaEnmix_paths"
names(svaPipelineResult$savedFiles)
#> [1] "svaRData"     "svaCSV"       "phenoWithSva" "dataDir"      "rDir"
basename(unlist(svaPipelineResult$savedFiles, use.names = FALSE))
#> [1] "svaMatrix.RData" "svaMatrix.csv"   "phenoLC.csv"     "svaEnmix"       
#> [5] "svaEnmix"

The returned savedFiles object records the main paths written by this step. names(svaPipelineResult$savedFiles) identifies the output groups: svaRData is the serialized surrogate-variable matrix, svaCSV is the CSV version of that matrix, phenoWithSva is the updated phenotype table with appended SVA columns, and dataDir / rDir are the parent directories used for the saved outputs. The basename(...) call shows the file names generated from those paths.

Arguments:

phenoFile and rgsetData identify the phenotype table and saved RGSet produced by the earlier preprocessing step.
SampleID, SentrixIDColumn, and SentrixPositionColumn specify how samples and array positions are identified in the phenotype table.
ctrlSvaPercVar = 0.90 keeps enough control-derived surrogate variables to explain 90% of the control-probe variance.
ctrlSvaFlag = 1 enables the ENmix control-based SVA workflow.
dataBaseDir and rBaseDir define where file outputs are written when saveOutputs = TRUE.
display = FALSE and verbose = FALSE keep the console output quiet, while logs = TRUE writes the log files used by the report.

Write preprocessingPheno outputs to disk

preprocessingPheno() is the step that usually prepares the largest number of pipeline-ready files because it writes timepoint-specific tables, combined longitudinal tables, and the Clock Foundation export inputs.

Create example input files

The next chunk shows how to recreate the temporary phenotype and matrix files used in this example. The helper writes a phenotype table plus aligned beta, M-value, and copy-number objects to a temporary directory.

pheno_inputs <- dnaEPICO:::examplePreprocessingPhenoStateDnaEpico()

names(pheno_inputs)
#>  [1] "tempDir"           "pheno"             "phenoPath"        
#>  [4] "betaPath"          "mPath"             "cnPath"           
#>  [7] "metricsData"       "timepointData"     "combinedData"     
#> [10] "clockFoundation"   "preprocessingData"
pheno_inputs$tempDir
#> [1] "/tmp/Rtmp9wkOQD/dnaEPICO-preprocessingPheno-example-32e6298064f9"
basename(c(
  pheno_inputs$phenoPath,
  pheno_inputs$betaPath,
  pheno_inputs$mPath,
  pheno_inputs$cnPath
))
#> [1] "phenoLC.csv" "beta.RData"  "m.RData"     "cn.RData"

pheno_inputs is a list that bundles the temporary files and the aligned in-memory objects used in this stage. names(pheno_inputs) shows that it contains the temporary directory, the phenotype table, the saved beta, M-value, and copy-number files, and precomputed helper objects such as metricsData, timepointData, combinedData, and clockFoundation.

phenoPipelineResult <- preprocessingPheno(
  phenoFile = pheno_inputs$phenoPath,
  betaPath = pheno_inputs$betaPath,
  mPath = pheno_inputs$mPath,
  cnPath = pheno_inputs$cnPath,
  SampleID = "Sample_Name",
  timeVar = "Timepoint",
  timepoints = "1,2",
  combineTimepoints = "1,2",
  outputPheno = file.path(pheno_inputs$tempDir, "data", "preprocessingPheno"),
  outputRData = file.path(
    pheno_inputs$tempDir,
    "rData",
    "preprocessingPheno",
    "metrics"
  ),
  outputRDataMerge = file.path(
    pheno_inputs$tempDir,
    "rData",
    "preprocessingPheno",
    "mergeData"
  ),
  sexColumn = "Sex",
  outputLogs = file.path(pheno_inputs$tempDir, "logs"),
  outputDir = file.path(pheno_inputs$tempDir, "clockFoundation"),
  verbose = FALSE,
  logs = TRUE,
  saveOutputs = TRUE
)

names(phenoPipelineResult$savedFiles)
#> [1] "timepoints"        "combinedPheno"     "combinedPhenoBeta"
#> [4] "betaCSV"           "betaZIP"           "phenoCF"
names(phenoPipelineResult$savedFiles$timepoints)
#> [1] "1" "2"
basename(unlist(phenoPipelineResult$savedFiles$timepoints[["1"]]))
#> [1] "phenoT1.csv"       "betaT1.RData"      "mT1.RData"        
#> [4] "phenoBetaT1.RData"
basename(unlist(phenoPipelineResult$savedFiles[c(
  "combinedPheno",
  "combinedPhenoBeta",
  "betaCSV",
  "phenoCF"
)]))
#> [1] "phenoT1T2.csv"       "phenoBetaT1T2.RData" "beta.csv"           
#> [4] "phenoCF.csv"

These outputs are the files most commonly consumed by downstream modeling functions. names(phenoPipelineResult$savedFiles) shows the high-level output groups returned by the wrapper. names(phenoPipelineResult$savedFiles$timepoints) lists the available timepoint-specific subsets. The basename(...) calls then show examples of the concrete files written for one timepoint and for the combined outputs. In this structure, combinedPhenoBeta is the merged phenotype-plus-beta object that becomes the direct input to the GLM and GLMM wrappers, while betaCSV, betaZIP, and phenoCF are the Clock Foundation exports.

Arguments:

phenoFile, betaPath, mPath, and cnPath point to the phenotype table and the aligned methylation matrices from the previous workflow stages.
SampleID = "Sample_Name" defines the sample key used during alignment.
timeVar = "Timepoint", timepoints = "1,2", and combineTimepoints = "1,2" determine the timepoint-specific outputs and the combined longitudinal object.
outputPheno, outputRData, outputRDataMerge, and outputDir define the directories used for phenotype exports, metric objects, merged objects, and Clock Foundation inputs.
saveOutputs = TRUE is what makes this step useful in a file-based pipeline, because later modeling steps consume these written files.

Write GLM model outputs to disk

The modeling wrappers also separate in-memory results from file outputs. The next example runs a small GLM analysis and prints the names of the files written to disk.

Create example input files

The next chunk shows how to recreate the temporary merged phenotype-plus-beta input file used by the GLM example.

glm_inputs <- dnaEPICO:::exampleMethylationGLMStateDnaEpico()

names(glm_inputs)
#> [1] "tempDir"        "inputPath"      "preparedData"   "modelResults"  
#> [5] "modelSummaries" "annotationData"
glm_inputs$tempDir
#> [1] "/tmp/Rtmp9wkOQD/dnaEPICO-glm-example-32e62b292260"
basename(glm_inputs$inputPath)
#> [1] "phenoBT1.RData"
file.exists(glm_inputs$inputPath)
#> [1] TRUE

glm_inputs is a list returned by the GLM example helper. names(glm_inputs) shows that it contains the temporary directory, the saved merged phenotype-plus-beta input file (inputPath), and the corresponding in-memory objects produced during helper construction: preparedData, modelResults, and modelSummaries.

glmPipelineResult <- methylationGLM_T1(
  inputPheno = glm_inputs$inputPath,
  phenotypes = "status",
  covariates = "sex,age",
  factorVars = "status,sex",
  cpgLimit = 2,
  nCores = 1,
  outputLogs = file.path(glm_inputs$tempDir, "logs"),
  outputRData = file.path(glm_inputs$tempDir, "rData", "models"),
  outputPlots = file.path(glm_inputs$tempDir, "figures"),
  significantCpGDir = file.path(glm_inputs$tempDir, "significant"),
  summaryTxtDir = file.path(glm_inputs$tempDir, "summaries"),
  summaryPval = 1,
  significantCpGPval = 1,
  annotationPackage = "IlluminaHumanMethylation450kanno.ilmn12.hg19",
  annotationCols = "Name,chr,pos",
  annotatedGLMOut = file.path(glm_inputs$tempDir, "annotated"),
  display = FALSE,
  verbose = FALSE,
  logs = TRUE,
  saveOutputs = TRUE
)

class(glmPipelineResult$savedFiles)
#> [1] "dnaEPICO_methylationGLM_T1_paths"
names(glmPipelineResult$savedFiles)
#> [1] "modelFiles"          "summaryFiles"        "summaryTxtFiles"    
#> [4] "significantCpGFiles" "annotatedGLM"

The returned savedFiles object records the groups of outputs written by the GLM step. names(glmPipelineResult$savedFiles) typically includes the model files, summary files, diagnostic plot files, significant-CpG exports, summary text files, and the annotated results table. This is the object to inspect when you want to confirm which modeling outputs were written and how they are grouped.

Arguments:

inputPheno points to the merged phenotype-plus-beta object generated by preprocessingPheno().
phenotypes lists the outcome variables that will be modeled one at a time.
covariates lists the adjustment variables included in every model.
factorVars identifies the categorical variables that should be treated as factors before model fitting.
cpgLimit = 2 keeps the example fast by fitting only two CpG models.
nCores = 1 keeps the example deterministic and lightweight; production HPC runs typically use higher values.
annotationPackage and annotationCols control which array annotation is merged into the final summary tables.
outputRData, outputPlots, significantCpGDir, summaryTxtDir, and annotatedGLMOut define where model objects, plots, summaries, and annotated results are written.
saveOutputs = TRUE enables the file-based pipeline behaviour expected by Makefile and HPC use.

Write LME model outputs to disk

The longitudinal modeling step follows the same pattern as the GLM wrapper, but the saved outputs now correspond to mixed-effects models, longitudinal summary tables, and interaction-specific result files.

Create example input files

The next chunk shows how to recreate the temporary combined longitudinal phenotype-plus-beta input file used by the LME example.

glmm_inputs <- dnaEPICO:::exampleMethylationGLMMStateDnaEpico()

names(glmm_inputs)
#> [1] "tempDir"        "inputPath"      "preparedData"   "modelResults"  
#> [5] "modelSummaries" "annotationData"
glmm_inputs$tempDir
#> [1] "/tmp/Rtmp9wkOQD/dnaEPICO-glmm-example-32e66576c4be"
basename(glmm_inputs$inputPath)
#> [1] "phenoBT1T2.RData"
file.exists(glmm_inputs$inputPath)
#> [1] TRUE

glmm_inputs plays the same role for the longitudinal example. The helper returns a list with the temporary directory, the saved combined longitudinal input file (inputPath), and the in-memory objects used to build that example: preparedData, modelResults, and modelSummaries.

glmmPipelineResult <- methylationGLMM_T1T2(
  inputPheno = glmm_inputs$inputPath,
  outputLogs = file.path(glmm_inputs$tempDir, "logs"),
  outputRData = file.path(glmm_inputs$tempDir, "rData", "models"),
  outputPlots = file.path(glmm_inputs$tempDir, "figures"),
  personVar = "person",
  timeVar = "Timepoint",
  phenotypes = "score",
  covariates = "sex",
  factorVars = "sex,Timepoint",
  cpgLimit = 2,
  nCores = 1,
  summaryPval = 1,
  saveSignificantInteractions = TRUE,
  significantInteractionDir = file.path(
    glmm_inputs$tempDir,
    "results",
    "cpgs",
    "methylationGLMM_T1T2"
  ),
  significantInteractionPval = 1,
  saveTxtSummaries = TRUE,
  summaryTxtDir = file.path(
    glmm_inputs$tempDir,
    "results",
    "summary",
    "methylationGLMM_T1T2"
  ),
  annotationPackage = "IlluminaHumanMethylation450kanno.ilmn12.hg19",
  annotationCols = "Name,chr,pos",
  annotatedLMEOut = file.path(glmm_inputs$tempDir, "annotated"),
  display = FALSE,
  verbose = FALSE,
  logs = TRUE,
  saveOutputs = TRUE
)

class(glmmPipelineResult$savedFiles)
#> [1] "dnaEPICO_methylationGLMM_T1T2_paths"
names(glmmPipelineResult$savedFiles)
#> [1] "modelFiles"                  "summaryFiles"               
#> [3] "summaryTxtFiles"             "significantInteractionFiles"
#> [5] "annotatedLME"

The returned savedFiles object records the groups of outputs written by the LME step. names(glmmPipelineResult$savedFiles) typically includes the model files, summary files, diagnostic plot files, summary text files, significant interaction exports, and the annotated longitudinal results table. This is the object to inspect when you want to confirm which longitudinal modeling outputs were written and how they are grouped.

Arguments:

inputPheno points to the combined longitudinal phenotype-plus-beta object.
personVar defines the participant identifier used for the random effect.
timeVar defines the repeated-measures time variable.
phenotypes, covariates, and factorVars define the fixed-effect portion of the mixed model.
cpgLimit = 2 keeps the example short by fitting only two CpG models.
nCores = 1 keeps the example lightweight; production HPC runs usually use higher parallel settings.
saveSignificantInteractions, significantInteractionDir, and significantInteractionPval control the export of interaction-specific results.
summaryTxtDir defines where plain-text model summaries are written.
annotationPackage, annotationCols, and annotatedLMEOut define the annotation resources and output location for the final longitudinal summary table.
saveOutputs = TRUE enables the file-based longitudinal workflow expected by Makefile and HPC use.

Generate the report from the example outputs

After the file-based examples above have been run, dnamReport() can assemble the phenotype table, ENmix control figures, quality-control figures, batch effect figures, metrics figures, model annotation tables, and logs into one website report. The example below uses the concrete output paths created by the previous chunks.

The preprocessing and SVA examples share preprocessing_inputs$tempDir, while the GLM and LME examples write model outputs into their own temporary directories. Because dnamReport() accepts one path per tab, those outputs can be passed directly.

When running this section from an R session, dnamReport() still needs the Quarto command line interface because the function renders the .qmd files into the final website. You can check whether R can find Quarto with:

Sys.which("quarto")
system2("quarto", "--version")

If Sys.which("quarto") returns an empty string, install the Quarto command line interface from the Quarto get-started page before rendering the report. The R package quarto provides R helper functions for an existing Quarto installation; it does not replace the Quarto command line interface used by dnamReport().

If Quarto is installed but not on PATH, set the executable path before calling dnamReport():

Sys.setenv(QUARTO_BIN = "/full/path/to/quarto")

report_log_dir <- file.path(preprocessing_inputs$tempDir, "logs", "report")
dir.create(report_log_dir, recursive = TRUE, showWarnings = FALSE)

report_logs <- c(
  file.path(
    preprocessing_inputs$tempDir,
    "logs",
    "log_preprocessingMinfiEwasWater.txt"
  ),
  file.path(pheno_inputs$tempDir, "logs", "log_preprocessingPheno.txt"),
  file.path(preprocessing_inputs$tempDir, "logs", "log_svaEnmix.txt"),
  file.path(glm_inputs$tempDir, "logs", "log_methylationGLM_T1.txt"),
  file.path(glmm_inputs$tempDir, "logs", "log_methylationGLMM_T1T2.txt")
)

existing_logs <- report_logs[file.exists(report_logs)]
if (length(existing_logs)) {
  file.copy(existing_logs, report_log_dir, overwrite = TRUE)
}

reportResult <- dnamReport(
  outputDir = file.path(preprocessing_inputs$tempDir, "reports", "example"),
  phenoTab = file.path(
    preprocessing_inputs$tempDir,
    "data",
    "preprocessingMinfiEwasWater",
    "phenoLC.csv"
  ),
  enmixTab = file.path(
    preprocessing_inputs$tempDir,
    "figures",
    "preprocessingMinfiEwasWater",
    "enmix"
  ),
  qcTab = file.path(
    preprocessing_inputs$tempDir,
    "figures",
    "preprocessingMinfiEwasWater",
    "qc"
  ),
  svaTab = file.path(preprocessing_inputs$tempDir, "figures", "svaEnmix"),
  metricTab = file.path(
    preprocessing_inputs$tempDir,
    "figures",
    "preprocessingMinfiEwasWater",
    "metrics"
  ),
  glmTab = glmPipelineResult$savedFiles$annotatedGLM,
  lmerTab = glmmPipelineResult$savedFiles$annotatedLME,
  logTab = report_log_dir,
  detPPath = file.path(
    preprocessing_inputs$tempDir,
    "rData",
    "preprocessingMinfiEwasWater",
    "qc",
    "detP_RGSet.RData"
  ),
  detPThreshold = 0.01,
  verbose = FALSE,
  logs = TRUE
)

reportResult$outputFile

GLM model structure and PRS terms

For each CpG, methylationGLM_T1() fits a regression model of the form:

$\text{Methylation}_{CpG_i} = \beta_0 + \beta_1 \cdot \text{Phenotype} + \sum_{k = 1}^{K} \beta_{k + 1} \cdot \text{Covariate}_k + \varepsilon$

where $\text{Methylation}_{CpG_i}$ is the methylation value at CpG $i$ , $\beta_0$ is the intercept, the phenotype term is the variable of interest, and the remaining fixed effects represent the listed covariates.

If no polygenic risk score is included, a model can be read schematically as:

$\begin{aligned} \text{Methylation}_{CpG_i} =\;& \beta_0 + \beta_1 \cdot \text{Phenotype} + \beta_2 \cdot \text{Age} + \beta_3 \cdot \text{Sex} + \beta_4 \cdot \text{Ethnicity} \\ &+ \varepsilon \end{aligned}$

When prsMap is supplied, a phenotype-specific PRS term is appended only for the matching phenotype. For example, the mapping "Pheno1:PRS_1,Pheno2:PRS_2" means that:

models for Pheno1 include PRS_1
models for Pheno2 include PRS_2
phenotypes without a mapping are fit without a PRS term

For a phenotype mapped to a PRS, the model becomes:

$\text{Methylation}_{CpG_i} = \beta_0 + \beta_1 \cdot \text{Phenotype} + \sum_{k = 1}^{K} \beta_{k + 1} \cdot \text{Covariate}_k + \beta_{\text{PRS}} \cdot \text{PRS} + \varepsilon$

If interactionTerm is supplied, the same framework is extended by adding the requested interaction to the fixed-effect portion of the model.

Longitudinal mixed-effects structure

The longitudinal wrapper methylationGLMM_T1T2() follows the same file-based pattern, but uses a mixed-effects model with a participant-level random intercept. In schematic form:

$\begin{aligned} \text{Methylation}_{CpG_i} =\;& \beta_0 + \beta_1 \cdot \text{Phenotype} + \beta_2 \cdot \text{Timepoint} + \beta_3 \cdot (\text{Phenotype} \times \text{Timepoint}) \\ &+ \sum_{k = 1}^{K} \beta_{k + 3} \cdot \text{Covariate}_k + \beta_{\text{PRS}} \cdot \text{PRS} + b_{\text{person}} + \varepsilon \end{aligned}$

where $b_{\text{person}}$ is the subject-specific random intercept. As in the GLM workflow, the PRS term is included only when the current phenotype is matched in prsMap.

Common arguments in this longitudinal stage are:

inputPheno for the combined longitudinal phenotype-plus-beta object
personVar for the participant identifier used in the random effect
timeVar for the repeated-measures variable
phenotypes, covariates, and factorVars for the fixed-effect structure
prsMap for phenotype-specific PRS terms
cpgLimit and nCores for computational control
annotationPackage and annotationCols for annotated summaries

Typical Makefile use

In a project directory, the Makefile route is usually simpler than calling each wrapper by hand. A minimal project layout looks like this:

my_project/
  Makefile
  metadata/
    pheno_model1.csv
  data/
    preprocessingMinfiEwasWater/
      idats/
        <IDAT files>

The exported template should then be edited so that at least the model name, phenotype path, and directory variables match the project. A minimal example is:

MODEL = model1
PHENO_FILE = metadata/pheno_model1.csv
LOGS_DIR := $(abspath logs)
DATA_DIR = data
RDATA_DIR = rData
RESULTS_DIR = results
FIGURES_DIR = figures

With that structure in place, a typical sequence of commands looks like this:

make step1 MODEL=model1
make step2 MODEL=model1
make step3 MODEL=model1
make f3 MODEL=model1
make f4 MODEL=model1
make f3lme MODEL=model1
make all MODEL=model1
make status MODEL=model1
make clean MODEL=model1

These targets correspond to the grouped outputs defined in the exported rules:

make step1 MODEL=model1 runs preprocessing only.
make step2 MODEL=model1 runs the SVA step using the files written by Step 1.
make step3 MODEL=model1 runs phenotype preparation using the files written by Step 1.
make f3 MODEL=model1 runs preprocessing, SVA, phenotype preparation, and the report target defined by the current rules.
make f4 MODEL=model1 runs the same route, adds the GLM step, and refreshes the report target.
make f3lme MODEL=model1 runs the same route, adds the longitudinal LME step, and refreshes the report target.
make all MODEL=model1 runs the full pipeline, including both modeling steps and the report.
make status MODEL=model1 checks which expected output files are already present.
make clean MODEL=model1 removes the files generated for the selected model while keeping the shared raw input area.

If several models are declared in MODELS, the grouped multi-model targets can be used instead:

make f3_models
make f4_models
make f3lme_models
make models

These grouped targets loop over every model listed in MODELS:

make f3_models runs the f3 route for each declared model.
make f4_models runs the f4 route for each declared model.
make f3lme_models runs the f3lme route for each declared model.
make models runs the full all pipeline for each declared model.

Two maintenance targets are also useful during routine work:

make status MODEL=model1
make clean MODEL=model1
make clean MODEL=all

make status MODEL=model1 reports which expected outputs are already present for one model.
make clean MODEL=model1 removes the generated outputs for one model.
make clean MODEL=all removes the generated outputs for every model listed in MODELS.

The exact target names and variables depend on the Makefile settings extracted for the project. In practice, saveOutputs = TRUE is what makes the individual steps compatible with this route because later steps consume the files written by earlier ones.

Report generation

The exported Makefile contains a report rule whose output is $(STEP6)/$(MODEL)/docs/index.html. With the default directory settings, this becomes reports/<model>/docs/index.html. Users usually do not need to call that file target directly: the grouped targets above depend on it, so Make refreshes the report automatically after the required pipeline outputs are available.

For example, the f3 route builds the report from the Data, ENmix QC, Quality Control, Batch Effect, Metrics, Report, and Logs inputs. The f4, f3lme, and all routes additionally make the GLM and LMER tables available when the corresponding model outputs exist.

The report target reads the same types of outputs generated in the examples above, organised under the Makefile project layout: the phenotype table in data/, quality-control and metric figures in figures/, the detection P-value object in rData/, model annotation tables in data/, and workflow logs in logs/.

The generated site is opened from the report target path:

$(STEP6)/$(MODEL)/docs/index.html

Internally, the Makefile calls dnamReport() with one argument per tab, using the project paths defined by the Makefile variables. The report target requires the Quarto command line interface in the same environment where make is run. If Quarto is not already available, install the Quarto command line interface from the Quarto get-started page, or use the installation method provided by the local computing environment. Before launching the report target, check:

quarto --version

If Quarto is installed outside PATH, export its executable path before running make:

export QUARTO_BIN=/full/path/to/quarto

Full exported Makefile

The full Makefile exported by extractMake() is shown below. This is the template that should be adapted for a project before running the pipeline in the target analysis environment.

# ===============================================
# USER CONFIGURATION
# ===============================================
MAKEFLAGS += --output-sync

# ===============================================
# MODELS SELECTION PARALLEL RUN
# ===============================================
MODEL ?= model1 # Single model
MODELS = modelA modelB modelC # Multiple models for parallel run

# ===============================================
# PER-MODEL OVERRIDES
# ===============================================
ifeq ($(MODEL), modelA)
  PHENO_FILE = $(DATA_DIR)/$(STEP1)/phenoA.csv

else ifeq ($(MODEL), modelB)
  PHENO_FILE = $(DATA_DIR)/$(STEP1)/phenoB.csv

else ifeq ($(MODEL), modelC)
  PHENO_FILE = $(DATA_DIR)/$(STEP1)/phenoC.csv

else # default (model1)
  PHENO_FILE = $(DATA_DIR)/$(STEP1)/pheno.csv

endif

# ===============================================
# DIRECTORIES
# ===============================================
LOGS_DIR := $(abspath logs)
DATA_DIR = data
RDATA_DIR = rData
RESULTS_DIR = results
FIGURES_DIR = figures
STEP1 = preprocessingMinfiEwasWater
STEP2 = svaEnmix
STEP3 = preprocessingPheno
STEP4 = methylationGLM_T1
STEP5 = methylationGLMM_T1T2
STEP6 = reports
METRICS_DIR = metrics
IDAT_DIR = idats
OBJ_DIR = objects
MERGE_DIR = mergeData
MODEL_DIR = models
CPG_DIR = cpgs
SUMMARY_DIR = summary
ENMIX_DIR = enmix
SVA_DIR = sva
QC_DIR = qc
R_DIR = ~/R/x86_64-pc-linux-gnu-library/4.4 # NULL is default R library path

# ===============================================
# GLOBAL PARAMETERS
# ===============================================
# STEP 1-5 PARAMETERS
PHENO_FILE = $(DATA_DIR)/$(STEP1)/phenoEMD.csv
SEP_TYPE = ""
SAMPLE_ID = Sample_Name
N_SAMPLES = 30
ARRAY_TYPE = IlluminaHumanMethylationEPICv2
ANNOTATION_VERSION = 20a1.hg38
TIFF_WIDTH = 2000
TIFF_HEIGHT = 1000
TIFF_RES = 150
SEX_COLUMN = Sex
SENTRIX_ID_COLUMN = Sentrix_ID
SENTRIX_POSITION_COLUMN = Sentrix_Position
BASENAME_COLUMN = Basename
TIME_VAR = Timepoint
PHENO_ORDER = $(SAMPLE_ID);$(TIME_VAR);$(SEX_COLUMN);PredSex;$(BASENAME_COLUMN);$(SENTRIX_ID_COLUMN);$(SENTRIX_POSITION_COLUMN)

# STEP 4-5 PARAMETERS
COVARIATES = Sex
FACTOR_VARS = Sex,TreatmentGroup
CPG_PREFIX = cg
CPG_LIMIT = NA
PRS_MAP = NULL
SUMMARY_PVAL = NA
N_CORES = 64
SAVE_TXT_SUMMARIES = TRUE
CHUNK_SIZE = 10000
FDR_THRESHOLD = 0.05
PADJ_METHOD = fdr
ANNOTATION_PACKAGE = IlluminaHumanMethylationEPICv2anno.20a1.hg38
ANNOTATION_COLS = Name,chr,pos,UCSC_RefGene_Group,UCSC_RefGene_Name,Relation_to_Island,GencodeV41_Group

# ===============================================
# STEP 1 PARAMETERS
# ===============================================
QC_CUTOFF = 10.5
DET_PTYPE = m+u
DET_PTHRESHOLD = 0.05
PVAL_THRESHOLD = 0.01
CHR_TO_REMOVE = chrX,chrY
SNPS_TO_REMOVE = SBE,CpG
CROSS_REACTIVE_FILE = $(DATA_DIR)/$(STEP1)/12864_2024_10027_MOESM8_ESM.csv
MAF_THRESHOLD = 0.1
PLOT_GROUP_VAR = TreatvControl_Time1_vs_Time2
LC_REF = salivaEPIC

# ===============================================
# STEP 2 PARAMETERS
# ===============================================
CTRL_SVA_PERC_VAR = 0.90
CTRL_SVA_FLAG = 1

# ===============================================
# STEP 3 PARAMETERS
# ===============================================
TIMEPOINTS = 1,2
COMBINE_TIMEPOINTS = 1,2

# ===============================================
# STEP 4 PARAMETERS
# ===============================================
PHENOTYPES_GLM = TreatmentGroup
GLM_LIBS = glm2
INTERACTION_GLM = NULL
SUMMARY_RESIDUAL_SD = TRUE
SAVE_SIGNIFICANT_CPGS = TRUE
SIGNIFICANT_CPG_PVAL = 0.1

# ==============================================
# STEP 5 PARAMETERS
# ==============================================
PHENOTYPES_LME = TreatmentGroup
PERSON_VAR = person
LME_LIBS = lme4,lmerTest
INTERACTION_LME = TreatvControl_Time1_vs_Time2
SAVE_SIGNIFICANT_INTERACTIONS = TRUE
SIGNIFICANT_INTERACTION_PVAL = 0.1

# ===============================================
# ARGUMENT NORMALISATION
# ===============================================

ifeq ($(PRS_MAP),NULL)
  PRS_MAP_ARG = NULL
else
  PRS_MAP_ARG = '$(PRS_MAP)'
endif

ifeq ($(R_DIR),NULL)
  R_DIR_ARG = NULL
else
  R_DIR_ARG = '$(R_DIR)'
endif

ifeq ($(INTERACTION_GLM),NULL)
  INTERACTION_GLM_ARG = NULL
else
  INTERACTION_GLM_ARG = '$(INTERACTION_GLM)'
endif

ifeq ($(INTERACTION_LME),NULL)
  INTERACTION_LME_ARG = NULL
else
  INTERACTION_LME_ARG = '$(INTERACTION_LME)'
endif

# ===============================================
# INCLUDE PIPELINE FROM dnaEPICO
# ===============================================
DNAPIPE_MK := $(shell Rscript -e "cat(system.file('extdata','make','Makefile.rules.pipeline',package='dnaEPICO'))")
include $(DNAPIPE_MK)

Makefile section guide

The exported Makefile is organised into sections so that project-specific choices can be edited without changing the pipeline rules themselves.

User configuration

MAKEFLAGS += --output-sync keeps parallel Make output grouped by recipe, which makes logs easier to read.

Models selection and parallel run

MODEL defines a single pipeline configuration to run.
MODELS defines a set of configurations that can be launched in parallel by the grouped Make targets.

In single-model use, commands such as make f4 MODEL=model1 run one workflow. In multi-model use, commands such as make f4_models evaluate the same template for each name listed in MODELS.

Per-model overrides

PHENO_FILE is reassigned according to the selected MODEL.

This block is useful when different cohorts, case definitions, or phenotype encodings should reuse the same analysis pipeline without duplicating the full Makefile.

Directories

LOGS_DIR, DATA_DIR, RDATA_DIR, RESULTS_DIR, and FIGURES_DIR define the top-level folder structure.
STEP1 to STEP6 define the canonical names of the major pipeline stages.
METRICS_DIR, IDAT_DIR, OBJ_DIR, MERGE_DIR, MODEL_DIR, CPG_DIR, SUMMARY_DIR, ENMIX_DIR, SVA_DIR, and QC_DIR define the subdirectory names used by the rules file.
R_DIR optionally points to a custom R library path on shared systems. When set to NULL, the default library path is used.

These variables control where files are written and how paths are composed across all stages of the pipeline.

Global parameters

These parameters are shared across multiple workflow steps.

Step 1 to Step 5 shared parameters

PHENO_FILE: input phenotype table used to start the workflow.
SEP_TYPE: separator used when reading text files.
SAMPLE_ID: phenotype column used to align samples.
N_SAMPLES: optional sample subset for testing; NA uses all samples.
ARRAY_TYPE and ANNOTATION_VERSION: array platform and annotation build.
TIFF_WIDTH, TIFF_HEIGHT, and TIFF_RES: default plot dimensions and resolution.
SEX_COLUMN: phenotype column used for sex-based checks and prediction.
SENTRIX_ID_COLUMN and SENTRIX_POSITION_COLUMN: chip-position metadata used by the SVA stage.
BASENAME_COLUMN: column holding the array basename.
TIME_VAR: phenotype column used for repeated-measures or longitudinal workflows.
PHENO_ORDER: desired order of key phenotype columns in exported tables.

Step 4 and Step 5 shared modeling parameters

COVARIATES: comma-separated adjustment variables included in each model.
FACTOR_VARS: variables that should be treated as categorical before model fitting.
CPG_PREFIX: prefix used to identify CpG columns.
CPG_LIMIT: optional cap on the number of CpGs processed; NA uses all.
PRS_MAP: mapping from phenotype names to PRS variables.
SUMMARY_PVAL: optional p-value cutoff for summary tables.
N_CORES: number of cores used within each R job.
SAVE_TXT_SUMMARIES: whether plain-text summaries should be written.
CHUNK_SIZE: number of CpGs processed per chunk.
FDR_THRESHOLD: false-discovery-rate threshold used for filtering.
PADJ_METHOD: multiple-testing correction method.
ANNOTATION_PACKAGE: array annotation package used for annotated summaries.
ANNOTATION_COLS: annotation fields to merge into output tables.

Step 1 parameters

QC_CUTOFF: sample-quality threshold applied during preprocessing.
DET_PTYPE: detection p-value method.
DET_PTHRESHOLD: detection p-value cutoff.
PVAL_THRESHOLD: probe-level p-value filter applied after preprocessing.
CHR_TO_REMOVE: chromosomes to exclude from downstream matrices.
SNPS_TO_REMOVE: SNP-related probe categories to remove.
CROSS_REACTIVE_FILE: cross-reactive probe reference file.
MAF_THRESHOLD: minor-allele-frequency threshold used during SNP filtering.
PLOT_GROUP_VAR: phenotype variable used for grouped QC plots.
LC_REF: reference panel used for cell-type estimation.

These values control the main preprocessing and filtering choices in preprocessingMinfiEwasWater().

Step 2 parameters

CTRL_SVA_PERC_VAR: target proportion of control-probe variance explained by the retained surrogate variables.
CTRL_SVA_FLAG: enables the control-based ENmix SVA workflow.

These values determine how svaEnmix() estimates and retains surrogate variables.

Step 3 parameters

TIMEPOINTS: comma-separated timepoints exported separately.
COMBINE_TIMEPOINTS: comma-separated timepoints combined into the longitudinal object.

These values determine how preprocessingPheno() splits and recombines the phenotype-methylation data.

Step 4 parameters

PHENOTYPES_GLM: comma-separated phenotype variables that will be modelled one at a time by methylationGLM_T1().
GLM_LIBS: GLM implementation used by methylationGLM_T1().
INTERACTION_GLM: optional interaction term added to the fixed-effect part of the GLM.
SUMMARY_RESIDUAL_SD: whether residual standard deviations are reported in summaries.
SAVE_SIGNIFICANT_CPGS: whether tables of significant CpGs are exported.
SIGNIFICANT_CPG_PVAL: p-value threshold used for those significant-CpG exports.

Step 5 parameters

PHENOTYPES_LME: comma-separated phenotype variables that will be modelled one at a time by methylationGLMM_T1T2().
PERSON_VAR: participant identifier used as the random effect grouping factor.
LME_LIBS: mixed-effects libraries used by methylationGLMM_T1T2().
INTERACTION_LME: optional interaction term used in the longitudinal model.
SAVE_SIGNIFICANT_INTERACTIONS: whether significant interaction tables are exported.
SIGNIFICANT_INTERACTION_PVAL: p-value threshold for those exported interaction results.

Argument normalisation

PRS_MAP_ARG, R_DIR_ARG, INTERACTION_GLM_ARG, and INTERACTION_LME_ARG convert Makefile string values into arguments that can be passed safely to R.

This section is especially important for values such as NULL, because it prevents the literal string "NULL" from being interpreted as a real model value inside the wrapper functions.

Include pipeline from dnaEPICO

DNAPIPE_MK resolves the packaged Makefile.rules.pipeline file.
include $(DNAPIPE_MK) imports the actual pipeline rules after the user configuration has been defined.

This separation is what allows the template Makefile to stay compact while the package keeps the rule implementation in one maintained location.

Summary

The file-based route is the correct choice when you need:

a reproducible folder structure,
files that can be reused by later workflow stages,
Makefile execution, or
repeated analyses across one or more models.

The interactive route and the file-based route are therefore complementary: local use is best for understanding the returned objects, while pipeline use is best for reproducible multi-step execution.

Basics

Date the vignette was generated.

#> [1] "2026-05-19 00:11:50 UTC"

Wallclock time spent generating the vignette.

#> Time difference of 1.493 mins

R session information.

#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.1
#>  [2] IlluminaHumanMethylation450kmanifest_0.4.0        
#>  [3] minfi_1.52.1                                      
#>  [4] bumphunter_1.48.0                                 
#>  [5] locfit_1.5-9.12                                   
#>  [6] iterators_1.0.14                                  
#>  [7] foreach_1.5.2                                     
#>  [8] Biostrings_2.74.1                                 
#>  [9] XVector_0.46.0                                    
#> [10] SummarizedExperiment_1.36.0                       
#> [11] Biobase_2.66.0                                    
#> [12] MatrixGenerics_1.18.1                             
#> [13] matrixStats_1.5.0                                 
#> [14] GenomicRanges_1.58.0                              
#> [15] GenomeInfoDb_1.42.3                               
#> [16] IRanges_2.40.1                                    
#> [17] S4Vectors_0.44.0                                  
#> [18] BiocGenerics_0.52.0                               
#> [19] dnaEPICO_0.99.18                                  
#> [20] BiocStyle_2.34.0                                  
#> 
#> loaded via a namespace (and not attached):
#>   [1] splines_4.4.2             BiocIO_1.16.0            
#>   [3] filelock_1.0.3            bitops_1.0-9             
#>   [5] tibble_3.3.1              preprocessCore_1.68.0    
#>   [7] XML_3.99-0.23             lifecycle_1.0.5          
#>   [9] Rdpack_2.6.6              doParallel_1.0.17        
#>  [11] lattice_0.22-9            MASS_7.3-65              
#>  [13] base64_2.0.2              scrime_1.3.7             
#>  [15] magrittr_2.0.5            limma_3.62.2             
#>  [17] sass_0.4.10               rmarkdown_2.31           
#>  [19] jquerylib_0.1.4           yaml_2.3.12              
#>  [21] otel_0.2.0                doRNG_1.8.6.3            
#>  [23] askpass_1.2.1             DBI_1.3.0                
#>  [25] minqa_1.2.8               RColorBrewer_1.1-3       
#>  [27] abind_1.4-8               zlibbioc_1.52.0          
#>  [29] quadprog_1.5-8            purrr_1.2.2              
#>  [31] RCurl_1.98-1.18           rappdirs_0.3.4           
#>  [33] GenomeInfoDbData_1.2.13   ggrepel_0.9.8            
#>  [35] irlba_2.3.7               rentrez_1.2.4            
#>  [37] genefilter_1.88.0         annotate_1.84.0          
#>  [39] pkgdown_2.2.0             DelayedMatrixStats_1.28.1
#>  [41] codetools_0.2-20          DelayedArray_0.32.0      
#>  [43] xml2_1.5.2                tidyselect_1.2.1         
#>  [45] glm2_1.2.1                UCSC.utils_1.2.0         
#>  [47] farver_2.1.2              lme4_2.0-1               
#>  [49] beanplot_1.3.1            BiocFileCache_2.14.0     
#>  [51] dynamicTreeCut_1.63-1     illuminaio_0.48.0        
#>  [53] GenomicAlignments_1.42.0  jsonlite_2.0.0           
#>  [55] multtest_2.62.0           survival_3.8-6           
#>  [57] systemfonts_1.3.2         tools_4.4.2              
#>  [59] ragg_1.5.2                Rcpp_1.1.1-1.1           
#>  [61] glue_1.8.1                SparseArray_1.6.2        
#>  [63] xfun_0.57                 dplyr_1.2.1              
#>  [65] HDF5Array_1.34.0          withr_3.0.2              
#>  [67] numDeriv_2016.8-1.1       BiocManager_1.30.27      
#>  [69] fastmap_1.2.0             boot_1.3-32              
#>  [71] rhdf5filters_1.18.1       openssl_2.4.1            
#>  [73] caTools_1.18.3            digest_0.6.39            
#>  [75] R6_2.6.1                  textshaping_1.0.5        
#>  [77] RPMM_1.25                 gtools_3.9.5             
#>  [79] RSQLite_3.52.0            minfiData_0.52.0         
#>  [81] tidyr_1.3.2               generics_0.1.4           
#>  [83] data.table_1.18.4         rtracklayer_1.66.0       
#>  [85] httr_1.4.8                htmlwidgets_1.6.4        
#>  [87] S4Arrays_1.6.0            pkgconfig_2.0.3          
#>  [89] gtable_0.3.6              blob_1.3.0               
#>  [91] S7_0.2.2                  siggenes_1.80.0          
#>  [93] impute_1.80.0             htmltools_0.5.9          
#>  [95] bookdown_0.46             geneplotter_1.84.0       
#>  [97] scales_1.4.0              png_0.1-9                
#>  [99] reformulas_0.4.4          knitr_1.51               
#> [101] tzdb_0.5.0                rjson_0.2.23             
#> [103] nlme_3.1-169              curl_7.1.0               
#> [105] nloptr_2.2.1              cachem_1.1.0             
#> [107] rhdf5_2.50.2              KernSmooth_2.23-26       
#> [109] BiocVersion_3.20.0        AnnotationDbi_1.68.0     
#> [111] restfulr_0.0.16           desc_1.4.3               
#> [113] GEOquery_2.74.0           pillar_1.11.1            
#> [115] grid_4.4.2                reshape_0.8.10           
#> [117] vctrs_0.7.3               gplots_3.3.0             
#> [119] ENmix_1.42.2              dbplyr_2.5.2             
#> [121] xtable_1.8-8              cluster_2.1.8.2          
#> [123] evaluate_1.0.5            readr_2.2.0              
#> [125] GenomicFeatures_1.58.0    cli_3.6.6                
#> [127] compiler_4.4.2            Rsamtools_2.22.0         
#> [129] rlang_1.2.0               crayon_1.5.3             
#> [131] rngtools_1.5.2            labeling_0.4.3           
#> [133] nor1mix_1.3-3             mclust_6.1.2             
#> [135] plyr_1.8.9                fs_2.1.0                 
#> [137] BiocParallel_1.40.2       lmerTest_3.2-1           
#> [139] Matrix_1.7-5              ExperimentHub_2.14.0     
#> [141] hms_1.1.4                 sparseMatrixStats_1.18.0 
#> [143] bit64_4.8.0               ggplot2_4.0.3            
#> [145] Rhdf5lib_1.28.0           KEGGREST_1.46.0          
#> [147] statmod_1.5.2             AnnotationHub_3.14.0     
#> [149] rbibutils_2.4.1           memoise_2.0.1            
#> [151] bslib_0.11.0              bit_4.6.0

Asking for help

As package developers, we try to explain clearly how to use our packages and in which order to use the functions. But R and Bioconductor have a steep learning curve, so it is critical to learn where to ask for help. We would like to highlight the Bioconductor support site as the main resource for getting help. Please remember to use the dnaEPICO tag and check the older posts. If you want to receive help, please provide a small reproducible example and your session information so the source of the problem can be tracked efficiently.

References

Aryee, Martin J, Andrew E Jaffe, Hector Corrada Bravo, Christine Ladd-Acosta, Andrew P Feinberg, Kasper D Hansen, and Rafael A Irizarry. 2014. “Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays.” Bioinformatics 30 (10): 1363–69. https://doi.org/10.1093/bioinformatics/btu049.

Bates, Douglas, Martin Mächler, Ben Bolker, and Steve Walker. 2015. “Fitting Linear Mixed-Effects Models Using lme4.” Journal of Statistical Software 67 (1): 1–48. https://doi.org/10.18637/jss.v067.i01.

Fortin, Jean-Philippe, Aurélie Labbe, Mathieu Lemire, Brent W Zanke, Thomas J Hudson, Elana J Frtig, Celia MT Greenwood, and Kasper D Hansen. 2014. “Functional normalization of 450k methylation array data improves replication in large cancer studies.” Genome Biology 15 (11): 503. https://doi.org/10.1186/s13059-014-0503-2.

Fortin, Jean-Philippe, Timothy Triche Jr., and Kasper D Hansen. 2017. “Preprocessing, Normalization and Integration of the Illumina HumanMethylationEPIC Array with Minfi.” Bioinformatics 33 (4): 558–60. https://doi.org/10.1093/bioinformatics/btw691.

Maksimovic, Jovana, Lavinia Gordon, and Alicia Oshlack. 2012. “SWAN: Subset quantile Within-Array Normalization for Illumina Infinium HumanMethylation450 BeadChips.” Genome Biology 13 (6): R44. https://doi.org/10.1186/gb-2012-13-6-r44.

Marschner, Ian. 2011. “Glm2: Fitting Generalized Linear Models with Convergence Problems.” The R Journal 3: 12–15. https://doi.org/10.32614/RJ-2011-012.

Murat, Kubra, Björn Grüning, Pawel W Poterlowicz, Gareth Westgate, Desmond J Tobin, and Krzysztof Poterlowicz. 2020. “Ewastools: Infinium Human Methylation BeadChip pipeline for population epigenetics integrated into Galaxy.” GigaScience 9 (5): giaa049. https://doi.org/10.1093/gigascience/giaa049.

Pidsley, Ruth, Ching Ching Y Wong, Matteo Volta, Katie Lunnon, Jonathan Mill, and Leonard C Schalkwyk. 2013. “A data-driven approach to preprocessing Illumina 450K methylation array data.” BMC Genomics 14: 293. https://doi.org/10.1186/1471-2164-14-293.

Touleimat, Nizar, and Jörg Tost. 2012. “Complete Pipeline for Infinium() Human Methylation 450K BeadChip Data Processing Using Subset Quantile Normalization for Accurate DNA Methylation Estimation.” Epigenomics 4 (3): 325–41. https://doi.org/10.2217/epi.12.21.

Triche, Timothy J, Daniel J Weisenberger, David Van Den Berg, Peter W Laird, and Kimberly D Siegmund. 2013. “Low-level processing of Illumina Infinium DNA Methylation BeadArrays.” Nucleic Acids Research 41 (7): e90. https://doi.org/10.1093/nar/gkt090.

Xu, Zongli, Li Niu, and Jack A Taylor. 2021. “The ENmix DNA methylation analysis pipeline for Illumina BeadChip and comparisons with seven other preprocessing pipelines.” Clinical Epigenetics 13 (1): 216. https://doi.org/10.1186/s13148-021-01207-1.

Paul Ruiz

Divya Mehta

2026-05-19

Introduction

Citation

Required knowledge

Installation

Extract the example Makefile

Write preprocessingMinfiEwasWater outputs to disk

Create example input files

Write SVA outputs to disk

Reuse the temporary files written in Step 1

Write preprocessingPheno outputs to disk

Create example input files

Write GLM model outputs to disk

Create example input files

Write LME model outputs to disk

Create example input files

Generate the report from the example outputs

GLM model structure and PRS terms

Longitudinal mixed-effects structure

Typical Makefile use

Report generation

Full exported Makefile

Makefile section guide

User configuration

Models selection and parallel run

Per-model overrides

Directories

Global parameters

Step 1 to Step 5 shared parameters

Step 4 and Step 5 shared modeling parameters

Step 1 parameters

Step 2 parameters

Step 3 parameters

Step 4 parameters

Step 5 parameters

Argument normalisation

Include pipeline from dnaEPICO

Summary

Basics

Asking for help

References