Subsetted Differential Gene Expression (DGE) Analysis

Last updated: 2025-11-11

Checks: 7 0

Knit directory: dge-analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20230618)

The command set.seed(20230618) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 17c660d

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 17c660d. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    .github/.DS_Store
    Ignored:    dge-analysis/.DS_Store
    Ignored:    dge-analysis/.Rhistory
    Ignored:    dge-analysis/.Rproj.user/
    Ignored:    dge-analysis/analysis/.DS_Store
    Ignored:    dge-analysis/data/.DS_Store
    Ignored:    dge-analysis/docs/.DS_Store
    Ignored:    dge-analysis/docs/figure/.DS_Store
    Ignored:    dge-analysis/docs/site_libs/.DS_Store
    Ignored:    dge-analysis/output/.DS_Store
    Ignored:    dge-analysis/output/all-samples-analysis/all-condition-analysis.RData
    Ignored:    dge-analysis/output/publication-analysis/publication-analysis.RData
    Ignored:    dge-analysis/output/spta1-analysis/batch-correction-limma/
    Ignored:    dge-analysis/output/subgroup-hpp-analysis/HPP-subgroup-analysis.RData
    Ignored:    dge-analysis/renv/.DS_Store
    Ignored:    dge-analysis/renv/library/
    Ignored:    dge-analysis/renv/staging/
    Ignored:    phenotypic-analysis/.DS_Store
    Ignored:    phenotypic-analysis/notebooks/.DS_Store

Unstaged changes:
    Modified:   .gitignore
    Deleted:    dge-analysis/docs/custom.css
    Modified:   dge-analysis/output/all-samples-analysis/analysis-volcano-plot.png
    Modified:   dge-analysis/output/all-samples-analysis/batch-correction-limma/plot-counts/madd-genes-of-interest/CYTH2-plot-counts.png
    Modified:   dge-analysis/output/all-samples-analysis/batch-correction-limma/plot-counts/padj-05/faceted_significant_degs_plot_counts.png
    Modified:   dge-analysis/output/all-samples-analysis/counts_vst.csv
    Modified:   dge-analysis/output/all-samples-analysis/counts_vst_limma.csv
    Modified:   dge-analysis/output/all-samples-analysis/madd_genes_of_interest.csv
    Modified:   dge-analysis/output/all-samples-analysis/mito_genes_of_interest.csv
    Modified:   dge-analysis/output/all-samples-analysis/other_genes_of_interest.csv
    Modified:   dge-analysis/output/all-samples-analysis/res_aff_vs_unaff.csv
    Modified:   dge-analysis/output/all-samples-analysis/res_aff_vs_unaff_df_genename.csv
    Modified:   dge-analysis/output/all-samples-analysis/res_aff_vs_unaff_significant_mygene.csv
    Modified:   dge-analysis/output/all-samples-analysis/res_aff_vs_unaff_significant_samples.csv
    Modified:   dge-analysis/output/all-samples-analysis/res_spta1_vs_unaff_df_genename_05.csv
    Modified:   dge-analysis/output/all-samples-analysis/res_spta1_vs_unaff_df_genename_padj05_lfc1.csv
    Modified:   dge-analysis/output/all-samples-analysis/res_spta1_vs_unaff_df_genename_padj1.csv
    Modified:   dge-analysis/output/all-samples-analysis/res_spta1_vs_unaff_df_genename_padj1_lfc1.csv
    Modified:   dge-analysis/output/all-samples-analysis/slc4a1_genes_of_interest.csv
    Modified:   dge-analysis/output/subgroup-hpp-analysis/counts_vst.csv
    Modified:   dge-analysis/output/subgroup-hpp-analysis/counts_vst_limma.csv
    Modified:   dge-analysis/output/subgroup-hpp-analysis/res_hpp_vs_unaff.csv
    Modified:   dge-analysis/output/subgroup-hpp-analysis/res_hpp_vs_unaff_df_genename_05.csv
    Modified:   dge-analysis/output/subgroup-hpp-analysis/res_hpp_vs_unaff_df_genename_padj05_lfc1.csv
    Modified:   dge-analysis/output/subgroup-hpp-analysis/res_hpp_vs_unaff_df_genename_padj1.csv
    Modified:   dge-analysis/output/subgroup-hpp-analysis/res_hpp_vs_unaff_df_genename_padj1_lfc1.csv
    Modified:   dge-analysis/output/subgroup-hpp-analysis/res_hpp_vs_unaff_siggenes.csv
    Modified:   dge-analysis/output/subgroup-hpp-analysis/res_subgroup_vs_unaff_df_genename.csv
    Modified:   dge-analysis/output/subgroup-hpp-analysis/volcano-plot-hpp.png

Staged changes:
    Modified:   dge-analysis/analysis/_site.yml

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (dge-analysis/analysis/subsetted-condition-analysis.Rmd) and HTML (docs/subsetted-condition-analysis.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	9f4f1fd	GitHub	2025-11-06	Reorganize repository. (#4)
html	9f4f1fd	GitHub	2025-11-06	Reorganize repository. (#4)
html	a6247eb	GitHub	2025-07-23	Prepare for PhysGen submission. (#2)
html	b56e0ee	GitHub	2025-02-03	Prep repository for publication. (#1)
html	721ce08	sdhutchins	2024-07-15	Build site.
html	fefc055	sdhutchins	2024-07-12	Build site.
html	d5cad3c	sdhutchins	2024-07-12	Build site.
html	27d94ad	sdhutchins	2024-07-12	Build site.
html	c32ee82	sdhutchins	2024-07-12	Build site.
html	614a5fc	sdhutchins	2024-06-13	Build site.
html	841b69b	sdhutchins	2024-06-12	Build site.
html	e92fe76	sdhutchins	2024-06-11	Build site.
html	281ec58	sdhutchins	2024-06-11	Build site.
html	f5411e0	sdhutchins	2024-05-20	Build site.
html	6b0693d	sdhutchins	2024-05-17	Build site.
html	df552ab	sdhutchins	2024-05-17	Build site.
html	573ba28	sdhutchins	2024-05-16	Build site.
html	be532a2	sdhutchins	2024-05-10	Build site.
html	8a91bc0	sdhutchins	2024-04-29	Build site.
html	ac28b95	sdhutchins	2024-04-22	Build site.
html	bd1bae0	sdhutchins	2024-04-12	Build site.
html	c835cb7	sdhutchins	2024-04-12	Build site.
html	18ba804	sdhutchins	2024-04-12	Build site.
html	f2c768c	sdhutchins	2024-04-11	Build site.
html	1e45c53	sdhutchins	2024-04-09	Build site.
html	1804028	sdhutchins	2024-04-08	Update metadata.
html	0c1afe6	sdhutchins	2024-03-13	Build site.

Load libraries

Ensure you have followed instructions in the readme on activating the renv environment.

library(tidyverse) # Available via CRAN
library(DESeq2) # Available via Bioconductor
library(RColorBrewer) # Available via CRAN
library(pheatmap) # Available via CRAN
library(genefilter) # Available via Bioconductor
library(limma) # Available via Bioconductor
library(gprofiler2) # Available via CRAN
library(biomaRt) # Available via Bioconductor
library(plotly) # Available via CRAN
library(ggpubr) # Available via CRAN
library(rmarkdown) # Available via CRAN
library(clusterProfiler) # Available via Bioconductor
library(org.Hs.eg.db) # Available via Bioconductor
library(ggrepel) # Available via CRAN
library(ReactomePA) # Available via Bioconductor
library(mygene) # Available via Bioconductor
library(DOSE) # Available via Bioconductor
library(enrichR) # Available via Bioconductor
library(STRINGdb) # Available via Bioconductor
library(EnhancedVolcano)
library(ComplexHeatmap)

Data Import

We will be importing counts data from the star-salmon pipeline and our metadata for the project which is hosted on Box. This also ensures data is properly ordered by sample id.

# Import counts and metadata
sample_metadata <- read_csv("data/Metadata_2024_11_20.csv")
row.names(sample_metadata) <- sample_metadata$ID
counts <- read_tsv("data/star-salmon/salmon_merged_gene_counts_length_scaled.tsv")

# Import variants of interest
genes_of_interest <- read_csv("data/Prioritized_Genes_From_WGS_2025_06_18.csv")
genes_of_interest <- unique(genes_of_interest$Genes)
genes_of_interest <- genes_of_interest[!is.na(genes_of_interest)]

# Use first column (gene_id) for row names
counts <- data.frame(counts, row.names = 1, check.names = FALSE)
counts$Ensembl_ID <- row.names(counts)
drop <- c("Ensembl_ID", "gene_name")
gene_info <- counts[, drop]
counts <- counts[, !(names(counts) %in% drop)] # remove both columns

# What gene (based on variants samples have) is this analysis based on
analysis_name <- "subsetted"
outpath_var <- paste0(analysis_name, "-analysis")

# Check that data is ordered properly
sample_metadata <- check_order(sample_metadata = sample_metadata, counts = counts)

genes_biomart <- retrieve_gene_info(values = gene_info$Ensembl_ID, filters = "ensembl_gene_id_version")

DESeq2 Analysis

sample_metadata$Family <- factor(sample_metadata$Family)
sample_metadata$Affected <- factor(sample_metadata$Affected)
sample_metadata$Batch <- factor(sample_metadata$Batch)
sample_metadata$Sex <- factor(sample_metadata$Sex)
sample_metadata$Ancestry <- factor(sample_metadata$Ancestry)
sample_metadata$Category <- factor(sample_metadata$Category)
sample_metadata$Subgroup <- factor(sample_metadata$Subgroup)

# Account for Batch and Affected Status
dds <- DESeqDataSetFromMatrix(
  countData = round(counts), colData = sample_metadata,
  design = ~ Batch + Affected
)

# Pre-filtering: Keep only rows that have at least 10 reads total
keep <- rowSums(counts(dds)) >= 50
dds <- dds[keep, ]

# Remove samples from families with multiple samples
# View Excluded_Samples.csv to learn why these were excluded
remove_multiple_fam <- c("8", "12", "19", "21", "23", "27")

# Save excluded/removed samples
# Code is commented out since this is for 1 time usage
# excluded_samples_df <- filter(sample_metadata,
#                        ID %in% remove_multiple_fam)
# write.csv(excluded_samples_df, "data/Excluded_Samples.csv")

# Create a logical vector to index the columns you want to keep
kept_fam_members <- !(colnames(dds) %in% remove_multiple_fam)
dds <- dds[, kept_fam_members]

# Run DESeq function
dds <- DESeq(dds)

# Normalize gene counts for differences in seq. depth/global differences
counts_norm <- counts(dds, normalized = TRUE)

Data transformation

Perform count data transformation by variance stabilizing transformation (vst) on normalized counts.

vsd <- vst(dds, blind = FALSE)

Batch correction with limma

This code performs batch effect correction on variance-stabilized transformed (VST) counts using the Limma package. The corrected counts are then saved and incorporated into the VST object for downstream analysis.

# Extract VST counts from the DESeq2 object
counts_vst <- assay(vsd)

# Save the VST counts to a CSV file
write.csv(counts_vst, file = "output/publication-analysis/counts_vst_subsetted.csv")

# Create a model matrix with Batch and Affected as covariates
mm <- model.matrix(~ Batch + Affected, colData(vsd))

# Remove batch effects using the Limma package
counts_vst_limma <- limma::removeBatchEffect(counts_vst,
  batch = vsd$Batch,
  design = mm
)

Coefficients not estimable: batch1 batch2

# Save the batch-effect corrected counts to a CSV file
write.csv(counts_vst_limma, file = "output/publication-analysis/counts_vst_limma_subsetted.csv")

# Replace the VST counts in the DESeq2 object with the batch-effect corrected counts
vsd_limma <- vsd
assay(vsd_limma) <- counts_vst_limma

Data Visualization

Sample distances heatmap

This code calculates sample-to-sample distances from the batch-effect corrected VST data, organizes the distances into a matrix, and visualizes the results as a heatmap for quality control.

# Compute sample-to-sample distances using the corrected VST data
sample_dists_all <- dist(t(assay(vsd_limma)))

# Convert the distance object to a matrix for visualization
sample_dist_matrix_all <- as.matrix(sample_dists_all)

# Set row and column names for the distance matrix using metadata
rownames(sample_dist_matrix_all) <- paste(vsd_limma$Batch, vsd_limma$Affected, sep = " | ")
colnames(sample_dist_matrix_all) <- paste(vsd_limma$ID, vsd_limma$Affected, sep = " | ")

# Define a color palette for the heatmap
colors <- colorRampPalette(rev(brewer.pal(9, "Blues")))(255)

# Plot the heatmap of sample distances with hierarchical clustering
pheatmap(sample_dist_matrix_all,
  clustering_distance_rows = sample_dists_all,
  clustering_distance_cols = sample_dists_all,
  col = colors
)

Version	Author	Date
9f4f1fd	GitHub	2025-11-06
a6247eb	GitHub	2025-07-23
27d94ad	sdhutchins	2024-07-12
6b0693d	sdhutchins	2024-05-17
1e45c53	sdhutchins	2024-04-09
0c1afe6	sdhutchins	2024-03-13

Principal Components Analysis & Visualization

This code prepares data for PCA analysis, computes principal components using batch-effect corrected VST data, and generates a PCA plot to visualize sample clustering based on specified metadata variables.

# Prepare PCA data from the corrected VST data
# intgroup specifies the metadata variables to include in the PCA
pca_results <- prepare_pca_data(vsd_data = vsd_limma, intgroup = c("Batch", "Affected", "Sex"))

# Generate a PCA plot with metadata-based coloring, filling, and shaping
pca_plot <- generate_pca_plot(
  pca_data = pca_results$pca_data,
  percent_var = pca_results$percent_var,
  colour = "Affected", # Color by affected status
  fill = "Sex", # Fill by sex
  shape = "Batch" # Shape by batch
)

# Display the PCA plot
print(pca_plot)

Version	Author	Date
9f4f1fd	GitHub	2025-11-06
a6247eb	GitHub	2025-07-23
b56e0ee	GitHub	2025-02-03
1e45c53	sdhutchins	2024-04-09
0c1afe6	sdhutchins	2024-03-13

The below PCA plot highlights overall categories, affected status, and batch information for each sample.

# Prepare PCA data
pca_results <- prepare_pca_data(
  vsd_data = vsd_limma,
  intgroup = c("Batch", "Affected", "Category")
)

# Generate PCA plot
pca_plot <- generate_pca_plot(
  pca_data = pca_results$pca_data,
  percent_var = pca_results$percent_var,
  colour = "Category", fill = "Affected", shape = "Batch"
)

# Display the plot
pca_plot

Version	Author	Date
9f4f1fd	GitHub	2025-11-06
a6247eb	GitHub	2025-07-23

Heatmaps of Genes with Highest Variance

# Specify annotation colors by columns
# Use RColorBrewer::brewer.pal(n=10, name="Set1")
category_colors <- c(
  ENE        = "#3C8031",  # Energy Production and Metabolism (green)
  STR        = "#1F57C3",  # Muscle Structure and Stability (blue)
  RBC        = "#C9302C",  # RBC Membrane Integrity (red)
  SOL        = "#9B2FAE",  # Solute Transport and Ion Homeostasis (purple)
  IMM        = "#FFB000",  # IMM (assigned amber-gold)
  Unaffected = "#808080"   # Neutral gray
)

ann_colors <- list(
  Batch = c(B1 = "#c38b1f", B2 = "#c31f57", B3 = "#1fc38b"),
  Affected = c(Affected = "#800000", Unaffected = "#010080"),
  Category = category_colors
)

This is a heatmap of the top 50 genes with the highest variance across samples

# Parameters for the function
annotation_columns <- c("Batch", "Affected", "Category")
annotation_colors <- ann_colors # Predefined annotation colors

# Generate the heatmap for the top 50 variable genes
generate_top_variable_genes_heatmap(
  vsd_data = vsd_limma,
  gene_info = gene_info,
  annotation_columns = annotation_columns,
  annotation_colors = annotation_colors,
  top_n = 50
)

Version	Author	Date
9f4f1fd	GitHub	2025-11-06
a6247eb	GitHub	2025-07-23
b56e0ee	GitHub	2025-02-03
27d94ad	sdhutchins	2024-07-12
614a5fc	sdhutchins	2024-06-13
281ec58	sdhutchins	2024-06-11
1e45c53	sdhutchins	2024-04-09
0c1afe6	sdhutchins	2024-03-13

Comparison/Contrast of Affected_Affected_vs_Unaffected

This code performs differential expression analysis between affected and unaffected samples, processes and filters the results, and summarizes the findings based on adjusted p-value and log2 fold change thresholds.

# Perform differential expression analysis for affected vs. unaffected samples
res_aff_vs_unaff_sub <- results(dds, contrast = c("Affected", "Affected", "Unaffected"))

# Process and save the results to a CSV file
res_aff_vs_unaff_df_sub <- process_and_save_results(
  res_aff_vs_unaff_sub,
  "output/publication-analysis/res_aff_vs_unaff_sub.csv"
)

# Arrange the results by adjusted p-value
res_aff_vs_unaff_df_sub <- arrange(res_aff_vs_unaff_df_sub, padj)

# Filter results with adjusted p-value < 0.05
res_aff_vs_unaff_df_05_sub <- subset(res_aff_vs_unaff_df_sub, padj < 0.05)

# Filter results with adjusted p-value < 0.1
res_aff_vs_unaff_df_1_sub <- subset(res_aff_vs_unaff_df_sub, padj < 0.1)

# Further filter results with |log2 fold change| >= 1 and adjusted p-value < 0.1
res_aff_vs_unaff_df_padj1_lfc1_sub <- subset(res_aff_vs_unaff_df_1_sub, log2FoldChange >= 1 | log2FoldChange <= -1)

# Further filter results with |log2 fold change| >= 1 and adjusted p-value < 0.05
res_aff_vs_unaff_df_padj05_lfc1_sub <- subset(res_aff_vs_unaff_df_05_sub, log2FoldChange >= 1 | log2FoldChange <= -1)

# Summarize the differential expression results
summary(res_aff_vs_unaff_sub)


out of 24628 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 22, 0.089%
LFC < 0 (down)     : 28, 0.11%
outliers [1]       : 164, 0.67%
low counts [2]     : 0, 0%
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

Heatmap of Significant Genes

This code generates a heatmap for significantly differentially expressed genes (DEGs) with an adjusted p-value < 0.1 and |log2 fold change| ≥ 1. It uses metadata annotations for grouping and color-coding.

# Parameters for the function
annotation_columns <- c("Affected", "Category")
annotation_colors <- ann_colors # Predefined annotation colors
output_file <- "output/publication-analysis/Supplemental Figure 4. DEGs (Padj < 0.1).png"

# Generate the heatmap for significant genes
generate_significant_genes_heatmap(
  vsd_data = vsd_limma,
  res_data = res_aff_vs_unaff_df_padj1_lfc1_sub, # Significant genes
  gene_info = gene_info,
  annotation_columns = annotation_columns,
  annotation_colors = annotation_colors,
  output_file = output_file
)

Version	Author	Date
9f4f1fd	GitHub	2025-11-06
a6247eb	GitHub	2025-07-23

This code generates a heatmap for significantly differentially expressed genes (DEGs) with an adjusted p-value < 0.05 and |log2 fold change| ≥ 1. It uses metadata annotations for grouping and color-coding.

# Parameters for the function
annotation_columns <- c("Affected", "Category")
annotation_colors <- ann_colors # Predefined annotation colors
output_file <- "output/publication-analysis/Supplemental Figure 4. Significant DEGs (Padj < 0.05).png"

# Generate the heatmap for significant genes
generate_significant_genes_heatmap(
  vsd_data = vsd_limma,
  res_data = res_aff_vs_unaff_df_padj05_lfc1_sub, # Significant genes
  gene_info = gene_info,
  annotation_columns = annotation_columns,
  annotation_colors = annotation_colors,
  output_file = output_file
)

Version	Author	Date
9f4f1fd	GitHub	2025-11-06
a6247eb	GitHub	2025-07-23

This code integrates gene annotations into the differential expression results using Ensembl IDs and reorders the data by adjusted p-values for downstream analysis.

# Extract relevant columns from the gene information data
gb_df <- genes_biomart[, c(1, ncol(genes_biomart))]

# Add Ensembl IDs to the DEG results
res_aff_vs_unaff_df_sub_genename <- res_aff_vs_unaff_df_sub
res_aff_vs_unaff_df_sub_genename$Ensembl_ID <- row.names(res_aff_vs_unaff_df_sub)

# Merge DEG results with gene information based on Ensembl IDs
res_aff_vs_unaff_df_sub_genename <- merge(
  x = res_aff_vs_unaff_df_sub_genename,
  y = gene_info,
  by.x = "Ensembl_ID",
  by.y = "Ensembl_ID",
  all.x = TRUE
)

# Reorganize columns to place gene annotations first
res_aff_vs_unaff_df_sub_genename <- res_aff_vs_unaff_df_sub_genename[, c(
  dim(res_aff_vs_unaff_df_sub_genename)[2],
  1:(dim(res_aff_vs_unaff_df_sub_genename)[2] - 1)
)]

# Order the DEG results by adjusted p-value
res_aff_vs_unaff_df_sub_genename <- res_aff_vs_unaff_df_sub_genename[
  order(res_aff_vs_unaff_df_sub_genename[, "padj"]),
]

# Save the updated results with gene annotations to a CSV file
write.csv(
  res_aff_vs_unaff_df_sub_genename,
  file = "output/publication-analysis/res_aff_vs_unaff_df_sub_genename.csv"
)

Volcano Plot

This code generates a volcano plot to visualize the distribution of differentially expressed genes (DEGs) and highlights selected genes and applying thresholds for significance and fold change.

# Parameters for the function
res_data <- res_aff_vs_unaff_df_sub_genename
gene_labels <- "gene_name"
x_col <- "log2FoldChange"
y_col <- "padj"
select_genes <- c("SLC4A10", "SNTG2", "KCNQ5", "DEFA1", "CD248", "POLR3G", "ZNF683")
xlab_text <- bquote(~ Log[2] ~ "fold change")
ylab_text <- bquote(~ Log[10] ~ "padj")
xlim_range <- c(-25, 25)
ylim_range <- c(0, 7)
output_file <- "output/publication-analysis/subsetted-analysis-volcano-plot.png"

# Generate and save the volcano plot
generate_volcano_plot(
  res_data = res_data, gene_labels = gene_labels, x_col = x_col,
  y_col = y_col, select_genes = select_genes, xlab_text = xlab_text,
  ylab_text = ylab_text, p_cutoff = 0.05, fc_cutoff = 1.0,
  xlim_range = xlim_range, ylim_range = ylim_range,
  output_file = output_file
)

Version	Author	Date
9f4f1fd	GitHub	2025-11-06
a6247eb	GitHub	2025-07-23
b56e0ee	GitHub	2025-02-03
27d94ad	sdhutchins	2024-07-12

This code subsets differentially expressed genes (DEGs) by significance thresholds, orders them by adjusted p-values, and saves the results to separate CSV files for further analysis and reporting.

res_aff_vs_unaff_df_genename_05 <- subset(res_aff_vs_unaff_df_sub_genename, padj < 0.05)
res_aff_vs_unaff_df_genename_05 <- res_aff_vs_unaff_df_genename_05[order(res_aff_vs_unaff_df_genename_05$padj), ]
write.csv(res_aff_vs_unaff_df_genename_05, file = "output/publication-analysis/res_aff_vs_unaff_df_sub_genename_05.csv")

res_aff_vs_unaff_df_genename_padj05_lfc1 <- subset(res_aff_vs_unaff_df_genename_05, log2FoldChange >= 1 | log2FoldChange <= -1)
res_aff_vs_unaff_df_genename_padj05_lfc1 <- res_aff_vs_unaff_df_genename_padj05_lfc1[order(res_aff_vs_unaff_df_genename_padj05_lfc1$padj), ]
write.csv(res_aff_vs_unaff_df_genename_padj05_lfc1, file = "output/publication-analysis/Supplemental Table 5. Significant Differentially Expressed Genes.csv", row.names = FALSE)

res_aff_vs_unaff_df_genename_1 <- subset(res_aff_vs_unaff_df_sub_genename, padj < 0.1)
res_aff_vs_unaff_df_genename_1 <- res_aff_vs_unaff_df_genename_1[order(res_aff_vs_unaff_df_genename_1$padj), ]
write.csv(res_aff_vs_unaff_df_genename_1, file = "output/publication-analysis/res_aff_vs_unaff_df_sub_genename_padj1.csv")

res_aff_vs_unaff_df_genename_padj1_lfc1 <- subset(res_aff_vs_unaff_df_genename_1, log2FoldChange >= 1 | log2FoldChange <= -1)
res_aff_vs_unaff_df_genename_padj1_lfc1 <- res_aff_vs_unaff_df_genename_padj1_lfc1[order(res_aff_vs_unaff_df_genename_padj1_lfc1$padj), ]
write.csv(res_aff_vs_unaff_df_genename_padj1_lfc1, file = "output/publication-analysis/res_aff_vs_unaff_df_sub_genename_padj1_lfc1.csv")

This code queries gene symbols from significant DEGs (padj < 0.05) against the mygene.info database to retrieve detailed annotations, ensuring unique results for each gene.

# Extract gene symbols from significant DEGs (Padj < 0.05)
genes <- res_aff_vs_unaff_df_genename_05$gene_name

# Query gene information from mygene.info
# Retrieve gene symbol, name, and summary fields for human species
my_gene_data <- queryMany(
  genes,
  scopes = "symbol",
  fields = c("symbol", "name", "summary"),
  species = "human"
)

Finished
Pass returnall=TRUE to return lists of duplicate or missing query terms.

# Convert the results to a data frame and ensure unique entries for each query
my_gene_data_unique <- as.data.frame(my_gene_data) %>%
  dplyr::distinct(query, .keep_all = TRUE)

MA Plot and Significant Gene Analysis

This code generates an MA plot to visualize the differential expression results, identifies significant genes, combines these with detailed gene annotations, and saves the results to CSV files.

filtered_gene_names <- res_aff_vs_unaff_df_sub_genename$gene_name[!grepl("^ENS", res_aff_vs_unaff_df_sub_genename$gene_name)]
# Select specific genes to show
# set top = 0, then specify genes using label.select argument
# Generate the MA plot
maplot <- generate_ma_plot(
  data = res_aff_vs_unaff_df_sub,
  genenames = as.vector(res_aff_vs_unaff_df_sub_genename$gene_name),
  main_title = "Affected vs Unaffected MA Plot",
  top_genes = 30
)
maplot

Version	Author	Date
9f4f1fd	GitHub	2025-11-06
a6247eb	GitHub	2025-07-23
b56e0ee	GitHub	2025-02-03
281ec58	sdhutchins	2024-06-11
6b0693d	sdhutchins	2024-05-17
1e45c53	sdhutchins	2024-04-09
0c1afe6	sdhutchins	2024-03-13

significant_data <- maplot$data %>%
  filter(grepl("Up|Down", sig)) %>%
  mutate(direction = ifelse(grepl("Up", sig), "Up", "Down")) %>%
  dplyr::select(-sig) # This removes the 'sig' column

# Combine DataFrames based on matching 'query' in my_gene_data_unique to 'gene' in significant_data
combined_data <- inner_join(my_gene_data_unique, significant_data, by = c("query" = "name"))

combined_data <- combined_data %>%
  dplyr::select(-notfound, -X_id, -X_score) %>%
  rename(gene = query)

paged_table(as.data.frame(significant_data), options = list(rows.print = 30))

# Save significant genes
write.csv(significant_data, file = "output/publication-analysis/res_aff_vs_unaff_significant_subsetted_samples.csv", row.names = FALSE)

# Save significant genes
write.csv(combined_data, file = "output/publication-analysis/res_aff_vs_unaff_significant_subsetted_mygene.csv", row.names = FALSE)

Expression of Candidate Genes

Below is a table of expression of the genes identified during our WGS analysis.

# Subset gene_info using genes of interest
subset_gene_info <- gene_info[gene_info$gene_name %in% genes_of_interest, ]

filtered_by_interest <- filter(res_aff_vs_unaff_df_sub_genename, Ensembl_ID %in% subset_gene_info$Ensembl_ID)

# Filtering the dataframe by row names
filtered_counts <- counts[row.names(counts) %in% subset_gene_info$Ensembl_ID, ]

# Match and update row names
matching_indices <- match(rownames(filtered_counts), subset_gene_info$Ensembl_ID)

# Update row names based on matching_column values from metadata
rownames(filtered_counts) <- subset_gene_info$gene_name[matching_indices]

missing_genes <- setdiff(subset_gene_info$Ensembl_ID, filtered_by_interest$Ensembl_ID)

corresponding_gene_names <- subset_gene_info$gene_name[subset_gene_info$Ensembl_ID %in% missing_genes]

# Add this to get the counts and assign gene names as rownames
missing_gene_counts <- counts[rownames(counts) %in% missing_genes, ]

missing_gene_map <- subset_gene_info[match(rownames(missing_gene_counts), subset_gene_info$Ensembl_ID), "gene_name"]

rownames(missing_gene_counts) <- missing_gene_map

# Save Missing Genes As CSV
write.csv(missing_gene_counts,
          file = "output/publication-analysis/missing_genes_interest_counts.csv")

paged_table(filtered_by_interest, options = list(rows.print = 15))

paged_table(filtered_counts, options = list(rows.print = 15))

STRINGdb Network Visualization

This code visualizes a protein-protein interaction network for significant DEGs using STRINGdb. It includes mapping DEGs, adding differential expression color codes, and generating a network plot.

library(igraph)

library(STRINGdb)
library(igraph)

# STRINGdb setup
string_db <- STRINGdb$new(version = "11.5", species = 9606,
                          score_threshold = 100, network_type = "full",
                          input_directory = "")

# Combine DEGs and genes of interest
all_genes <- unique(c(significant_data$name, genes_of_interest))

# Map to STRING IDs
mapped_all <- string_db$map(data.frame(name = all_genes), "name", removeUnmappedRows = TRUE)

Warning:  we couldn't map to STRING 10% of your identifiers

# Label type
mapped_all$type <- ifelse(mapped_all$name %in% significant_data$name, "DEG", "variant")

# Merge logFC from DEGs
mapped_all$lfc <- significant_data$lfc[match(mapped_all$name, significant_data$name)]

# Manually rescale lfc to [0,1]
lfc_vals <- mapped_all$lfc
lfc_min <- min(lfc_vals, na.rm = TRUE)
lfc_max <- max(lfc_vals, na.rm = TRUE)

lfc_scaled <- if (lfc_max == lfc_min) {
  rep(0.5, length(lfc_vals))
} else {
  (lfc_vals - lfc_min) / (lfc_max - lfc_min)
}

# Create color palette and assign
color_fun <- colorRampPalette(c("blue", "white", "firebrick"))
deg_colors <- color_fun(100)[as.numeric(cut(lfc_scaled, breaks = 100))]
mapped_all$color <- ifelse(mapped_all$type == "variant", "moccasin", deg_colors)

# Build graph from all mapped STRING IDs
full_graph <- string_db$get_graph()
graph_all <- induced_subgraph(full_graph, vids = mapped_all$STRING_id)
graph_all <- simplify(graph_all, remove.loops = TRUE, remove.multiple = TRUE)

# Keep all DEGs
deg_ids <- mapped_all$STRING_id[mapped_all$type == "DEG"]

# From variant-only genes, keep top 20 by degree
variant_ids <- mapped_all$STRING_id[mapped_all$type == "variant"]
variant_ids_in_graph <- intersect(variant_ids, V(graph_all)$name)
variant_degrees <- degree(graph_all, v = variant_ids_in_graph)
top_variant_ids <- names(sort(variant_degrees, decreasing = TRUE))[1:10]

# Final node set = all DEGs + top variant genes
final_ids <- unique(c(deg_ids, top_variant_ids))
graph <- induced_subgraph(graph_all, vids = final_ids)

# Reapply aesthetics
label_map <- setNames(mapped_all$name, mapped_all$STRING_id)
color_map <- setNames(mapped_all$color, mapped_all$STRING_id)

V(graph)$label <- label_map[V(graph)$name]
V(graph)$color <- color_map[V(graph)$name]
V(graph)$frame.color <- "white"
V(graph)$label.color <- "black"
V(graph)$size <- 8

# Layout
layout <- layout_with_fr(graph, niter = 2000, area = vcount(graph)^3)

# Save plot
png("output/publication-analysis/mecfs_deg_variant_network.png", width = 1600, height = 1000, res = 300)
plot(graph,
     vertex.label.cex = 0.75,
     edge.color = "black",
     edge.width = 1,
     layout = layout)
dev.off()

quartz_off_screen 
                2

Genes of Interest Plot Counts

# List of gene names to remove or keep
spta1_genes <- c(
  "AHSP", "HEMGN", "SLC4A1", "KEL", "SPTA1", "RHAG",
  "GYPE", "GYPA", "GYPB", "EPB42", "SPTB", "SPTBN5"
)

# Define paths
output_path <- "output/publication-analysis/batch-correction-limma/plot-counts/genes-of-interest"

filename <- file.path(output_path, "genes-of-interest-heatmap.png")

# Filter genes
filtered_without_spta1_genes <- filter_genes(filtered_by_interest, spta1_genes, include = FALSE)

spta1_genes_only <- filter_genes(filtered_by_interest, filtered_without_spta1_genes, include = FALSE)

# Generate plots and results using the modularized functions
results <- process_genes_of_interest(
  filtered_data = filtered_by_interest,
  dds = dds,
  vsd = vsd_limma,
  genes_to_filter = spta1_genes,
  output_path = output_path
)

# Extract the faceted plot and gene results
faceted_plot <- results$faceted_plot
gene_results_df <- results$gene_results_df

# Save the gene results to a CSV
file_name <- file.path(output_path, "genes_of_interest_plot_counts_signals.csv")
write.csv(gene_results_df, file_name, row.names = FALSE)

# Prepare the heatmap data
mat3 <- assay(vsd_limma)[filtered_without_spta1_genes$Ensembl_ID, ]
rownames(mat3) <- gene_info$gene_name[match(filtered_without_spta1_genes$Ensembl_ID, gene_info$Ensembl_ID)]
df_sub <- as.data.frame(colData(vsd_limma)[, c("Affected", "Category")])

# Create the heatmap
create_heatmap(
  mat = mat3,
  gene_info = gene_info,
  vsd_coldata = colData(vsd_limma),
  output_path = filename,
  ann_colors = ann_colors
)

quartz_off_screen 
                2

SPTA1 Genes of Interest

# Define the output path
output_path <- "output/publication-analysis/batch-correction-limma/plot-counts/spta1-genes-of-interest"

# Generate plots and results using the modularized function
results <- process_genes_of_interest(
  filtered_data = filtered_by_interest,
  dds = dds,
  vsd = vsd_limma,
  genes_to_filter = filtered_without_spta1_genes$gene_name,
  output_path = output_path
)

# Extract the faceted plot and gene results
faceted_plot <- results$faceted_plot
gene_results_df <- results$gene_results_df

# Save the gene results to a CSV
file_name <- file.path(output_path, "genes_of_interest_plot_counts_signals.csv")
write.csv(gene_results_df, file_name, row.names = FALSE)

# Prepare data for the heatmap
mat_spta1 <- assay(vsd_limma)[spta1_genes_only$Ensembl_ID, ]
rownames(mat_spta1) <- gene_info$gene_name[match(spta1_genes_only$Ensembl_ID, gene_info$Ensembl_ID)]
annotation_data <- as.data.frame(colData(vsd_limma)[, c("Affected", "Category")])

# Generate a heatmap for SPTA1 genes
create_heatmap(
  mat = mat_spta1,
  gene_info = gene_info,
  vsd_coldata = colData(vsd_limma),
  output_path = file.path(output_path, "spta1_genes_of_interest_heatmap.png"),
  ann_colors = ann_colors
)

quartz_off_screen 
                2

Significant Genes Plot Counts

# Define the output path
output_path <- "output/publication-analysis/batch-correction-limma/plot-counts/padj-05"

# Generate plots and results using the modularized function
results <- process_genes_of_interest(
  dds = dds,
  vsd = vsd_limma,
  output_path = output_path,
  preprocessed_data = res_aff_vs_unaff_df_genename_padj05_lfc1
)

# Extract the faceted plot and gene results
faceted_plot <- results$faceted_plot
gene_results_df <- results$gene_results_df

# Save the gene results to a CSV
file_name <- file.path(output_path, "significant_genes_plot_counts_signals.csv")
write.csv(gene_results_df, file_name, row.names = FALSE)

# Prepare data for the heatmap
mat_significant <- assay(vsd_limma)[res_aff_vs_unaff_df_genename_padj05_lfc1$Ensembl_ID, ]
rownames(mat_significant) <- res_aff_vs_unaff_df_genename_padj05_lfc1$gene_name[match(
  res_aff_vs_unaff_df_genename_padj05_lfc1$Ensembl_ID,
  res_aff_vs_unaff_df_genename_padj05_lfc1$Ensembl_ID
)]
annotation_data <- as.data.frame(colData(vsd_limma)[, c("Affected", "Category")])

# Generate a heatmap for significant genes
create_heatmap(
  mat = mat_significant,
  gene_info = gene_info,
  vsd_coldata = colData(vsd_limma),
  output_path = file.path(output_path, "significant_genes_heatmap.png"),
  ann_colors = ann_colors
)

quartz_off_screen 
                2

Save data

save.image(file = "output/publication-analysis/publication-analysis.RData")

sessionInfo()

R version 4.5.1 (2025-06-13)
Platform: x86_64-apple-darwin20
Running under: macOS Sequoia 15.6.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Chicago
tzcode source: internal

attached base packages:
[1] grid      stats4    stats     graphics  grDevices datasets  utils    
[8] methods   base     

other attached packages:
 [1] igraph_2.1.4                ComplexHeatmap_2.24.1      
 [3] EnhancedVolcano_1.26.0      STRINGdb_2.20.0            
 [5] enrichR_3.4                 DOSE_4.2.0                 
 [7] mygene_1.44.0               txdbmaker_1.4.2            
 [9] GenomicFeatures_1.60.0      ReactomePA_1.52.0          
[11] ggrepel_0.9.6               org.Hs.eg.db_3.21.0        
[13] AnnotationDbi_1.70.0        clusterProfiler_4.16.0     
[15] rmarkdown_2.29              ggpubr_0.6.1               
[17] plotly_4.11.0               biomaRt_2.64.0             
[19] gprofiler2_0.2.3            limma_3.64.1               
[21] genefilter_1.90.0           pheatmap_1.0.13            
[23] RColorBrewer_1.1-3          DESeq2_1.48.1              
[25] SummarizedExperiment_1.38.1 Biobase_2.68.0             
[27] MatrixGenerics_1.20.0       matrixStats_1.5.0          
[29] GenomicRanges_1.60.0        GenomeInfoDb_1.44.0        
[31] IRanges_2.42.0              S4Vectors_0.46.0           
[33] BiocGenerics_0.54.0         generics_0.1.4             
[35] lubridate_1.9.4             forcats_1.0.0              
[37] stringr_1.5.1               dplyr_1.1.4                
[39] purrr_1.1.0                 readr_2.1.5                
[41] tidyr_1.3.1                 tibble_3.3.0               
[43] ggplot2_3.5.2               tidyverse_2.0.0            
[45] workflowr_1.7.1            

loaded via a namespace (and not attached):
  [1] fs_1.6.6                 bitops_1.0-9             enrichplot_1.28.4       
  [4] doParallel_1.0.17        httr_1.4.7               tools_4.5.1             
  [7] backports_1.5.0          R6_2.6.1                 lazyeval_0.2.2          
 [10] GetoptLong_1.0.5         withr_3.0.2              graphite_1.54.0         
 [13] prettyunits_1.2.0        gridExtra_2.3            textshaping_1.0.1       
 [16] cli_3.6.5                labeling_0.4.3           sass_0.4.10             
 [19] systemfonts_1.2.3        Rsamtools_2.24.0         yulab.utils_0.2.0       
 [22] gson_0.1.0               foreign_0.8-90           R.utils_2.13.0          
 [25] WriteXLS_6.8.0           plotrix_3.8-4            rstudioapi_0.17.1       
 [28] RSQLite_2.4.2            shape_1.4.6.1            gridGraphics_0.5-1      
 [31] BiocIO_1.18.0            vroom_1.6.5              gtools_3.9.5            
 [34] car_3.1-3                GO.db_3.21.0             Matrix_1.7-3            
 [37] abind_1.4-8              R.methodsS3_1.8.2        lifecycle_1.0.4         
 [40] whisker_0.4.1            yaml_2.3.10              carData_3.0-5           
 [43] gplots_3.2.0             qvalue_2.40.0            SparseArray_1.8.0       
 [46] BiocFileCache_2.16.0     blob_1.2.4               promises_1.3.3          
 [49] crayon_1.5.3             ggtangle_0.0.7           lattice_0.22-7          
 [52] cowplot_1.2.0            annotate_1.86.1          KEGGREST_1.48.1         
 [55] pillar_1.11.0            knitr_1.50               fgsea_1.34.2            
 [58] rjson_0.2.23             codetools_0.2-20         fastmatch_1.1-6         
 [61] glue_1.8.0               getPass_0.2-4            ggfun_0.2.0             
 [64] data.table_1.17.8        vctrs_0.6.5              png_0.1-8               
 [67] treeio_1.32.0            gtable_0.3.6             gsubfn_0.7              
 [70] cachem_1.1.0             xfun_0.52                S4Arrays_1.8.1          
 [73] tidygraph_1.3.1          survival_3.8-3           iterators_1.0.14        
 [76] statmod_1.5.0            nlme_3.1-168             ggtree_3.16.3           
 [79] bit64_4.6.0-1            progress_1.2.3           filelock_1.0.3          
 [82] rprojroot_2.1.0          bslib_0.9.0              KernSmooth_2.23-26      
 [85] rpart_4.1.24             colorspace_2.1-1         DBI_1.2.3               
 [88] Hmisc_5.2-3              nnet_7.3-20              tidyselect_1.2.1        
 [91] processx_3.8.6           chron_2.3-62             bit_4.6.0               
 [94] compiler_4.5.1           curl_6.4.0               git2r_0.36.2            
 [97] httr2_1.2.0              graph_1.86.0             htmlTable_2.4.3         
[100] xml2_1.3.8               DelayedArray_0.34.1      rtracklayer_1.68.0      
[103] caTools_1.18.3           checkmate_2.3.2          scales_1.4.0            
[106] callr_3.7.6              rappdirs_0.3.3           digest_0.6.37           
[109] XVector_0.48.0           htmltools_0.5.8.1        pkgconfig_2.0.3         
[112] base64enc_0.1-3          dbplyr_2.5.0             fastmap_1.2.0           
[115] GlobalOptions_0.1.2      rlang_1.1.6              htmlwidgets_1.6.4       
[118] UCSC.utils_1.4.0         farver_2.1.2             jquerylib_0.1.4         
[121] jsonlite_2.0.0           BiocParallel_1.42.1      GOSemSim_2.34.0         
[124] R.oo_1.27.1              RCurl_1.98-1.17          magrittr_2.0.3          
[127] Formula_1.2-5            GenomeInfoDbData_1.2.14  ggplotify_0.1.2         
[130] patchwork_1.3.1          Rcpp_1.1.0               proto_1.0.0             
[133] ape_5.8-1                viridis_0.6.5            sqldf_0.4-11            
[136] stringi_1.8.7            ggraph_2.2.1             MASS_7.3-65             
[139] plyr_1.8.9               parallel_4.5.1           Biostrings_2.76.0       
[142] graphlayouts_1.2.2       splines_4.5.1            hash_2.2.6.3            
[145] circlize_0.4.16          hms_1.1.3                locfit_1.5-9.12         
[148] ps_1.9.1                 ggsignif_0.6.4           reshape2_1.4.4          
[151] XML_3.99-0.18            evaluate_1.0.4           renv_1.1.4              
[154] BiocManager_1.30.26      foreach_1.5.2            tzdb_0.5.0              
[157] tweenr_2.0.3             httpuv_1.6.16            polyclip_1.10-7         
[160] clue_0.3-66              ggforce_0.5.0            broom_1.0.8             
[163] xtable_1.8-4             restfulr_0.0.16          reactome.db_1.92.0      
[166] tidytree_0.4.6           rstatix_0.7.2            later_1.4.2             
[169] ragg_1.4.0               viridisLite_0.4.2        aplot_0.2.8             
[172] memoise_2.0.1            GenomicAlignments_1.44.0 cluster_2.1.8.1         
[175] timechange_0.3.0