PAG_scRNAseq_analysis_Part7.Rmd

---
title: "Topographic, single-cell gene expression profiling of Periaqueductal Gray neurons"
subtitle: "Part VII: differential expression analysis"
author:
  - name: "Oriol Pavon Arocas, Sarah F. Olesen, and Tiago Branco"
    affiliation: "Sainsbury Wellcome Centre for Neural Circuits and Behaviour, University College London, UK"
    email: "oriol.pavon.16@ucl.ac.uk"
date: "`r format(Sys.time(), '%d %B %Y')`"
output:
  html_notebook:
    highlight: pygments
    number_sections: FALSE
    theme: lumen
    toc: TRUE
    toc_float: TRUE

#output: rmdformats::readthedown:
  #highlight: pygments
---

***

This is a pipeline to analyse single-cell RNA sequencing data from Periaqueductal Gray neurons (1) isolated from acute midbrain slices of transgenic mice using visually guided aspiration via patch pipettes and (2) processed using SMART-seq2 (Picelli et al., Nature Protocols 2014).

This pipeline has been generated after attending the [EMBL-EBI RNA-Sequence Analysis Course](https://www.ebi.ac.uk/training/events/2019/rna-sequence-analysis) and [attending](https://training.csx.cam.ac.uk/bioinformatics/event/2823386) and following the online course on [Analysis of single cell RNA-seq data](https://github.com/hemberg-lab/scRNA.seq.course) by the [Hemberg Lab](https://www.sanger.ac.uk/science/groups/hemberg-group). Many other resources have been used, including the [Orchestrating Single-Cell Analysis with Bioconductor book](https://bioconductor.org/books/release/OSCA/) by Robert Amezquita, Aaron Lun, Stephanie Hicks, and Raphael Gottardo, the [simpleSingleCell workflow in Bioconductor](https://bioconductor.org/packages/3.9/workflows/html/simpleSingleCell.html) maintained by Aaron Lun, the [rnaseqGene workflow](https://bioconductor.org/packages/3.10/workflows/html/rnaseqGene.html) maintained by Michael Love, the [RNAseq123 workflow](https://bioconductor.org/packages/3.10/workflows/html/RNAseq123.html) maintained by Matthew Ritchie, and the [EGSEA123 workflow](https://bioconductor.org/packages/3.10/workflows/html/EGSEA123.html) maintained by Matthew Ritchie.

Other key resources include [Bioconductor](http://www.bioconductor.org/) (Huber et al., Nature Methods 2015), [scRNA-tools](https://www.scrna-tools.org/), `scater` (McCarthy et al., Bioinformatics 2017), `scran` (Lun et al., F1000Res 2016), `SC3` (Kiselev et al., Nature Methods 2017), `Seurat` (Butler et al., Nature Biotechnology 2018), `clusterExperiment` (Risso et al., PLOS Computational Biology 2018), `limma` (Ritchie et al., Nucleic Acids Research 2015), `DESeq2` (Love et al., Genome Biology 2014), `edgeR` (Robinson et al., Bioinformatics 2010), `MAST` (Finak, McDavid, Yajima et al., Genome Biology 2015), `iSEE` (Rue-Albrecht & Marini et al., F1000Research 2018), `t-SNE` (van der Maaten & Hinton, Journal of Machine Learning Research 2008), `UMAP` (McInnes et al., arXiv 2018), and the [Mathematical Statistics and Machine Learning for Life Sciences](https://towardsdatascience.com/tagged/stats-ml-life-sciences) column by Nikolay Oskolkov.

***

# STEP 7 | Differential Expression Analysis
One of the most common types of analyses when working with bulk RNA-seq data is to identify differentially expressed genes. By comparing the genes that change between two conditions, e.g. mutant and wild-type or stimulated and unstimulated, it is possible to characterize the molecular mechanisms underlying the change. Several methods, e.g. `DESeq2` and `edgeR`, have been developed for bulk RNA-seq. 

In scRNA-seq we usually do not have a defined set of experimental conditions. Instead, we can identify the cell groups by using an unsupervised clustering approach. Once the groups have been identified one can find differentially expressed genes either by comparing the differences in variance between the groups (like the Kruskal-Wallis test implemented in `SC3`), or by comparing gene expression between clusters in a pairwise manner. In our case, in addition to clusters we have experimental conditions we can test, namely the cell type or PAG subdivision from where the cells were collected, which makes it possible for us to follow either approach.

```{r}
# Set the directory where your data and scripts are:
setwd("D:/Dropbox (UCL - SWC)/Project_transcriptomics/analysis/PAG_scRNAseq_analysis")

# Set the path to save figures from this Part:
path_for_figures <- "D:/Dropbox (UCL - SWC)/Project_transcriptomics/figures/R_figures_Part7_DE/"
date <- Sys.Date()
date <- gsub("-", "_", date)

# Load packages:
library(tidyverse)
library(SingleCellExperiment)
library(scater)
library(scran)
library(ggplot2)
library(pheatmap)
library(DESeq2)
library(edgeR)
library(limma)
library(monocle)
library(MAST)
library(ROCR)
library(patchwork)
library(extrafont)
```

We continue using the `PAG_sceset_qc_norm_filt_corr_clust` after normalisation, filtering, batch correction, and clustering. We should thus have a `corrected` slot in `assays`:
```{r}
# If starting from stored results, load saved filtered dataset from previous Step:
options(stringsAsFactors = FALSE)

PAG_sceset_qc_norm_filt_corr_clust <- readRDS("PAG_sceset_qc_norm_filt_corr_clust.rds") # Contains filtered cells and genes, log-normalised, filtered, corrected, and clustered data
assayNames(PAG_sceset_qc_norm_filt_corr_clust)
reducedDimNames(PAG_sceset_qc_norm_filt_corr_clust)
```

We first check we have the relevant metadata, making sure the key annotations are `factor`:
```{r}
# Check you have the relevant metadata as factors:
tail(colnames(colData(PAG_sceset_qc_norm_filt_corr_clust)))

class(PAG_sceset_qc_norm_filt_corr_clust$mouse.id)
class(PAG_sceset_qc_norm_filt_corr_clust$cell.type)
class(PAG_sceset_qc_norm_filt_corr_clust$PAG.area)
levels(PAG_sceset_qc_norm_filt_corr_clust$celltype_PAGarea)
```

## Step 7.0 | Recap of clustering results
To remind ourselves about which clustering results we have available, we can look at the names of the `colData` slot:
```{r}
names(colData(PAG_sceset_qc_norm_filt_corr_clust)[grep("cluster", names(colData(PAG_sceset_qc_norm_filt_corr_clust)))])
```

If we wanted to export the cluster IDs of each cell in our dataset for a particuler clustering approach, we could do it as follows:
```{r}
#groups <- as.data.frame(PAG_sceset_qc_norm_filt_corr_clust$SNN_clusters_cv2_jaccard_k5)
#rownames(groups) <- PAG_sceset_qc_norm_filt_corr_clust$cell.id
#groups
#write.csv(groups, file = str_c(path_for_figures, "SNN_clusters/", "cluster_ids.csv"), quote = FALSE, row.names = TRUE)
```

We can also replot the results of our chosen clustering solution before we attempt to find marker genes:
```{r}
# Set theme parameters
theme_args_SNN_clustering <- theme(plot.title = element_text(size = 11, face = "bold"),
                                   axis.title = element_text(size = 11, face = "plain"),
                                   axis.text = element_text(size = 10, face = "plain"), 
                                   legend.text = element_text(size = 9, face = "plain"), 
                                   strip.text = element_text(size = 10, face = "plain"))


### Using UMAP (CV2) - 15_neighbors_0.01_min_dist ###
UMAP_cv2_15_001_PAGsubdivisions <- plotReducedDim(PAG_sceset_qc_norm_filt_corr_clust, 
                                                  dimred = "UMAP_cv2_corrected_15_neighbors_0.01_min_dist",
                                                  colour_by = "PAG.area", shape_by = "cell.type",
                                                  point_alpha = 0.6, point_size = 2, theme_size = 11
                                                  ) + labs(title = "PAG subdivisions", x = "UMAP 1", y = "UMAP 2") + theme_args_SNN_clustering

# Using HVG from CV2 with rank-based weights
UMAP_cv2_15_001_RankWalktrap_k7 <- plotReducedDim(PAG_sceset_qc_norm_filt_corr_clust, 
                                                  dimred = "UMAP_cv2_corrected_15_neighbors_0.01_min_dist",
                                                  colour_by = "SNN_clusters_cv2_rank_k7", shape_by = "cell.type", text_by = "SNN_clusters_cv2_rank_k7",
                                                  point_alpha = 0.6, point_size = 2, theme_size = 11
                                                  ) + labs(title = "Rank-Walktrap (k = 7)", x = "UMAP 1", y = "UMAP 2") + theme_args_SNN_clustering

UMAP_cv2_15_001_RankWalktrap_k12 <- plotReducedDim(PAG_sceset_qc_norm_filt_corr_clust, 
                                                  dimred = "UMAP_cv2_corrected_15_neighbors_0.01_min_dist",
                                                  colour_by = "SNN_clusters_cv2_rank_k12", shape_by = "cell.type", text_by = "SNN_clusters_cv2_rank_k12",
                                                  point_alpha = 0.6, point_size = 2, theme_size = 11
                                                  ) + labs(title = "Rank-Walktrap (k = 12)", x = "UMAP 1", y = "UMAP 2") + theme_args_SNN_clustering

# Using HVG from CV2 with jaccard-based weights
UMAP_cv2_15_001_JaccardLouvain_k5 <- plotReducedDim(PAG_sceset_qc_norm_filt_corr_clust, 
                                                    dimred = "UMAP_cv2_corrected_15_neighbors_0.01_min_dist",
                                                    colour_by = "SNN_clusters_cv2_jaccard_k5", shape_by = "cell.type", text_by = "SNN_clusters_cv2_jaccard_k5",
                                                    point_alpha = 0.6, point_size = 2, theme_size = 11
                                                    ) + labs(title = "Jaccard-Louvain (k = 5)", x = "UMAP 1", y = "UMAP 2") + theme_args_SNN_clustering

UMAP_cv2_15_001_JaccardLouvain_k8 <- plotReducedDim(PAG_sceset_qc_norm_filt_corr_clust, 
                                                    dimred = "UMAP_cv2_corrected_15_neighbors_0.01_min_dist",
                                                    colour_by = "SNN_clusters_cv2_jaccard_k8", shape_by = "cell.type", text_by = "SNN_clusters_cv2_jaccard_k8",
                                                    point_alpha = 0.6, point_size = 2, theme_size = 11
                                                    ) + labs(title = "Jaccard-Louvain (k = 8)", x = "UMAP 1", y = "UMAP 2") + theme_args_SNN_clustering

UMAP_cv2_15_001_JaccardLouvain_k9 <- plotReducedDim(PAG_sceset_qc_norm_filt_corr_clust, 
                                                    dimred = "UMAP_cv2_corrected_15_neighbors_0.01_min_dist",
                                                    colour_by = "SNN_clusters_cv2_jaccard_k9", shape_by = "cell.type", text_by = "SNN_clusters_cv2_jaccard_k9",
                                                    point_alpha = 0.6, point_size = 2, theme_size = 11
                                                    ) + labs(title = "Jaccard-Louvain (k = 9)", x = "UMAP 1", y = "UMAP 2") + theme_args_SNN_clustering

UMAP_cv2_15_001_JaccardLouvain_k13 <- plotReducedDim(PAG_sceset_qc_norm_filt_corr_clust, 
                                                    dimred = "UMAP_cv2_corrected_15_neighbors_0.01_min_dist",
                                                    colour_by = "SNN_clusters_cv2_jaccard_k13", shape_by = "cell.type", text_by = "SNN_clusters_cv2_jaccard_k13",
                                                    point_alpha = 0.6, point_size = 2, theme_size = 11
                                                    ) + labs(title = "Jaccard-Louvain (k = 13)", x = "UMAP 1", y = "UMAP 2") + theme_args_SNN_clustering

# Using SC3 with k = 4
UMAP_cv2_15_001_SC3_4 <- plotReducedDim(PAG_sceset_qc_norm_filt_corr_clust, 
                                                    dimred = "UMAP_cv2_corrected_15_neighbors_0.01_min_dist",
                                                    colour_by = "sc3_4_clusters", shape_by = "cell.type", text_by = "sc3_4_clusters",
                                                    point_alpha = 0.6, point_size = 2, theme_size = 11
                                                    ) + labs(title = "SC3 (k = 4)", x = "UMAP 1", y = "UMAP 2") + theme_args_SNN_clustering


# Prepare and save composed plots
library(patchwork)

UMAP_SNN_RankWalktrap_k7 <- (UMAP_cv2_15_001_PAGsubdivisions + UMAP_cv2_15_001_RankWalktrap_k7) +
  plot_annotation(title = "Graph-based clustering on UMAP (CV2)", tag_levels = "A", 
                  theme = theme(plot.title = element_text(size = 12, face = "bold"), plot.tag = element_text(size = 11))) + 
  plot_layout(ncol = 2, guides = "collect")
UMAP_SNN_RankWalktrap_k7

UMAP_SNN_RankWalktrap_k12 <- (UMAP_cv2_15_001_PAGsubdivisions + UMAP_cv2_15_001_RankWalktrap_k12) +
  plot_annotation(title = "Graph-based clustering on UMAP (CV2)", tag_levels = "A", 
                  theme = theme(plot.title = element_text(size = 12, face = "bold"), plot.tag = element_text(size = 11))) + 
  plot_layout(ncol = 2, guides = "collect")
UMAP_SNN_RankWalktrap_k12

UMAP_SNN_JaccardLouvain_k5 <- (UMAP_cv2_15_001_PAGsubdivisions + UMAP_cv2_15_001_JaccardLouvain_k5) +
  plot_annotation(title = "Graph-based clustering on UMAP (CV2)", tag_levels = "A", 
                  theme = theme(plot.title = element_text(size = 12, face = "bold"), plot.tag = element_text(size = 11))) + 
  plot_layout(ncol = 2, guides = "collect")
UMAP_SNN_JaccardLouvain_k5

UMAP_SNN_JaccardLouvain_k8 <- (UMAP_cv2_15_001_PAGsubdivisions + UMAP_cv2_15_001_JaccardLouvain_k8) +
  plot_annotation(title = "Graph-based clustering on UMAP (CV2)", tag_levels = "A", 
                  theme = theme(plot.title = element_text(size = 12, face = "bold"), plot.tag = element_text(size = 11))) + 
  plot_layout(ncol = 2, guides = "collect")
UMAP_SNN_JaccardLouvain_k8

UMAP_SNN_JaccardLouvain_k9 <- (UMAP_cv2_15_001_PAGsubdivisions + UMAP_cv2_15_001_JaccardLouvain_k9) +
  plot_annotation(title = "Graph-based clustering on UMAP (CV2)", tag_levels = "A", 
                  theme = theme(plot.title = element_text(size = 12, face = "bold"), plot.tag = element_text(size = 11))) + 
  plot_layout(ncol = 2, guides = "collect")
UMAP_SNN_JaccardLouvain_k9

UMAP_SNN_JaccardLouvain_k13 <- (UMAP_cv2_15_001_PAGsubdivisions + UMAP_cv2_15_001_JaccardLouvain_k13) +
  plot_annotation(title = "Graph-based clustering on UMAP (CV2)", tag_levels = "A", 
                  theme = theme(plot.title = element_text(size = 12, face = "bold"), plot.tag = element_text(size = 11))) + 
  plot_layout(ncol = 2, guides = "collect")
UMAP_SNN_JaccardLouvain_k13

UMAP_SC3_k4 <- (UMAP_cv2_15_001_PAGsubdivisions + UMAP_cv2_15_001_SC3_4) +
  plot_annotation(title = "SC3 clustering on UMAP (CV2)", tag_levels = "A", 
                  theme = theme(plot.title = element_text(size = 12, face = "bold"), plot.tag = element_text(size = 11))) + 
  plot_layout(ncol = 2, guides = "collect")
UMAP_SC3_k4

ggsave(filename = str_c(date, "_UMAP_SNN_rank_k7_clusters.pdf"),
       plot = UMAP_SNN_RankWalktrap_k7,
       device = "pdf", # or one of "eps", "ps", "tex" (pictex), "pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only).
       path = path_for_figures,
       width = 8, height = 4, units = "in", dpi = 300,
       family = "Arial", bg = "transparent"
       )

ggsave(filename = str_c(date, "_UMAP_SNN_rank_k12_clusters.pdf"),
       plot = UMAP_SNN_RankWalktrap_k12,
       device = "pdf", # or one of "eps", "ps", "tex" (pictex), "pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only).
       path = path_for_figures,
       width = 8, height = 4, units = "in", dpi = 300,
       family = "Arial", bg = "transparent"
       )

ggsave(filename = str_c(date, "_UMAP_SNN_jaccard_k5_clusters.pdf"),
       plot = UMAP_SNN_JaccardLouvain_k5,
       device = "pdf", # or one of "eps", "ps", "tex" (pictex), "pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only).
       path = path_for_figures,
       width = 8, height = 4, units = "in", dpi = 300,
       family = "Arial", bg = "transparent"
       )

ggsave(filename = str_c(date, "_UMAP_SNN_jaccard_k8_clusters.pdf"),
       plot = UMAP_SNN_JaccardLouvain_k8,
       device = "pdf", # or one of "eps", "ps", "tex" (pictex), "pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only).
       path = path_for_figures,
       width = 8, height = 4, units = "in", dpi = 300,
       family = "Arial", bg = "transparent"
       )

ggsave(filename = str_c(date, "_UMAP_SNN_jaccard_k9_clusters.pdf"),
       plot = UMAP_SNN_JaccardLouvain_k9,
       device = "pdf", # or one of "eps", "ps", "tex" (pictex), "pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only).
       path = path_for_figures,
       width = 8, height = 4, units = "in", dpi = 300,
       family = "Arial", bg = "transparent"
       )

ggsave(filename = str_c(date, "_UMAP_SNN_jaccard_k13_clusters.pdf"),
       plot = UMAP_SNN_JaccardLouvain_k13,
       device = "pdf", # or one of "eps", "ps", "tex" (pictex), "pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only).
       path = path_for_figures,
       width = 8, height = 4, units = "in", dpi = 300,
       family = "Arial", bg = "transparent"
       )

ggsave(filename = str_c(date, "_UMAP_sc3_k4_clusters.pdf"),
       plot = UMAP_SC3_k4,
       device = "pdf", # or one of "eps", "ps", "tex" (pictex), "pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only).
       path = path_for_figures,
       width = 8, height = 4, units = "in", dpi = 300,
       family = "Arial", bg = "transparent"
       )
```

Before we begin with the differential expression analysis, we will remove the cells assigned to the macrophage-enriched cluster from the dataset so that they don't affect the results. As we saw in the previous section, there should be 12 cells with high levels of expression for macrophage and immune response markers like "Cd33", "Plaur", "Cxcl16", or "Cd68".
```{r}
paste0("Macrophage cluster has ", sum(PAG_sceset_qc_norm_filt_corr_clust$SNN_clusters_cv2_rank_k7==8), " cells in SNN_var_rank_k7 method")
paste0("Macrophage cluster has ", sum(PAG_sceset_qc_norm_filt_corr_clust$SNN_clusters_cv2_rank_k12==7), " cells in SNN_var_rank_k12 method")
paste0("Macrophage cluster has ", sum(PAG_sceset_qc_norm_filt_corr_clust$SNN_clusters_cv2_jaccard_k5==10), " cells in SNN_cv2_jaccard_k5 method")
paste0("Macrophage cluster has ", sum(PAG_sceset_qc_norm_filt_corr_clust$SNN_clusters_cv2_jaccard_k8==1), " cells in SNN_cv2_jaccard_k8 method")
paste0("Macrophage cluster has ", sum(PAG_sceset_qc_norm_filt_corr_clust$SNN_clusters_cv2_jaccard_k9==3), " cells in SNN_cv2_jaccard_k9 method")
paste0("Macrophage cluster has ", sum(PAG_sceset_qc_norm_filt_corr_clust$SNN_clusters_cv2_jaccard_k13==7), " cells in SNN_cv2_jaccard_k13 method")
```

```{r}
# Create a new SingleCellExperiment object by subsetting to leave out the cells from the macrophage-enriched cluster:
PAG_sceset_qc_norm_filt_corr_clust_de <- PAG_sceset_qc_norm_filt_corr_clust[, PAG_sceset_qc_norm_filt_corr_clust$SNN_clusters_cv2_jaccard_k5!=10]
dim(PAG_sceset_qc_norm_filt_corr_clust)
dim(PAG_sceset_qc_norm_filt_corr_clust_de)
```

## Step 7.1 | Overview of approaches to identify marker genes between subpopulations
To interpret the clustering results, we can identify the genes that drive separation between clusters. These marker genes allow us to assign biological meaning to each cluster based on their functional annotation. In the most obvious case, the marker genes for each cluster are a priori associated with particular cell types, allowing us to treat the clustering as a proxy for cell type identity. The same principle can be applied to discover more subtle differences between clusters (e.g., changes in activation or differentiation state) based on the behavior of genes in the affected pathways.

Identification of marker genes is usually based around the retrospective detection of differential expression between clusters. Genes that are more strongly DE are more likely to have caused separate clustering of cells in the first place. Several different statistical tests are available to quantify the differences in expression profiles, and different approaches can be used to consolidate test results into a single ranking of genes for each cluster.

### 7.1.1 | Using pairwise t-tests
The Welch t-test is a statistical method that allows us to *test for differences in expression between clusters*. It is quickly computed and has good statistical properties for large numbers of cells (Soneson and Robinson 2018). The `findMarkers()` function from the `scran` package provides a convenience wrapper for marker gene identification between groups of cells, based on running `pairwiseTTests` or related functions and passing the result to `combineMarkers()`. We can use the `findMarkers()` function to perform pairwise comparisons between clusters for each gene (the function performs Welch t-tests on the normalised log-expression values for every gene and between every pair of clusters), which returns a list of `DataFrame`s containing ranked candidate markers for each cluster. The top DE genes are likely to be good candidate markers as they can effectively distinguish between cells in different clusters. Importantly, besides looking at the clustering results, we can also apply this approach to identify genes driving differences between other factors of our metadata, such as `cell.type` or `PAG.area`, by explicitly supplying them via the `groups=` argument.

For each cluster, the DE results of the relevant comparisons are consolidated into a single output table. The relevant `DataFrame` contains log2-fold changes of expression in cluster X over each other cluster, along with several statistics obtained by combining p-values (Simes 1986) across the pairwise comparisons involving X. This allows a set of marker genes to be easily defined by taking the top DE genes from each pairwise comparison between clusters. For example, to construct a marker set for cluster X from the top 10 genes of each comparison, we could filter `marker_set` to retain rows with `Top` less than or equal to 10. Other statistics are also reported for each gene, including the adjusted p-values and the log-fold changes relative to every other cluster.

Of particular interest is the `Top` field. The set of genes with `Top ≤ X` is the union of the top X genes (ranked by p-value) from each pairwise comparison involving our chosen cluster. For example, the set of all genes with `Top` values of 1 contains the gene with the lowest p-value from each comparison. Similarly, the set of genes with `Top` values less than or equal to 10 contains the top 10 genes from each comparison. The `Top` field represents `findMarkers()`’s approach to consolidating multiple pairwise comparisons into a single ranking for each cluster; each `DataFrame` produced by `findMarkers()` will order genes based on the `Top` value by default. We can thus use the `Top` field to identify a set of genes that is guaranteed to distinguish cluster X from any other cluster. If we take all genes with `Top <= 6`, this is equivalent to the union of the top 6 genes from each pairwise comparison.

Each `DataFrame` also contains several other statistics that may be of interest. The `p.value` field contains the combined p-value that is obtained by applying Simes’s method to the pairwise p-values for each gene and represents the evidence against the joint null hypothesis, i.e., that the gene is not DE between cluster X and any other cluster. In a future version of `scran`, the `summary.logFC` field provides a convenient summary of the direction and effect size for each gene, and is defined here as the log-fold change from the comparison with the lowest p-value. Examination of these statistics permits a quick evaluation of the suitability of a candidate marker; if both of these metrics are poor (small log-fold change, large p-value), the gene can most likely be dismissed.

We intentionally use pairwise comparisons between clusters rather than comparing each cluster to the average of all other cells. The latter approach is sensitive to the population composition, potentially resulting in substantially different sets of markers when cell type abundances change in different contexts. In the worst case, the presence of a single dominant subpopulation will drive the selection of top markers for every other cluster, pushing out useful genes that can resolve the various minor subpopulations. Moreover, pairwise comparisons naturally provide more information to interpret of the utility of a marker, e.g., by providing log-fold changes to indicate which clusters are distinguished by each gene.

### 7.1.2 | Focusing on up or downregulated genes and using the log-fold change
_Focusing on up or downregulated genes_
The previous `findMarkers()` call considers both up- and downregulated genes to be potential markers. However, downregulated genes are less appealing as markers as it is more difficult to interpret and experimentally validate an absence of expression. To focus on up-regulated markers, we can instead perform a one-sided t-test to identify genes that are upregulated in each cluster compared to the others. This is achieved by setting `direction = "up"` in the `findMarkers()` call.

_Using the log-fold change_
The t-test also allows us to specify a non-zero log-fold change as the null hypothesis. This allows us to consider the magnitude of the log-fold change in our p-value calculations, in a manner that is more rigorous than simply filtering directly on the log-fold changes (McCarthy and Smyth 2009). (Specifically, a simple threshold does not consider the variance and can enrich for genes that have both large log-fold changes and large variances.) We perform this by setting `lfc=` in our `findMarkers()` call - when combined with `direction=`, this tests for genes with log-fold changes that are significantly greater than 1.

These two settings yield a more focused set of candidate marker genes that are upregulated in cluster X. This increased stringency, however, is not without cost. If only upregulated genes are requested from `findMarkers()`, any cluster defined by downregulation of a marker gene will not contain that gene among the top set of features in its `DataFrame`. This is occasionally relevant for subtypes or other states that are distinguished by high versus low expression of particular genes. Similarly, setting an excessively high log-fold change threshold may discard otherwise useful genes. For example, a gene upregulated in a small proportion of cells of a cluster will have a small log-fold change but can still be an effective marker if the focus is on specificity rather than sensitivity.

### 7.1.3 | Choosing which pairwise comparisons to performm
The choice of `pval.type` determines whether the highly ranked genes are those that are DE between the current group and: (1) any other group ("any"), (2) all other groups ("all"), or (3) some other groups ("some"). The result is a named list of `DataFrame`s, each of which contains a sorted marker gene list for the corresponding group. In each `DataFrame`, the top genes are chosen to enable separation of that group from all other groups. Log-fold changes are reported as differences in average x between groups (usually in base 2, depending on the transformation applied to x).

By default, `findMarkers()` will give a high ranking to genes that are differentially expressed in any pairwise comparison. This is because a gene only needs a very low p-value in a single pairwise comparison to achieve a low `Top` value. A more stringent approach would only consider genes that are differentially expressed in all pairwise comparisons involving the cluster of interest. To achieve this, we set `pval.type="all"` in `findMarkers()` to use an intersection-union test (Berger and Hsu 1996) where the combined p-value for each gene is the maximum of the p-values from all pairwise comparisons. A gene will only achieve a low combined p-value if it is strongly DE in all comparisons to other clusters. Combined with `direction="up"`, this can be used to identify unique markers for each cluster. However, this is sensitive to overclustering, as unique marker genes will no longer exist if a cluster is split into two smaller subclusters.

This strategy will only report genes that are highly specific to the cluster of interest. When it works, it can be highly effective as it generates a small focused set of candidate markers. However, any gene that is expressed at the same level in two or more clusters will simply not be detected. This is likely to discard many interesting genes, especially if the clusters are finely resolved with weak separation (such as different clusters within a cell type). To give a concrete example, consider a mixed population of CD4+-only, CD8+-only, double-positive and double-negative T cells. With `pval.type="all"`, neither Cd4 or Cd8 would be detected as subpopulation-specific markers because each gene is expressed in two subpopulations. In comparison, `pval.type="any"` will detect both of these genes as they will be DE between at least one pair of subpopulations.

If `pval.type="all"` is too stringent yet `pval.type="any"` is too generous, a compromise is to set `pval.type="some"`. For each gene, we apply the Holm-Bonferroni correction across its p-values and take the middle-most value as the combined p-value. This effectively tests the global null hypothesis that at least 50% of the individual pairwise comparisons exhibit no DE. We then rank the genes by their combined p-values to obtain an ordered set of marker candidates. The aim is to improve the conciseness of the top markers for defining a cluster while mitigating the risk of discarding useful genes that are not DE to all other clusters. The downside is that taking this compromise position sacrifices the theoretical guarantees offered at the other two extremes.

_Only relevant for the newer `scran` version, which returns a column with `summary.logFC`._ In both cases, a different method is used to compute the summary effect size compared to `pval.type="any"`. For `pval.type="all"`, the summary log-fold change is defined as that corresponding to the pairwise comparison with the largest p-value, while for `pval.type="some"`, it is defined as the log-fold change for the comparison with the middle-most p-value. This reflects the calculation of the combined p-value and avoids focusing on genes with strong changes in only one comparison.

### 7.1.4 | Handling blocking factors
_Using the `block=` argument_
Large studies may contain factors of variation that are known and not interesting (e.g., batch effects, sex differences). If these are not modelled, they can interfere with marker gene detection - most obviously by inflating the variance within each cluster, but also by distorting the log-fold changes if the cluster composition varies across levels of the blocking factor. To avoid these issues, we can set the `block=` argument in the `findMarkers()` call.

For each gene, each pairwise comparison between clusters is performed separately in each level of the blocking factor. The function will then combine p-values from different blocking levels using Stouffer’s Z method to obtain a single p-value per pairwise comparison. (These p-values are further combined across comparisons to obtain a single p-value per gene, using either Simes’s method or an intersection-union test depending on the value of` pval.type=`.) This approach favours genes that exhibit consistent DE in the same direction in each plate.

The `block=` argument works with all tests shown above and is robust to difference in the log-fold changes or variance between batches. However, it assumes that each pair of clusters is present in at least one batch. In scenarios where cells from two clusters never co-occur in the same batch, the comparison will be impossible and NAs will be reported in the output.

_Using the `design=` argument_
Another approach is to define a design matrix containing the batch of origin as the sole factor. `findMarkers()` will then fit a linear model to the log-expression values, similar to the use of `limma` for bulk RNA sequencing data (Ritchie et al. 2015). This handles situations where multiple batches contain unique clusters, as comparisons can be implicitly performed via shared cell types in each batch. There is also a slight increase in power when information is shared across clusters for variance estimation.

The use of a linear model makes some strong assumptions, necessitating some caution when interpreting the results. If the batch effect is not consistent across clusters, the variance will be inflated and the log-fold change estimates will be distorted. Variances are also assumed to be equal across groups, which is not true in general. In particular, the presence of clusters in which a gene is silent will shrink the residual variance towards zero, preventing the model from penalizing genes with high variance in other clusters. Thus, it is generally recommended to use `block=` where possible.

### 7.1.5 | Using the wilcoxon rank sum test
The Wilcoxon rank sum test (also known as the Wilcoxon-Mann-Whitney test, or WMW test) is another widely used method for pairwise comparisons between groups of observations. Its strength lies in the fact that *it directly assesses separation between the expression distributions of different clusters*. The WMW test statistic is proportional to the area-under-the-curve (AUC), i.e., the concordance probability, which is the probability of a random cell from one cluster having higher expression than a random cell from another cluster. In a pairwise comparison, AUCs of 1 or 0 indicate that the two clusters have perfectly separated expression distributions. Thus, the WMW test directly addresses the most desirable property of a candidate marker gene, while the t-test only does so indirectly via the difference in the means and the intra-group variance.

We can perform WMW tests using the `findMarkers()` function, setting `test="wilcox"`. This returns a list of `DataFrames` containing ranked candidate markers for each cluster. The `direction=`, `lfc=` and `pval.type=` arguments can be specified and have the same interpretation as described for t-tests. The interpretation of `Top` is the same as described for t-tests, and Simes’s method is again used to combine p-values across pairwise comparisons. If we want more focused sets, we can also change `pval.type=` as previously described.

Again, to explore the results in more detail, we can focus on the `DataFrame` for group X. The `DataFrame` contains the AUCs from comparing group X to every other group. A value greater than 0.5 indicates that the gene is upregulated in the current cluster compared to the other cluster, while values less than 0.5 correspond to downregulation. We would typically expect AUCs of 0.7-0.8 for a strongly upregulated candidate marker.

One practical advantage of the WMW test over the Welch t-test is that it is symmetric with respect to differences in the size of the groups being compared. This means that, all else being equal, the top-ranked genes on each side of a DE comparison will have similar expression profiles regardless of the number of cells in each group. In contrast, the t-test will favor genes where the larger group has the higher relative variance as this increases the estimated degrees of freedom and decreases the resulting p-value. This can lead to unappealing rankings when the aim is to identify genes upregulated in smaller groups. The WMW test is not completely immune the variance effects - for example, it will slightly favor detection of DEGs at low average abundance where the greater number of ties at zero deflates the approximate variance of the rank sum statistic - but this is relatively benign as the selected genes are still fairly interesting.

The main disadvantage of the WMW test is that the AUCs are much slower to compute compared to t-statistics. This may be inconvenient for interactive analyses involving multiple iterations of marker detection. We can mitigate this to some extent by parallelizing these calculations using the `BPPARAM=` argument in `findMarkers()`.

### 7.1.6 | Using the binomial test
The binomial test *identifies genes that differ in the proportion of expressing cells between clusters*. This represents a much more stringent definition of marker genes compared to the other methods, as differences in expression between clusters are effectively ignored if both distributions of expression values are not near zero. The premise is that genes are more likely to contribute to important biological decisions if they were active in one cluster and silent in another, compared to more subtle “tuning” effects from changing the expression of an active gene. From a practical perspective, a binary measure of presence/absence is easier to validate.

We can perform pairwise binomial tests between clusters using the `findMarkers()` function with `test="binom"`. This returns a list of `DataFrames` containing marker statistics for each cluster such as the `Top` rank and its p-value. Here, the effect size is reported as the log-fold change in this proportion between each pair of clusters. Large positive log-fold changes indicate that the gene is more frequently expressed in one cluster compared to the other. We can focus on genes that are upregulated in each cluster compared to the others by setting `direction="up"`.

The disadvantage of the binomial test is that its increased stringency can lead to the loss of good candidate markers. Another property of the binomial test is that it will not respond to scaling normalisation. Systematic differences in library size between clusters will not be considered when computing p-values or effect sizes. This is not necessarily problematic for marker gene detection - users can treat this as retaining information about the total RNA content, analogous to spike-in normalisation.

### 7.1.7 | Using custom DE methods
It is also possible to perform marker gene detection based on precomputed DE statistics, which allows us to take advantage of more sophisticated tests in dedicated DE analysis packages in the Bioconductor ecosystem. Count-based statistical methods such as `DESeq2` (Love, Huber, and Anders 2014), `edgeR` (Robinson, McCarthy, and Smyth 2009), `limma` with the `voom` method (Law et al. 2014), can be used to that purpose.

By default, we do not use custom DE methods to perform marker detection, for several reasons. Many of these methods rely on empirical Bayes shrinkage to share information across genes in the presence of limited replication. However, this is unnecessary when there are large numbers of “replicate” cells in each group. These methods also make stronger assumptions about the data (e.g., equal variances for linear models, the distribution of variances during empirical Bayes) that are more likely to be violated in noisy scRNA-seq contexts. From a practical perspective, they require more work to set up and take more time to run. Nonetheless, some custom methods (e.g., MAST) may provide a useful point of difference from the simpler tests.

`MAST` is an R/Bioconductor package for managing and analyzing qPCR and sequencing-based single-cell gene expression data, as well as data from other types of single-cell assays. Apart from reading and storing single-cell assay data, the package also provides functionality for significance testing of differential expression using a Hurdle model, gene set enrichment, facilities for visualizing patterns in residuals indicative of differential expression, and power calculations. `MAST` is based on a zero-inflated negative binomial model. It tests for differential expression using a Hurdle model to combine tests of discrete (0 vs not zero) and continuous (non-zero values) aspects of gene expression. Similar to `DESeq2` and `edgeR`, it uses a linear modelling framework to enable complex models to be considered.

### 7.1.8 | Some comments on p-values and DE analysis
_Data dredging_
All of our DE strategies for detecting marker genes between clusters are statistically flawed to some extent. The DE analysis is performed on the same data used to obtain the clusters, which represents “data dredging” (also known as fishing or data snooping). The hypothesis of interest - are there differences between clusters? - is formulated from the data, so we are more likely to get a positive result when we re-use the data set to test that hypothesis.

The practical effect of data dredging is best illustrated with a simple simulation. If we simulate i.i.d. normal values, perform k-means clustering and test for DE between clusters of cells with `findMarkers()`, the resulting distribution of p-values is heavily skewed towards low values. Thus, we can detect “significant” differences between clusters even in the absence of any real substructure in the data. This effect arises from the fact that clustering, by definition, yields groups of cells that are separated in expression space. Testing for DE genes between clusters will inevitably yield some significant results as that is how the clusters were defined.

For marker gene detection, this effect is largely harmless as the p-values are used only for ranking. However, it becomes an issue when the p-values are used to define “significant differences” between clusters with respect to an error rate threshold. Meaningful interpretation of error rates require consideration of the long-run behaviour, i.e., the rate of incorrect rejections if the experiment were repeated many times. The concept of statistical significance for differences between clusters is not applicable if clusters and their interpretations are not stably reproducible across (hypothetical) replicate experiments.

_Nature of replication_
The naive application of DE analysis methods will treat counts from the same cluster of cells as replicate observations. This is not the most relevant level of replication when cells are derived from the same biological sample (i.e., cell culture, animal or patient). DE analyses that treat cells as replicates fail to properly model the sample-to-sample variability (Lun and Marioni 2017). The latter is arguably the more important level of replication as different samples will necessarily be generated if the experiment is to be replicated. Indeed, the use of cells as replicates only masks the fact that the sample size is actually one in an experiment involving a single biological sample. This reinforces the inappropriateness of using the marker gene p-values to perform statistical inference.

We strongly recommend selecting some markers for use in validation studies with an independent replicate population of cells. A typical strategy is to identify a corresponding subset of cells that express the upregulated markers and do not express the downregulated markers. Ideally, a different technique for quantifying expression would also be used during validation, e.g., fluorescent in situ hybridisation or quantitative PCR. This confirms that the subpopulation genuinely exists and is not an artifact of the scRNA-seq protocol or the computational analysis.

_Further comments_
One consequence of the DE analysis strategy is that markers are defined relative to subpopulations in the same dataset. Biologically meaningful genes will not be detected if they are expressed uniformly throughout the population, e.g., T cell markers will not be detected if only T cells are present in the dataset. In practice, this is usually only a problem when the experimental data are provided without any biological context - certainly, we would hope to have some a priori idea about what cells have been captured. For most applications, it is actually desirable to avoid detecting such genes as we are interested in characterizing heterogeneity within the context of a known cell population. Continuing from the example above, the failure to detect T cell markers is of little consequence if we already know we are working with T cells.

Alternatively, marker detection can be performed by treating gene expression as a predictor variable for cluster assignment. For a pair of clusters, we can find genes that discriminate between them by performing inference with a logistic model where the outcome for each cell is whether it was assigned to the first cluster and the lone predictor is the expression of each gene. Treating the cluster assignment as the dependent variable is more philosophically pleasing in some sense, as the clusters are indeed defined from the expression data rather than being known in advance. (Note that this does not solve the data snooping problem.) In practice, this approach effectively does the same task as a Wilcoxon rank sum test in terms of quantifying separation between clusters. Logistic models have the advantage in that they can easily be extended to block on multiple nuisance variables, though this is not typically necessary in most use cases. Even more complex strategies use machine learning methods to determine which features contribute most to successful cluster classification, but this is probably unnecessary for routine analyses.

### 7.1.9 | Our approach
After reading about the different options we have for DE analysis, we will be using Wilcoxon rank sum test to find marker genes for the following situations:

* `cell.type`: we are interested in the gene expression differences between excitatory and inhibitory neurons, as they can tell us about the molecular toolkit available to each type and help us generate hypothesis to test their physiological function. Given that we used transgenic lines to target each cell type, we can use the relevant metadata slot to define the groups we will be comparing.

* `celltype_PAGarea`: we are also interested in the differences between the different PAG subdivisions. Each PAG subdivision has been involved in regulating or driving different behaviours. It would be interesting to know whether the cell types in each subdivision can be modulated via similar or different neurotransmitter and neuromodulatory pathways. We can investigate this by using the compound metadata factor we created by combining `cell.type` and `PAG.area`.

* `cluster.ID`: finally, we are also interested in finding marker genes for the clusters we identified using the clustering approach of our choice. 

The next two chunks of code will be the main blocks of our DE analysis. They will allow us to set the main paramteres for the `findMarkers()` function and fine tune the comparisons we want to make. Before we proceed with the different analysis sections, let us take a look at the main parameters we will need for our call to `findMarkers()` and reset them to their default values:
```{r}
# library(scran)
# # Parameters defaults:
# groups <- NULL # clustering results or group for each cell.
# 
# test <- c("t", "wilcox", "binom") # Pairwise test to perform: t-test "t", Wilcoxon rank sum test "wilcox", binomial test "binom".
# comparisons <- c("any", "some", "all") # how p-values are to be combined across pairwise comparisons for a given group: "any", "some", or "all.
# direction <- c("any", "up", "down") # the direction of log-fold changes to be considered in the alternative hypothesis.
# lfc <- 0 # positive numeric scalar specifying the log-fold change threshold to be tested against.
# 
# block <- NULL # factor specifying the blocking level for each cell.
# design <- NULL # numeric matrix with blocking terms for uninteresting factors (not to be confounded with groups).
# restrict <- NULL # the levels of groups for which to perform pairwise comparisons.
# exclude <- NULL # the levels of groups for which NOT to perform pairwise comparisons.
# print("Parameters back to default values")
```

Finally, the following code allows us to run the main analysis using `findMarkers()` and adapt it to each comparison. We can run the following chunk of code several times, changing the parameters for the `test.type`, `pval.type`, `direction`, `lfc`, and others to extract the results of the comparison we want into a .csv file. It also allows us to assign the results to separate variables so we can plot them later on.
```{r}
# # Set the groups to be compared (cluster IDs or metadata slot of interest):
# groups <- PAG_sceset_qc_norm_filt_corr_clust$SNN_clusters_cv2_jaccard_k5
# 
# # Set the string to be added to the filename when saving the results:
# results_name <- "SNN_clusters_cv2_jaccard_k5_" # one of "SNN_clusters_cv2_jaccard_k5_", "cell_type_", "PAG_subdivision", "celltype_PAGarea_", "celltype_PAGAPaxis_", "VGAT_VGluT2_expression"
# 
# # Create a variable with the tests you want to run:
# desired_tests <- c("t", "wilcox", "binom") # Pairwise test to perform: t-test "t", Wilcoxon rank sum test "wilcox", binomial test "binom".
# desired_comparisons <- c("any", "some", "all") # how p-values are to be combined across pairwise comparisons for a given group: "any", "some", or "all".
# desired_directions <- c("up", "down") # the direction of log-fold changes to be considered in the alternative hypothesis: "any", "up", "down".
# 
# # Initialise an empty `data.frame` to populate with a summary of marker genes:
# SNN_clusters_marker_set_summary <- data.frame()
# 
# # Perform comparisons of chosen groups with a set of statistical tests and store/export all the results:
# for (test_choice in desired_tests){
#   test <- test_choice
#   for (comparison_choice in desired_comparisons){
#     comparisons <- comparison_choice
#     for (direction_choice in desired_directions){
#       direction <- direction_choice
#       
#       # Set the remaining parameters:
#       lfc <- 0 # positive numeric scalar specifying the log-fold change threshold to be tested against.
#       block <- PAG_sceset_qc_norm_filt_corr_clust$mouse.sex # factor specifying the blocking level for each cell. Try PAG_sceset_qc_norm_filt_corr_clust_de$mouse.sex
#       design <- NULL # numeric matrix with blocking terms for uninteresting factors (not to be confounded with groups).
#       restrict <- NULL # the levels of groups for which to perform pairwise comparisons.
#       exclude <- NULL # the levels of groups for which NOT to perform pairwise comparisons.
#       
#       # Run findMarkers:
#       marker_genes <- findMarkers(PAG_sceset_qc_norm_filt_corr_clust,
#                                   groups = groups,
#                                   test.type = test,
#                                   pval.type = comparisons,
#                                   direction = direction,
#                                   lfc = lfc,
#                                   assay.type = "logcounts",
#                                   block = block,
#                                   restrict = restrict,
#                                   exclude = exclude)
#       
#       # Save results:
#       saveRDS(marker_genes, file = str_c(path_for_figures, "SNN_clusters/", results_name, 
#                                          "marker_genes_", test, "_", comparisons, "_", direction, ".rds"))
#       
#       # Create a variable with a value for each group name to iterate through:
#       range_of_groups <- 1:paste(length(levels(groups)))
#       
#       # Create and export a separate table of results for each group:
#       for (chosen_group in range_of_groups) {
#         # Compose name of variable where results will be assigned to:
#         marker_set <- str_c("marker_set_", levels(groups)[chosen_group], "_", test, "_", comparisons, "_", direction)
#         
#         # Assign results of marker genes for current group to a temp variable:
#         temp_marker_set <- marker_genes[[chosen_group]]
#         
#         #print(marker_set) # Print variable name of current group
#         #print(sum(temp_marker_set$FDR<0.05)) # Print the number of genes with a FDR below 0.05
#         #print(temp_marker_set[1:5, 1:3]) # Print top results for current group
#         
#         # Assign the results to the created variable:
#         assign(marker_set, temp_marker_set) # Assign the results to the created variable
#         
#         # Export the results to a .csv file:
#         write.csv(temp_marker_set[temp_marker_set$FDR<0.1,], file = str_c(path_for_figures, "SNN_clusters/", "PAG_DE_results_", #                                                 results_name, marker_set, ".csv"), quote = FALSE, row.names = TRUE)
#         # Populate the summary data.frame:        
#         SNN_clusters_marker_set_summary = rbind(SNN_clusters_marker_set_summary,
#                                                 data.frame(marker_set = marker_set,
#                                                            group = levels(groups)[chosen_group],
#                                                            test = test,
#                                                            comparisons = comparisons,
#                                                            direction = direction,
#                                                            genes_FDR_001 = sum(temp_marker_set$FDR<0.01),
#                                                            genes_FDR_005 = sum(temp_marker_set$FDR<0.05),
#                                                            genes_FDR_01 = sum(temp_marker_set$FDR<0.1)))
#       }
#     }
#   }
# }
# 
# # Check out the summary: 
# SNN_clusters_marker_set_summary[SNN_clusters_marker_set_summary$test=="t",]
# SNN_clusters_marker_set_summary[SNN_clusters_marker_set_summary$test=="wilcox",]
# SNN_clusters_marker_set_summary[SNN_clusters_marker_set_summary$test=="binom",]
```

If we were to run the previous chunk using the default parameters (the three types of test, the three possible types of comparison, and the three possible directions for testing log-fold changes of gene expression) on our graph-based clustering results (which resulted in 11 clusters), we would get 297 files. By removing the direction `any` (which is the same as having "up" and "down" separately) we can reduce it to 198. We can do this for exploratory analysis if we want to compare the output of each type of test. Otherwise, once we have decided on the type of test and the type of comparisons we want to make, we can reduce the output to 22 files (if we have 11 clusters).

## Step 7.2 | Identifying marker genes between excitatory and inhibitory cells
A main advantage of our experimental design is that we know in advance the type of cell we are sequencing. We can thus take advantage of our metadata to perform comparisons between excitatory and inhibitory cells. For this, we will continue with the `SingleCellExperiment` object we created after excluding the cells assigned to the macrophage-enriched cluster, to ensure they do not mask any biological results:
```{r}
dim(PAG_sceset_qc_norm_filt_corr_clust)
dim(PAG_sceset_qc_norm_filt_corr_clust_de)
```

To make sure we don't carry any variables from previous runs, we reset all the parameters to their default values:
```{r}
library(scran)
# Parameters defaults:
groups <- NULL # clustering results or group for each cell.

test <- c("t", "wilcox", "binom") # Pairwise test to perform: t-test "t", Wilcoxon rank sum test "wilcox", binomial test "binom".
comparisons <- c("any", "some", "all") # how p-values are to be combined across pairwise comparisons for a given group: "any", "some", or "all.
direction <- c("any", "up", "down") # the direction of log-fold changes to be considered in the alternative hypothesis.
lfc <- 0 # positive numeric scalar specifying the log-fold change threshold to be tested against.

block <- NULL # factor specifying the blocking level for each cell.
design <- NULL # numeric matrix with blocking terms for uninteresting factors (not to be confounded with groups).
restrict <- NULL # the levels of groups for which to perform pairwise comparisons.
exclude <- NULL # the levels of groups for which NOT to perform pairwise comparisons.
print("Parameters back to default values")
```

For the `cell.type` comparisons, the fact that we only have two groups (VGAT and VGluT2 cells) means that all options in `pval.type` will yield the same results, and `direction = "up"` in one `cell.type` will be equivalent to `direction = "down"` in the other `cell.type` and viceversa, so by setting `direction = "up"` we will already obtain all the results we want. We will be using the Wilcoxon rank sum test and will block for `mouse.sex` to account for any sex specific differences that we are not interested in.
```{r}
# Set the groups to be compared (cluster IDs or metadata slot of interest):
groups <- PAG_sceset_qc_norm_filt_corr_clust_de$cell.type

# Set the string to be added to the filename when saving the results:
results_name <- "cell_type_" # one of "SNN_clusters_cv2_jaccard_k5_", "cell_type_", "PAG_subdivision", "celltype_PAGarea_", "celltype_PAGAPaxis_", "VGAT_VGluT2_expression"

# Create a variable with the tests you want to run:
desired_tests <- c("wilcox") # Pairwise test to perform: t-test "t", Wilcoxon rank sum test "wilcox", binomial test "binom".

# Initialise an empty data.frame to populate with a summary of marker genes:
celltype_marker_set_summary <- data.frame()

# Perform comparisons of chosen groups with a set of statistical tests and store/export all the results:
for (test_choice in desired_tests){
  test <- test_choice
  
  # Set the remaining parameters:
  comparisons <- "all" # how p-values are to be combined across pairwise comparisons for a given group: "any", "some", or "all".
  direction <- "up" # the direction of log-fold changes to be considered in the alternative hypothesis: "any", "up", "down".
  lfc <- 0 # positive numeric scalar specifying the log-fold change threshold to be tested against.
  block <- PAG_sceset_qc_norm_filt_corr_clust_de$mouse.sex # factor specifying the blocking level for each cell. Try PAG_sceset_qc_norm_filt_corr_clust_de$mouse.sex
  design <- NULL # numeric matrix with blocking terms for uninteresting factors (not to be confounded with groups).
  restrict <- NULL # the levels of groups for which to perform pairwise comparisons.
  exclude <- NULL # the levels of groups for which NOT to perform pairwise comparisons.
  
  # Run findMarkers:
  marker_genes <- findMarkers(PAG_sceset_qc_norm_filt_corr_clust_de,
                              groups = groups,
                              test.type = test,
                              pval.type = comparisons,
                              direction = direction,
                              lfc = lfc,
                              assay.type = "logcounts",
                              block = block,
                              restrict = restrict,
                              exclude = exclude)
  
  # Save results:
  saveRDS(marker_genes, file = str_c(path_for_figures, "cell_type/", results_name, 
                                     "marker_genes_", test, "_", comparisons, "_", direction, ".rds"))
  
  # Create a variable with a value for each group name to iterate through:
  range_of_groups <- 1:paste(length(levels(groups)))
  
  # Create and export a separate table of results for each group:
  for (chosen_group in range_of_groups) {
    # Compose name of variable where results will be assigned to:
    marker_set <- str_c("marker_set_", levels(groups)[chosen_group], "_", test, "_", comparisons, "_", direction)
    
    # Assign results of marker genes for current group to a temp variable:
    temp_marker_set <- marker_genes[[chosen_group]]
    
    #print(marker_set) # Print variable name of current group
    #print(sum(temp_marker_set$FDR<0.05)) # Print the number of genes with a FDR below 0.05
    #print(temp_marker_set[1:5, 1:3]) # Print top results for current group
    
    # Assign the results to the created variable:
    assign(marker_set, temp_marker_set)
    
    # Export the results to a .csv file:
    write.csv(temp_marker_set[temp_marker_set$FDR<0.1,], file = str_c(path_for_figures, "cell_type/", "PAG_DE_results_", 
                                                                     results_name, marker_set, ".csv"), quote = FALSE, row.names = TRUE)
    
    # Populate the summary data.frame:
    celltype_marker_set_summary = rbind(celltype_marker_set_summary,
                                        data.frame(marker_set = marker_set,
                                                   group = levels(groups)[chosen_group],
                                                   test = test,
                                                   comparisons = comparisons,
                                                   direction = direction,
                                                   genes_FDR_001 = sum(temp_marker_set$FDR<0.01),
                                                   genes_FDR_005 = sum(temp_marker_set$FDR<0.05),
                                                   genes_FDR_01 = sum(temp_marker_set$FDR<0.1)))
    }
}

# Check out the summary:
celltype_marker_set_summary
```

```{r}
# Take a look at the results from one fo the comparisons
head(marker_set_VGAT_wilcox_all_up[marker_set_VGAT_wilcox_all_up$FDR<0.05,])
head(marker_set_VGluT2_wilcox_all_up[marker_set_VGluT2_wilcox_all_up$FDR<0.05,])
```

Once we have obtained the results for each comparison, we can take a quick look at the top DE genes and filter them according the our custom list of genes we appended to the metadata in the begining of the pipeline.
```{r}
# Take a look at the gene lists we have in the metadata
names(metadata(PAG_sceset_qc_norm_filt_corr_clust_de))
```

```{r}
# Set the FDR threshold you want to look at.
fdr_threshold <- 0.05 # try 0.1, 0.05, 0.01 depending on how stringent you want your results to be

VGAT_de_genes <- rownames(marker_set_VGAT_wilcox_all_up[marker_set_VGAT_wilcox_all_up$FDR<fdr_threshold, ])
VGluT2_de_genes <- rownames(marker_set_VGluT2_wilcox_all_up[marker_set_VGluT2_wilcox_all_up$FDR<fdr_threshold, ])

cat("--- Ion Channels ---\n")
cat("--- VGAT cells\n")
VGAT_de_genes[VGAT_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.ionchannels]
cat("\n--- VGluT2 cells\n")
VGluT2_de_genes[VGluT2_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.ionchannels]

cat("\n\n--- Neuromodulators and Neuropeptides ---\n")
cat("--- VGAT cells\n")
VGAT_de_genes[VGAT_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.NeuromodulatorsPeptides]
cat("\n--- VGluT2 cells\n")
VGluT2_de_genes[VGluT2_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.NeuromodulatorsPeptides]

cat("\n\n--- Monoamines ---\n")
cat("--- VGAT cells\n")
VGAT_de_genes[VGAT_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Monoamines]
cat("\n--- VGluT2 cells\n")
VGluT2_de_genes[VGluT2_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Monoamines]

cat("\n\n--- Neurotransmitters ---\n")
cat("--- VGAT cells\n")
VGAT_de_genes[VGAT_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Neurotransmitters]
cat("\n--- VGluT2 cells\n")
VGluT2_de_genes[VGluT2_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Neurotransmitters]

cat("\n\n--- Transporters ---\n")
cat("--- VGAT cells\n")
VGAT_de_genes[VGAT_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Transporters]
cat("\n--- VGluT2 cells\n")
VGluT2_de_genes[VGluT2_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Transporters]

cat("\n\n--- Transcription Factors ---\n")
cat("--- VGAT cells\n")
VGAT_de_genes[VGAT_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Transcriptionfactors]
cat("\n--- VGluT2 cells\n")
VGluT2_de_genes[VGluT2_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Transcriptionfactors]
```

As a final step, we can assign the names of the differentially expressed genes to the `metadata` slot of the `PAG_sceset_qc_norm_filt_corr_clust_de` object, so we can quickly retrieve it for plotting and summarising our results:
```{r}
# Set false discovery rate threshold
fdr_threshold <- 0.05 # try 0.1, 0.05, 0.01 depending on how stringent you want your results to be

# DE genes in VGAT cells
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_VGAT_wilcox_all_up <- rownames(marker_set_VGAT_wilcox_all_up[marker_set_VGAT_wilcox_all_up$FDR<fdr_threshold, ])

# DE genes in VGluT2 cells
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_VGluT2_wilcox_all_up <- rownames(marker_set_VGluT2_wilcox_all_up[marker_set_VGluT2_wilcox_all_up$FDR<fdr_threshold, ])

names(metadata(PAG_sceset_qc_norm_filt_corr_clust_de))
```

## Step 7.3 | Identifying marker genes between cell-type and PAG subdivisions
To find interesting genes across PAG subdivisions while accounting for cell-type, we can make use of the combined metadata factor `celltype_PAGarea`, in which each cell will be grouped according to both their type and their PAG area. For this, we continue with the `SingleCellExperiment` object we created after excluding the cells assigned to the macrophage-enriched cluster:
```{r}
dim(PAG_sceset_qc_norm_filt_corr_clust)
dim(PAG_sceset_qc_norm_filt_corr_clust_de)
```

To make sure we don't carry any variables from previous runs, we reset all the parameters to their default values:
```{r}
library(scran)
# Parameters defaults:
groups <- NULL # clustering results or group for each cell.

test <- c("t", "wilcox", "binom") # Pairwise test to perform: t-test "t", Wilcoxon rank sum test "wilcox", binomial test "binom".
comparisons <- c("any", "some", "all") # how p-values are to be combined across pairwise comparisons for a given group: "any", "some", or "all.
direction <- c("any", "up", "down") # the direction of log-fold changes to be considered in the alternative hypothesis.
lfc <- 0 # positive numeric scalar specifying the log-fold change threshold to be tested against.

block <- NULL # factor specifying the blocking level for each cell.
design <- NULL # numeric matrix with blocking terms for uninteresting factors (not to be confounded with groups).
restrict <- NULL # the levels of groups for which to perform pairwise comparisons.
exclude <- NULL # the levels of groups for which NOT to perform pairwise comparisons.
print("Parameters back to default values")
```

For the `celltype_PAGarea` comparisons, we are up to eight groups (dmPAG, dlPAG, lPAG, vlPAG and VGAT and VGluT2 cells for each of those subdivisions). In this case, we are only interested in the genes specific to each group, so we will set `pval.type=all`. However, after seeing `all` results in nearly no potential markers, we could think of adding `pval.type=some` to the mix. In this case, though, by doing that we would mainly be getting the cell-type specific markers back in, so we will stick with `pval.type=all`. We will also set `direction = "up"` to focus on genes that are more highly expressed in each group. We will again use the Wilcoxon rank sum test and will block for `mouse.sex` to account for any sex specific differences that we are not interested in.
```{r}
# Set the groups to be compared (cluster IDs or metadata slot of interest):
groups <- PAG_sceset_qc_norm_filt_corr_clust_de$celltype_PAGarea

# Set the string to be added to the filename when saving the results:
results_name <- "celltype_PAGarea_" # one of "SNN_clusters_cv2_jaccard_k5_", "cell_type_", "PAG_subdivision", "celltype_PAGarea_", "celltype_PAGAPaxis_", "VGAT_VGluT2_expression"

# Create a variable with the tests you want to run:
desired_tests <- c("wilcox") # Pairwise test to perform: t-test "t", Wilcoxon rank sum test "wilcox", binomial test "binom".
desired_comparisons <- c("all") # how p-values are to be combined across pairwise comparisons for a given group: "any", "some", or "all".
desired_directions <- c("up") # the direction of log-fold changes to be considered in the alternative hypothesis: "any", "up", "down".

# Initialise an empty data.frame to populate with a summary of marker genes:
celltype_PAGarea_marker_set_summary <- data.frame()

# Perform comparisons of chosen groups with a set of statistical tests and store/export all the results:
for (test_choice in desired_tests){
  test <- test_choice
  for (comparison_choice in desired_comparisons){
    comparisons <- comparison_choice
    for (direction_choice in desired_directions){
      direction <- direction_choice
      
      # Set the remaining parameters:
      lfc <- 0 # positive numeric scalar specifying the log-fold change threshold to be tested against.
      block <- PAG_sceset_qc_norm_filt_corr_clust_de$mouse.sex # factor specifying the blocking level for each cell.
      design <- NULL # numeric matrix with blocking terms for uninteresting factors (not to be confounded with groups).
      restrict <- NULL # the levels of groups for which to perform pairwise comparisons.
      exclude <- NULL # the levels of groups for which NOT to perform pairwise comparisons.
      
      # Run findMarkers:  
      marker_genes <- findMarkers(PAG_sceset_qc_norm_filt_corr_clust_de,
                                  groups = groups,
                                  test.type = test,
                                  pval.type = comparisons,
                                  direction = direction,
                                  lfc = lfc,
                                  assay.type = "logcounts",
                                  block = block,
                                  restrict = restrict,
                                  exclude = exclude)
      
      # Save results:
      saveRDS(marker_genes, file = str_c(path_for_figures, "celltype_PAGarea/", results_name,
                                         "marker_genes_", test, "_", comparisons, "_", direction, ".rds"))
      
      # Create a variable with a value for each group name to iterate through:
      range_of_groups <- 1:paste(length(levels(groups)))
      
      # Create and export a separate table of results for each group:
      for (chosen_group in range_of_groups) {
        # Compose name of variable where results will be assigned to:
        marker_set <- str_c("marker_set_", levels(groups)[chosen_group], "_", test, "_", comparisons, "_", direction)
        
        # Assign results of marker genes for current group to a temp variable:
        temp_marker_set <- marker_genes[[chosen_group]] # Assign results of marker genes for current group to a temp variable
        
        #print(marker_set) # Print variable name of current group
        #print(sum(temp_marker_set$FDR<0.05)) # Print the number of genes with a FDR below 0.05
        #print(temp_marker_set[1:5, 1:3]) # Print top results for current group

        # Assign the results to the created variable:
        assign(marker_set, temp_marker_set)
        
        # Export the results to a .csv file:
        write.csv(temp_marker_set[temp_marker_set$FDR<0.1,], file = str_c(path_for_figures, "celltype_PAGarea/", "PAG_DE_results_", 
                                                                         results_name, marker_set, ".csv"), quote = FALSE, row.names = TRUE)
        
        # Populate the summary data.frame:
        celltype_PAGarea_marker_set_summary = rbind(celltype_PAGarea_marker_set_summary,
                                                    data.frame(marker_set = marker_set,
                                                               group = levels(groups)[chosen_group],
                                                               test = test,
                                                               comparisons = comparisons,
                                                               direction = direction,
                                                               genes_FDR_001 = sum(temp_marker_set$FDR<0.01),
                                                               genes_FDR_005 = sum(temp_marker_set$FDR<0.05),
                                                               genes_FDR_01 = sum(temp_marker_set$FDR<0.1)))
        }
    }
  }
}

# Check out the summary: 
celltype_PAGarea_marker_set_summary
```

```{r}
# Take a look at the results from one fo the comparisons
head(marker_set_VGluT2_dlpag_wilcox_all_up[marker_set_VGluT2_dlpag_wilcox_all_up$FDR<0.1,])
```

For `pval.type=all`, we found no marker genes for any of the groups that passed the false discovery rate threshold of `FDR<0.05`. This suggests that if we want to find genes that are differentially expressed across PAG subdivisions we can either look at the results from `pval.type=some` or repeat the analysis following one of two approaches:

* We could compare PAG subdivisions by using the metadata factor `PAG.area` and ignoring the `cell.type`, which would provide us with marker genes to target specific subdivisions.

* Alternatively we could first subset the dataset by `cell.type` so that we are left with excitatory cells and inhibitory cells separately, and then re-run the comparisons by using the metadata factor `PAG.area`.

### 7.3.1 | Marker genes for PAG subdivisions
We will now follow the first of the two approaches delineated above and perform pairwise comparisons to identify marker genes for PAG subdivisions. For this, we continue with the `SingleCellExperiment` object we created after excluding the cells assigned to the macrophage-enriched cluster:
```{r}
dim(PAG_sceset_qc_norm_filt_corr_clust)
dim(PAG_sceset_qc_norm_filt_corr_clust_de)
```

As always, to make sure we don't carry any variables from previous runs, we reset all the parameters to their default values:
```{r}
library(scran)
# Parameters defaults:
groups <- NULL # clustering results or group for each cell.

test <- c("t", "wilcox", "binom") # Pairwise test to perform: t-test "t", Wilcoxon rank sum test "wilcox", binomial test "binom".
comparisons <- c("any", "some", "all") # how p-values are to be combined across pairwise comparisons for a given group: "any", "some", or "all.
direction <- c("any", "up", "down") # the direction of log-fold changes to be considered in the alternative hypothesis.
lfc <- 0 # positive numeric scalar specifying the log-fold change threshold to be tested against.

block <- NULL # factor specifying the blocking level for each cell.
design <- NULL # numeric matrix with blocking terms for uninteresting factors (not to be confounded with groups).
restrict <- NULL # the levels of groups for which to perform pairwise comparisons.
exclude <- NULL # the levels of groups for which NOT to perform pairwise comparisons.
print("Parameters back to default values")
```

For the `PAG.area` comparisons, we have four groups (dmPAG, dlPAG, lPAG, vlPAG). In this case, we are mainly interested in the genes specific to each group, so we will set `pval.type=all`. Again, after seeing `all` results in very few potential markers, we could add `pval.type=some` to the mix. We will also set `direction = "up"` to focus on genes that are more highly expressed in each group. We will again use the Wilcoxon rank sum test and will block for `mouse.sex` to account for any sex specific differences that we are not interested in.
```{r}
# Set the groups to be compared (cluster IDs or metadata slot of interest):
groups <- PAG_sceset_qc_norm_filt_corr_clust_de$PAG.area

# Set the string to be added to the filename when saving the results:
results_name <- "PAG_subdivision_" # one of "SNN_clusters_cv2_jaccard_k5_", "cell_type_", "PAG_subdivision", "celltype_PAGarea_", "celltype_PAGAPaxis_", "VGAT_VGluT2_expression"

# Create a variable with the tests you want to run:
desired_tests <- c("wilcox") # Pairwise test to perform: t-test "t", Wilcoxon rank sum test "wilcox", binomial test "binom".
desired_comparisons <- c("some", "all") # how p-values are to be combined across pairwise comparisons for a given group: "any", "some", or "all".
desired_directions <- c("up") # the direction of log-fold changes to be considered in the alternative hypothesis: "any", "up", "down".

# Initialise an empty data.frame to populate with a summary of marker genes:
PAGarea_marker_set_summary <- data.frame()

# Perform comparisons of chosen groups with a set of statistical tests and store/export all the results:
for (test_choice in desired_tests){
  test <- test_choice
  for (comparison_choice in desired_comparisons){
    comparisons <- comparison_choice
    for (direction_choice in desired_directions){
      direction <- direction_choice
      
      # Set the remaining parameters:
      lfc <- 0 # positive numeric scalar specifying the log-fold change threshold to be tested against.
      block <- PAG_sceset_qc_norm_filt_corr_clust_de$mouse.sex # factor specifying the blocking level for each cell.
      design <- NULL # numeric matrix with blocking terms for uninteresting factors (not to be confounded with groups).
      restrict <- NULL # the levels of groups for which to perform pairwise comparisons.
      exclude <- NULL # the levels of groups for which NOT to perform pairwise comparisons.
      
      # Run findMarkers:  
      marker_genes <- findMarkers(PAG_sceset_qc_norm_filt_corr_clust_de,
                                  groups = groups,
                                  test.type = test,
                                  pval.type = comparisons,
                                  direction = direction,
                                  lfc = lfc,
                                  assay.type = "logcounts",
                                  block = block,
                                  restrict = restrict,
                                  exclude = exclude)
      
      # Save results:
      saveRDS(marker_genes, file = str_c(path_for_figures, "PAG_subdivision/", results_name,
                                         "marker_genes_", test, "_", comparisons, "_", direction, ".rds"))
      
      # Create a variable with a value for each group name to iterate through:
      range_of_groups <- 1:paste(length(levels(groups)))
      
      # Create and export a separate table of results for each group:
      for (chosen_group in range_of_groups) {
        # Compose name of variable where results will be assigned to:
        marker_set <- str_c("marker_set_", levels(groups)[chosen_group], "_", test, "_", comparisons, "_", direction)
        
        # Assign results of marker genes for current group to a temp variable:
        temp_marker_set <- marker_genes[[chosen_group]] # Assign results of marker genes for current group to a temp variable
        
        #print(marker_set) # Print variable name of current group
        #print(sum(temp_marker_set$FDR<0.05)) # Print the number of genes with a FDR below 0.05
        #print(temp_marker_set[1:5, 1:3]) # Print top results for current group

        # Assign the results to the created variable:
        assign(marker_set, temp_marker_set)
        
        # Export the results to a .csv file:
        write.csv(temp_marker_set[temp_marker_set$FDR<0.1,], file = str_c(path_for_figures, "PAG_subdivision/", "PAG_DE_results_",
                                                                         results_name, marker_set, ".csv"), quote = FALSE, row.names = TRUE)
        
        # Populate the summary data.frame:
        PAGarea_marker_set_summary = rbind(PAGarea_marker_set_summary,
                                           data.frame(marker_set = marker_set,
                                                      group = levels(groups)[chosen_group],
                                                      test = test,
                                                      comparisons = comparisons,
                                                      direction = direction,
                                                      genes_FDR_001 = sum(temp_marker_set$FDR<0.01),
                                                      genes_FDR_005 = sum(temp_marker_set$FDR<0.05),
                                                      genes_FDR_01 = sum(temp_marker_set$FDR<0.1)))
      }
    }
  }
}

# Check out the summary: 
PAGarea_marker_set_summary
```

As we can see, even when looking only at the differences between PAG subdivisions without paying attention to the cell-type we still don't find many potential marker genes. Interestingly though, we can see there are less genes for dmPAG, dlPAG, and lPAG compared to vlPAG. Given that we are using `pval.type=all` this may be a hint that vlPAG is more different from all the other subdivisions than what dm/dl/l are between themselves. To test this, we could repeat the analysis by pooling the metadata factors from dm/dl/l into a single one and compare dorsal PAG vs vlPAG.
```{r}
# Take a look at the results from one fo the comparisons
head(marker_set_dlpag_wilcox_all_up[marker_set_dlpag_wilcox_all_up$FDR<0.05,])
```

Once we have obtained the results for each comparison, we can take a quick look at the top DE genes and filter them according the our custom list of genes we appended to the metadata in the begining of the pipeline. __Hint: only Transcription Factors have some results.__
```{r}
# Set the FDR threshold you want to look at.
fdr_threshold <- 0.05 # try 0.1, 0.05, 0.01 depending on how stringent you want your results to be

dmpag_de_genes <- rownames(marker_set_dmpag_wilcox_all_up[marker_set_dmpag_wilcox_all_up$FDR<fdr_threshold, ])
dlpag_de_genes <- rownames(marker_set_dlpag_wilcox_all_up[marker_set_dlpag_wilcox_all_up$FDR<fdr_threshold, ])
lpag_de_genes <- rownames(marker_set_lpag_wilcox_all_up[marker_set_lpag_wilcox_all_up$FDR<fdr_threshold, ])
vlpag_de_genes <- rownames(marker_set_vlpag_wilcox_all_up[marker_set_vlpag_wilcox_all_up$FDR<fdr_threshold, ])

# cat("--- Ion Channels ---\n")
# cat("--- dmPAG cells\n")
# dmpag_de_genes[dmpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.ionchannels]
# cat("--- dlPAG cells\n")
# dlpag_de_genes[dlpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.ionchannels]
# cat("--- lPAG cells\n")
# lpag_de_genes[lpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.ionchannels]
# cat("--- vlPAG cells\n")
# vlpag_de_genes[vlpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.ionchannels]
# 
# cat("\n\n--- Neuromodulators and Neuropeptides ---\n")
# cat("--- dmPAG cells\n")
# dmpag_de_genes[dmpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.NeuromodulatorsPeptides]
# cat("--- dlPAG cells\n")
# dlpag_de_genes[dlpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.NeuromodulatorsPeptides]
# cat("--- lPAG cells\n")
# lpag_de_genes[lpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.NeuromodulatorsPeptides]
# cat("--- vlPAG cells\n")
# vlpag_de_genes[vlpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.NeuromodulatorsPeptides]
# 
# cat("\n\n--- Monoamines ---\n")
# cat("--- dmPAG cells\n")
# dmpag_de_genes[dmpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Monoamines]
# cat("--- dlPAG cells\n")
# dlpag_de_genes[dlpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Monoamines]
# cat("--- lPAG cells\n")
# lpag_de_genes[lpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Monoamines]
# cat("--- vlPAG cells\n")
# vlpag_de_genes[vlpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Monoamines]
# 
# cat("\n\n--- Neurotransmitters ---\n")
# cat("--- dmPAG cells\n")
# dmpag_de_genes[dmpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Neurotransmitters]
# cat("--- dlPAG cells\n")
# dlpag_de_genes[dlpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Neurotransmitters]
# cat("--- lPAG cells\n")
# lpag_de_genes[lpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Neurotransmitters]
# cat("--- vlPAG cells\n")
# vlpag_de_genes[vlpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Neurotransmitters]
# 
# cat("\n\n--- Transporters ---\n")
# cat("--- dmPAG cells\n")
# dmpag_de_genes[dmpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Transporters]
# cat("--- dlPAG cells\n")
# dlpag_de_genes[dlpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Transporters]
# cat("--- lPAG cells\n")
# lpag_de_genes[lpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Transporters]
# cat("--- vlPAG cells\n")
# vlpag_de_genes[vlpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Transporters]

cat("\n\n--- Transcription Factors ---\n")
cat("--- dmPAG cells\n")
dmpag_de_genes[dmpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Transcriptionfactors]
cat("--- dlPAG cells\n")
dlpag_de_genes[dlpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Transcriptionfactors]
cat("--- lPAG cells\n")
lpag_de_genes[lpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Transcriptionfactors]
cat("--- vlPAG cells\n")
vlpag_de_genes[vlpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Transcriptionfactors]
```

Finally, we assign the names of the differentially expressed genes to the `metadata` slot of the `PAG_sceset_qc_norm_filt_corr_clust_de` object, so we can quickly retrieve it for plotting and summarising our results:
```{r}
# Set false discovery rate threshold
fdr_threshold <- 0.05 # try 0.1, 0.05, 0.01 depending on how stringent you want your results to be

## Results from `pval = all`
# DE genes in dmPAG cells
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_dmpag_wilcox_all_up <- rownames(marker_set_dmpag_wilcox_all_up[marker_set_dmpag_wilcox_all_up$FDR<fdr_threshold, ])

# DE genes in dlPAG cells
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_dlpag_wilcox_all_up <- rownames(marker_set_dlpag_wilcox_all_up[marker_set_dlpag_wilcox_all_up$FDR<fdr_threshold, ])

# DE genes in lPAG cells
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_lpag_wilcox_all_up <- rownames(marker_set_lpag_wilcox_all_up[marker_set_lpag_wilcox_all_up$FDR<fdr_threshold, ])

# DE genes in vlPAG cells
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_vlpag_wilcox_all_up <- rownames(marker_set_vlpag_wilcox_all_up[marker_set_vlpag_wilcox_all_up$FDR<fdr_threshold, ])

## Results from `pval = some`
# DE genes in dmPAG cells
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_dmpag_wilcox_some_up <- rownames(marker_set_dmpag_wilcox_some_up[marker_set_dmpag_wilcox_some_up$FDR<fdr_threshold, ])

# DE genes in dlPAG cells
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_dlpag_wilcox_some_up <- rownames(marker_set_dlpag_wilcox_some_up[marker_set_dlpag_wilcox_some_up$FDR<fdr_threshold, ])

# DE genes in lPAG cells
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_lpag_wilcox_some_up <- rownames(marker_set_lpag_wilcox_some_up[marker_set_lpag_wilcox_some_up$FDR<fdr_threshold, ])

# DE genes in vlPAG cells
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_vlpag_wilcox_some_up <- rownames(marker_set_vlpag_wilcox_some_up[marker_set_vlpag_wilcox_some_up$FDR<fdr_threshold, ])

# Check results were appended correctly
names(metadata(PAG_sceset_qc_norm_filt_corr_clust_de))
```

### 7.3.2 | Marker genes for PAG subdivisions within cell type
We can repeat the same strategy used for `PAG.area`, but this time using only either VGAT+ or VGluT2+ neurons. This will test for differences across subdivisions within a particular cell type, thus highlighting genes relevant to either inhibitory or excitatory neurons, which will be less stringent than taking all the groups together.

#### Marker genes for PAG subdivisions in VGAT+ neurons
To make sure we don't carry any variables from previous runs, we reset all the parameters to their default values:
```{r}
library(scran)
# Parameters defaults:
groups <- NULL # clustering results or group for each cell.

test <- c("t", "wilcox", "binom") # Pairwise test to perform: t-test "t", Wilcoxon rank sum test "wilcox", binomial test "binom".
comparisons <- c("any", "some", "all") # how p-values are to be combined across pairwise comparisons for a given group: "any", "some", or "all.
direction <- c("any", "up", "down") # the direction of log-fold changes to be considered in the alternative hypothesis.
lfc <- 0 # positive numeric scalar specifying the log-fold change threshold to be tested against.

block <- NULL # factor specifying the blocking level for each cell.
design <- NULL # numeric matrix with blocking terms for uninteresting factors (not to be confounded with groups).
restrict <- NULL # the levels of groups for which to perform pairwise comparisons.
exclude <- NULL # the levels of groups for which NOT to perform pairwise comparisons.
print("Parameters back to default values")
```

For the `PAG.area` comparisons within VGAT+ neurons, we are back to having four groups (dmPAG, dlPAG, lPAG, vlPAG). Given that we are still mainly interested in the genes specific to each group, we will set `pval.type=all`. Again, after seeing `all` results in very few potential markers, we could add `pval.type=some` to the mix. We will also set `direction = "up"` to focus on genes that are more highly expressed in each group. We will again use the Wilcoxon rank sum test and will block for `mouse.sex` to account for any sex specific differences that we are not interested in.
```{r}
# Set the groups to be compared (cluster IDs or metadata slot of interest):
groups <- PAG_sceset_qc_norm_filt_corr_clust_de[,PAG_sceset_qc_norm_filt_corr_clust_de$cell.type=="VGAT"]$PAG.area

# Set the string to be added to the filename when saving the results:
results_name <- "PAG_subdivision_" # one of "SNN_clusters_cv2_jaccard_k5_", "cell_type_", "PAG_subdivision", "celltype_PAGarea_", "celltype_PAGAPaxis_", "VGAT_VGluT2_expression"

# Create a variable with the tests you want to run:
desired_tests <- c("wilcox") # Pairwise test to perform: t-test "t", Wilcoxon rank sum test "wilcox", binomial test "binom".
desired_comparisons <- c("some", "all") # how p-values are to be combined across pairwise comparisons for a given group: "any", "some", or "all".
desired_directions <- c("up") # the direction of log-fold changes to be considered in the alternative hypothesis: "any", "up", "down".

# Initialise an empty data.frame to populate with a summary of marker genes:
PAGarea_VGAT_marker_set_summary <- data.frame()

# Perform comparisons of chosen groups with a set of statistical tests and store/export all the results:
for (test_choice in desired_tests){
  test <- test_choice
  for (comparison_choice in desired_comparisons){
    comparisons <- comparison_choice
    for (direction_choice in desired_directions){
      direction <- direction_choice
      
      # Set the remaining parameters:
      lfc <- 0 # positive numeric scalar specifying the log-fold change threshold to be tested against.
      block <- PAG_sceset_qc_norm_filt_corr_clust_de[,PAG_sceset_qc_norm_filt_corr_clust_de$cell.type=="VGAT"]$mouse.sex # factor specifying the blocking level for each cell.
      design <- NULL # numeric matrix with blocking terms for uninteresting factors (not to be confounded with groups).
      restrict <- NULL # the levels of groups for which to perform pairwise comparisons.
      exclude <- NULL # the levels of groups for which NOT to perform pairwise comparisons.
      
      # Run findMarkers:  
      marker_genes <- findMarkers(PAG_sceset_qc_norm_filt_corr_clust_de[,PAG_sceset_qc_norm_filt_corr_clust_de$cell.type=="VGAT"],
                                  groups = groups,
                                  test.type = test,
                                  pval.type = comparisons,
                                  direction = direction,
                                  lfc = lfc,
                                  assay.type = "logcounts",
                                  block = block,
                                  restrict = restrict,
                                  exclude = exclude)
      
      # Save results:
      saveRDS(marker_genes, file = str_c(path_for_figures, "PAG_subdivision/", results_name, 
                                         "marker_genes_", test, "_", comparisons, "_", direction, ".rds"))
      
      # Create a variable with a value for each group name to iterate through:
      range_of_groups <- 1:paste(length(levels(groups)))
      
      # Create and export a separate table of results for each group:
      for (chosen_group in range_of_groups) {
        # Compose name of variable where results will be assigned to:
        marker_set <- str_c("marker_set_vgat_", levels(groups)[chosen_group], "_", test, "_", comparisons, "_", direction)
        
        # Assign results of marker genes for current group to a temp variable:
        temp_marker_set <- marker_genes[[chosen_group]] # Assign results of marker genes for current group to a temp variable
        
        #print(marker_set) # Print variable name of current group
        #print(sum(temp_marker_set$FDR<0.05)) # Print the number of genes with a FDR below 0.05
        #print(temp_marker_set[1:5, 1:3]) # Print top results for current group

        # Assign the results to the created variable:
        assign(marker_set, temp_marker_set)
        
        # Export the results to a .csv file:
        write.csv(temp_marker_set[temp_marker_set$FDR<0.1,], file = str_c(path_for_figures, "PAG_subdivision/", "PAG_DE_results_", 
                                                                         results_name, marker_set, ".csv"), quote = FALSE, row.names = TRUE)
        
        # Populate the summary data.frame:
        PAGarea_VGAT_marker_set_summary = rbind(PAGarea_VGAT_marker_set_summary,
                                                data.frame(marker_set = marker_set,
                                                           group = levels(groups)[chosen_group],
                                                           test = test,
                                                           comparisons = comparisons,
                                                           direction = direction,
                                                           genes_FDR_001 = sum(temp_marker_set$FDR<0.01),
                                                           genes_FDR_005 = sum(temp_marker_set$FDR<0.05),
                                                           genes_FDR_01 = sum(temp_marker_set$FDR<0.1)))
        }
    }
  }
}

# Check out the summary: 
PAGarea_VGAT_marker_set_summary
```

Again, even when looking only at the differences between PAG subdivisions within VGAT+ neurons we only find some potential markers in vlPAG neurons, again hinting at the possibility that vlPAG is more different from all the other subdivisions than what dm/dl/l are between themselves. To test this, we could repeat the analysis by pooling the metadata factors from dm/dl/l into a single one and compare dorsal PAG vs vlPAG.
```{r}
# Take a look at the results from one fo the comparisons
head(marker_set_vgat_vlpag_wilcox_all_up[marker_set_vgat_vlpag_wilcox_all_up$FDR<0.05,])
```

In the case of VGAT neurons, even when seting `pval.type=some` we find very little or no markers in dm/dl/lpag neurons. We nonetheless keep the few we obtained:
```{r}
# Set false discovery rate threshold
fdr_threshold <- 0.05 # try 0.1, 0.05, 0.01 depending on how stringent you want your results to be

## Results from `pval = all`
# DE genes in dmPAG cells
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_vgat_dmpag_wilcox_all_up <- rownames(marker_set_vgat_dmpag_wilcox_all_up[marker_set_vgat_dmpag_wilcox_all_up$FDR<fdr_threshold, ])

# DE genes in dlPAG cells
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_vgat_dlpag_wilcox_all_up <- rownames(marker_set_vgat_dlpag_wilcox_all_up[marker_set_vgat_dlpag_wilcox_all_up$FDR<fdr_threshold, ])

# DE genes in lPAG cells
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_vgat_lpag_wilcox_all_up <- rownames(marker_set_vgat_lpag_wilcox_all_up[marker_set_vgat_lpag_wilcox_all_up$FDR<fdr_threshold, ])

# DE genes in vlPAG cells
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_vgat_vlpag_wilcox_all_up <- rownames(marker_set_vgat_vlpag_wilcox_all_up[marker_set_vgat_vlpag_wilcox_all_up$FDR<fdr_threshold, ])

## Results from `pval = some`
# DE genes in dmPAG cells
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_vgat_dmpag_wilcox_some_up <- rownames(marker_set_vgat_dmpag_wilcox_some_up[marker_set_vgat_dmpag_wilcox_some_up$FDR<fdr_threshold, ])

# DE genes in dlPAG cells
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_vgat_dlpag_wilcox_some_up <- rownames(marker_set_vgat_dlpag_wilcox_some_up[marker_set_vgat_dlpag_wilcox_some_up$FDR<fdr_threshold, ])

# DE genes in lPAG cells
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_vgat_lpag_wilcox_all_up <- rownames(marker_set_vgat_lpag_wilcox_all_up[marker_set_vgat_lpag_wilcox_all_up$FDR<fdr_threshold, ])

# DE genes in vlPAG cells
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_vgat_vlpag_wilcox_all_up <- rownames(marker_set_vgat_vlpag_wilcox_all_up[marker_set_vgat_vlpag_wilcox_all_up$FDR<fdr_threshold, ])

# Check results were appended correctly
names(metadata(PAG_sceset_qc_norm_filt_corr_clust_de))
```

#### Marker genes for PAG subdivision in VGluT2+ neurons
We next repeat the same for VGluT2 neurons. To make sure we don't carry any variables from previous runs, we reset all the parameters to their default values:
```{r}
library(scran)
# Parameters defaults:
groups <- NULL # clustering results or group for each cell.

test <- c("t", "wilcox", "binom") # Pairwise test to perform: t-test "t", Wilcoxon rank sum test "wilcox", binomial test "binom".
comparisons <- c("any", "some", "all") # how p-values are to be combined across pairwise comparisons for a given group: "any", "some", or "all.
direction <- c("any", "up", "down") # the direction of log-fold changes to be considered in the alternative hypothesis.
lfc <- 0 # positive numeric scalar specifying the log-fold change threshold to be tested against.

block <- NULL # factor specifying the blocking level for each cell.
design <- NULL # numeric matrix with blocking terms for uninteresting factors (not to be confounded with groups).
restrict <- NULL # the levels of groups for which to perform pairwise comparisons.
exclude <- NULL # the levels of groups for which NOT to perform pairwise comparisons.
print("Parameters back to default values")
```

For the `PAG.area` comparisons within VGluT2+ neurons, we again have four groups (dmPAG, dlPAG, lPAG, vlPAG). Given that we are still mainly interested in the genes specific to each group, we will set `pval.type=all`. Again, after seeing `all` results in very few potential markers, we could add `pval.type=some` to the mix. We will also set `direction = "up"` to focus on genes that are more highly expressed in each group. We will again use the Wilcoxon rank sum test and will block for `mouse.sex` to account for any sex specific differences that we are not interested in.
```{r}
# Set the groups to be compared (cluster IDs or metadata slot of interest):
groups <- PAG_sceset_qc_norm_filt_corr_clust_de[,PAG_sceset_qc_norm_filt_corr_clust_de$cell.type=="VGluT2"]$PAG.area

# Set the string to be added to the filename when saving the results:
results_name <- "PAG_subdivision_" # one of "SNN_clusters_cv2_jaccard_k5_", "cell_type_", "PAG_subdivision", "celltype_PAGarea_", "celltype_PAGAPaxis_", "VGAT_VGluT2_expression"

# Create a variable with the tests you want to run:
desired_tests <- c("wilcox") # Pairwise test to perform: t-test "t", Wilcoxon rank sum test "wilcox", binomial test "binom".
desired_comparisons <- c("some", "all") # how p-values are to be combined across pairwise comparisons for a given group: "any", "some", or "all".
desired_directions <- c("up") # the direction of log-fold changes to be considered in the alternative hypothesis: "any", "up", "down".

# Initialise an empty data.frame to populate with a summary of marker genes:
PAGarea_VGluT2_marker_set_summary <- data.frame()

# Perform comparisons of chosen groups with a set of statistical tests and store/export all the results:
for (test_choice in desired_tests){
  test <- test_choice
  for (comparison_choice in desired_comparisons){
    comparisons <- comparison_choice
    for (direction_choice in desired_directions){
      direction <- direction_choice
      
      # Set the remaining parameters:
      lfc <- 0 # positive numeric scalar specifying the log-fold change threshold to be tested against.
      block <- PAG_sceset_qc_norm_filt_corr_clust_de[,PAG_sceset_qc_norm_filt_corr_clust_de$cell.type=="VGluT2"]$mouse.sex # factor specifying the blocking level for each cell.
      design <- NULL # numeric matrix with blocking terms for uninteresting factors (not to be confounded with groups).
      restrict <- NULL # the levels of groups for which to perform pairwise comparisons.
      exclude <- NULL # the levels of groups for which NOT to perform pairwise comparisons.
      
      # Run findMarkers:  
      marker_genes <- findMarkers(PAG_sceset_qc_norm_filt_corr_clust_de[,PAG_sceset_qc_norm_filt_corr_clust_de$cell.type=="VGluT2"],
                                  groups = groups,
                                  test.type = test,
                                  pval.type = comparisons,
                                  direction = direction,
                                  lfc = lfc,
                                  assay.type = "logcounts",
                                  block = block,
                                  restrict = restrict,
                                  exclude = exclude)
      
      # Save results:
      saveRDS(marker_genes, file = str_c(path_for_figures, "PAG_subdivision/", results_name,
                                         "marker_genes_", test, "_", comparisons, "_", direction, ".rds"))
      
      # Create a variable with a value for each group name to iterate through:
      range_of_groups <- 1:paste(length(levels(groups)))
      
      # Create and export a separate table of results for each group:
      for (chosen_group in range_of_groups) {
        # Compose name of variable where results will be assigned to:
        marker_set <- str_c("marker_set_vglut2_", levels(groups)[chosen_group], "_", test, "_", comparisons, "_", direction)
        
        # Assign results of marker genes for current group to a temp variable:
        temp_marker_set <- marker_genes[[chosen_group]] # Assign results of marker genes for current group to a temp variable
        
        #print(marker_set) # Print variable name of current group
        #print(sum(temp_marker_set$FDR<0.05)) # Print the number of genes with a FDR below 0.05
        #print(temp_marker_set[1:5, 1:3]) # Print top results for current group

        # Assign the results to the created variable:
        assign(marker_set, temp_marker_set)
        
        # Export the results to a .csv file:
        write.csv(temp_marker_set[temp_marker_set$FDR<0.1,], file = str_c(path_for_figures, "PAG_subdivision/", "PAG_DE_results_", 
                                                                         results_name, marker_set, ".csv"), quote = FALSE, row.names = TRUE)
        
        # Populate the summary data.frame:
        PAGarea_VGluT2_marker_set_summary = rbind(PAGarea_VGluT2_marker_set_summary,
                                                  data.frame(marker_set = marker_set,
                                                             group = levels(groups)[chosen_group],
                                                             test = test,
                                                             comparisons = comparisons,
                                                             direction = direction,
                                                             genes_FDR_001 = sum(temp_marker_set$FDR<0.01),
                                                             genes_FDR_005 = sum(temp_marker_set$FDR<0.05),
                                                             genes_FDR_01 = sum(temp_marker_set$FDR<0.1)))
        }
    }
  }
}

# Check out the summary: 
PAGarea_VGluT2_marker_set_summary
```

Even though we are still looking only at the differences between PAG subdivisions within VGluT2+ neurons, in this case we seem to have slightly more potential markers between dmPAG and dlPAG neurons, and nearly none for lPAG and vlPAG. Whereas for VGAT neurons it was the vlPAG that seemed more different than the other three, in the case of VGluT2 neurons it seems to be dm/dlPAG that have more markers than l/vlPAG.
```{r}
# Take a look at the results from one fo the comparisons
head(marker_set_vglut2_dlpag_wilcox_all_up[marker_set_vglut2_dlpag_wilcox_all_up$FDR<0.05,])
```

Again, we store the genes that pass the false discovery rate threshold in the `SingleCellExperiment` object.
```{r}
# Set false discovery rate threshold
fdr_threshold <- 0.05 # try 0.1, 0.05, 0.01 depending on how stringent you want your results to be

## Results from `pval = all`
# DE genes in dmPAG cells
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_vglut2_dmpag_wilcox_all_up <- rownames(marker_set_vglut2_dmpag_wilcox_all_up[marker_set_vglut2_dmpag_wilcox_all_up$FDR<fdr_threshold, ])

# DE genes in dlPAG cells
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_vglut2_dlpag_wilcox_all_up <- rownames(marker_set_vglut2_dlpag_wilcox_all_up[marker_set_vglut2_dlpag_wilcox_all_up$FDR<fdr_threshold, ])

# DE genes in lPAG cells
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_vglut2_lpag_wilcox_all_up <- rownames(marker_set_vglut2_lpag_wilcox_all_up[marker_set_vglut2_lpag_wilcox_all_up$FDR<fdr_threshold, ])

# DE genes in vlPAG cells
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_vglut2_vlpag_wilcox_all_up <- rownames(marker_set_vglut2_vlpag_wilcox_all_up[marker_set_vglut2_vlpag_wilcox_all_up$FDR<fdr_threshold, ])

## Results from `pval = some`
# DE genes in dmPAG cells
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_vglut2_dmpag_wilcox_some_up <- rownames(marker_set_vglut2_dmpag_wilcox_some_up[marker_set_vglut2_dmpag_wilcox_some_up$FDR<fdr_threshold, ])

# DE genes in dlPAG cells
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_vglut2_dlpag_wilcox_some_up <- rownames(marker_set_vglut2_dlpag_wilcox_some_up[marker_set_vglut2_dlpag_wilcox_some_up$FDR<fdr_threshold, ])

# DE genes in lPAG cells
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_vglut2_lpag_wilcox_some_up <- rownames(marker_set_vglut2_lpag_wilcox_some_up[marker_set_vglut2_lpag_wilcox_some_up$FDR<fdr_threshold, ])

# DE genes in vlPAG cells
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_vglut2_vlpag_wilcox_some_up <- rownames(marker_set_vglut2_vlpag_wilcox_some_up[marker_set_vglut2_vlpag_wilcox_some_up$FDR<fdr_threshold, ])

# Check results were appended correctly
names(metadata(PAG_sceset_qc_norm_filt_corr_clust_de))
```

## Step 7.4 | Identifying marker genes between clusters
For the `cluster.ID` comparisons, the number of groups will depend on the parameters we chose. For the graph-based clustering results, we will run the pairwise comparisons both for a low value of `k` (e.g. `k=5`, which resulted in 11 clusters) and for a higher value of `k` (e.g. `k = 13`, which resulted in 8 clusters). For the SC3 clustering results, we will run the comparisons for the case of `k=4`, as it cleanly splits dm/dl/lPAG and vlPAG cells for both inhibitory and excitatory cells. Given that we are still mainly interested in the genes specific to each group, we will set `pval.type=all`. Again, after seeing `all` results in very few potential markers, we could add `some` to the mix. We will also set `direction = "up"` to focus on genes that are more highly expressed in each group. We will again use the Wilcoxon rank sum test and will block for `mouse.sex` to account for any sex specific differences that we are not interested in.

### 7.4.1 | Marker genes for SNN clusters (k = 5)
We can start by refreshing how the clustering results looked like when using `k = 5`:
```{r}
UMAP_SNN_JaccardLouvain_k5
```

```{r}
library(scran)
# Parameters defaults:
groups <- NULL # clustering results or group for each cell.

test <- c("t", "wilcox", "binom") # Pairwise test to perform: t-test "t", Wilcoxon rank sum test "wilcox", binomial test "binom".
comparisons <- c("any", "some", "all") # how p-values are to be combined across pairwise comparisons for a given group: "any", "some", or "all.
direction <- c("any", "up", "down") # the direction of log-fold changes to be considered in the alternative hypothesis.
lfc <- 0 # positive numeric scalar specifying the log-fold change threshold to be tested against.

block <- NULL # factor specifying the blocking level for each cell.
design <- NULL # numeric matrix with blocking terms for uninteresting factors (not to be confounded with groups).
restrict <- NULL # the levels of groups for which to perform pairwise comparisons.
exclude <- NULL # the levels of groups for which NOT to perform pairwise comparisons.
print("Parameters back to default values")
```

```{r}
# Set the groups to be compared (cluster IDs or metadata slot of interest):
groups <- PAG_sceset_qc_norm_filt_corr_clust_de$SNN_clusters_cv2_jaccard_k5

# Set the string to be added to the filename when saving the results:
results_name <- "SNN_clusters_cv2_jaccard_k5_" # one of "SNN_clusters_cv2_jaccard_k13_", "cell_type_", "PAG_subdivision", "celltype_PAGarea_", "celltype_PAGAPaxis_", "VGAT_VGluT2_expression"

# Create a variable with the tests you want to run:
desired_tests <- c("wilcox") # Pairwise test to perform: t-test "t", Wilcoxon rank sum test "wilcox", binomial test "binom".
desired_comparisons <- c("some", "all") # how p-values are to be combined across pairwise comparisons for a given group: "any", "some", or "all".
desired_directions <- c("up") # the direction of log-fold changes to be considered in the alternative hypothesis: "any", "up", "down".

# Initialise an empty `data.frame` to populate with a summary of marker genes:
SNN_clusters_k5_marker_set_summary <- data.frame()

# Perform comparisons of chosen groups with a set of statistical tests and store/export all the results:
for (test_choice in desired_tests){
  test <- test_choice
  for (comparison_choice in desired_comparisons){
    comparisons <- comparison_choice
    for (direction_choice in desired_directions){
      direction <- direction_choice
      
      # Set the remaining parameters:
      lfc <- 0 # positive numeric scalar specifying the log-fold change threshold to be tested against.
      block <- PAG_sceset_qc_norm_filt_corr_clust_de$mouse.sex # factor specifying the blocking level for each cell. Try PAG_sceset_qc_norm_filt_corr_clust_de$mouse.sex
      design <- NULL # numeric matrix with blocking terms for uninteresting factors (not to be confounded with groups).
      restrict <- NULL # the levels of groups for which to perform pairwise comparisons.
      exclude <- 10 # the levels of groups for which NOT to perform pairwise comparisons.
      
      # Run findMarkers:
      marker_genes <- findMarkers(PAG_sceset_qc_norm_filt_corr_clust_de,
                                  groups = groups,
                                  test.type = test,
                                  pval.type = comparisons,
                                  direction = direction,
                                  lfc = lfc,
                                  assay.type = "logcounts",
                                  block = block,
                                  restrict = restrict,
                                  exclude = exclude)
      
      # Save results:
      saveRDS(marker_genes, file = str_c(path_for_figures, "SNN_clusters/", results_name, 
                                         "marker_genes_", test, "_", comparisons, "_", direction, ".rds"))
      
      # Create a variable with a value for each group name to iterate through:
      range_of_groups <- 1:paste(length(levels(groups))-1) # subtract 1 to account for the exclusion of the macrophage-enriched cluster
      
      # Create and export a separate table of results for each group:
      for (chosen_group in range_of_groups) {
        # Compose name of variable where results will be assigned to:
        marker_set <- str_c("marker_set_SNN_k5_", levels(groups)[chosen_group], "_", test, "_", comparisons, "_", direction)
        
        # Assign results of marker genes for current group to a temp variable:
        temp_marker_set <- marker_genes[[chosen_group]]
        
        #print(marker_set) # Print variable name of current group
        #print(sum(temp_marker_set$FDR<0.05)) # Print the number of genes with a FDR below 0.05
        #print(temp_marker_set[1:5, 1:3]) # Print top results for current group
        
        # Assign the results to the created variable:
        assign(marker_set, temp_marker_set) # Assign the results to the created variable
        
        # Export the results to a .csv file:
        write.csv(temp_marker_set[temp_marker_set$FDR<0.1,], file = str_c(path_for_figures, "SNN_clusters/", "PAG_DE_results_", 
                                                                         results_name, marker_set, ".csv"), quote = FALSE, row.names = TRUE)
        # Populate the summary data.frame:        
        SNN_clusters_k5_marker_set_summary = rbind(SNN_clusters_k5_marker_set_summary,
                                                   data.frame(marker_set = marker_set,
                                                              group = levels(groups)[chosen_group],
                                                              test = test,
                                                              comparisons = comparisons,
                                                              direction = direction,
                                                              genes_FDR_001 = sum(temp_marker_set$FDR<0.01),
                                                              genes_FDR_005 = sum(temp_marker_set$FDR<0.05),
                                                              genes_FDR_01 = sum(temp_marker_set$FDR<0.1)))
      }
    }
  }
}

# Check out the summary: 
SNN_clusters_k5_marker_set_summary
```

```{r}
# Take a look at the results from one fo the comparisons
head(marker_set_SNN_k5_1_wilcox_some_up[marker_set_SNN_k5_1_wilcox_some_up$FDR<0.05,])[1:2]
```

As it happened in the case of `celltype_PAGarea`, we observe that using the `pval.type=all` finds little or no marker genes that are specific for each cluster. We do have a few when using `pval.type=some`, so we will store the results from both approaches in the `metadata` slot.
```{r}
# Set false discovery rate threshold
fdr_threshold <- 0.05 # try 0.1, 0.05, 0.01 depending on how stringent you want your results to be

## Results from `pval = all`
# DE genes in cluster 1
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_SNN_k5_1_wilcox_all_up <- rownames(marker_set_SNN_k5_1_wilcox_all_up[marker_set_SNN_k5_1_wilcox_all_up$FDR<fdr_threshold, ])
# DE genes in cluster 2
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_SNN_k5_2_wilcox_all_up <- rownames(marker_set_SNN_k5_2_wilcox_all_up[marker_set_SNN_k5_2_wilcox_all_up$FDR<fdr_threshold, ])
# DE genes in cluster 3
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_SNN_k5_3_wilcox_all_up <- rownames(marker_set_SNN_k5_3_wilcox_all_up[marker_set_SNN_k5_3_wilcox_all_up$FDR<fdr_threshold, ])
# DE genes in cluster 4
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_SNN_k5_4_wilcox_all_up <- rownames(marker_set_SNN_k5_4_wilcox_all_up[marker_set_SNN_k5_4_wilcox_all_up$FDR<fdr_threshold, ])
# DE genes in cluster 5
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_SNN_k5_5_wilcox_all_up <- rownames(marker_set_SNN_k5_5_wilcox_all_up[marker_set_SNN_k5_5_wilcox_all_up$FDR<fdr_threshold, ])
# DE genes in cluster 6
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_SNN_k5_6_wilcox_all_up <- rownames(marker_set_SNN_k5_6_wilcox_all_up[marker_set_SNN_k5_6_wilcox_all_up$FDR<fdr_threshold, ])
# DE genes in cluster 7
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_SNN_k5_7_wilcox_all_up <- rownames(marker_set_SNN_k5_7_wilcox_all_up[marker_set_SNN_k5_7_wilcox_all_up$FDR<fdr_threshold, ])
# DE genes in cluster 8
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_SNN_k5_8_wilcox_all_up <- rownames(marker_set_SNN_k5_8_wilcox_all_up[marker_set_SNN_k5_8_wilcox_all_up$FDR<fdr_threshold, ])
# DE genes in cluster 9
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_SNN_k5_9_wilcox_all_up <- rownames(marker_set_SNN_k5_9_wilcox_all_up[marker_set_SNN_k5_9_wilcox_all_up$FDR<fdr_threshold, ])
# DE genes in cluster 11 (we removed cluster 10)
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_SNN_k5_10_wilcox_all_up <- rownames(marker_set_SNN_k5_10_wilcox_all_up[marker_set_SNN_k5_10_wilcox_all_up$FDR<fdr_threshold, ])

## Results from `pval = some`
# DE genes in cluster 1
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_SNN_k5_1_wilcox_some_up <- rownames(marker_set_SNN_k5_1_wilcox_some_up[marker_set_SNN_k5_1_wilcox_some_up$FDR<fdr_threshold, ])
# DE genes in cluster 2
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_SNN_k5_2_wilcox_some_up <- rownames(marker_set_SNN_k5_2_wilcox_some_up[marker_set_SNN_k5_2_wilcox_some_up$FDR<fdr_threshold, ])
# DE genes in cluster 3
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_SNN_k5_3_wilcox_some_up <- rownames(marker_set_SNN_k5_3_wilcox_some_up[marker_set_SNN_k5_3_wilcox_some_up$FDR<fdr_threshold, ])
# DE genes in cluster 4
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_SNN_k5_4_wilcox_some_up <- rownames(marker_set_SNN_k5_4_wilcox_some_up[marker_set_SNN_k5_4_wilcox_some_up$FDR<fdr_threshold, ])
# DE genes in cluster 5
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_SNN_k5_5_wilcox_some_up <- rownames(marker_set_SNN_k5_5_wilcox_some_up[marker_set_SNN_k5_5_wilcox_some_up$FDR<fdr_threshold, ])
# DE genes in cluster 6
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_SNN_k5_6_wilcox_some_up <- rownames(marker_set_SNN_k5_6_wilcox_some_up[marker_set_SNN_k5_6_wilcox_some_up$FDR<fdr_threshold, ])
# DE genes in cluster 7
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_SNN_k5_7_wilcox_some_up <- rownames(marker_set_SNN_k5_7_wilcox_some_up[marker_set_SNN_k5_7_wilcox_some_up$FDR<fdr_threshold, ])
# DE genes in cluster 8
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_SNN_k5_8_wilcox_some_up <- rownames(marker_set_SNN_k5_8_wilcox_some_up[marker_set_SNN_k5_8_wilcox_some_up$FDR<fdr_threshold, ])
# DE genes in cluster 9
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_SNN_k5_9_wilcox_some_up <- rownames(marker_set_SNN_k5_9_wilcox_some_up[marker_set_SNN_k5_9_wilcox_some_up$FDR<fdr_threshold, ])
# DE genes in cluster 11 (we removed cluster 10)
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_SNN_k5_10_wilcox_some_up <- rownames(marker_set_SNN_k5_10_wilcox_some_up[marker_set_SNN_k5_10_wilcox_some_up$FDR<fdr_threshold, ])

# Check results were appended correctly
names(metadata(PAG_sceset_qc_norm_filt_corr_clust_de))
```

### 7.4.2 | Marker genes for SNN clusters (k = 13)
We repeat the analysis for the results obtained using `k = 13`:
```{r}
UMAP_SNN_JaccardLouvain_k13
```

```{r}
library(scran)
# Parameters defaults:
groups <- NULL # clustering results or group for each cell.

test <- c("t", "wilcox", "binom") # Pairwise test to perform: t-test "t", Wilcoxon rank sum test "wilcox", binomial test "binom".
comparisons <- c("any", "some", "all") # how p-values are to be combined across pairwise comparisons for a given group: "any", "some", or "all.
direction <- c("any", "up", "down") # the direction of log-fold changes to be considered in the alternative hypothesis.
lfc <- 0 # positive numeric scalar specifying the log-fold change threshold to be tested against.

block <- NULL # factor specifying the blocking level for each cell.
design <- NULL # numeric matrix with blocking terms for uninteresting factors (not to be confounded with groups).
restrict <- NULL # the levels of groups for which to perform pairwise comparisons.
exclude <- NULL # the levels of groups for which NOT to perform pairwise comparisons.
print("Parameters back to default values")
```

```{r}
# Set the groups to be compared (cluster IDs or metadata slot of interest):
groups <- PAG_sceset_qc_norm_filt_corr_clust_de$SNN_clusters_cv2_jaccard_k13

# Set the string to be added to the filename when saving the results:
results_name <- "SNN_clusters_cv2_jaccard_k13_" # one of "SNN_clusters_cv2_jaccard_k5_", "cell_type_", "PAG_subdivision", "celltype_PAGarea_", "celltype_PAGAPaxis_", "VGAT_VGluT2_expression"

# Create a variable with the tests you want to run:
desired_tests <- c("wilcox") # Pairwise test to perform: t-test "t", Wilcoxon rank sum test "wilcox", binomial test "binom".
desired_comparisons <- c("some", "all") # how p-values are to be combined across pairwise comparisons for a given group: "any", "some", or "all".
desired_directions <- c("up") # the direction of log-fold changes to be considered in the alternative hypothesis: "any", "up", "down".

# Initialise an empty `data.frame` to populate with a summary of marker genes:
SNN_clusters_k13_marker_set_summary <- data.frame()

# Perform comparisons of chosen groups with a set of statistical tests and store/export all the results:
for (test_choice in desired_tests){
  test <- test_choice
  for (comparison_choice in desired_comparisons){
    comparisons <- comparison_choice
    for (direction_choice in desired_directions){
      direction <- direction_choice
      
      # Set the remaining parameters:
      lfc <- 0 # positive numeric scalar specifying the log-fold change threshold to be tested against.
      block <- PAG_sceset_qc_norm_filt_corr_clust_de$mouse.sex # factor specifying the blocking level for each cell. Try PAG_sceset_qc_norm_filt_corr_clust_de$mouse.sex
      design <- NULL # numeric matrix with blocking terms for uninteresting factors (not to be confounded with groups).
      restrict <- NULL # the levels of groups for which to perform pairwise comparisons.
      exclude <- 7 # the levels of groups for which NOT to perform pairwise comparisons.
      
      # Run findMarkers:
      marker_genes <- findMarkers(PAG_sceset_qc_norm_filt_corr_clust_de,
                                  groups = groups,
                                  test.type = test,
                                  pval.type = comparisons,
                                  direction = direction,
                                  lfc = lfc,
                                  assay.type = "logcounts",
                                  block = block,
                                  restrict = restrict,
                                  exclude = exclude)
      
      # Save results:
      saveRDS(marker_genes, file = str_c(path_for_figures, "SNN_clusters/", results_name, 
                                         "marker_genes_", test, "_", comparisons, "_", direction, ".rds"))
      
      # Create a variable with a value for each group name to iterate through:
      range_of_groups <- 1:paste(length(levels(groups))-1) # subtract 1 to account for the exclusion of the macrophage-enriched cluster
      
      # Create and export a separate table of results for each group:
      for (chosen_group in range_of_groups) {
        # Compose name of variable where results will be assigned to:
        marker_set <- str_c("marker_set_SNN_k13_", levels(groups)[chosen_group], "_", test, "_", comparisons, "_", direction)
        
        # Assign results of marker genes for current group to a temp variable:
        temp_marker_set <- marker_genes[[chosen_group]]
        
        #print(marker_set) # Print variable name of current group
        #print(sum(temp_marker_set$FDR<0.05)) # Print the number of genes with a FDR below 0.05
        #print(temp_marker_set[1:5, 1:3]) # Print top results for current group
        
        # Assign the results to the created variable:
        assign(marker_set, temp_marker_set) # Assign the results to the created variable
        
        # Export the results to a .csv file:
        write.csv(temp_marker_set[temp_marker_set$FDR<0.1,], file = str_c(path_for_figures, "SNN_clusters/", "PAG_DE_results_", 
                                                                         results_name, marker_set, ".csv"), quote = FALSE, row.names = TRUE)
        # Populate the summary data.frame:        
        SNN_clusters_k13_marker_set_summary = rbind(SNN_clusters_k13_marker_set_summary,
                                                   data.frame(marker_set = marker_set,
                                                              group = levels(groups)[chosen_group],
                                                              test = test,
                                                              comparisons = comparisons,
                                                              direction = direction,
                                                              genes_FDR_001 = sum(temp_marker_set$FDR<0.01),
                                                              genes_FDR_005 = sum(temp_marker_set$FDR<0.05),
                                                              genes_FDR_01 = sum(temp_marker_set$FDR<0.1)))
      }
    }
  }
}

# Check out the summary: 
SNN_clusters_k13_marker_set_summary
```

```{r}
# Take a look at the results from one fo the comparisons
head(marker_set_SNN_k13_3_wilcox_all_up[marker_set_SNN_k13_3_wilcox_all_up$FDR<0.05,])[1:2]
```

Once again, some clusters are left without specific markers when using `pval.type=all`, so we will store the results from both approaches into the `metadata` slot.
```{r}
# Set false discovery rate threshold
fdr_threshold <- 0.05 # try 0.1, 0.05, 0.01 depending on how stringent you want your results to be

## Results from `pval = all`
# DE genes in cluster 1
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_SNN_k13_1_wilcox_all_up <- rownames(marker_set_SNN_k13_1_wilcox_all_up[marker_set_SNN_k13_1_wilcox_all_up$FDR<fdr_threshold, ])
# DE genes in cluster 2
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_SNN_k13_2_wilcox_all_up <- rownames(marker_set_SNN_k13_2_wilcox_all_up[marker_set_SNN_k13_2_wilcox_all_up$FDR<fdr_threshold, ])
# DE genes in cluster 3
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_SNN_k13_3_wilcox_all_up <- rownames(marker_set_SNN_k13_3_wilcox_all_up[marker_set_SNN_k13_3_wilcox_all_up$FDR<fdr_threshold, ])
# DE genes in cluster 4
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_SNN_k13_4_wilcox_all_up <- rownames(marker_set_SNN_k13_4_wilcox_all_up[marker_set_SNN_k13_4_wilcox_all_up$FDR<fdr_threshold, ])
# DE genes in cluster 5
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_SNN_k13_5_wilcox_all_up <- rownames(marker_set_SNN_k13_5_wilcox_all_up[marker_set_SNN_k13_5_wilcox_all_up$FDR<fdr_threshold, ])
# DE genes in cluster 6
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_SNN_k13_6_wilcox_all_up <- rownames(marker_set_SNN_k13_6_wilcox_all_up[marker_set_SNN_k13_6_wilcox_all_up$FDR<fdr_threshold, ])
# DE genes in cluster 8 (we removed cluster 7)
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_SNN_k13_7_wilcox_all_up <- rownames(marker_set_SNN_k13_7_wilcox_all_up[marker_set_SNN_k13_7_wilcox_all_up$FDR<fdr_threshold, ])

## Results from `pval = some`
# DE genes in cluster 1
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_SNN_k13_1_wilcox_some_up <- rownames(marker_set_SNN_k13_1_wilcox_some_up[marker_set_SNN_k13_1_wilcox_some_up$FDR<fdr_threshold, ])
# DE genes in cluster 2
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_SNN_k13_2_wilcox_some_up <- rownames(marker_set_SNN_k13_2_wilcox_some_up[marker_set_SNN_k13_2_wilcox_some_up$FDR<fdr_threshold, ])
# DE genes in cluster 3
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_SNN_k13_3_wilcox_some_up <- rownames(marker_set_SNN_k13_3_wilcox_some_up[marker_set_SNN_k13_3_wilcox_some_up$FDR<fdr_threshold, ])
# DE genes in cluster 4
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_SNN_k13_4_wilcox_some_up <- rownames(marker_set_SNN_k13_4_wilcox_some_up[marker_set_SNN_k13_4_wilcox_some_up$FDR<fdr_threshold, ])
# DE genes in cluster 5
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_SNN_k13_5_wilcox_some_up <- rownames(marker_set_SNN_k13_5_wilcox_some_up[marker_set_SNN_k13_5_wilcox_some_up$FDR<fdr_threshold, ])
# DE genes in cluster 6
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_SNN_k13_6_wilcox_some_up <- rownames(marker_set_SNN_k13_6_wilcox_some_up[marker_set_SNN_k13_6_wilcox_some_up$FDR<fdr_threshold, ])
# DE genes in cluster 8 (we removed cluster 7)
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_SNN_k13_7_wilcox_some_up <- rownames(marker_set_SNN_k13_7_wilcox_some_up[marker_set_SNN_k13_7_wilcox_some_up$FDR<fdr_threshold, ])

# Check results were appended correctly
names(metadata(PAG_sceset_qc_norm_filt_corr_clust_de))
```

### 7.4.3 | Marker genes for SC3 clusters (k = 4)
Finally, we will run the `findMarkers()` function on the results obtained by clustering with SC3 and `k=4`, as it is the clustering approach that better captures the two main clusters for each excitatory and inhibitory cells. This would be the closest to comparing dm/dl versus l/vl without using the actual metadata, and hopefully yields more markers than those obtained with the other approaches. We can first refresh the clustering results using `k = 4`:
```{r}
UMAP_SC3_k4
```

```{r}
library(scran)
# Parameters defaults:
groups <- NULL # clustering results or group for each cell.

test <- c("t", "wilcox", "binom") # Pairwise test to perform: t-test "t", Wilcoxon rank sum test "wilcox", binomial test "binom".
comparisons <- c("any", "some", "all") # how p-values are to be combined across pairwise comparisons for a given group: "any", "some", or "all.
direction <- c("any", "up", "down") # the direction of log-fold changes to be considered in the alternative hypothesis.
lfc <- 0 # positive numeric scalar specifying the log-fold change threshold to be tested against.

block <- NULL # factor specifying the blocking level for each cell.
design <- NULL # numeric matrix with blocking terms for uninteresting factors (not to be confounded with groups).
restrict <- NULL # the levels of groups for which to perform pairwise comparisons.
exclude <- NULL # the levels of groups for which NOT to perform pairwise comparisons.
print("Parameters back to default values")
```

```{r}
# Set the groups to be compared (cluster IDs or metadata slot of interest):
groups <- PAG_sceset_qc_norm_filt_corr_clust_de$sc3_4_clusters

# Set the string to be added to the filename when saving the results:
results_name <- "SC3_clusters_k4_" # one of "SNN_clusters_cv2_jaccard_k5_", "cell_type_", "PAG_subdivision", "celltype_PAGarea_", "celltype_PAGAPaxis_", "VGAT_VGluT2_expression"

# Create a variable with the tests you want to run:
desired_tests <- c("wilcox") # Pairwise test to perform: t-test "t", Wilcoxon rank sum test "wilcox", binomial test "binom".
desired_comparisons <- c("all") # how p-values are to be combined across pairwise comparisons for a given group: "any", "some", or "all".
desired_directions <- c("up") # the direction of log-fold changes to be considered in the alternative hypothesis: "any", "up", "down".

# Initialise an empty `data.frame` to populate with a summary of marker genes:
SC3_clusters_k4_marker_set_summary <- data.frame()

# Perform comparisons of chosen groups with a set of statistical tests and store/export all the results:
for (test_choice in desired_tests){
  test <- test_choice
  for (comparison_choice in desired_comparisons){
    comparisons <- comparison_choice
    for (direction_choice in desired_directions){
      direction <- direction_choice
      
      # Set the remaining parameters:
      lfc <- 0 # positive numeric scalar specifying the log-fold change threshold to be tested against.
      block <- PAG_sceset_qc_norm_filt_corr_clust_de$mouse.sex # factor specifying the blocking level for each cell. Try PAG_sceset_qc_norm_filt_corr_clust_de$mouse.sex
      design <- NULL # numeric matrix with blocking terms for uninteresting factors (not to be confounded with groups).
      restrict <- NULL # the levels of groups for which to perform pairwise comparisons.
      exclude <- NULL # the levels of groups for which NOT to perform pairwise comparisons.
      
      # Run findMarkers:
      marker_genes <- findMarkers(PAG_sceset_qc_norm_filt_corr_clust_de,
                                  groups = groups,
                                  test.type = test,
                                  pval.type = comparisons,
                                  direction = direction,
                                  lfc = lfc,
                                  assay.type = "logcounts",
                                  block = block,
                                  restrict = restrict,
                                  exclude = exclude)
      
      # Save results:
      saveRDS(marker_genes, file = str_c(path_for_figures, "SC3_clusters/", results_name, 
                                         "marker_genes_", test, "_", comparisons, "_", direction, ".rds"))
      
      # Create a variable with a value for each group name to iterate through:
      range_of_groups <- 1:paste(length(levels(groups)))
      
      # Create and export a separate table of results for each group:
      for (chosen_group in range_of_groups) {
        # Compose name of variable where results will be assigned to:
        marker_set <- str_c("marker_set_SC3_k4_", levels(groups)[chosen_group], "_", test, "_", comparisons, "_", direction)
        
        # Assign results of marker genes for current group to a temp variable:
        temp_marker_set <- marker_genes[[chosen_group]]
        
        #print(marker_set) # Print variable name of current group
        #print(sum(temp_marker_set$FDR<0.05)) # Print the number of genes with a FDR below 0.05
        #print(temp_marker_set[1:5, 1:3]) # Print top results for current group
        
        # Assign the results to the created variable:
        assign(marker_set, temp_marker_set) # Assign the results to the created variable
        
        # Export the results to a .csv file:
        write.csv(temp_marker_set[temp_marker_set$FDR<0.1,], file = str_c(path_for_figures, "SC3_clusters/", "PAG_DE_results_", 
                                                                         results_name, marker_set, ".csv"), quote = FALSE, row.names = TRUE)
        # Populate the summary data.frame:        
        SC3_clusters_k4_marker_set_summary = rbind(SC3_clusters_k4_marker_set_summary,
                                                   data.frame(marker_set = marker_set,
                                                              group = levels(groups)[chosen_group],
                                                              test = test,
                                                              comparisons = comparisons,
                                                              direction = direction,
                                                              genes_FDR_001 = sum(temp_marker_set$FDR<0.01),
                                                              genes_FDR_005 = sum(temp_marker_set$FDR<0.05),
                                                              genes_FDR_01 = sum(temp_marker_set$FDR<0.1)))
      }
    }
  }
}

# Check out the summary: 
SC3_clusters_k4_marker_set_summary
```

Indeed, this time we get a few more markers with the `pval.type=all`, suggesting that we may have been overclustering with the other approaches.
```{r}
# Take a look at the results from one fo the comparisons
head(marker_set_SC3_k4_1_wilcox_all_up[marker_set_SC3_k4_1_wilcox_all_up$FDR<0.05,]) # VGluT2 cluster (dm/dl)
head(marker_set_SC3_k4_4_wilcox_all_up[marker_set_SC3_k4_4_wilcox_all_up$FDR<0.05,]) # VGluT2 cluster (l/vl)

head(marker_set_SC3_k4_2_wilcox_all_up[marker_set_SC3_k4_2_wilcox_all_up$FDR<0.05,]) # VGAT cluster (dm/dl)
head(marker_set_SC3_k4_3_wilcox_all_up[marker_set_SC3_k4_3_wilcox_all_up$FDR<0.05,]) # VGAT cluster (l/vl)
```

We can quickly examine whether there are any interesting genes
```{r}
# Set the FDR threshold you want to look at.
fdr_threshold <- 0.05 # try 0.1, 0.05, 0.01 depending on how stringent you want your results to be

VGluT2_dmdl_cluster_de_genes <- rownames(marker_set_SC3_k4_1_wilcox_all_up[marker_set_SC3_k4_1_wilcox_all_up$FDR<fdr_threshold, ])
VGluT2_lvl_cluster_de_genes <- rownames(marker_set_SC3_k4_4_wilcox_all_up[marker_set_SC3_k4_4_wilcox_all_up$FDR<fdr_threshold, ])
VGAT_dmdl_cluster_de_genes <- rownames(marker_set_SC3_k4_2_wilcox_all_up[marker_set_SC3_k4_2_wilcox_all_up$FDR<fdr_threshold, ])
VGAT_lvl_cluster_de_genes <- rownames(marker_set_SC3_k4_3_wilcox_all_up[marker_set_SC3_k4_3_wilcox_all_up$FDR<fdr_threshold, ])

cat("--- Ion Channels ---\n")
cat("--- VGluT2 cells (dm/dl cluster)\n")
VGluT2_dmdl_cluster_de_genes[VGluT2_dmdl_cluster_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.ionchannels]
cat("--- VGluT2 cells (l/vl cluster)\n")
VGluT2_lvl_cluster_de_genes[VGluT2_lvl_cluster_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.ionchannels]
cat("--- VGAT cells (dm/dl cluster)\n")
VGAT_dmdl_cluster_de_genes[VGAT_dmdl_cluster_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.ionchannels]
cat("--- VGAT cells (l/vl cluster)\n")
VGAT_lvl_cluster_de_genes[VGAT_lvl_cluster_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.ionchannels]

cat("\n\n--- Neuromodulators and Neuropeptides ---\n")
cat("--- VGluT2 cells (dm/dl cluster)\n")
VGluT2_dmdl_cluster_de_genes[VGluT2_dmdl_cluster_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.NeuromodulatorsPeptides]
cat("--- VGluT2 cells (l/vl cluster)\n")
VGluT2_lvl_cluster_de_genes[VGluT2_lvl_cluster_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.NeuromodulatorsPeptides]
cat("--- VGAT cells (dm/dl cluster)\n")
VGAT_dmdl_cluster_de_genes[VGAT_dmdl_cluster_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.NeuromodulatorsPeptides]
cat("--- VGAT cells (l/vl cluster)\n")
VGAT_lvl_cluster_de_genes[VGAT_lvl_cluster_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.NeuromodulatorsPeptides]

cat("\n\n--- Monoamines ---\n")
cat("--- VGluT2 cells (dm/dl cluster)\n")
VGluT2_dmdl_cluster_de_genes[VGluT2_dmdl_cluster_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Monoamines]
cat("--- VGluT2 cells (l/vl cluster)\n")
VGluT2_lvl_cluster_de_genes[VGluT2_lvl_cluster_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Monoamines]
cat("--- VGAT cells (dm/dl cluster)\n")
VGAT_dmdl_cluster_de_genes[VGAT_dmdl_cluster_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Monoamines]
cat("--- VGAT cells (l/vl cluster)\n")
VGAT_lvl_cluster_de_genes[VGAT_lvl_cluster_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Monoamines]

cat("\n\n--- Transporters ---\n")
cat("--- VGluT2 cells (dm/dl cluster)\n")
VGluT2_dmdl_cluster_de_genes[VGluT2_dmdl_cluster_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Transporters]
cat("--- VGluT2 cells (l/vl cluster)\n")
VGluT2_lvl_cluster_de_genes[VGluT2_lvl_cluster_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Transporters]
cat("--- VGAT cells (dm/dl cluster)\n")
VGAT_dmdl_cluster_de_genes[VGAT_dmdl_cluster_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Transporters]
cat("--- VGAT cells (l/vl cluster)\n")
VGAT_lvl_cluster_de_genes[VGAT_lvl_cluster_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Transporters]

cat("\n\n--- Transcription Factors ---\n")
cat("--- VGluT2 cells (dm/dl cluster)\n")
VGluT2_dmdl_cluster_de_genes[VGluT2_dmdl_cluster_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Transcriptionfactors]
cat("--- VGluT2 cells (l/vl cluster)\n")
VGluT2_lvl_cluster_de_genes[VGluT2_lvl_cluster_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Transcriptionfactors]
cat("--- VGAT cells (dm/dl cluster)\n")
VGAT_dmdl_cluster_de_genes[VGAT_dmdl_cluster_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Transcriptionfactors]
cat("--- VGAT cells (l/vl cluster)\n")
VGAT_lvl_cluster_de_genes[VGAT_lvl_cluster_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Transcriptionfactors]
```

We store the results in the `metadata`:
```{r}
# Set false discovery rate threshold
fdr_threshold <- 0.05 # try 0.1, 0.05, 0.01 depending on how stringent you want your results to be

## Results from `pval = all`
# DE genes in cluster 1
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_SC3_k4_1_wilcox_all_up <- rownames(marker_set_SC3_k4_1_wilcox_all_up[marker_set_SC3_k4_1_wilcox_all_up$FDR<fdr_threshold, ])
# DE genes in cluster 2
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_SC3_k4_2_wilcox_all_up <- rownames(marker_set_SC3_k4_2_wilcox_all_up[marker_set_SC3_k4_2_wilcox_all_up$FDR<fdr_threshold, ])
# DE genes in cluster 3
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_SC3_k4_3_wilcox_all_up <- rownames(marker_set_SC3_k4_3_wilcox_all_up[marker_set_SC3_k4_3_wilcox_all_up$FDR<fdr_threshold, ])
# DE genes in cluster 4
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_SC3_k4_wilcox_all_up <- rownames(marker_set_SC3_k4_4_wilcox_all_up[marker_set_SC3_k4_4_wilcox_all_up$FDR<fdr_threshold, ])

# Check results were appended correctly
names(metadata(PAG_sceset_qc_norm_filt_corr_clust_de))
```

## Step 7.5 | Save the SingleCellExperiment object
After extracting and saving the differential expression results obtained with `findMarkers()`, we save the `SingleCellExperiment` object so we can use it for data visualisation.
```{r}
saveRDS(PAG_sceset_qc_norm_filt_corr_clust_de, file = "PAG_sceset_qc_norm_filt_corr_clust_de.rds")
print("Part 7 - Done!")
```

```{r}
sessionInfo()
```