-
Notifications
You must be signed in to change notification settings - Fork 0
/
PAG_scRNAseq_analysis_Part7.Rmd
1847 lines (1520 loc) · 139 KB
/
PAG_scRNAseq_analysis_Part7.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Topographic, single-cell gene expression profiling of Periaqueductal Gray neurons"
subtitle: "Part VII: differential expression analysis"
author:
- name: "Oriol Pavon Arocas, Sarah F. Olesen, and Tiago Branco"
affiliation: "Sainsbury Wellcome Centre for Neural Circuits and Behaviour, University College London, UK"
email: "oriol.pavon.16@ucl.ac.uk"
date: "`r format(Sys.time(), '%d %B %Y')`"
output:
html_notebook:
highlight: pygments
number_sections: FALSE
theme: lumen
toc: TRUE
toc_float: TRUE
#output: rmdformats::readthedown:
#highlight: pygments
---
***
This is a pipeline to analyse single-cell RNA sequencing data from Periaqueductal Gray neurons (1) isolated from acute midbrain slices of transgenic mice using visually guided aspiration via patch pipettes and (2) processed using SMART-seq2 (Picelli et al., Nature Protocols 2014).
This pipeline has been generated after attending the [EMBL-EBI RNA-Sequence Analysis Course](https://www.ebi.ac.uk/training/events/2019/rna-sequence-analysis) and [attending](https://training.csx.cam.ac.uk/bioinformatics/event/2823386) and following the online course on [Analysis of single cell RNA-seq data](https://github.com/hemberg-lab/scRNA.seq.course) by the [Hemberg Lab](https://www.sanger.ac.uk/science/groups/hemberg-group). Many other resources have been used, including the [Orchestrating Single-Cell Analysis with Bioconductor book](https://bioconductor.org/books/release/OSCA/) by Robert Amezquita, Aaron Lun, Stephanie Hicks, and Raphael Gottardo, the [simpleSingleCell workflow in Bioconductor](https://bioconductor.org/packages/3.9/workflows/html/simpleSingleCell.html) maintained by Aaron Lun, the [rnaseqGene workflow](https://bioconductor.org/packages/3.10/workflows/html/rnaseqGene.html) maintained by Michael Love, the [RNAseq123 workflow](https://bioconductor.org/packages/3.10/workflows/html/RNAseq123.html) maintained by Matthew Ritchie, and the [EGSEA123 workflow](https://bioconductor.org/packages/3.10/workflows/html/EGSEA123.html) maintained by Matthew Ritchie.
Other key resources include [Bioconductor](http://www.bioconductor.org/) (Huber et al., Nature Methods 2015), [scRNA-tools](https://www.scrna-tools.org/), `scater` (McCarthy et al., Bioinformatics 2017), `scran` (Lun et al., F1000Res 2016), `SC3` (Kiselev et al., Nature Methods 2017), `Seurat` (Butler et al., Nature Biotechnology 2018), `clusterExperiment` (Risso et al., PLOS Computational Biology 2018), `limma` (Ritchie et al., Nucleic Acids Research 2015), `DESeq2` (Love et al., Genome Biology 2014), `edgeR` (Robinson et al., Bioinformatics 2010), `MAST` (Finak, McDavid, Yajima et al., Genome Biology 2015), `iSEE` (Rue-Albrecht & Marini et al., F1000Research 2018), `t-SNE` (van der Maaten & Hinton, Journal of Machine Learning Research 2008), `UMAP` (McInnes et al., arXiv 2018), and the [Mathematical Statistics and Machine Learning for Life Sciences](https://towardsdatascience.com/tagged/stats-ml-life-sciences) column by Nikolay Oskolkov.
***
# STEP 7 | Differential Expression Analysis
One of the most common types of analyses when working with bulk RNA-seq data is to identify differentially expressed genes. By comparing the genes that change between two conditions, e.g. mutant and wild-type or stimulated and unstimulated, it is possible to characterize the molecular mechanisms underlying the change. Several methods, e.g. `DESeq2` and `edgeR`, have been developed for bulk RNA-seq.
In scRNA-seq we usually do not have a defined set of experimental conditions. Instead, we can identify the cell groups by using an unsupervised clustering approach. Once the groups have been identified one can find differentially expressed genes either by comparing the differences in variance between the groups (like the Kruskal-Wallis test implemented in `SC3`), or by comparing gene expression between clusters in a pairwise manner. In our case, in addition to clusters we have experimental conditions we can test, namely the cell type or PAG subdivision from where the cells were collected, which makes it possible for us to follow either approach.
```{r}
# Set the directory where your data and scripts are:
setwd("D:/Dropbox (UCL - SWC)/Project_transcriptomics/analysis/PAG_scRNAseq_analysis")
# Set the path to save figures from this Part:
path_for_figures <- "D:/Dropbox (UCL - SWC)/Project_transcriptomics/figures/R_figures_Part7_DE/"
date <- Sys.Date()
date <- gsub("-", "_", date)
# Load packages:
library(tidyverse)
library(SingleCellExperiment)
library(scater)
library(scran)
library(ggplot2)
library(pheatmap)
library(DESeq2)
library(edgeR)
library(limma)
library(monocle)
library(MAST)
library(ROCR)
library(patchwork)
library(extrafont)
```
We continue using the `PAG_sceset_qc_norm_filt_corr_clust` after normalisation, filtering, batch correction, and clustering. We should thus have a `corrected` slot in `assays`:
```{r}
# If starting from stored results, load saved filtered dataset from previous Step:
options(stringsAsFactors = FALSE)
PAG_sceset_qc_norm_filt_corr_clust <- readRDS("PAG_sceset_qc_norm_filt_corr_clust.rds") # Contains filtered cells and genes, log-normalised, filtered, corrected, and clustered data
assayNames(PAG_sceset_qc_norm_filt_corr_clust)
reducedDimNames(PAG_sceset_qc_norm_filt_corr_clust)
```
We first check we have the relevant metadata, making sure the key annotations are `factor`:
```{r}
# Check you have the relevant metadata as factors:
tail(colnames(colData(PAG_sceset_qc_norm_filt_corr_clust)))
class(PAG_sceset_qc_norm_filt_corr_clust$mouse.id)
class(PAG_sceset_qc_norm_filt_corr_clust$cell.type)
class(PAG_sceset_qc_norm_filt_corr_clust$PAG.area)
levels(PAG_sceset_qc_norm_filt_corr_clust$celltype_PAGarea)
```
## Step 7.0 | Recap of clustering results
To remind ourselves about which clustering results we have available, we can look at the names of the `colData` slot:
```{r}
names(colData(PAG_sceset_qc_norm_filt_corr_clust)[grep("cluster", names(colData(PAG_sceset_qc_norm_filt_corr_clust)))])
```
If we wanted to export the cluster IDs of each cell in our dataset for a particuler clustering approach, we could do it as follows:
```{r}
#groups <- as.data.frame(PAG_sceset_qc_norm_filt_corr_clust$SNN_clusters_cv2_jaccard_k5)
#rownames(groups) <- PAG_sceset_qc_norm_filt_corr_clust$cell.id
#groups
#write.csv(groups, file = str_c(path_for_figures, "SNN_clusters/", "cluster_ids.csv"), quote = FALSE, row.names = TRUE)
```
We can also replot the results of our chosen clustering solution before we attempt to find marker genes:
```{r}
# Set theme parameters
theme_args_SNN_clustering <- theme(plot.title = element_text(size = 11, face = "bold"),
axis.title = element_text(size = 11, face = "plain"),
axis.text = element_text(size = 10, face = "plain"),
legend.text = element_text(size = 9, face = "plain"),
strip.text = element_text(size = 10, face = "plain"))
### Using UMAP (CV2) - 15_neighbors_0.01_min_dist ###
UMAP_cv2_15_001_PAGsubdivisions <- plotReducedDim(PAG_sceset_qc_norm_filt_corr_clust,
dimred = "UMAP_cv2_corrected_15_neighbors_0.01_min_dist",
colour_by = "PAG.area", shape_by = "cell.type",
point_alpha = 0.6, point_size = 2, theme_size = 11
) + labs(title = "PAG subdivisions", x = "UMAP 1", y = "UMAP 2") + theme_args_SNN_clustering
# Using HVG from CV2 with rank-based weights
UMAP_cv2_15_001_RankWalktrap_k7 <- plotReducedDim(PAG_sceset_qc_norm_filt_corr_clust,
dimred = "UMAP_cv2_corrected_15_neighbors_0.01_min_dist",
colour_by = "SNN_clusters_cv2_rank_k7", shape_by = "cell.type", text_by = "SNN_clusters_cv2_rank_k7",
point_alpha = 0.6, point_size = 2, theme_size = 11
) + labs(title = "Rank-Walktrap (k = 7)", x = "UMAP 1", y = "UMAP 2") + theme_args_SNN_clustering
UMAP_cv2_15_001_RankWalktrap_k12 <- plotReducedDim(PAG_sceset_qc_norm_filt_corr_clust,
dimred = "UMAP_cv2_corrected_15_neighbors_0.01_min_dist",
colour_by = "SNN_clusters_cv2_rank_k12", shape_by = "cell.type", text_by = "SNN_clusters_cv2_rank_k12",
point_alpha = 0.6, point_size = 2, theme_size = 11
) + labs(title = "Rank-Walktrap (k = 12)", x = "UMAP 1", y = "UMAP 2") + theme_args_SNN_clustering
# Using HVG from CV2 with jaccard-based weights
UMAP_cv2_15_001_JaccardLouvain_k5 <- plotReducedDim(PAG_sceset_qc_norm_filt_corr_clust,
dimred = "UMAP_cv2_corrected_15_neighbors_0.01_min_dist",
colour_by = "SNN_clusters_cv2_jaccard_k5", shape_by = "cell.type", text_by = "SNN_clusters_cv2_jaccard_k5",
point_alpha = 0.6, point_size = 2, theme_size = 11
) + labs(title = "Jaccard-Louvain (k = 5)", x = "UMAP 1", y = "UMAP 2") + theme_args_SNN_clustering
UMAP_cv2_15_001_JaccardLouvain_k8 <- plotReducedDim(PAG_sceset_qc_norm_filt_corr_clust,
dimred = "UMAP_cv2_corrected_15_neighbors_0.01_min_dist",
colour_by = "SNN_clusters_cv2_jaccard_k8", shape_by = "cell.type", text_by = "SNN_clusters_cv2_jaccard_k8",
point_alpha = 0.6, point_size = 2, theme_size = 11
) + labs(title = "Jaccard-Louvain (k = 8)", x = "UMAP 1", y = "UMAP 2") + theme_args_SNN_clustering
UMAP_cv2_15_001_JaccardLouvain_k9 <- plotReducedDim(PAG_sceset_qc_norm_filt_corr_clust,
dimred = "UMAP_cv2_corrected_15_neighbors_0.01_min_dist",
colour_by = "SNN_clusters_cv2_jaccard_k9", shape_by = "cell.type", text_by = "SNN_clusters_cv2_jaccard_k9",
point_alpha = 0.6, point_size = 2, theme_size = 11
) + labs(title = "Jaccard-Louvain (k = 9)", x = "UMAP 1", y = "UMAP 2") + theme_args_SNN_clustering
UMAP_cv2_15_001_JaccardLouvain_k13 <- plotReducedDim(PAG_sceset_qc_norm_filt_corr_clust,
dimred = "UMAP_cv2_corrected_15_neighbors_0.01_min_dist",
colour_by = "SNN_clusters_cv2_jaccard_k13", shape_by = "cell.type", text_by = "SNN_clusters_cv2_jaccard_k13",
point_alpha = 0.6, point_size = 2, theme_size = 11
) + labs(title = "Jaccard-Louvain (k = 13)", x = "UMAP 1", y = "UMAP 2") + theme_args_SNN_clustering
# Using SC3 with k = 4
UMAP_cv2_15_001_SC3_4 <- plotReducedDim(PAG_sceset_qc_norm_filt_corr_clust,
dimred = "UMAP_cv2_corrected_15_neighbors_0.01_min_dist",
colour_by = "sc3_4_clusters", shape_by = "cell.type", text_by = "sc3_4_clusters",
point_alpha = 0.6, point_size = 2, theme_size = 11
) + labs(title = "SC3 (k = 4)", x = "UMAP 1", y = "UMAP 2") + theme_args_SNN_clustering
# Prepare and save composed plots
library(patchwork)
UMAP_SNN_RankWalktrap_k7 <- (UMAP_cv2_15_001_PAGsubdivisions + UMAP_cv2_15_001_RankWalktrap_k7) +
plot_annotation(title = "Graph-based clustering on UMAP (CV2)", tag_levels = "A",
theme = theme(plot.title = element_text(size = 12, face = "bold"), plot.tag = element_text(size = 11))) +
plot_layout(ncol = 2, guides = "collect")
UMAP_SNN_RankWalktrap_k7
UMAP_SNN_RankWalktrap_k12 <- (UMAP_cv2_15_001_PAGsubdivisions + UMAP_cv2_15_001_RankWalktrap_k12) +
plot_annotation(title = "Graph-based clustering on UMAP (CV2)", tag_levels = "A",
theme = theme(plot.title = element_text(size = 12, face = "bold"), plot.tag = element_text(size = 11))) +
plot_layout(ncol = 2, guides = "collect")
UMAP_SNN_RankWalktrap_k12
UMAP_SNN_JaccardLouvain_k5 <- (UMAP_cv2_15_001_PAGsubdivisions + UMAP_cv2_15_001_JaccardLouvain_k5) +
plot_annotation(title = "Graph-based clustering on UMAP (CV2)", tag_levels = "A",
theme = theme(plot.title = element_text(size = 12, face = "bold"), plot.tag = element_text(size = 11))) +
plot_layout(ncol = 2, guides = "collect")
UMAP_SNN_JaccardLouvain_k5
UMAP_SNN_JaccardLouvain_k8 <- (UMAP_cv2_15_001_PAGsubdivisions + UMAP_cv2_15_001_JaccardLouvain_k8) +
plot_annotation(title = "Graph-based clustering on UMAP (CV2)", tag_levels = "A",
theme = theme(plot.title = element_text(size = 12, face = "bold"), plot.tag = element_text(size = 11))) +
plot_layout(ncol = 2, guides = "collect")
UMAP_SNN_JaccardLouvain_k8
UMAP_SNN_JaccardLouvain_k9 <- (UMAP_cv2_15_001_PAGsubdivisions + UMAP_cv2_15_001_JaccardLouvain_k9) +
plot_annotation(title = "Graph-based clustering on UMAP (CV2)", tag_levels = "A",
theme = theme(plot.title = element_text(size = 12, face = "bold"), plot.tag = element_text(size = 11))) +
plot_layout(ncol = 2, guides = "collect")
UMAP_SNN_JaccardLouvain_k9
UMAP_SNN_JaccardLouvain_k13 <- (UMAP_cv2_15_001_PAGsubdivisions + UMAP_cv2_15_001_JaccardLouvain_k13) +
plot_annotation(title = "Graph-based clustering on UMAP (CV2)", tag_levels = "A",
theme = theme(plot.title = element_text(size = 12, face = "bold"), plot.tag = element_text(size = 11))) +
plot_layout(ncol = 2, guides = "collect")
UMAP_SNN_JaccardLouvain_k13
UMAP_SC3_k4 <- (UMAP_cv2_15_001_PAGsubdivisions + UMAP_cv2_15_001_SC3_4) +
plot_annotation(title = "SC3 clustering on UMAP (CV2)", tag_levels = "A",
theme = theme(plot.title = element_text(size = 12, face = "bold"), plot.tag = element_text(size = 11))) +
plot_layout(ncol = 2, guides = "collect")
UMAP_SC3_k4
ggsave(filename = str_c(date, "_UMAP_SNN_rank_k7_clusters.pdf"),
plot = UMAP_SNN_RankWalktrap_k7,
device = "pdf", # or one of "eps", "ps", "tex" (pictex), "pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only).
path = path_for_figures,
width = 8, height = 4, units = "in", dpi = 300,
family = "Arial", bg = "transparent"
)
ggsave(filename = str_c(date, "_UMAP_SNN_rank_k12_clusters.pdf"),
plot = UMAP_SNN_RankWalktrap_k12,
device = "pdf", # or one of "eps", "ps", "tex" (pictex), "pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only).
path = path_for_figures,
width = 8, height = 4, units = "in", dpi = 300,
family = "Arial", bg = "transparent"
)
ggsave(filename = str_c(date, "_UMAP_SNN_jaccard_k5_clusters.pdf"),
plot = UMAP_SNN_JaccardLouvain_k5,
device = "pdf", # or one of "eps", "ps", "tex" (pictex), "pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only).
path = path_for_figures,
width = 8, height = 4, units = "in", dpi = 300,
family = "Arial", bg = "transparent"
)
ggsave(filename = str_c(date, "_UMAP_SNN_jaccard_k8_clusters.pdf"),
plot = UMAP_SNN_JaccardLouvain_k8,
device = "pdf", # or one of "eps", "ps", "tex" (pictex), "pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only).
path = path_for_figures,
width = 8, height = 4, units = "in", dpi = 300,
family = "Arial", bg = "transparent"
)
ggsave(filename = str_c(date, "_UMAP_SNN_jaccard_k9_clusters.pdf"),
plot = UMAP_SNN_JaccardLouvain_k9,
device = "pdf", # or one of "eps", "ps", "tex" (pictex), "pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only).
path = path_for_figures,
width = 8, height = 4, units = "in", dpi = 300,
family = "Arial", bg = "transparent"
)
ggsave(filename = str_c(date, "_UMAP_SNN_jaccard_k13_clusters.pdf"),
plot = UMAP_SNN_JaccardLouvain_k13,
device = "pdf", # or one of "eps", "ps", "tex" (pictex), "pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only).
path = path_for_figures,
width = 8, height = 4, units = "in", dpi = 300,
family = "Arial", bg = "transparent"
)
ggsave(filename = str_c(date, "_UMAP_sc3_k4_clusters.pdf"),
plot = UMAP_SC3_k4,
device = "pdf", # or one of "eps", "ps", "tex" (pictex), "pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only).
path = path_for_figures,
width = 8, height = 4, units = "in", dpi = 300,
family = "Arial", bg = "transparent"
)
```
Before we begin with the differential expression analysis, we will remove the cells assigned to the macrophage-enriched cluster from the dataset so that they don't affect the results. As we saw in the previous section, there should be 12 cells with high levels of expression for macrophage and immune response markers like "Cd33", "Plaur", "Cxcl16", or "Cd68".
```{r}
paste0("Macrophage cluster has ", sum(PAG_sceset_qc_norm_filt_corr_clust$SNN_clusters_cv2_rank_k7==8), " cells in SNN_var_rank_k7 method")
paste0("Macrophage cluster has ", sum(PAG_sceset_qc_norm_filt_corr_clust$SNN_clusters_cv2_rank_k12==7), " cells in SNN_var_rank_k12 method")
paste0("Macrophage cluster has ", sum(PAG_sceset_qc_norm_filt_corr_clust$SNN_clusters_cv2_jaccard_k5==10), " cells in SNN_cv2_jaccard_k5 method")
paste0("Macrophage cluster has ", sum(PAG_sceset_qc_norm_filt_corr_clust$SNN_clusters_cv2_jaccard_k8==1), " cells in SNN_cv2_jaccard_k8 method")
paste0("Macrophage cluster has ", sum(PAG_sceset_qc_norm_filt_corr_clust$SNN_clusters_cv2_jaccard_k9==3), " cells in SNN_cv2_jaccard_k9 method")
paste0("Macrophage cluster has ", sum(PAG_sceset_qc_norm_filt_corr_clust$SNN_clusters_cv2_jaccard_k13==7), " cells in SNN_cv2_jaccard_k13 method")
```
```{r}
# Create a new SingleCellExperiment object by subsetting to leave out the cells from the macrophage-enriched cluster:
PAG_sceset_qc_norm_filt_corr_clust_de <- PAG_sceset_qc_norm_filt_corr_clust[, PAG_sceset_qc_norm_filt_corr_clust$SNN_clusters_cv2_jaccard_k5!=10]
dim(PAG_sceset_qc_norm_filt_corr_clust)
dim(PAG_sceset_qc_norm_filt_corr_clust_de)
```
## Step 7.1 | Overview of approaches to identify marker genes between subpopulations
To interpret the clustering results, we can identify the genes that drive separation between clusters. These marker genes allow us to assign biological meaning to each cluster based on their functional annotation. In the most obvious case, the marker genes for each cluster are a priori associated with particular cell types, allowing us to treat the clustering as a proxy for cell type identity. The same principle can be applied to discover more subtle differences between clusters (e.g., changes in activation or differentiation state) based on the behavior of genes in the affected pathways.
Identification of marker genes is usually based around the retrospective detection of differential expression between clusters. Genes that are more strongly DE are more likely to have caused separate clustering of cells in the first place. Several different statistical tests are available to quantify the differences in expression profiles, and different approaches can be used to consolidate test results into a single ranking of genes for each cluster.
### 7.1.1 | Using pairwise t-tests
The Welch t-test is a statistical method that allows us to *test for differences in expression between clusters*. It is quickly computed and has good statistical properties for large numbers of cells (Soneson and Robinson 2018). The `findMarkers()` function from the `scran` package provides a convenience wrapper for marker gene identification between groups of cells, based on running `pairwiseTTests` or related functions and passing the result to `combineMarkers()`. We can use the `findMarkers()` function to perform pairwise comparisons between clusters for each gene (the function performs Welch t-tests on the normalised log-expression values for every gene and between every pair of clusters), which returns a list of `DataFrame`s containing ranked candidate markers for each cluster. The top DE genes are likely to be good candidate markers as they can effectively distinguish between cells in different clusters. Importantly, besides looking at the clustering results, we can also apply this approach to identify genes driving differences between other factors of our metadata, such as `cell.type` or `PAG.area`, by explicitly supplying them via the `groups=` argument.
For each cluster, the DE results of the relevant comparisons are consolidated into a single output table. The relevant `DataFrame` contains log2-fold changes of expression in cluster X over each other cluster, along with several statistics obtained by combining p-values (Simes 1986) across the pairwise comparisons involving X. This allows a set of marker genes to be easily defined by taking the top DE genes from each pairwise comparison between clusters. For example, to construct a marker set for cluster X from the top 10 genes of each comparison, we could filter `marker_set` to retain rows with `Top` less than or equal to 10. Other statistics are also reported for each gene, including the adjusted p-values and the log-fold changes relative to every other cluster.
Of particular interest is the `Top` field. The set of genes with `Top ≤ X` is the union of the top X genes (ranked by p-value) from each pairwise comparison involving our chosen cluster. For example, the set of all genes with `Top` values of 1 contains the gene with the lowest p-value from each comparison. Similarly, the set of genes with `Top` values less than or equal to 10 contains the top 10 genes from each comparison. The `Top` field represents `findMarkers()`’s approach to consolidating multiple pairwise comparisons into a single ranking for each cluster; each `DataFrame` produced by `findMarkers()` will order genes based on the `Top` value by default. We can thus use the `Top` field to identify a set of genes that is guaranteed to distinguish cluster X from any other cluster. If we take all genes with `Top <= 6`, this is equivalent to the union of the top 6 genes from each pairwise comparison.
Each `DataFrame` also contains several other statistics that may be of interest. The `p.value` field contains the combined p-value that is obtained by applying Simes’s method to the pairwise p-values for each gene and represents the evidence against the joint null hypothesis, i.e., that the gene is not DE between cluster X and any other cluster. In a future version of `scran`, the `summary.logFC` field provides a convenient summary of the direction and effect size for each gene, and is defined here as the log-fold change from the comparison with the lowest p-value. Examination of these statistics permits a quick evaluation of the suitability of a candidate marker; if both of these metrics are poor (small log-fold change, large p-value), the gene can most likely be dismissed.
We intentionally use pairwise comparisons between clusters rather than comparing each cluster to the average of all other cells. The latter approach is sensitive to the population composition, potentially resulting in substantially different sets of markers when cell type abundances change in different contexts. In the worst case, the presence of a single dominant subpopulation will drive the selection of top markers for every other cluster, pushing out useful genes that can resolve the various minor subpopulations. Moreover, pairwise comparisons naturally provide more information to interpret of the utility of a marker, e.g., by providing log-fold changes to indicate which clusters are distinguished by each gene.
### 7.1.2 | Focusing on up or downregulated genes and using the log-fold change
_Focusing on up or downregulated genes_
The previous `findMarkers()` call considers both up- and downregulated genes to be potential markers. However, downregulated genes are less appealing as markers as it is more difficult to interpret and experimentally validate an absence of expression. To focus on up-regulated markers, we can instead perform a one-sided t-test to identify genes that are upregulated in each cluster compared to the others. This is achieved by setting `direction = "up"` in the `findMarkers()` call.
_Using the log-fold change_
The t-test also allows us to specify a non-zero log-fold change as the null hypothesis. This allows us to consider the magnitude of the log-fold change in our p-value calculations, in a manner that is more rigorous than simply filtering directly on the log-fold changes (McCarthy and Smyth 2009). (Specifically, a simple threshold does not consider the variance and can enrich for genes that have both large log-fold changes and large variances.) We perform this by setting `lfc=` in our `findMarkers()` call - when combined with `direction=`, this tests for genes with log-fold changes that are significantly greater than 1.
These two settings yield a more focused set of candidate marker genes that are upregulated in cluster X. This increased stringency, however, is not without cost. If only upregulated genes are requested from `findMarkers()`, any cluster defined by downregulation of a marker gene will not contain that gene among the top set of features in its `DataFrame`. This is occasionally relevant for subtypes or other states that are distinguished by high versus low expression of particular genes. Similarly, setting an excessively high log-fold change threshold may discard otherwise useful genes. For example, a gene upregulated in a small proportion of cells of a cluster will have a small log-fold change but can still be an effective marker if the focus is on specificity rather than sensitivity.
### 7.1.3 | Choosing which pairwise comparisons to performm
The choice of `pval.type` determines whether the highly ranked genes are those that are DE between the current group and: (1) any other group ("any"), (2) all other groups ("all"), or (3) some other groups ("some"). The result is a named list of `DataFrame`s, each of which contains a sorted marker gene list for the corresponding group. In each `DataFrame`, the top genes are chosen to enable separation of that group from all other groups. Log-fold changes are reported as differences in average x between groups (usually in base 2, depending on the transformation applied to x).
By default, `findMarkers()` will give a high ranking to genes that are differentially expressed in any pairwise comparison. This is because a gene only needs a very low p-value in a single pairwise comparison to achieve a low `Top` value. A more stringent approach would only consider genes that are differentially expressed in all pairwise comparisons involving the cluster of interest. To achieve this, we set `pval.type="all"` in `findMarkers()` to use an intersection-union test (Berger and Hsu 1996) where the combined p-value for each gene is the maximum of the p-values from all pairwise comparisons. A gene will only achieve a low combined p-value if it is strongly DE in all comparisons to other clusters. Combined with `direction="up"`, this can be used to identify unique markers for each cluster. However, this is sensitive to overclustering, as unique marker genes will no longer exist if a cluster is split into two smaller subclusters.
This strategy will only report genes that are highly specific to the cluster of interest. When it works, it can be highly effective as it generates a small focused set of candidate markers. However, any gene that is expressed at the same level in two or more clusters will simply not be detected. This is likely to discard many interesting genes, especially if the clusters are finely resolved with weak separation (such as different clusters within a cell type). To give a concrete example, consider a mixed population of CD4+-only, CD8+-only, double-positive and double-negative T cells. With `pval.type="all"`, neither Cd4 or Cd8 would be detected as subpopulation-specific markers because each gene is expressed in two subpopulations. In comparison, `pval.type="any"` will detect both of these genes as they will be DE between at least one pair of subpopulations.
If `pval.type="all"` is too stringent yet `pval.type="any"` is too generous, a compromise is to set `pval.type="some"`. For each gene, we apply the Holm-Bonferroni correction across its p-values and take the middle-most value as the combined p-value. This effectively tests the global null hypothesis that at least 50% of the individual pairwise comparisons exhibit no DE. We then rank the genes by their combined p-values to obtain an ordered set of marker candidates. The aim is to improve the conciseness of the top markers for defining a cluster while mitigating the risk of discarding useful genes that are not DE to all other clusters. The downside is that taking this compromise position sacrifices the theoretical guarantees offered at the other two extremes.
_Only relevant for the newer `scran` version, which returns a column with `summary.logFC`._ In both cases, a different method is used to compute the summary effect size compared to `pval.type="any"`. For `pval.type="all"`, the summary log-fold change is defined as that corresponding to the pairwise comparison with the largest p-value, while for `pval.type="some"`, it is defined as the log-fold change for the comparison with the middle-most p-value. This reflects the calculation of the combined p-value and avoids focusing on genes with strong changes in only one comparison.
### 7.1.4 | Handling blocking factors
_Using the `block=` argument_
Large studies may contain factors of variation that are known and not interesting (e.g., batch effects, sex differences). If these are not modelled, they can interfere with marker gene detection - most obviously by inflating the variance within each cluster, but also by distorting the log-fold changes if the cluster composition varies across levels of the blocking factor. To avoid these issues, we can set the `block=` argument in the `findMarkers()` call.
For each gene, each pairwise comparison between clusters is performed separately in each level of the blocking factor. The function will then combine p-values from different blocking levels using Stouffer’s Z method to obtain a single p-value per pairwise comparison. (These p-values are further combined across comparisons to obtain a single p-value per gene, using either Simes’s method or an intersection-union test depending on the value of` pval.type=`.) This approach favours genes that exhibit consistent DE in the same direction in each plate.
The `block=` argument works with all tests shown above and is robust to difference in the log-fold changes or variance between batches. However, it assumes that each pair of clusters is present in at least one batch. In scenarios where cells from two clusters never co-occur in the same batch, the comparison will be impossible and NAs will be reported in the output.
_Using the `design=` argument_
Another approach is to define a design matrix containing the batch of origin as the sole factor. `findMarkers()` will then fit a linear model to the log-expression values, similar to the use of `limma` for bulk RNA sequencing data (Ritchie et al. 2015). This handles situations where multiple batches contain unique clusters, as comparisons can be implicitly performed via shared cell types in each batch. There is also a slight increase in power when information is shared across clusters for variance estimation.
The use of a linear model makes some strong assumptions, necessitating some caution when interpreting the results. If the batch effect is not consistent across clusters, the variance will be inflated and the log-fold change estimates will be distorted. Variances are also assumed to be equal across groups, which is not true in general. In particular, the presence of clusters in which a gene is silent will shrink the residual variance towards zero, preventing the model from penalizing genes with high variance in other clusters. Thus, it is generally recommended to use `block=` where possible.
### 7.1.5 | Using the wilcoxon rank sum test
The Wilcoxon rank sum test (also known as the Wilcoxon-Mann-Whitney test, or WMW test) is another widely used method for pairwise comparisons between groups of observations. Its strength lies in the fact that *it directly assesses separation between the expression distributions of different clusters*. The WMW test statistic is proportional to the area-under-the-curve (AUC), i.e., the concordance probability, which is the probability of a random cell from one cluster having higher expression than a random cell from another cluster. In a pairwise comparison, AUCs of 1 or 0 indicate that the two clusters have perfectly separated expression distributions. Thus, the WMW test directly addresses the most desirable property of a candidate marker gene, while the t-test only does so indirectly via the difference in the means and the intra-group variance.
We can perform WMW tests using the `findMarkers()` function, setting `test="wilcox"`. This returns a list of `DataFrames` containing ranked candidate markers for each cluster. The `direction=`, `lfc=` and `pval.type=` arguments can be specified and have the same interpretation as described for t-tests. The interpretation of `Top` is the same as described for t-tests, and Simes’s method is again used to combine p-values across pairwise comparisons. If we want more focused sets, we can also change `pval.type=` as previously described.
Again, to explore the results in more detail, we can focus on the `DataFrame` for group X. The `DataFrame` contains the AUCs from comparing group X to every other group. A value greater than 0.5 indicates that the gene is upregulated in the current cluster compared to the other cluster, while values less than 0.5 correspond to downregulation. We would typically expect AUCs of 0.7-0.8 for a strongly upregulated candidate marker.
One practical advantage of the WMW test over the Welch t-test is that it is symmetric with respect to differences in the size of the groups being compared. This means that, all else being equal, the top-ranked genes on each side of a DE comparison will have similar expression profiles regardless of the number of cells in each group. In contrast, the t-test will favor genes where the larger group has the higher relative variance as this increases the estimated degrees of freedom and decreases the resulting p-value. This can lead to unappealing rankings when the aim is to identify genes upregulated in smaller groups. The WMW test is not completely immune the variance effects - for example, it will slightly favor detection of DEGs at low average abundance where the greater number of ties at zero deflates the approximate variance of the rank sum statistic - but this is relatively benign as the selected genes are still fairly interesting.
The main disadvantage of the WMW test is that the AUCs are much slower to compute compared to t-statistics. This may be inconvenient for interactive analyses involving multiple iterations of marker detection. We can mitigate this to some extent by parallelizing these calculations using the `BPPARAM=` argument in `findMarkers()`.
### 7.1.6 | Using the binomial test
The binomial test *identifies genes that differ in the proportion of expressing cells between clusters*. This represents a much more stringent definition of marker genes compared to the other methods, as differences in expression between clusters are effectively ignored if both distributions of expression values are not near zero. The premise is that genes are more likely to contribute to important biological decisions if they were active in one cluster and silent in another, compared to more subtle “tuning” effects from changing the expression of an active gene. From a practical perspective, a binary measure of presence/absence is easier to validate.
We can perform pairwise binomial tests between clusters using the `findMarkers()` function with `test="binom"`. This returns a list of `DataFrames` containing marker statistics for each cluster such as the `Top` rank and its p-value. Here, the effect size is reported as the log-fold change in this proportion between each pair of clusters. Large positive log-fold changes indicate that the gene is more frequently expressed in one cluster compared to the other. We can focus on genes that are upregulated in each cluster compared to the others by setting `direction="up"`.
The disadvantage of the binomial test is that its increased stringency can lead to the loss of good candidate markers. Another property of the binomial test is that it will not respond to scaling normalisation. Systematic differences in library size between clusters will not be considered when computing p-values or effect sizes. This is not necessarily problematic for marker gene detection - users can treat this as retaining information about the total RNA content, analogous to spike-in normalisation.
### 7.1.7 | Using custom DE methods
It is also possible to perform marker gene detection based on precomputed DE statistics, which allows us to take advantage of more sophisticated tests in dedicated DE analysis packages in the Bioconductor ecosystem. Count-based statistical methods such as `DESeq2` (Love, Huber, and Anders 2014), `edgeR` (Robinson, McCarthy, and Smyth 2009), `limma` with the `voom` method (Law et al. 2014), can be used to that purpose.
By default, we do not use custom DE methods to perform marker detection, for several reasons. Many of these methods rely on empirical Bayes shrinkage to share information across genes in the presence of limited replication. However, this is unnecessary when there are large numbers of “replicate” cells in each group. These methods also make stronger assumptions about the data (e.g., equal variances for linear models, the distribution of variances during empirical Bayes) that are more likely to be violated in noisy scRNA-seq contexts. From a practical perspective, they require more work to set up and take more time to run. Nonetheless, some custom methods (e.g., MAST) may provide a useful point of difference from the simpler tests.
`MAST` is an R/Bioconductor package for managing and analyzing qPCR and sequencing-based single-cell gene expression data, as well as data from other types of single-cell assays. Apart from reading and storing single-cell assay data, the package also provides functionality for significance testing of differential expression using a Hurdle model, gene set enrichment, facilities for visualizing patterns in residuals indicative of differential expression, and power calculations. `MAST` is based on a zero-inflated negative binomial model. It tests for differential expression using a Hurdle model to combine tests of discrete (0 vs not zero) and continuous (non-zero values) aspects of gene expression. Similar to `DESeq2` and `edgeR`, it uses a linear modelling framework to enable complex models to be considered.
### 7.1.8 | Some comments on p-values and DE analysis
_Data dredging_
All of our DE strategies for detecting marker genes between clusters are statistically flawed to some extent. The DE analysis is performed on the same data used to obtain the clusters, which represents “data dredging” (also known as fishing or data snooping). The hypothesis of interest - are there differences between clusters? - is formulated from the data, so we are more likely to get a positive result when we re-use the data set to test that hypothesis.
The practical effect of data dredging is best illustrated with a simple simulation. If we simulate i.i.d. normal values, perform k-means clustering and test for DE between clusters of cells with `findMarkers()`, the resulting distribution of p-values is heavily skewed towards low values. Thus, we can detect “significant” differences between clusters even in the absence of any real substructure in the data. This effect arises from the fact that clustering, by definition, yields groups of cells that are separated in expression space. Testing for DE genes between clusters will inevitably yield some significant results as that is how the clusters were defined.
For marker gene detection, this effect is largely harmless as the p-values are used only for ranking. However, it becomes an issue when the p-values are used to define “significant differences” between clusters with respect to an error rate threshold. Meaningful interpretation of error rates require consideration of the long-run behaviour, i.e., the rate of incorrect rejections if the experiment were repeated many times. The concept of statistical significance for differences between clusters is not applicable if clusters and their interpretations are not stably reproducible across (hypothetical) replicate experiments.
_Nature of replication_
The naive application of DE analysis methods will treat counts from the same cluster of cells as replicate observations. This is not the most relevant level of replication when cells are derived from the same biological sample (i.e., cell culture, animal or patient). DE analyses that treat cells as replicates fail to properly model the sample-to-sample variability (Lun and Marioni 2017). The latter is arguably the more important level of replication as different samples will necessarily be generated if the experiment is to be replicated. Indeed, the use of cells as replicates only masks the fact that the sample size is actually one in an experiment involving a single biological sample. This reinforces the inappropriateness of using the marker gene p-values to perform statistical inference.
We strongly recommend selecting some markers for use in validation studies with an independent replicate population of cells. A typical strategy is to identify a corresponding subset of cells that express the upregulated markers and do not express the downregulated markers. Ideally, a different technique for quantifying expression would also be used during validation, e.g., fluorescent in situ hybridisation or quantitative PCR. This confirms that the subpopulation genuinely exists and is not an artifact of the scRNA-seq protocol or the computational analysis.
_Further comments_
One consequence of the DE analysis strategy is that markers are defined relative to subpopulations in the same dataset. Biologically meaningful genes will not be detected if they are expressed uniformly throughout the population, e.g., T cell markers will not be detected if only T cells are present in the dataset. In practice, this is usually only a problem when the experimental data are provided without any biological context - certainly, we would hope to have some a priori idea about what cells have been captured. For most applications, it is actually desirable to avoid detecting such genes as we are interested in characterizing heterogeneity within the context of a known cell population. Continuing from the example above, the failure to detect T cell markers is of little consequence if we already know we are working with T cells.
Alternatively, marker detection can be performed by treating gene expression as a predictor variable for cluster assignment. For a pair of clusters, we can find genes that discriminate between them by performing inference with a logistic model where the outcome for each cell is whether it was assigned to the first cluster and the lone predictor is the expression of each gene. Treating the cluster assignment as the dependent variable is more philosophically pleasing in some sense, as the clusters are indeed defined from the expression data rather than being known in advance. (Note that this does not solve the data snooping problem.) In practice, this approach effectively does the same task as a Wilcoxon rank sum test in terms of quantifying separation between clusters. Logistic models have the advantage in that they can easily be extended to block on multiple nuisance variables, though this is not typically necessary in most use cases. Even more complex strategies use machine learning methods to determine which features contribute most to successful cluster classification, but this is probably unnecessary for routine analyses.
### 7.1.9 | Our approach
After reading about the different options we have for DE analysis, we will be using Wilcoxon rank sum test to find marker genes for the following situations:
* `cell.type`: we are interested in the gene expression differences between excitatory and inhibitory neurons, as they can tell us about the molecular toolkit available to each type and help us generate hypothesis to test their physiological function. Given that we used transgenic lines to target each cell type, we can use the relevant metadata slot to define the groups we will be comparing.
* `celltype_PAGarea`: we are also interested in the differences between the different PAG subdivisions. Each PAG subdivision has been involved in regulating or driving different behaviours. It would be interesting to know whether the cell types in each subdivision can be modulated via similar or different neurotransmitter and neuromodulatory pathways. We can investigate this by using the compound metadata factor we created by combining `cell.type` and `PAG.area`.
* `cluster.ID`: finally, we are also interested in finding marker genes for the clusters we identified using the clustering approach of our choice.
The next two chunks of code will be the main blocks of our DE analysis. They will allow us to set the main paramteres for the `findMarkers()` function and fine tune the comparisons we want to make. Before we proceed with the different analysis sections, let us take a look at the main parameters we will need for our call to `findMarkers()` and reset them to their default values:
```{r}
# library(scran)
# # Parameters defaults:
# groups <- NULL # clustering results or group for each cell.
#
# test <- c("t", "wilcox", "binom") # Pairwise test to perform: t-test "t", Wilcoxon rank sum test "wilcox", binomial test "binom".
# comparisons <- c("any", "some", "all") # how p-values are to be combined across pairwise comparisons for a given group: "any", "some", or "all.
# direction <- c("any", "up", "down") # the direction of log-fold changes to be considered in the alternative hypothesis.
# lfc <- 0 # positive numeric scalar specifying the log-fold change threshold to be tested against.
#
# block <- NULL # factor specifying the blocking level for each cell.
# design <- NULL # numeric matrix with blocking terms for uninteresting factors (not to be confounded with groups).
# restrict <- NULL # the levels of groups for which to perform pairwise comparisons.
# exclude <- NULL # the levels of groups for which NOT to perform pairwise comparisons.
# print("Parameters back to default values")
```
Finally, the following code allows us to run the main analysis using `findMarkers()` and adapt it to each comparison. We can run the following chunk of code several times, changing the parameters for the `test.type`, `pval.type`, `direction`, `lfc`, and others to extract the results of the comparison we want into a .csv file. It also allows us to assign the results to separate variables so we can plot them later on.
```{r}
# # Set the groups to be compared (cluster IDs or metadata slot of interest):
# groups <- PAG_sceset_qc_norm_filt_corr_clust$SNN_clusters_cv2_jaccard_k5
#
# # Set the string to be added to the filename when saving the results:
# results_name <- "SNN_clusters_cv2_jaccard_k5_" # one of "SNN_clusters_cv2_jaccard_k5_", "cell_type_", "PAG_subdivision", "celltype_PAGarea_", "celltype_PAGAPaxis_", "VGAT_VGluT2_expression"
#
# # Create a variable with the tests you want to run:
# desired_tests <- c("t", "wilcox", "binom") # Pairwise test to perform: t-test "t", Wilcoxon rank sum test "wilcox", binomial test "binom".
# desired_comparisons <- c("any", "some", "all") # how p-values are to be combined across pairwise comparisons for a given group: "any", "some", or "all".
# desired_directions <- c("up", "down") # the direction of log-fold changes to be considered in the alternative hypothesis: "any", "up", "down".
#
# # Initialise an empty `data.frame` to populate with a summary of marker genes:
# SNN_clusters_marker_set_summary <- data.frame()
#
# # Perform comparisons of chosen groups with a set of statistical tests and store/export all the results:
# for (test_choice in desired_tests){
# test <- test_choice
# for (comparison_choice in desired_comparisons){
# comparisons <- comparison_choice
# for (direction_choice in desired_directions){
# direction <- direction_choice
#
# # Set the remaining parameters:
# lfc <- 0 # positive numeric scalar specifying the log-fold change threshold to be tested against.
# block <- PAG_sceset_qc_norm_filt_corr_clust$mouse.sex # factor specifying the blocking level for each cell. Try PAG_sceset_qc_norm_filt_corr_clust_de$mouse.sex
# design <- NULL # numeric matrix with blocking terms for uninteresting factors (not to be confounded with groups).
# restrict <- NULL # the levels of groups for which to perform pairwise comparisons.
# exclude <- NULL # the levels of groups for which NOT to perform pairwise comparisons.
#
# # Run findMarkers:
# marker_genes <- findMarkers(PAG_sceset_qc_norm_filt_corr_clust,
# groups = groups,
# test.type = test,
# pval.type = comparisons,
# direction = direction,
# lfc = lfc,
# assay.type = "logcounts",
# block = block,
# restrict = restrict,
# exclude = exclude)
#
# # Save results:
# saveRDS(marker_genes, file = str_c(path_for_figures, "SNN_clusters/", results_name,
# "marker_genes_", test, "_", comparisons, "_", direction, ".rds"))
#
# # Create a variable with a value for each group name to iterate through:
# range_of_groups <- 1:paste(length(levels(groups)))
#
# # Create and export a separate table of results for each group:
# for (chosen_group in range_of_groups) {
# # Compose name of variable where results will be assigned to:
# marker_set <- str_c("marker_set_", levels(groups)[chosen_group], "_", test, "_", comparisons, "_", direction)
#
# # Assign results of marker genes for current group to a temp variable:
# temp_marker_set <- marker_genes[[chosen_group]]
#
# #print(marker_set) # Print variable name of current group
# #print(sum(temp_marker_set$FDR<0.05)) # Print the number of genes with a FDR below 0.05
# #print(temp_marker_set[1:5, 1:3]) # Print top results for current group
#
# # Assign the results to the created variable:
# assign(marker_set, temp_marker_set) # Assign the results to the created variable
#
# # Export the results to a .csv file:
# write.csv(temp_marker_set[temp_marker_set$FDR<0.1,], file = str_c(path_for_figures, "SNN_clusters/", "PAG_DE_results_", # results_name, marker_set, ".csv"), quote = FALSE, row.names = TRUE)
# # Populate the summary data.frame:
# SNN_clusters_marker_set_summary = rbind(SNN_clusters_marker_set_summary,
# data.frame(marker_set = marker_set,
# group = levels(groups)[chosen_group],
# test = test,
# comparisons = comparisons,
# direction = direction,
# genes_FDR_001 = sum(temp_marker_set$FDR<0.01),
# genes_FDR_005 = sum(temp_marker_set$FDR<0.05),
# genes_FDR_01 = sum(temp_marker_set$FDR<0.1)))
# }
# }
# }
# }
#
# # Check out the summary:
# SNN_clusters_marker_set_summary[SNN_clusters_marker_set_summary$test=="t",]
# SNN_clusters_marker_set_summary[SNN_clusters_marker_set_summary$test=="wilcox",]
# SNN_clusters_marker_set_summary[SNN_clusters_marker_set_summary$test=="binom",]
```
If we were to run the previous chunk using the default parameters (the three types of test, the three possible types of comparison, and the three possible directions for testing log-fold changes of gene expression) on our graph-based clustering results (which resulted in 11 clusters), we would get 297 files. By removing the direction `any` (which is the same as having "up" and "down" separately) we can reduce it to 198. We can do this for exploratory analysis if we want to compare the output of each type of test. Otherwise, once we have decided on the type of test and the type of comparisons we want to make, we can reduce the output to 22 files (if we have 11 clusters).
## Step 7.2 | Identifying marker genes between excitatory and inhibitory cells
A main advantage of our experimental design is that we know in advance the type of cell we are sequencing. We can thus take advantage of our metadata to perform comparisons between excitatory and inhibitory cells. For this, we will continue with the `SingleCellExperiment` object we created after excluding the cells assigned to the macrophage-enriched cluster, to ensure they do not mask any biological results:
```{r}
dim(PAG_sceset_qc_norm_filt_corr_clust)
dim(PAG_sceset_qc_norm_filt_corr_clust_de)
```
To make sure we don't carry any variables from previous runs, we reset all the parameters to their default values:
```{r}
library(scran)
# Parameters defaults:
groups <- NULL # clustering results or group for each cell.
test <- c("t", "wilcox", "binom") # Pairwise test to perform: t-test "t", Wilcoxon rank sum test "wilcox", binomial test "binom".
comparisons <- c("any", "some", "all") # how p-values are to be combined across pairwise comparisons for a given group: "any", "some", or "all.
direction <- c("any", "up", "down") # the direction of log-fold changes to be considered in the alternative hypothesis.
lfc <- 0 # positive numeric scalar specifying the log-fold change threshold to be tested against.
block <- NULL # factor specifying the blocking level for each cell.
design <- NULL # numeric matrix with blocking terms for uninteresting factors (not to be confounded with groups).
restrict <- NULL # the levels of groups for which to perform pairwise comparisons.
exclude <- NULL # the levels of groups for which NOT to perform pairwise comparisons.
print("Parameters back to default values")
```
For the `cell.type` comparisons, the fact that we only have two groups (VGAT and VGluT2 cells) means that all options in `pval.type` will yield the same results, and `direction = "up"` in one `cell.type` will be equivalent to `direction = "down"` in the other `cell.type` and viceversa, so by setting `direction = "up"` we will already obtain all the results we want. We will be using the Wilcoxon rank sum test and will block for `mouse.sex` to account for any sex specific differences that we are not interested in.
```{r}
# Set the groups to be compared (cluster IDs or metadata slot of interest):
groups <- PAG_sceset_qc_norm_filt_corr_clust_de$cell.type
# Set the string to be added to the filename when saving the results:
results_name <- "cell_type_" # one of "SNN_clusters_cv2_jaccard_k5_", "cell_type_", "PAG_subdivision", "celltype_PAGarea_", "celltype_PAGAPaxis_", "VGAT_VGluT2_expression"
# Create a variable with the tests you want to run:
desired_tests <- c("wilcox") # Pairwise test to perform: t-test "t", Wilcoxon rank sum test "wilcox", binomial test "binom".
# Initialise an empty data.frame to populate with a summary of marker genes:
celltype_marker_set_summary <- data.frame()
# Perform comparisons of chosen groups with a set of statistical tests and store/export all the results:
for (test_choice in desired_tests){
test <- test_choice
# Set the remaining parameters:
comparisons <- "all" # how p-values are to be combined across pairwise comparisons for a given group: "any", "some", or "all".
direction <- "up" # the direction of log-fold changes to be considered in the alternative hypothesis: "any", "up", "down".
lfc <- 0 # positive numeric scalar specifying the log-fold change threshold to be tested against.
block <- PAG_sceset_qc_norm_filt_corr_clust_de$mouse.sex # factor specifying the blocking level for each cell. Try PAG_sceset_qc_norm_filt_corr_clust_de$mouse.sex
design <- NULL # numeric matrix with blocking terms for uninteresting factors (not to be confounded with groups).
restrict <- NULL # the levels of groups for which to perform pairwise comparisons.
exclude <- NULL # the levels of groups for which NOT to perform pairwise comparisons.
# Run findMarkers:
marker_genes <- findMarkers(PAG_sceset_qc_norm_filt_corr_clust_de,
groups = groups,
test.type = test,
pval.type = comparisons,
direction = direction,
lfc = lfc,
assay.type = "logcounts",
block = block,
restrict = restrict,
exclude = exclude)
# Save results:
saveRDS(marker_genes, file = str_c(path_for_figures, "cell_type/", results_name,
"marker_genes_", test, "_", comparisons, "_", direction, ".rds"))
# Create a variable with a value for each group name to iterate through:
range_of_groups <- 1:paste(length(levels(groups)))
# Create and export a separate table of results for each group:
for (chosen_group in range_of_groups) {
# Compose name of variable where results will be assigned to:
marker_set <- str_c("marker_set_", levels(groups)[chosen_group], "_", test, "_", comparisons, "_", direction)
# Assign results of marker genes for current group to a temp variable:
temp_marker_set <- marker_genes[[chosen_group]]
#print(marker_set) # Print variable name of current group
#print(sum(temp_marker_set$FDR<0.05)) # Print the number of genes with a FDR below 0.05
#print(temp_marker_set[1:5, 1:3]) # Print top results for current group
# Assign the results to the created variable:
assign(marker_set, temp_marker_set)
# Export the results to a .csv file:
write.csv(temp_marker_set[temp_marker_set$FDR<0.1,], file = str_c(path_for_figures, "cell_type/", "PAG_DE_results_",
results_name, marker_set, ".csv"), quote = FALSE, row.names = TRUE)
# Populate the summary data.frame:
celltype_marker_set_summary = rbind(celltype_marker_set_summary,
data.frame(marker_set = marker_set,
group = levels(groups)[chosen_group],
test = test,
comparisons = comparisons,
direction = direction,
genes_FDR_001 = sum(temp_marker_set$FDR<0.01),
genes_FDR_005 = sum(temp_marker_set$FDR<0.05),
genes_FDR_01 = sum(temp_marker_set$FDR<0.1)))
}
}
# Check out the summary:
celltype_marker_set_summary
```
```{r}
# Take a look at the results from one fo the comparisons
head(marker_set_VGAT_wilcox_all_up[marker_set_VGAT_wilcox_all_up$FDR<0.05,])
head(marker_set_VGluT2_wilcox_all_up[marker_set_VGluT2_wilcox_all_up$FDR<0.05,])
```
Once we have obtained the results for each comparison, we can take a quick look at the top DE genes and filter them according the our custom list of genes we appended to the metadata in the begining of the pipeline.
```{r}
# Take a look at the gene lists we have in the metadata
names(metadata(PAG_sceset_qc_norm_filt_corr_clust_de))
```
```{r}
# Set the FDR threshold you want to look at.
fdr_threshold <- 0.05 # try 0.1, 0.05, 0.01 depending on how stringent you want your results to be
VGAT_de_genes <- rownames(marker_set_VGAT_wilcox_all_up[marker_set_VGAT_wilcox_all_up$FDR<fdr_threshold, ])
VGluT2_de_genes <- rownames(marker_set_VGluT2_wilcox_all_up[marker_set_VGluT2_wilcox_all_up$FDR<fdr_threshold, ])
cat("--- Ion Channels ---\n")
cat("--- VGAT cells\n")
VGAT_de_genes[VGAT_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.ionchannels]
cat("\n--- VGluT2 cells\n")
VGluT2_de_genes[VGluT2_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.ionchannels]
cat("\n\n--- Neuromodulators and Neuropeptides ---\n")
cat("--- VGAT cells\n")
VGAT_de_genes[VGAT_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.NeuromodulatorsPeptides]
cat("\n--- VGluT2 cells\n")
VGluT2_de_genes[VGluT2_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.NeuromodulatorsPeptides]
cat("\n\n--- Monoamines ---\n")
cat("--- VGAT cells\n")
VGAT_de_genes[VGAT_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Monoamines]
cat("\n--- VGluT2 cells\n")
VGluT2_de_genes[VGluT2_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Monoamines]
cat("\n\n--- Neurotransmitters ---\n")
cat("--- VGAT cells\n")
VGAT_de_genes[VGAT_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Neurotransmitters]
cat("\n--- VGluT2 cells\n")
VGluT2_de_genes[VGluT2_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Neurotransmitters]
cat("\n\n--- Transporters ---\n")
cat("--- VGAT cells\n")
VGAT_de_genes[VGAT_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Transporters]
cat("\n--- VGluT2 cells\n")
VGluT2_de_genes[VGluT2_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Transporters]
cat("\n\n--- Transcription Factors ---\n")
cat("--- VGAT cells\n")
VGAT_de_genes[VGAT_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Transcriptionfactors]
cat("\n--- VGluT2 cells\n")
VGluT2_de_genes[VGluT2_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Transcriptionfactors]
```
As a final step, we can assign the names of the differentially expressed genes to the `metadata` slot of the `PAG_sceset_qc_norm_filt_corr_clust_de` object, so we can quickly retrieve it for plotting and summarising our results:
```{r}
# Set false discovery rate threshold
fdr_threshold <- 0.05 # try 0.1, 0.05, 0.01 depending on how stringent you want your results to be
# DE genes in VGAT cells
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_VGAT_wilcox_all_up <- rownames(marker_set_VGAT_wilcox_all_up[marker_set_VGAT_wilcox_all_up$FDR<fdr_threshold, ])
# DE genes in VGluT2 cells
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_VGluT2_wilcox_all_up <- rownames(marker_set_VGluT2_wilcox_all_up[marker_set_VGluT2_wilcox_all_up$FDR<fdr_threshold, ])
names(metadata(PAG_sceset_qc_norm_filt_corr_clust_de))
```
## Step 7.3 | Identifying marker genes between cell-type and PAG subdivisions
To find interesting genes across PAG subdivisions while accounting for cell-type, we can make use of the combined metadata factor `celltype_PAGarea`, in which each cell will be grouped according to both their type and their PAG area. For this, we continue with the `SingleCellExperiment` object we created after excluding the cells assigned to the macrophage-enriched cluster:
```{r}
dim(PAG_sceset_qc_norm_filt_corr_clust)
dim(PAG_sceset_qc_norm_filt_corr_clust_de)
```
To make sure we don't carry any variables from previous runs, we reset all the parameters to their default values:
```{r}
library(scran)
# Parameters defaults:
groups <- NULL # clustering results or group for each cell.
test <- c("t", "wilcox", "binom") # Pairwise test to perform: t-test "t", Wilcoxon rank sum test "wilcox", binomial test "binom".
comparisons <- c("any", "some", "all") # how p-values are to be combined across pairwise comparisons for a given group: "any", "some", or "all.
direction <- c("any", "up", "down") # the direction of log-fold changes to be considered in the alternative hypothesis.
lfc <- 0 # positive numeric scalar specifying the log-fold change threshold to be tested against.
block <- NULL # factor specifying the blocking level for each cell.
design <- NULL # numeric matrix with blocking terms for uninteresting factors (not to be confounded with groups).
restrict <- NULL # the levels of groups for which to perform pairwise comparisons.
exclude <- NULL # the levels of groups for which NOT to perform pairwise comparisons.
print("Parameters back to default values")
```
For the `celltype_PAGarea` comparisons, we are up to eight groups (dmPAG, dlPAG, lPAG, vlPAG and VGAT and VGluT2 cells for each of those subdivisions). In this case, we are only interested in the genes specific to each group, so we will set `pval.type=all`. However, after seeing `all` results in nearly no potential markers, we could think of adding `pval.type=some` to the mix. In this case, though, by doing that we would mainly be getting the cell-type specific markers back in, so we will stick with `pval.type=all`. We will also set `direction = "up"` to focus on genes that are more highly expressed in each group. We will again use the Wilcoxon rank sum test and will block for `mouse.sex` to account for any sex specific differences that we are not interested in.
```{r}
# Set the groups to be compared (cluster IDs or metadata slot of interest):
groups <- PAG_sceset_qc_norm_filt_corr_clust_de$celltype_PAGarea
# Set the string to be added to the filename when saving the results:
results_name <- "celltype_PAGarea_" # one of "SNN_clusters_cv2_jaccard_k5_", "cell_type_", "PAG_subdivision", "celltype_PAGarea_", "celltype_PAGAPaxis_", "VGAT_VGluT2_expression"
# Create a variable with the tests you want to run:
desired_tests <- c("wilcox") # Pairwise test to perform: t-test "t", Wilcoxon rank sum test "wilcox", binomial test "binom".
desired_comparisons <- c("all") # how p-values are to be combined across pairwise comparisons for a given group: "any", "some", or "all".
desired_directions <- c("up") # the direction of log-fold changes to be considered in the alternative hypothesis: "any", "up", "down".
# Initialise an empty data.frame to populate with a summary of marker genes:
celltype_PAGarea_marker_set_summary <- data.frame()
# Perform comparisons of chosen groups with a set of statistical tests and store/export all the results:
for (test_choice in desired_tests){
test <- test_choice
for (comparison_choice in desired_comparisons){
comparisons <- comparison_choice
for (direction_choice in desired_directions){
direction <- direction_choice
# Set the remaining parameters:
lfc <- 0 # positive numeric scalar specifying the log-fold change threshold to be tested against.
block <- PAG_sceset_qc_norm_filt_corr_clust_de$mouse.sex # factor specifying the blocking level for each cell.
design <- NULL # numeric matrix with blocking terms for uninteresting factors (not to be confounded with groups).
restrict <- NULL # the levels of groups for which to perform pairwise comparisons.
exclude <- NULL # the levels of groups for which NOT to perform pairwise comparisons.
# Run findMarkers:
marker_genes <- findMarkers(PAG_sceset_qc_norm_filt_corr_clust_de,
groups = groups,
test.type = test,
pval.type = comparisons,
direction = direction,
lfc = lfc,
assay.type = "logcounts",
block = block,
restrict = restrict,
exclude = exclude)
# Save results:
saveRDS(marker_genes, file = str_c(path_for_figures, "celltype_PAGarea/", results_name,
"marker_genes_", test, "_", comparisons, "_", direction, ".rds"))
# Create a variable with a value for each group name to iterate through:
range_of_groups <- 1:paste(length(levels(groups)))
# Create and export a separate table of results for each group:
for (chosen_group in range_of_groups) {
# Compose name of variable where results will be assigned to:
marker_set <- str_c("marker_set_", levels(groups)[chosen_group], "_", test, "_", comparisons, "_", direction)
# Assign results of marker genes for current group to a temp variable:
temp_marker_set <- marker_genes[[chosen_group]] # Assign results of marker genes for current group to a temp variable
#print(marker_set) # Print variable name of current group
#print(sum(temp_marker_set$FDR<0.05)) # Print the number of genes with a FDR below 0.05
#print(temp_marker_set[1:5, 1:3]) # Print top results for current group
# Assign the results to the created variable:
assign(marker_set, temp_marker_set)
# Export the results to a .csv file:
write.csv(temp_marker_set[temp_marker_set$FDR<0.1,], file = str_c(path_for_figures, "celltype_PAGarea/", "PAG_DE_results_",
results_name, marker_set, ".csv"), quote = FALSE, row.names = TRUE)
# Populate the summary data.frame:
celltype_PAGarea_marker_set_summary = rbind(celltype_PAGarea_marker_set_summary,
data.frame(marker_set = marker_set,
group = levels(groups)[chosen_group],
test = test,
comparisons = comparisons,
direction = direction,
genes_FDR_001 = sum(temp_marker_set$FDR<0.01),
genes_FDR_005 = sum(temp_marker_set$FDR<0.05),
genes_FDR_01 = sum(temp_marker_set$FDR<0.1)))
}
}
}
}
# Check out the summary:
celltype_PAGarea_marker_set_summary
```
```{r}
# Take a look at the results from one fo the comparisons
head(marker_set_VGluT2_dlpag_wilcox_all_up[marker_set_VGluT2_dlpag_wilcox_all_up$FDR<0.1,])
```
For `pval.type=all`, we found no marker genes for any of the groups that passed the false discovery rate threshold of `FDR<0.05`. This suggests that if we want to find genes that are differentially expressed across PAG subdivisions we can either look at the results from `pval.type=some` or repeat the analysis following one of two approaches:
* We could compare PAG subdivisions by using the metadata factor `PAG.area` and ignoring the `cell.type`, which would provide us with marker genes to target specific subdivisions.
* Alternatively we could first subset the dataset by `cell.type` so that we are left with excitatory cells and inhibitory cells separately, and then re-run the comparisons by using the metadata factor `PAG.area`.
### 7.3.1 | Marker genes for PAG subdivisions
We will now follow the first of the two approaches delineated above and perform pairwise comparisons to identify marker genes for PAG subdivisions. For this, we continue with the `SingleCellExperiment` object we created after excluding the cells assigned to the macrophage-enriched cluster:
```{r}
dim(PAG_sceset_qc_norm_filt_corr_clust)
dim(PAG_sceset_qc_norm_filt_corr_clust_de)
```
As always, to make sure we don't carry any variables from previous runs, we reset all the parameters to their default values:
```{r}
library(scran)
# Parameters defaults:
groups <- NULL # clustering results or group for each cell.
test <- c("t", "wilcox", "binom") # Pairwise test to perform: t-test "t", Wilcoxon rank sum test "wilcox", binomial test "binom".
comparisons <- c("any", "some", "all") # how p-values are to be combined across pairwise comparisons for a given group: "any", "some", or "all.
direction <- c("any", "up", "down") # the direction of log-fold changes to be considered in the alternative hypothesis.
lfc <- 0 # positive numeric scalar specifying the log-fold change threshold to be tested against.
block <- NULL # factor specifying the blocking level for each cell.
design <- NULL # numeric matrix with blocking terms for uninteresting factors (not to be confounded with groups).
restrict <- NULL # the levels of groups for which to perform pairwise comparisons.
exclude <- NULL # the levels of groups for which NOT to perform pairwise comparisons.
print("Parameters back to default values")
```
For the `PAG.area` comparisons, we have four groups (dmPAG, dlPAG, lPAG, vlPAG). In this case, we are mainly interested in the genes specific to each group, so we will set `pval.type=all`. Again, after seeing `all` results in very few potential markers, we could add `pval.type=some` to the mix. We will also set `direction = "up"` to focus on genes that are more highly expressed in each group. We will again use the Wilcoxon rank sum test and will block for `mouse.sex` to account for any sex specific differences that we are not interested in.
```{r}
# Set the groups to be compared (cluster IDs or metadata slot of interest):
groups <- PAG_sceset_qc_norm_filt_corr_clust_de$PAG.area
# Set the string to be added to the filename when saving the results:
results_name <- "PAG_subdivision_" # one of "SNN_clusters_cv2_jaccard_k5_", "cell_type_", "PAG_subdivision", "celltype_PAGarea_", "celltype_PAGAPaxis_", "VGAT_VGluT2_expression"
# Create a variable with the tests you want to run:
desired_tests <- c("wilcox") # Pairwise test to perform: t-test "t", Wilcoxon rank sum test "wilcox", binomial test "binom".
desired_comparisons <- c("some", "all") # how p-values are to be combined across pairwise comparisons for a given group: "any", "some", or "all".
desired_directions <- c("up") # the direction of log-fold changes to be considered in the alternative hypothesis: "any", "up", "down".
# Initialise an empty data.frame to populate with a summary of marker genes:
PAGarea_marker_set_summary <- data.frame()
# Perform comparisons of chosen groups with a set of statistical tests and store/export all the results:
for (test_choice in desired_tests){
test <- test_choice
for (comparison_choice in desired_comparisons){
comparisons <- comparison_choice
for (direction_choice in desired_directions){
direction <- direction_choice
# Set the remaining parameters:
lfc <- 0 # positive numeric scalar specifying the log-fold change threshold to be tested against.
block <- PAG_sceset_qc_norm_filt_corr_clust_de$mouse.sex # factor specifying the blocking level for each cell.
design <- NULL # numeric matrix with blocking terms for uninteresting factors (not to be confounded with groups).
restrict <- NULL # the levels of groups for which to perform pairwise comparisons.
exclude <- NULL # the levels of groups for which NOT to perform pairwise comparisons.
# Run findMarkers:
marker_genes <- findMarkers(PAG_sceset_qc_norm_filt_corr_clust_de,
groups = groups,
test.type = test,
pval.type = comparisons,
direction = direction,
lfc = lfc,
assay.type = "logcounts",
block = block,
restrict = restrict,
exclude = exclude)
# Save results:
saveRDS(marker_genes, file = str_c(path_for_figures, "PAG_subdivision/", results_name,
"marker_genes_", test, "_", comparisons, "_", direction, ".rds"))
# Create a variable with a value for each group name to iterate through:
range_of_groups <- 1:paste(length(levels(groups)))
# Create and export a separate table of results for each group:
for (chosen_group in range_of_groups) {
# Compose name of variable where results will be assigned to:
marker_set <- str_c("marker_set_", levels(groups)[chosen_group], "_", test, "_", comparisons, "_", direction)
# Assign results of marker genes for current group to a temp variable:
temp_marker_set <- marker_genes[[chosen_group]] # Assign results of marker genes for current group to a temp variable
#print(marker_set) # Print variable name of current group
#print(sum(temp_marker_set$FDR<0.05)) # Print the number of genes with a FDR below 0.05
#print(temp_marker_set[1:5, 1:3]) # Print top results for current group
# Assign the results to the created variable:
assign(marker_set, temp_marker_set)
# Export the results to a .csv file:
write.csv(temp_marker_set[temp_marker_set$FDR<0.1,], file = str_c(path_for_figures, "PAG_subdivision/", "PAG_DE_results_",
results_name, marker_set, ".csv"), quote = FALSE, row.names = TRUE)
# Populate the summary data.frame:
PAGarea_marker_set_summary = rbind(PAGarea_marker_set_summary,
data.frame(marker_set = marker_set,
group = levels(groups)[chosen_group],
test = test,
comparisons = comparisons,
direction = direction,
genes_FDR_001 = sum(temp_marker_set$FDR<0.01),
genes_FDR_005 = sum(temp_marker_set$FDR<0.05),
genes_FDR_01 = sum(temp_marker_set$FDR<0.1)))
}
}
}
}
# Check out the summary:
PAGarea_marker_set_summary
```
As we can see, even when looking only at the differences between PAG subdivisions without paying attention to the cell-type we still don't find many potential marker genes. Interestingly though, we can see there are less genes for dmPAG, dlPAG, and lPAG compared to vlPAG. Given that we are using `pval.type=all` this may be a hint that vlPAG is more different from all the other subdivisions than what dm/dl/l are between themselves. To test this, we could repeat the analysis by pooling the metadata factors from dm/dl/l into a single one and compare dorsal PAG vs vlPAG.
```{r}
# Take a look at the results from one fo the comparisons
head(marker_set_dlpag_wilcox_all_up[marker_set_dlpag_wilcox_all_up$FDR<0.05,])
```
Once we have obtained the results for each comparison, we can take a quick look at the top DE genes and filter them according the our custom list of genes we appended to the metadata in the begining of the pipeline. __Hint: only Transcription Factors have some results.__
```{r}
# Set the FDR threshold you want to look at.
fdr_threshold <- 0.05 # try 0.1, 0.05, 0.01 depending on how stringent you want your results to be
dmpag_de_genes <- rownames(marker_set_dmpag_wilcox_all_up[marker_set_dmpag_wilcox_all_up$FDR<fdr_threshold, ])
dlpag_de_genes <- rownames(marker_set_dlpag_wilcox_all_up[marker_set_dlpag_wilcox_all_up$FDR<fdr_threshold, ])
lpag_de_genes <- rownames(marker_set_lpag_wilcox_all_up[marker_set_lpag_wilcox_all_up$FDR<fdr_threshold, ])
vlpag_de_genes <- rownames(marker_set_vlpag_wilcox_all_up[marker_set_vlpag_wilcox_all_up$FDR<fdr_threshold, ])
# cat("--- Ion Channels ---\n")
# cat("--- dmPAG cells\n")
# dmpag_de_genes[dmpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.ionchannels]
# cat("--- dlPAG cells\n")
# dlpag_de_genes[dlpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.ionchannels]
# cat("--- lPAG cells\n")
# lpag_de_genes[lpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.ionchannels]
# cat("--- vlPAG cells\n")
# vlpag_de_genes[vlpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.ionchannels]
#
# cat("\n\n--- Neuromodulators and Neuropeptides ---\n")
# cat("--- dmPAG cells\n")
# dmpag_de_genes[dmpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.NeuromodulatorsPeptides]
# cat("--- dlPAG cells\n")
# dlpag_de_genes[dlpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.NeuromodulatorsPeptides]
# cat("--- lPAG cells\n")
# lpag_de_genes[lpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.NeuromodulatorsPeptides]
# cat("--- vlPAG cells\n")
# vlpag_de_genes[vlpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.NeuromodulatorsPeptides]
#
# cat("\n\n--- Monoamines ---\n")
# cat("--- dmPAG cells\n")
# dmpag_de_genes[dmpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Monoamines]
# cat("--- dlPAG cells\n")
# dlpag_de_genes[dlpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Monoamines]
# cat("--- lPAG cells\n")
# lpag_de_genes[lpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Monoamines]
# cat("--- vlPAG cells\n")
# vlpag_de_genes[vlpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Monoamines]
#
# cat("\n\n--- Neurotransmitters ---\n")
# cat("--- dmPAG cells\n")
# dmpag_de_genes[dmpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Neurotransmitters]
# cat("--- dlPAG cells\n")
# dlpag_de_genes[dlpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Neurotransmitters]
# cat("--- lPAG cells\n")
# lpag_de_genes[lpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Neurotransmitters]
# cat("--- vlPAG cells\n")
# vlpag_de_genes[vlpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Neurotransmitters]
#
# cat("\n\n--- Transporters ---\n")
# cat("--- dmPAG cells\n")
# dmpag_de_genes[dmpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Transporters]
# cat("--- dlPAG cells\n")
# dlpag_de_genes[dlpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Transporters]
# cat("--- lPAG cells\n")
# lpag_de_genes[lpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Transporters]
# cat("--- vlPAG cells\n")
# vlpag_de_genes[vlpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Transporters]
cat("\n\n--- Transcription Factors ---\n")
cat("--- dmPAG cells\n")
dmpag_de_genes[dmpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Transcriptionfactors]
cat("--- dlPAG cells\n")
dlpag_de_genes[dlpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Transcriptionfactors]
cat("--- lPAG cells\n")
lpag_de_genes[lpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Transcriptionfactors]
cat("--- vlPAG cells\n")
vlpag_de_genes[vlpag_de_genes %in% metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$genes.Transcriptionfactors]
```
Finally, we assign the names of the differentially expressed genes to the `metadata` slot of the `PAG_sceset_qc_norm_filt_corr_clust_de` object, so we can quickly retrieve it for plotting and summarising our results:
```{r}
# Set false discovery rate threshold
fdr_threshold <- 0.05 # try 0.1, 0.05, 0.01 depending on how stringent you want your results to be
## Results from `pval = all`
# DE genes in dmPAG cells
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_dmpag_wilcox_all_up <- rownames(marker_set_dmpag_wilcox_all_up[marker_set_dmpag_wilcox_all_up$FDR<fdr_threshold, ])
# DE genes in dlPAG cells
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_dlpag_wilcox_all_up <- rownames(marker_set_dlpag_wilcox_all_up[marker_set_dlpag_wilcox_all_up$FDR<fdr_threshold, ])
# DE genes in lPAG cells
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_lpag_wilcox_all_up <- rownames(marker_set_lpag_wilcox_all_up[marker_set_lpag_wilcox_all_up$FDR<fdr_threshold, ])
# DE genes in vlPAG cells
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_vlpag_wilcox_all_up <- rownames(marker_set_vlpag_wilcox_all_up[marker_set_vlpag_wilcox_all_up$FDR<fdr_threshold, ])
## Results from `pval = some`
# DE genes in dmPAG cells
metadata(PAG_sceset_qc_norm_filt_corr_clust_de)$marker_set_dmpag_wilcox_some_up <- rownames(marker_set_dmpag_wilcox_some_up[marker_set_dmpag_wilcox_some_up$FDR<fdr_threshold, ])