-
Notifications
You must be signed in to change notification settings - Fork 0
/
PAG_scRNAseq_analysis_Part3.Rmd
623 lines (526 loc) · 46.5 KB
/
PAG_scRNAseq_analysis_Part3.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
---
title: "Topographic, single-cell gene expression profiling of Periaqueductal Gray neurons"
subtitle: "Part III: normalisation"
author:
- name: "Oriol Pavon Arocas, Sarah F. Olesen, and Tiago Branco"
affiliation: "Sainsbury Wellcome Centre for Neural Circuits and Behaviour, University College London, UK"
email: "oriol.pavon.16@ucl.ac.uk"
date: "`r format(Sys.time(), '%d %B %Y')`"
output:
html_notebook:
highlight: pygments
number_sections: FALSE
theme: lumen
toc: TRUE
toc_float: TRUE
#output: rmdformats::readthedown:
#highlight: pygments
---
***
This is a pipeline to analyse single-cell RNA sequencing data from Periaqueductal Gray neurons (1) isolated from acute midbrain slices of transgenic mice using visually guided aspiration via patch pipettes and (2) processed using SMART-seq2 (Picelli et al., Nature Protocols 2014).
This pipeline has been generated after attending the [EMBL-EBI RNA-Sequence Analysis Course](https://www.ebi.ac.uk/training/events/2019/rna-sequence-analysis) and [attending](https://training.csx.cam.ac.uk/bioinformatics/event/2823386) and following the online course on [Analysis of single cell RNA-seq data](https://github.com/hemberg-lab/scRNA.seq.course) by the [Hemberg Lab](https://www.sanger.ac.uk/science/groups/hemberg-group). Many other resources have been used, including the [Orchestrating Single-Cell Analysis with Bioconductor book](https://bioconductor.org/books/release/OSCA/) by Robert Amezquita, Aaron Lun, Stephanie Hicks, and Raphael Gottardo, the [simpleSingleCell workflow in Bioconductor](https://bioconductor.org/packages/3.9/workflows/html/simpleSingleCell.html) maintained by Aaron Lun, the [rnaseqGene workflow](https://bioconductor.org/packages/3.10/workflows/html/rnaseqGene.html) maintained by Michael Love, the [RNAseq123 workflow](https://bioconductor.org/packages/3.10/workflows/html/RNAseq123.html) maintained by Matthew Ritchie, and the [EGSEA123 workflow](https://bioconductor.org/packages/3.10/workflows/html/EGSEA123.html) maintained by Matthew Ritchie.
Other key resources include [Bioconductor](http://www.bioconductor.org/) (Huber et al., Nature Methods 2015), [scRNA-tools](https://www.scrna-tools.org/), `scater` (McCarthy et al., Bioinformatics 2017), `scran` (Lun et al., F1000Res 2016), `SC3` (Kiselev et al., Nature Methods 2017), `Seurat` (Butler et al., Nature Biotechnology 2018), `clusterExperiment` (Risso et al., PLOS Computational Biology 2018), `limma` (Ritchie et al., Nucleic Acids Research 2015), `DESeq2` (Love et al., Genome Biology 2014), `edgeR` (Robinson et al., Bioinformatics 2010), `MAST` (Finak, McDavid, Yajima et al., Genome Biology 2015), `iSEE` (Rue-Albrecht & Marini et al., F1000Research 2018), `t-SNE` (van der Maaten & Hinton, Journal of Machine Learning Research 2008), `UMAP` (McInnes et al., arXiv 2018), and the [Mathematical Statistics and Machine Learning for Life Sciences](https://towardsdatascience.com/tagged/stats-ml-life-sciences) column by Nikolay Oskolkov.
***
# STEP 3 | Normalization of cell-specific biases
Systematic differences in sequencing coverage between libraries are often observed in single-cell RNA sequencing data (Stegle, Teichmann, and Marioni 2015). They typically arise from technical differences in cDNA capture or PCR amplification efficiency across cells, attributable to the difficulty of achieving consistent library preparation with minimal starting material. Normalization aims to remove these differences such that they do not interfere with comparisons of the expression profiles between cells. This ensures that any observed heterogeneity or differential expression within the cell population are driven by biology and not technical biases.
Scaling normalisation is the simplest and most commonly used class of normalisation strategies. It involves dividing all counts for each cell by a cell-specific scaling factor, often called a “size factor” (Anders and Huber 2010). The assumption here is that any cell-specific bias (e.g., in capture or amplification efficiency) affects all genes equally via scaling of the expected mean count for that cell. The size factor for each cell represents the estimate of the relative bias in that cell, so division of its counts by its size factor should remove that bias. The resulting “normalised expression values” can then be used for downstream analyses such as clustering and dimensionality reduction.
***
_The following theory has been summarised from the online course on [Analysis of single cell RNA-seq data](https://github.com/hemberg-lab/scRNA.seq.course) by the [Hemberg Lab](https://www.sanger.ac.uk/science/groups/hemberg-group)._
Library sizes vary because scRNA-seq data is often sequenced on highly multiplexed platforms and the total reads which are derived from each cell may differ substantially. Library size must be corrected for by multiplying or dividing each column of the expression matrix by a normalisation factor which is an estimate of the library size relative to the other cells. Many methods to correct for library size have been developped for bulk RNA-seq (e.g. UQ, SF, CPM, RPKM, FPKM, TPM).
* The _Upperquartile (UQ)_: In the upperquartile method (Bullard et al. 2010), each column is divided by the 75% quantile of the counts for each library. Often the calculated quantile is scaled by the median across cells to keep the absolute level of expression relatively consistent. A drawback to this method is that for low-depth scRNA-seq experiments the large number of undetected genes may result in the 75% quantile being zero (or close to it). This limitation can be overcome by generalizing the idea and using a higher quantile (eg. the 99% quantile is the default in scater) or by excluding zeros prior to calculating the 75% quantile.
* The _Relative log expression (RLE) size factor (SF)_: The size factor (SF) was proposed and popularized by DESeq (Anders and Huber 2010). First the geometric mean of each gene across all cells is calculated. The size factor for each cell is the median across genes of the ratio of the expression to the gene geometric mean. A drawback to this method is that since it uses the geometric mean only genes with non-zero expression values across all cells can be used in its calculation, making it unadvisable for large low-depth scRNA-seq experiments.
* Using _CPM (counts per million)_: The simplest way to normalise RNAseq data is to convert it to counts per million (CPM) by dividing each column by its total then multiplying by 1,000,000. _RPKM_ (read per kilobase per milion), _FPKM_ (fragment per kilobase per million) and _TPM_ (transcripts per million) are variants on CPM which further adjust counts by the length of the respective gene/transcript.
+ Spike-ins should be excluded from the calculation of total expression in order to correct for total cell RNA content (we should only use endogenous genes).
+ One potential drawback of CPM is relevant to situations where a dataset contains genes that are both very highly expressed and differentially expressed across the cells. In this case, the total molecules in the cell may depend of whether such genes are on/off in the cell and normalising by total molecules may hide the differential expression of those genes and/or falsely create differential expression for the remaining genes.
* Using _scran (scRNA-seq specific)_: The `scran` package implements a variant on CPM specialized for single-cell data (Lun et al. Genome Biology 2016). Briefly, this method deals with the problem of vary large numbers of zero values per cell by pooling cells together calculating a normalisation factor (similar to CPM) for the sum of each pool. Since each cell is found in many different pools, cell-specific factors can be deconvoluted from the collection of pool-specific factors using linear algebra.
* Using _SCnorm (scRNA-seq specific)_: Another method developed to normalise scRNA-seq data (Bacher et al. Nature Methods 2017).
***
_A note on scaling normalisation strategies_ from the [simpleSingleCell workflow in Bioconductor](https://bioconductor.org/packages/3.9/workflows/html/simpleSingleCell.html) maintained by Aaron Lun.
Scaling normalisation strategies for scRNA-seq data can be broadly divided into two classes. The first class assumes that there exists a subset of genes that are not DE between samples. The second class uses the fact that the same amount of spike-in RNA was added to each cell, and therefore differences in the coverage of the spike-in transcripts can only be due to cell-specific biases (e.g., in capture efficiency or sequencing depth). Scaling normalisation is then applied to equalize spike-in coverage across cells.
The choice between these two normalisation strategies depends on the biology of the cells and the features of interest. If the majority of genes are expected to be DE and there is no reliable house-keeping set, spike-in normalisation may be the only option for removing cell-specific biases. Spike-in normalisation should also be used if differences in the total RNA content of individual cells are of interest. In any particular cell, an increase in the amount of endogenous RNA will not increase spike-in coverage (with or without library quantification). Thus, the former will not be represented as part of the bias in the latter, which means that the effects of total RNA content on expression will not be removed upon scaling. With non-DE normalisation, an increase in RNA content will systematically increase the expression of all genes in the non-DE subset, such that it will be treated as bias and removed.
In our case, we expect our dataset to contain a homogeneous population of neurons, with the majority of genes not differentially expressed between samples. Therefore, we will compute size factors for endogenous genes. We will not use spike-in transcripts as we did not add them in sufficient concentration to be used reliably.
## Step 3.1 | Compute size factors for endogenous genes
We will proceed with the `PAG_sceset_qc` after having applied the relevant QC to our dataset. We will use `scran` to calculate size factors for our samples and then use these for normalisation. Importantly, see *Vallejos et al., Nature Methods 2017* for an explanation of why it is not a good idea to use CPM normalisation and other bulk-based normalisation methods on scRNA-seq data.
```{r}
# Set the directory where your data and scripts are:
setwd("D:/Dropbox (UCL - SWC)/Project_transcriptomics/analysis/PAG_scRNAseq_analysis")
# Set the path to save figures from this Part:
path_for_figures <- "D:/Dropbox (UCL - SWC)/Project_transcriptomics/figures/R_figures_Part3_normalisation/"
date <- Sys.Date()
date <- gsub("-", "_", date)
# Load packages:
library(tidyverse)
library(SingleCellExperiment)
library(scater)
library(scran)
library(ggplot2)
library(sgeostat)
library(mvoutlier)
library(extrafont)
```
```{r}
# If starting from stored results, load saved filtered dataset from previous Step:
options(stringsAsFactors = FALSE)
PAG_sceset_qc <- readRDS("PAG_sceset_qc.rds") # Contains filtered cells and genes
assayNames(PAG_sceset_qc)
```
We should _not_ include spike-ins in the normalisation step, as we want to normalise by the genes each cell endogenously expressed. We can later calculate separate size factors for the spike-in transcripts.
```{r}
# For large datasets of highly heterogeneous data with multiple cell types we should run a preclustering step:
# set.seed(1991)
# qclust <- quickCluster(PAG_sceset_qc, min.size = 30)
# PAG_sceset_qc <- computeSumFactors(PAG_sceset_qc, sizes = 15, clusters = qclust)
# PAG_sceset_qc <- logNormCounts(PAG_sceset_qc)
# For small datasets like ours (< 1000 samples), the quickCluster step above can be ommited and we can run computeSumFactors straight away.
PAG_sceset_qc <- computeSumFactors(PAG_sceset_qc, min.mean = 1, sizes = seq(21, 201, 5), assay.type = "counts")
summary(sizeFactors(PAG_sceset_qc))
```
`scran` sometimes calculates negative or zero size factors. These will completely distort the normalised expression matrix. If you find `scran` has calculated negative size factors (you can see this at the Min. column of the above summary) try increasing the cluster and pool sizes until they are all positive.
Plotting the size factors against the library sizes should show a positive correlation.
```{r}
plot(PAG_sceset_qc$total_million,
sizeFactors(PAG_sceset_qc),
log = "xy",
main = "Correlation between size factors and library size", font.main = 2, cex.main = 1.1,
xlab = "Library size (million reads)", ylab = "Size factor", font.lab = 1, cex.lab = 1,
col = c(rgb(255, 119, 124, 255/2, maxColorValue = 255), rgb(0, 156, 181, 255/3, maxColorValue = 255))[factor(PAG_sceset_qc$cell.type)],
type = "p", pch = 19, cex = 1.25, las = 1)
legend("bottomright", col = c(rgb(255, 119, 124, 255/2, maxColorValue = 255), rgb(0, 156, 181, 255/3, maxColorValue = 255)), pch = 19, cex = 0.8, bty = "n",
legend = c("VGAT::Cre cells", "VGluT2::Cre cells")) #legend = levels(factor(PAG_sceset_qc$cell.type))
# Save as a pdf:
pdf(file = str_c(path_for_figures, date, "_SizeFactors_LibrarySize.pdf"),
width = 5, height = 5, pointsize = 11,
family = "Arial", bg = "transparent")
plot(PAG_sceset_qc$total_million,
sizeFactors(PAG_sceset_qc),
log = "xy",
main = "Correlation between size factors and library size", font.main = 2, cex.main = 1.1,
xlab = "Library size (million reads)", ylab = "Size factor", font.lab = 1, cex.lab = 1,
col = c(rgb(255, 119, 124, 255/2, maxColorValue = 255), rgb(0, 156, 181, 255/3, maxColorValue = 255))[factor(PAG_sceset_qc$cell.type)],
type = "p", pch = 19, cex = 1.25, las = 1)
legend("bottomright", col = c(rgb(255, 119, 124, 255/2, maxColorValue = 255), rgb(0, 156, 181, 255/3, maxColorValue = 255)), pch = 19, cex = 0.8, bty = "n",
legend = c("VGAT::Cre cells", "VGluT2::Cre cells")) #legend = levels(factor(PAG_sceset_qc$cell.type))
dev.off()
```
Indeed, our size factors are tightly correlated with the library sizes for all cells. This suggests that the systematic differences between cells are primarily driven by differences in capture efficiency or sequencing depth. Any DE between cells would yield a non-linear trend between the total count and size factor, and/or increased scatter around the trend. This does not occur here as strong DE is unlikely to exist within a homogeneous population of cells.
As we mentioned, we will not normalise using Spike-ins size factors, as we did not add a high enough concentration of ERCCs to achieve a sampling distribution/sequencing coverage constant across cells. If both the deconvolution size factors and the spike-in size factors captured similar technical biases in sequencing depth and capture efficiency, we should observe a positive correlation between them. However, if we plot one against the other, we can clearly see that this is not the case, and spike-in size factors are very low in most cells, reflecting the fact that for most cells the percentage of reads allocated to spike-ins was not high enough to use them for normalisation purposes.
```{r}
PAG_sceset_qc_spikes <- PAG_sceset_qc # Make a copy of the SingleCellExperiment object to avoid altering the current one
PAG_sceset_qc_spikes <- computeSpikeFactors(PAG_sceset_qc_spikes, "ERCC", assay.type = "counts")
to.plot <- data.frame(DeconvFactor = sizeFactors(PAG_sceset_qc),
SpikeFactor = sizeFactors(PAG_sceset_qc_spikes),
CellType = PAG_sceset_qc$cell.type)
ggplot(to.plot, aes(x = DeconvFactor, y = SpikeFactor, color = CellType)) +
geom_point() + scale_x_log10() +
scale_y_log10() + geom_abline(intercept = 0, slope = 1, color = "red")
```
We will thus use spike-ins for QC only, and we will normalise our dataset using the deconvolution size factors.
## Step 3.2 | Apply the size factors to normalise gene expression
We will now calculate normalised expression values using the `logNormCounts()` method from `scater` (McCarthy et al., 2017). This uses the deconvolution size factors for the endogenous genes, and the spike-in-based size factors for the spike-in transcripts. Each expression value can be interpreted as a log-transformed “normalised count”, and can be used in downstream applications like clustering or dimensionality reduction.
The count data are used to compute normalised log-expression values for use in downstream analyses. Each value is defined as the log2-ratio of each count to the size factor for the corresponding cell, after adding a pseudo-count of 1 to avoid undefined values at zero counts. Division of the counts for each gene by its appropriate size factor ensures that any cell-specific biases are removed. If spike-in-specific size factors are present in `SingleCellExperiment`, they will be automatically applied to normalise the spike-in transcripts separately from the endogenous genes.
The log-transformation provides some measure of variance stabilization (Law et al., 2014), so that high-abundance genes with large variances do not dominate downstream analyses. The computed values are stored as an expression matrix in addition to the other assay elements.
```{r}
PAG_sceset_qc <- logNormCounts(PAG_sceset_qc)
assayNames(PAG_sceset_qc)
```
The log-transformation is useful as differences in the log-values represent log-fold changes in expression. This is important in downstream procedures based on Euclidean distances, which includes many forms of clustering and dimensionality reduction. By operating on log-transformed data, we ensure that these procedures are measuring distances between cells based on log-fold changes in expression.
### 3.2.1 | TPM Normalization
Data from full length protocols may benefit from normalisation methods that take into account gene length (Patel et al, 2014; Kowalczyk et al, 2015; Soneson & Robinson, 2018). A commonly used normalisation method for full-length scRNA-seq data is TPM normalisation (Li et al, 2009), which comes from bulk RNA-seq analysis.
Given that we have full-length read data, we could use the gene lengths we computed at the begining to apply TPM normalisation:
```{r}
#assay(PAG_sceset_qc, "TPM") <- calculateTPM(PAG_sceset_qc,
# lengths = rowData(PAG_sceset_qc_norm)$gene_length,
# exprs_values = "counts",
# size_factors = sizeFactors(PAG_sceset_qc))
#assayNames(PAG_sceset_qc)
```
However, according to this post by [Nikolay Oskolkov](https://towardsdatascience.com/pitfalls-of-data-normalization-bf05d65f1f4c), TPM counts for a given sample sum up to one million, which is a Simplex Space constraint. This means that TPM counts are not better than Library Size normalised counts as they also suffer from the Simplex Space bias, and one should not naively apply e.g. PCA on those counts. For a clearer explanation read the full post on the link above. We will refrain from using TPM and will stick with the normalisation using the `scran` size factors.
### 3.2.2 | Plots after normalisation
During QC we noticed that a large proportion of cells seemed to express both the transporter (VGAT or VGluT2) used to create the transgenic line and the other one. Now that we have normalised and log-transformed the counts and filtered low quality cells, we can re-do the plots showing the expression levels of VGAT and VGluT2 and store them:
```{r}
# We can set any visualization parameters common to all plots at the beginning like this:
theme_args_violin_bymouse <- theme(plot.title = element_text(size = 12, face = "bold"),
axis.title = element_text(size = 11, face = "plain"),
axis.text = element_text(size = 10, face = "plain"),
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
legend.text = element_text(size = 9, face = "plain"),
strip.text = element_text(size = 10, face = "plain"))
theme_args_violin_bytype <- theme(plot.title = element_text(size = 12, face = "bold"),
axis.title = element_text(size = 11, face = "plain"),
axis.text = element_text(size = 10, face = "plain"),
legend.text = element_text(size = 9, face = "plain"),
strip.text = element_text(size = 10, face = "plain"))
## Violin plots from VGAT cells by Mouse ID
# Using counts
violin_QC_VGATbymouse_VGATVGluT2_counts <- plotExpression(PAG_sceset_qc[, PAG_sceset_qc$cell.type == "VGAT"], c("Slc32a1", "Slc17a6"), x = "mouse.id",
exprs_values = "counts",
colour_by = "batch.processing", ncol = 2, theme_size = 11, point_alpha = 0.8) +
ggtitle("Transporter expression levels in cells from VGAT::Cre animals after QC (raw counts)") + xlab("Mouse ID") + ylab("Expression (raw counts)") +
theme_args_violin_bymouse
violin_QC_VGATbymouse_VGATVGluT2_counts
saveRDS(violin_QC_VGATbymouse_VGATVGluT2_counts, file = str_c(path_for_figures, "violin_QC_VGATbymouse_VGATVGluT2_counts.rds"))
ggsave(filename = str_c(date, "_violin_QC_VGATbymouse_VGATVGluT2_counts.pdf"),
plot = violin_QC_VGATbymouse_VGATVGluT2_counts,
device = "pdf", # or one of "eps", "ps", "tex" (pictex), "pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only).
path = path_for_figures,
width = 8, height = 4, units = "in", dpi = 300,
family = "Arial", bg = "transparent"
)
# Using logcounts_raw
violin_QC_VGATbymouse_VGATVGluT2_logcounts_raw <- plotExpression(PAG_sceset_qc[, PAG_sceset_qc$cell.type == "VGAT"], c("Slc32a1", "Slc17a6"), x = "mouse.id",
exprs_values = "logcounts_raw",
colour_by = "batch.processing", ncol = 2, theme_size = 11, point_alpha = 0.8) +
ggtitle("Transporter expression levels in cells from VGAT::Cre animals after QC (log2 raw counts)") + xlab("Mouse ID") + ylab("Expression (log2 raw counts)") +
theme_args_violin_bymouse
violin_QC_VGATbymouse_VGATVGluT2_logcounts_raw
saveRDS(violin_QC_VGATbymouse_VGATVGluT2_logcounts_raw, file = str_c(path_for_figures, "violin_QC_VGATbymouse_VGATVGluT2_logcounts_raw.rds"))
ggsave(filename = str_c(date, "_violin_QC_VGATbymouse_VGATVGluT2_logcounts_raw.pdf"),
plot = violin_QC_VGATbymouse_VGATVGluT2_logcounts_raw,
device = "pdf", # or one of "eps", "ps", "tex" (pictex), "pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only).
path = path_for_figures,
width = 8, height = 4, units = "in", dpi = 300,
family = "Arial", bg = "transparent"
)
# Using logcounts
violin_QC_VGATbymouse_VGATVGluT2_logcounts <- plotExpression(PAG_sceset_qc[, PAG_sceset_qc$cell.type == "VGAT"], c("Slc32a1", "Slc17a6"), x = "mouse.id",
exprs_values = "logcounts",
colour_by = "batch.processing", ncol = 2, theme_size = 11, point_alpha = 0.8) +
ggtitle("Transporter expression levels in cells from VGAT::Cre animals after QC (logcounts)") + xlab("Mouse ID") + ylab("Expression (log-normalised counts)") +
theme_args_violin_bymouse
violin_QC_VGATbymouse_VGATVGluT2_logcounts
saveRDS(violin_QC_VGATbymouse_VGATVGluT2_logcounts, file = str_c(path_for_figures, "violin_QC_VGATbymouse_VGATVGluT2_logcounts.rds"))
ggsave(filename = str_c(date, "_violin_QC_VGATbymouse_VGATVGluT2_logcounts.pdf"),
plot = violin_QC_VGATbymouse_VGATVGluT2_logcounts,
device = "pdf", # or one of "eps", "ps", "tex" (pictex), "pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only).
path = path_for_figures,
width = 8, height = 4, units = "in", dpi = 300,
family = "Arial", bg = "transparent"
)
## Violin plots from VGluT2 cells by Mouse ID
# Using counts
violin_QC_VGluT2bymouse_VGATVGluT2_counts <- plotExpression(PAG_sceset_qc[, PAG_sceset_qc$cell.type == "VGluT2"], c("Slc32a1", "Slc17a6"), x = "mouse.id",
exprs_values = "counts",
colour_by = "batch.processing", ncol = 2, theme_size = 11, point_alpha = 0.8) +
ggtitle("Transporter expression levels in cells from VGluT2::Cre animals after QC (raw counts)") + xlab("Mouse ID") + ylab("Expression (raw counts)") +
theme_args_violin_bymouse
violin_QC_VGluT2bymouse_VGATVGluT2_counts
saveRDS(violin_QC_VGluT2bymouse_VGATVGluT2_counts, file = str_c(path_for_figures, "violin_QC_VGluT2bymouse_VGATVGluT2_counts.rds"))
ggsave(filename = str_c(date, "_violin_QC_VGluT2bymouse_VGATVGluT2_counts.pdf"),
plot = violin_QC_VGluT2bymouse_VGATVGluT2_counts,
device = "pdf", # or one of "eps", "ps", "tex" (pictex), "pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only).
path = path_for_figures,
width = 8, height = 4, units = "in", dpi = 300,
family = "Arial", bg = "transparent"
)
# Using logcounts_raw
violin_QC_VGluT2bymouse_VGATVGluT2_logcounts_raw <- plotExpression(PAG_sceset_qc[, PAG_sceset_qc$cell.type == "VGluT2"], c("Slc32a1", "Slc17a6"), x = "mouse.id",
exprs_values = "logcounts_raw",
colour_by = "batch.processing", ncol = 2, theme_size = 11, point_alpha = 0.8) +
ggtitle("Transporter expression levels in cells from VGluT2::Cre animals after QC (log2 raw counts)") + xlab("Mouse ID") + ylab("Expression (log2 raw counts)") +
theme_args_violin_bymouse
violin_QC_VGluT2bymouse_VGATVGluT2_logcounts_raw
saveRDS(violin_QC_VGluT2bymouse_VGATVGluT2_logcounts_raw, file = str_c(path_for_figures, "violin_QC_VGluT2bymouse_VGATVGluT2_logcounts_raw.rds"))
ggsave(filename = str_c(date, "_violin_QC_VGluT2bymouse_VGATVGluT2_logcounts_raw.pdf"),
plot = violin_QC_VGluT2bymouse_VGATVGluT2_logcounts_raw,
device = "pdf", # or one of "eps", "ps", "tex" (pictex), "pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only).
path = path_for_figures,
width = 8, height = 4, units = "in", dpi = 300,
family = "Arial", bg = "transparent"
)
# Using logcounts
violin_QC_VGluT2bymouse_VGATVGluT2_logcounts <- plotExpression(PAG_sceset_qc[, PAG_sceset_qc$cell.type == "VGluT2"], c("Slc32a1", "Slc17a6"), x = "mouse.id",
exprs_values = "logcounts",
colour_by = "batch.processing", ncol = 2, theme_size = 11, point_alpha = 0.8) +
ggtitle("Transporter expression levels in cells from VGluT2::Cre animals after QC (logcounts)") + xlab("Mouse ID") + ylab("Expression (log-normalised counts)") +
theme_args_violin_bymouse
violin_QC_VGluT2bymouse_VGATVGluT2_logcounts
saveRDS(violin_QC_VGluT2bymouse_VGATVGluT2_logcounts, file = str_c(path_for_figures, "violin_QC_VGluT2bymouse_VGATVGluT2_logcounts.rds"))
ggsave(filename = str_c(date, "_violin_QC_VGluT2bymouse_VGATVGluT2_logcounts.pdf"),
plot = violin_QC_VGluT2bymouse_VGATVGluT2_logcounts,
device = "pdf", # or one of "eps", "ps", "tex" (pictex), "pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only).
path = path_for_figures,
width = 8, height = 4, units = "in", dpi = 300,
family = "Arial", bg = "transparent"
)
## Violin plots by Cell Type
# Using counts
violin_QC_celltype_VGATVGluT2_counts <- plotExpression(PAG_sceset_qc, c("Slc32a1", "Slc17a6"), x = "cell.type",
exprs_values = "counts",
colour_by = "mouse.id", ncol = 2, theme_size = 11, point_alpha = 0.8) +
ggtitle("Transporter expression levels after QC (raw counts)") + xlab("Cell type") + ylab("Expression (raw counts)") +
theme_args_violin_bytype
violin_QC_celltype_VGATVGluT2_counts
saveRDS(violin_QC_celltype_VGATVGluT2_counts, file = str_c(path_for_figures, "violin_QC_celltype_VGATVGluT2_counts.rds"))
ggsave(filename = str_c(date, "_violin_QC_celltype_VGATVGluT2_counts.pdf"),
plot = violin_QC_celltype_VGATVGluT2_counts,
device = "pdf", # or one of "eps", "ps", "tex" (pictex), "pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only).
path = path_for_figures,
width = 8, height = 4, units = "in", dpi = 300,
family = "Arial", bg = "transparent"
)
# Using logcounts_raw
violin_QC_celltype_VGATVGluT2_logcounts_raw <- plotExpression(PAG_sceset_qc, c("Slc32a1", "Slc17a6"), x = "cell.type",
exprs_values = "logcounts_raw",
colour_by = "mouse.id", ncol = 2, theme_size = 11, point_alpha = 0.8) +
ggtitle("Transporter expression levels after QC (log2 raw counts)") + xlab("Cell type") + ylab("Expression (log2 raw counts)") +
theme_args_violin_bytype
violin_QC_celltype_VGATVGluT2_logcounts_raw
saveRDS(violin_QC_celltype_VGATVGluT2_logcounts_raw, file = str_c(path_for_figures, "violin_QC_celltype_VGATVGluT2_logcounts_raw.rds"))
ggsave(filename = str_c(date, "_violin_QC_celltype_VGATVGluT2_logcounts_raw.pdf"),
plot = violin_QC_celltype_VGATVGluT2_logcounts_raw,
device = "pdf", # or one of "eps", "ps", "tex" (pictex), "pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only).
path = path_for_figures,
width = 8, height = 4, units = "in", dpi = 300,
family = "Arial", bg = "transparent"
)
# Using logcounts
violin_QC_celltype_VGATVGluT2_logcounts <- plotExpression(PAG_sceset_qc, c("Slc32a1", "Slc17a6"), x = "cell.type",
exprs_values = "logcounts",
colour_by = "mouse.id", ncol = 2, theme_size = 11, point_alpha = 0.8) +
ggtitle("Transporter expression levels after QC (logcounts)") + xlab("Cell type") + ylab("Expression (log-normalised counts)") +
theme_args_violin_bytype
violin_QC_celltype_VGATVGluT2_logcounts
saveRDS(violin_QC_celltype_VGATVGluT2_logcounts, file = str_c(path_for_figures, "violin_QC_celltype_VGATVGluT2_logcounts.rds"))
ggsave(filename = str_c(date, "_violin_QC_celltype_VGATVGluT2_logcounts.pdf"),
plot = violin_QC_celltype_VGATVGluT2_logcounts,
device = "pdf", # or one of "eps", "ps", "tex" (pictex), "pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only).
path = path_for_figures,
width = 8, height = 4, units = "in", dpi = 300,
family = "Arial", bg = "transparent"
)
```
```{r}
## Scatter plot with counts
# In VGAT cells, plot the VGAT log2 counts against the VGluT2 counts
plot(assay(PAG_sceset_qc[rownames(PAG_sceset_qc)=="Slc32a1", PAG_sceset_qc$cell.type == "VGAT"], "counts"),
assay(PAG_sceset_qc[rownames(PAG_sceset_qc)=="Slc17a6", PAG_sceset_qc$cell.type == "VGAT"], "counts"),
xlim = c(0, max(assay(PAG_sceset_qc[rownames(PAG_sceset_qc)=="Slc32a1", ], "counts"))),
ylim = c(0, max(assay(PAG_sceset_qc[rownames(PAG_sceset_qc)=="Slc17a6", ], "counts"))),
main = "Scatter plot of VGAT-VGluT2 expression after QC (raw counts)", font.main = 2, cex.main = 1.1,
xlab = "VGAT (raw counts)", ylab = "VGluT2 (raw counts)", font.lab = 1, cex.lab = 1,
type = "p", pch = 19, cex = 1.25, las = 1,
col = rgb(255, 119, 124, 255/2, maxColorValue = 255))
# Add the equivalent plot for VGluT2 cells
points(assay(PAG_sceset_qc[rownames(PAG_sceset_qc)=="Slc32a1", PAG_sceset_qc$cell.type == "VGluT2"], "counts"),
assay(PAG_sceset_qc[rownames(PAG_sceset_qc)=="Slc17a6", PAG_sceset_qc$cell.type == "VGluT2"], "counts"),
type = "p", pch = 19, cex = 1, las = 1,
col = rgb(0, 156, 181, 255/6, maxColorValue = 255))
legend("topright", col = c(rgb(255, 119, 124, 255/2, maxColorValue = 255), rgb(0, 156, 181, 255/6, maxColorValue = 255)), pch = 19, cex = 0.8, bty = "n",
legend = c("VGAT::Cre cells", "VGluT2::Cre cells"))
# Save the plot
pdf(file = str_c(path_for_figures, date, "_scatter_QC_VGAT-VGluT2_expression_counts.pdf"),
width = 5, height = 5, pointsize = 11,
family = "Arial", bg = "transparent")
plot(assay(PAG_sceset_qc[rownames(PAG_sceset_qc)=="Slc32a1", PAG_sceset_qc$cell.type == "VGAT"], "counts"),
assay(PAG_sceset_qc[rownames(PAG_sceset_qc)=="Slc17a6", PAG_sceset_qc$cell.type == "VGAT"], "counts"),
xlim = c(0, max(assay(PAG_sceset_qc[rownames(PAG_sceset_qc)=="Slc32a1", ], "counts"))),
ylim = c(0, max(assay(PAG_sceset_qc[rownames(PAG_sceset_qc)=="Slc17a6", ], "counts"))),
main = "Scatter plot of VGAT-VGluT2 expression after QC (raw counts)", font.main = 2, cex.main = 1.1,
xlab = "VGAT (raw counts)", ylab = "VGluT2 (raw counts)", font.lab = 1, cex.lab = 1,
type = "p", pch = 19, cex = 1.25, las = 1,
col = rgb(255, 119, 124, 255/2, maxColorValue = 255))
points(assay(PAG_sceset_qc[rownames(PAG_sceset_qc)=="Slc32a1", PAG_sceset_qc$cell.type == "VGluT2"], "counts"),
assay(PAG_sceset_qc[rownames(PAG_sceset_qc)=="Slc17a6", PAG_sceset_qc$cell.type == "VGluT2"], "counts"),
type = "p", pch = 19, cex = 1, las = 1,
col = rgb(0, 156, 181, 255/6, maxColorValue = 255))
legend("topright", col = c(rgb(255, 119, 124, 255/2, maxColorValue = 255), rgb(0, 156, 181, 255/6, maxColorValue = 255)), pch = 19, cex = 0.8, bty = "n",
legend = c("VGAT::Cre cells", "VGluT2::Cre cells"))
dev.off()
## Scatter plot with logcounts_raw
# In VGAT cells, plot the VGAT log2 counts against the VGluT2 counts
plot(assay(PAG_sceset_qc[rownames(PAG_sceset_qc)=="Slc32a1", PAG_sceset_qc$cell.type == "VGAT"], "logcounts_raw"),
assay(PAG_sceset_qc[rownames(PAG_sceset_qc)=="Slc17a6", PAG_sceset_qc$cell.type == "VGAT"], "logcounts_raw"),
xlim = c(0, max(assay(PAG_sceset_qc[rownames(PAG_sceset_qc)=="Slc32a1", ], "logcounts_raw"))),
ylim = c(0, max(assay(PAG_sceset_qc[rownames(PAG_sceset_qc)=="Slc17a6", ], "logcounts_raw"))),
main = "Scatter plot of VGAT-VGluT2 expression after QC (log2 raw counts)", font.main = 2, cex.main = 1.1,
xlab = "VGAT (log2 raw counts)", ylab = "VGluT2 (log2 raw counts)", font.lab = 1, cex.lab = 1,
type = "p", pch = 19, cex = 1.25, las = 1,
col = rgb(255, 119, 124, 255/2, maxColorValue = 255))
# Add the equivalent plot for VGluT2 cells
points(assay(PAG_sceset_qc[rownames(PAG_sceset_qc)=="Slc32a1", PAG_sceset_qc$cell.type == "VGluT2"], "logcounts_raw"),
assay(PAG_sceset_qc[rownames(PAG_sceset_qc)=="Slc17a6", PAG_sceset_qc$cell.type == "VGluT2"], "logcounts_raw"),
type = "p", pch = 19, cex = 1, las = 1,
col = rgb(0, 156, 181, 255/6, maxColorValue = 255))
legend("topright", col = c(rgb(255, 119, 124, 255/2, maxColorValue = 255), rgb(0, 156, 181, 255/6, maxColorValue = 255)), pch = 19, cex = 0.8, bty = "n",
legend = c("VGAT::Cre cells", "VGluT2::Cre cells"))
# Save the plot
pdf(file = str_c(path_for_figures, date, "_scatter_QC_VGAT-VGluT2_expression_logcounts_raw.pdf"),
width = 5, height = 5, pointsize = 11,
family = "Arial", bg = "transparent")
plot(assay(PAG_sceset_qc[rownames(PAG_sceset_qc)=="Slc32a1", PAG_sceset_qc$cell.type == "VGAT"], "logcounts_raw"),
assay(PAG_sceset_qc[rownames(PAG_sceset_qc)=="Slc17a6", PAG_sceset_qc$cell.type == "VGAT"], "logcounts_raw"),
xlim = c(0, max(assay(PAG_sceset_qc[rownames(PAG_sceset_qc)=="Slc32a1", ], "logcounts_raw"))),
ylim = c(0, max(assay(PAG_sceset_qc[rownames(PAG_sceset_qc)=="Slc17a6", ], "logcounts_raw"))),
main = "Scatter plot of VGAT-VGluT2 expression after QC (log2 raw counts)", font.main = 2, cex.main = 1.1,
xlab = "VGAT (log2 raw counts)", ylab = "VGluT2 (log2 raw counts)", font.lab = 1, cex.lab = 1,
type = "p", pch = 19, cex = 1.25, las = 1,
col = rgb(255, 119, 124, 255/2, maxColorValue = 255))
points(assay(PAG_sceset_qc[rownames(PAG_sceset_qc)=="Slc32a1", PAG_sceset_qc$cell.type == "VGluT2"], "logcounts_raw"),
assay(PAG_sceset_qc[rownames(PAG_sceset_qc)=="Slc17a6", PAG_sceset_qc$cell.type == "VGluT2"], "logcounts_raw"),
type = "p", pch = 19, cex = 1, las = 1,
col = rgb(0, 156, 181, 255/6, maxColorValue = 255))
legend("topright", col = c(rgb(255, 119, 124, 255/2, maxColorValue = 255), rgb(0, 156, 181, 255/6, maxColorValue = 255)), pch = 19, cex = 0.8, bty = "n",
legend = c("VGAT::Cre cells", "VGluT2::Cre cells"))
dev.off()
## Scatter plot with logcounts
# In VGAT cells, plot the VGAT log2 counts against the VGluT2 counts
plot(assay(PAG_sceset_qc[rownames(PAG_sceset_qc)=="Slc32a1", PAG_sceset_qc$cell.type == "VGAT"], "logcounts"),
assay(PAG_sceset_qc[rownames(PAG_sceset_qc)=="Slc17a6", PAG_sceset_qc$cell.type == "VGAT"], "logcounts"),
xlim = c(0, max(assay(PAG_sceset_qc[rownames(PAG_sceset_qc)=="Slc32a1", ], "logcounts"))),
ylim = c(0, max(assay(PAG_sceset_qc[rownames(PAG_sceset_qc)=="Slc17a6", ], "logcounts"))),
main = "Scatter plot of VGAT-VGluT2 expression after QC (logcounts)", font.main = 2, cex.main = 1.1,
xlab = "VGAT (logcounts)", ylab = "VGluT2 (logcounts)", font.lab = 1, cex.lab = 1,
type = "p", pch = 19, cex = 1.25,
col = rgb(255, 119, 124, 255/2, maxColorValue = 255))
# Add the equivalent plot for VGluT2 cells
points(assay(PAG_sceset_qc[rownames(PAG_sceset_qc)=="Slc32a1", PAG_sceset_qc$cell.type == "VGluT2"], "logcounts"),
assay(PAG_sceset_qc[rownames(PAG_sceset_qc)=="Slc17a6", PAG_sceset_qc$cell.type == "VGluT2"], "logcounts"),
type = "p", pch = 19, cex = 1, las = 1,
col = rgb(0, 156, 181, 255/6, maxColorValue = 255))
legend("topright", col = c(rgb(255, 119, 124, 255/2, maxColorValue = 255), rgb(0, 156, 181, 255/6, maxColorValue = 255)), pch = 19, cex = 0.8, bty = "n",
legend = c("VGAT::Cre cells", "VGluT2::Cre cells"))
# Save the plot
pdf(file = str_c(path_for_figures, date, "_scatter_QC_VGAT-VGluT2_expression_logcounts.pdf"),
width = 5, height = 5, pointsize = 11,
family = "Arial", bg = "transparent")
plot(assay(PAG_sceset_qc[rownames(PAG_sceset_qc)=="Slc32a1", PAG_sceset_qc$cell.type == "VGAT"], "logcounts"),
assay(PAG_sceset_qc[rownames(PAG_sceset_qc)=="Slc17a6", PAG_sceset_qc$cell.type == "VGAT"], "logcounts"),
xlim = c(0, max(assay(PAG_sceset_qc[rownames(PAG_sceset_qc)=="Slc32a1", ], "logcounts"))),
ylim = c(0, max(assay(PAG_sceset_qc[rownames(PAG_sceset_qc)=="Slc17a6", ], "logcounts"))),
main = "Scatter plot of VGAT-VGluT2 expression after QC (logcounts)", font.main = 2, cex.main = 1.1,
xlab = "VGAT (logcounts)", ylab = "VGluT2 (logcounts)", font.lab = 1, cex.lab = 1,
type = "p", pch = 19, cex = 1.25, las = 1,
col = rgb(255, 119, 124, 255/2, maxColorValue = 255))
points(assay(PAG_sceset_qc[rownames(PAG_sceset_qc)=="Slc32a1", PAG_sceset_qc$cell.type == "VGluT2"], "logcounts"),
assay(PAG_sceset_qc[rownames(PAG_sceset_qc)=="Slc17a6", PAG_sceset_qc$cell.type == "VGluT2"], "logcounts"),
type = "p", pch = 19, cex = 1, las = 1,
col = rgb(0, 156, 181, 255/6, maxColorValue = 255))
legend("topright", col = c(rgb(255, 119, 124, 255/2, maxColorValue = 255), rgb(0, 156, 181, 255/6, maxColorValue = 255)), pch = 19, cex = 0.8, bty = "n",
legend = c("VGAT::Cre cells", "VGluT2::Cre cells"))
dev.off()
```
## Step 3.3 | Compare the effects of different normalisation strategies
To compare the efficiency of different normalisation methods we can use visual inspection of PCA plots and calculation of cell-wise relative log expression via scater's `plotRLE()` function. Namely, cells with many (few) reads have higher (lower) than median expression for most genes resulting in a positive (negative) RLE across the cell, whereas normalised cells have an RLE closer to zero. Relative log expression (RLE) plots are a powerful tool for visualising unwanted variation in high dimensional data. These plots were originally devised for gene expression data from microarrays but can also be used on single-cell expression data. RLE plots are particularly useful for assessing whether a procedure aimed at removing unwanted variation (e.g., scaling normalisation) has been successful.
If we look at the RLE plot, the graph shows a box plot of the RLE values for each cell (each line is a boxplot), the circles indicate the median (nearly all should be at 0 after normalisation), and the line indicates the interquartile range.
If style is “full”, the usual ggplot2 boxplot is created for each cell. Here, the box shows the inter-quartile range and whiskers extend no more than 1.5 times the IQR from the hinge (the 25th or 75th percentile). Data beyond the whiskers are called outliers and are plotted individually. The median (50th percentile) is shown with a white bar. This approach is detailed and flexible, but can take a long time to plot for large datasets.
If style is “minimal”, a Tufte-style boxplot is created for each cell. Here, the median is shown with a circle, the IQR in a grey line, and “whiskers” (as defined above) for the plots are shown with coloured lines. No outliers are shown for this plot style. This approach is more succinct and faster for large numbers of cells.
```{r}
library(ggplot2)
library(sgeostat)
library(mvoutlier)
# We can set any visualization parameters common to all plots at the beginning like this:
theme_args_RLE <- theme(plot.title = element_text(size = 12, face = "bold"),
axis.title = element_text(size = 11, face = "plain"),
axis.text = element_text(size = 10, face = "plain"),
legend.text = element_text(size = 9, face = "plain"))
## Using the "minimal" option
# Without log-transformation
p1 <- plotRLE(PAG_sceset_qc,
exprs_values = "counts",
exprs_logged = TRUE,
style = "minimal", # "full" or "minimal"
legend = TRUE,
colour_by = "batch.processing", # "mouse.id" or "batch.processing"
theme_size = 11
) + ggtitle("Cell-wise relative log expression (raw counts)") + theme_args_RLE
ggsave(filename = str_c(date, "_RLE_minimal_counts.pdf"),
plot = p1,
device = "pdf", # or one of "eps", "ps", "tex" (pictex), "pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only).
path = path_for_figures,
width = 8, height = 4, units = "in", dpi = 300,
family = "Arial", bg = "transparent"
)
# With log-transformation
p2 <- plotRLE(PAG_sceset_qc,
exprs_values = "logcounts_raw",
exprs_logged = TRUE,
style = "minimal", # "full" or "minimal"
legend = TRUE,
colour_by = "batch.processing" # "mouse.id" or "batch.processing"
) + ggtitle("Cell-wise relative log expression (log2 raw counts)") + theme_args_RLE
ggsave(filename = str_c(date, "_RLE_minimal_logcounts_raw.pdf"),
plot = p2,
device = "pdf", # or one of "eps", "ps", "tex" (pictex), "pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only).
path = path_for_figures,
width = 8, height = 4, units = "in", dpi = 300,
family = "Arial", bg = "transparent"
)
# After scran normalisation
p3 <- plotRLE(PAG_sceset_qc,
exprs_values = "logcounts",
exprs_logged = TRUE,
style = "minimal", # "full" or "minimal"
legend = TRUE,
colour_by = "batch.processing" # "mouse.id" or "batch.processing"
) + ggtitle("Cell-wise relative log expression (log-normalised counts)") + theme_args_RLE
ggsave(filename = str_c(date, "_RLE_minimal_logcounts.pdf"),
plot = p3,
device = "pdf", # or one of "eps", "ps", "tex" (pictex), "pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only).
path = path_for_figures,
width = 8, height = 4, units = "in", dpi = 300,
family = "Arial", bg = "transparent"
)
plot(p1)
plot(p2)
plot(p3)
## Using the "full" option
# Without log-transformation
p1_full <- plotRLE(PAG_sceset_qc,
exprs_values = "counts",
exprs_logged = TRUE,
style = "full", # "full" or "minimal"
legend = TRUE,
colour_by = "mouse.id" # "mouse.id" or "batch.processing"
) + ggtitle("Cell-wise relative log expression (raw counts)") + theme_args_RLE
ggsave(filename = str_c(date, "_RLE_full_counts.pdf"),
plot = p1_full,
device = "pdf", # or one of "eps", "ps", "tex" (pictex), "pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only).
path = path_for_figures,
width = 8, height = 4, units = "in", dpi = 300,
family = "Arial", bg = "transparent"
)
# With log-transformation
p2_full <- plotRLE(PAG_sceset_qc,
exprs_values = "logcounts_raw",
exprs_logged = TRUE,
style = "full", # "full" or "minimal"
legend = TRUE,
colour_by = "mouse.id" # "mouse.id" or "batch.processing"
) + ggtitle("Cell-wise relative log expression (log2 raw counts)") + theme_args_RLE
ggsave(filename = str_c(date, "_RLE_full_logcounts_raw.pdf"),
plot = p2_full,
device = "pdf", # or one of "eps", "ps", "tex" (pictex), "pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only).
path = path_for_figures,
width = 8, height = 4, units = "in", dpi = 300,
family = "Arial", bg = "transparent"
)
# After scran normalisation
p3_full <- plotRLE(PAG_sceset_qc,
exprs_values = "logcounts",
exprs_logged = TRUE,
style = "full", # "full" or "minimal"
legend = TRUE,
colour_by = "mouse.id" # "mouse.id" or "batch.processing"
) + ggtitle("Cell-wise relative log expression (log-normalised counts)") + theme_args_RLE
ggsave(filename = str_c(date, "_RLE_full_logcounts.pdf"),
plot = p3_full,
device = "pdf", # or one of "eps", "ps", "tex" (pictex), "pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only).
path = path_for_figures,
width = 8, height = 4, units = "in", dpi = 300,
family = "Arial", bg = "transparent"
)
# If you use `style = "full"`, the resulting plots will be too heavy to plot at once (around 140MB each), so plot each graph one at a time if that option is selected instead of "minimal":
#plot(p1_full)
#plot(p2_full)
#plot(p3_full)
```
## Step 3.5 | Save the SingleCellExperiment object containing the log-normalised counts
We now have normalised our gene expression values for our dataset, so we can store our `SingleCellExperiment` object and proceed to identifying highly variable genes.
```{r}
# Save the normalised data:
saveRDS(PAG_sceset_qc, file = "PAG_sceset_qc_norm.rds")
print("Part 3 - Done!")
```
```{r}
sessionInfo()
```