Workshop introduction

Instructors

Dr. Stefano Mangiola is currently a Postdoctoral researcher in the laboratory of Prof. Tony Papenfuss. His background spans from biotechnology to bioinformatics and biostatistics. His research focuses on prostate and breast tumour microenvironment, the development of statistical model for the analysis of RNA sequencing data, and data analysis and visualisation interfaces.

Dr. Maria Doyle is the Application and Training Specialist for Research Computing at the Peter MacCallum Cancer Centre in Melbourne, Australia. She has a PhD in Molecular Biology and currently works in bioinformatics and data science education and training. She is passionate about supporting researchers, reproducible research, open source and tidy data.

Overview

This tutorial will present how to perform analysis of single-cell and bulk RNA sequencing data following the tidy data paradigm. The tidy data paradigm provides a standard way to organise data values within a dataset, where each variable is a column, each observation is a row, and data is manipulated using an easy-to-understand vocabulary. Most importantly, the data structure remains consistent across manipulation and analysis functions.

This can be achieved with the integration of packages present in the R CRAN and Bioconductor ecosystem, including tidyseurat, tidySingleCellExperiment, tidybulk, tidyHeatmap and tidyverse. These packages are part of the tidytranscriptomics suite that introduces a tidy approach to RNA sequencing data representation and analysis. For more information see the tidy transcriptomics blog.

Pre-requisites

Familiarity with tidyverse syntax
Some familiarity with bulk RNA-seq and single cell RNA-seq

Strongly recommended background reading:

https://melbournebioinformatics.github.io/r-intro-biologists/intro_r_biologists.html
https://towardsdatascience.com/coding-in-r-nest-and-map-your-way-to-efficient-code-4e44ba58ee4a by Rebecca O’Dwyer
https://finnstats.com/index.php/2021/04/02/tidyverse-in-r/

Time outline

The workshop format is a 3 hour session consisting of hands-on demos, exercises and Q&A.

Guide

Activity	Time
Part 1 Bulk RNA-seq Core
Hands-on Demos + Exercises	90m
Differential gene expression
Cell type composition analysis
Part 2 Single-cell RNA-seq
Hands-on Demos + Exercises	90m
Single-cell analysis
Pseudobulk analysis
Total	180m

Format: Hands on demos, exercises plus Q&A Interact: Zoom chat, Menti quiz for challenges

Workshop goals and objectives

In exploring and analysing RNA sequencing data, there are a number of key concepts, such as filtering, scaling, dimensionality reduction, hypothesis testing, clustering and visualisation, that need to be understood. These concepts can be intuitively explained to new users, however, (i) the use of a heterogeneous vocabulary and jargon by methodologies/algorithms/packages, (ii) the complexity of data wrangling, and (iii) the coding burden, impede effective learning of the statistics and biology underlying an informed RNA sequencing analysis.

The tidytranscriptomics approach to RNA sequencing data analysis abstracts out the coding-related complexity and provides tools that use an intuitive and jargon-free vocabulary, enabling focus on the statistical and biological challenges.

Learning goals

To understand the key concepts and steps of RNA sequencing data analysis
To approach data representation and analysis though a tidy data paradigm, integrating tidyverse with tidybulk, tidyseurat, tidySingleCellExperiment and tidyHeatmap.

Learning objectives

Recall the key concepts of RNA sequencing data analysis
Apply the concepts to publicly available data
Create plots that summarise the information content of the data and analysis results

Google doc with all links

This Google doc has all the links you need for this workshop.

Material
Menti for Audience Questions and Challenges Quiz
RStudio cloud logins
Feedback survey

Audience Questions

First we have a few questions for you, the audience.

Please either open a tab in your browser or use your phone. Go to Menti (www.menti.com) and type the code given in the Google doc above.

Then please rate the following on a scale of 1-5 (1=no/none, 5=yes/a lot)

I will code along and/or try out exercises during this workshop.
I have experience with transcriptomic analyses
I have experience with tidyverse

What is transcriptomics?

“The transcriptome is the set of all RNA transcripts, including coding and non-coding, in an individual or a population of cells”

Wikipedia

Why use transcriptomics?

Genome (DNA) pretty stable
Proteome (proteins) harder to measure
Transcriptome (RNA) can measure changes in expression of thousands of coding and non-coding transcripts

Possible experimental design

How does transcriptomics work?

Types of transcriptomic analyses

Differential expression
Cell type composition
Alternative splicing
Novel transcript discovery
Fusions identification
Variant analysis

Topics in bold we will see in this workshop

Bulk RNA sequencing differential expression workflow

Differences between bulk and single-cell RNA sequencing

Shalek and Benson, 2017

Single-cell RNA sequencing analysis workflow

Tidy data and the tidyverse

This workshop demonstrates how to perform analysis of RNA sequencing data following the tidy data paradigm (Wickham and others 2014). The tidy data paradigm provides a standard way to organise data values within a dataset, where each variable is a column, each observation is a row, and data is manipulated using an easy-to-understand vocabulary. Most importantly, the data structure remains consistent across manipulation and analysis functions. For more information, see the R for Data Science chapter on tidy data here.

Tidy data for efficiency, reproducibilty and collaboration

The tidyverse is a collection of packages that can be used to tidy, manipulate and visualise data. We’ll use many functions from the tidyverse in this workshop, such as filter, select, mutate, pivot_longer and ggplot.

Adapted from Getting Started with tidyverse in R

Getting started

Cloud

Easiest way to run this material. Only available during workshop. The organisers will share the access link to the RStudio Cloud project at the start of the workshop. You then click the button to make a copy of the project, as in screenshot below.

RStudio Cloud project

Local

We will use RStudio Cloud during the RPharma workshop. After the workshop, if you want to install on your own computer, see instructions here.

Alternatively, you can view the material at the workshop webpage here.

What is tidytranscriptomics?

tidybulk, tidySummarizedExperiment and tidySingleCellExperiment are part of the tidytranscriptomics suite that introduces a tidy approach to RNA sequencing data representation and analysis. A diagram showing how the tidytranscriptomics packages integrate the [tidyverse] (https://www.tidyverse.org/) with Bioconductor and Seurat is shown in the figure below.

Part 1 Bulk RNA sequencing with tidySummarizedExperiment and tidybulk

Tidy Transcriptomics blog

Acknowledgements

Some of the material in Part 1 was adapted from an R RNA sequencing workshop first run here. The use of the airway dataset was inspired by the DESeq2 vignette.

Introduction

Measuring gene expression on a genome-wide scale has become common practice over the last two decades or so, with microarrays predominantly used pre-2008. With the advent of next-generation sequencing technology in 2008, an increasing number of scientists use this technology to measure and understand changes in gene expression in often complex systems. As sequencing costs have decreased, using RNA sequencing to simultaneously measure the expression of tens of thousands of genes for multiple samples has never been easier. The cost of these experiments has now moved from generating the data to storing and analysing it.

There are many steps involved in analysing an RNA sequencing dataset. Sequenced reads are aligned to a reference genome, then the number of reads mapped to each gene can be counted. This results in a table of counts, which is what we perform statistical analyses on in R. While mapping and counting are important and necessary tasks, today, we will be starting from the count data and showing how differential expression analysis can be performed in a friendly way using the Bioconductor package, tidybulk.

First, let’s load all the packages we will need to analyse the data.

Note: you should load the tidybulk and tidySummarizedExperiment libraries after the tidyverse core packages for best integration.

# dataset
library(airway)

# tidyverse core packages
library(tibble)
library(dplyr)
library(tidyr)
library(readr)
library(stringr)
library(ggplot2)
library(purrr)

# tidyverse-friendly packages
library(plotly)
library(ggrepel)
library(GGally)
library(tidyHeatmap)
library(tidybulk)
# library(tidySummarizedExperiment) # we'll load this below to show what it can do

Plot settings. Set the colours and theme we will use for our plots.

# Use colourblind-friendly colours
friendly_cols <- dittoSeq::dittoColors()

# Set theme
custom_theme <-
  list(
    scale_fill_manual(values = friendly_cols),
    scale_color_manual(values = friendly_cols),
    theme_bw() +
      theme(
        panel.border = element_blank(),
        axis.line = element_line(),
        panel.grid.major = element_line(size = 0.2),
        panel.grid.minor = element_line(size = 0.1),
        text = element_text(size = 12),
        legend.position = "bottom",
        strip.background = element_blank(),
        axis.title.x = element_text(margin = margin(t = 10, r = 10, b = 10, l = 10)),
        axis.title.y = element_text(margin = margin(t = 10, r = 10, b = 10, l = 10)),
        axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1)
      )
  )

tidySummarizedExperiment

Here we will perform our analysis using the data from the airway package. The airway data comes from the paper by (Himes et al. 2014); and it includes 8 samples from human airway smooth muscle cells, from 4 cell lines. For each cell line treated (with dexamethasone) and untreated (negative control) a sample has undergone RNA sequencing and gene counts have been generated.

# load airway RNA sequencing data
data(airway)

# take a look
airway
#> class: RangedSummarizedExperiment 
#> dim: 64102 8 
#> metadata(1): ''
#> assays(1): counts
#> rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
#> rowData names(0):
#> colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
#> colData names(9): SampleName cell ... Sample BioSample

The data in the airway package is a Bioconductor SummarizedExperiment object. For more information see here.

The tidySummarizedExperiment package enables any SummarizedExperiment object to be displayed and manipulated according to tidy data principles, without affecting any SummarizedExperiment behaviour.

If we load the tidySummarizedExperiment package and then view the airway data it now displays as a tibble. A tibble is the tidyverse table format.

# load tidySummarizedExperiment package
library(tidySummarizedExperiment)

# take a look
airway
#> # A SummarizedExperiment-tibble abstraction: 512,816 × 13
#> [90m# Transcripts=64102 | Samples=8 | Assays=counts[39m
#>    .feature .sample counts SampleName cell  dex   albut Run   avgLength
#>    <chr>    <chr>    <int> <fct>      <fct> <fct> <fct> <fct>     <int>
#>  1 ENSG000… SRR103…    679 GSM1275862 N613… untrt untrt SRR1…       126
#>  2 ENSG000… SRR103…      0 GSM1275862 N613… untrt untrt SRR1…       126
#>  3 ENSG000… SRR103…    467 GSM1275862 N613… untrt untrt SRR1…       126
#>  4 ENSG000… SRR103…    260 GSM1275862 N613… untrt untrt SRR1…       126
#>  5 ENSG000… SRR103…     60 GSM1275862 N613… untrt untrt SRR1…       126
#>  6 ENSG000… SRR103…      0 GSM1275862 N613… untrt untrt SRR1…       126
#>  7 ENSG000… SRR103…   3251 GSM1275862 N613… untrt untrt SRR1…       126
#>  8 ENSG000… SRR103…   1433 GSM1275862 N613… untrt untrt SRR1…       126
#>  9 ENSG000… SRR103…    519 GSM1275862 N613… untrt untrt SRR1…       126
#> 10 ENSG000… SRR103…    394 GSM1275862 N613… untrt untrt SRR1…       126
#> # … with 40 more rows, and 4 more variables: Experiment <fct>, Sample <fct>,
#> #   BioSample <fct>, GRangesList <list>

Now we can more easily see the data. The airway object contains information about genes and samples, the first column has the Ensembl gene identifier, the second column has the sample identifier and the third column has the gene transcription abundance. The abundance is the number of reads aligning to the gene in each experimental sample. The remaining columns include sample information. The dex column tells us whether the samples are treated or untreated and the cell column tells us what cell line they are from.

We can still interact with the tidy SummarizedExperiment object using commands for SummarizedExperiment objects.

assays(airway)
#> List of length 1
#> names(1): counts

Tidyverse commands

And now we can also use tidyverse commands, such as filter, select, group_by, summarise and mutate to explore the tidy SummarizedExperiment object. Some examples are shown below and more can be seen at the tidySummarizedExperiment website here. We can also use the tidyverse pipe %>%. This ‘pipes’ the output from the command on the left into the command on the right/below. Using the pipe is not essential but it reduces the amount of code we need to write when we have multiple steps. It also can make the steps clearer and easier to see. For more details on the pipe see here.

We can use filter to choose rows, for example, to see just the rows for the treated samples.

airway %>% filter(dex == "trt")
#> # A SummarizedExperiment-tibble abstraction: 256,408 × 13
#> [90m# Transcripts=64102 | Samples=4 | Assays=counts[39m
#>    .feature .sample counts SampleName cell  dex   albut Run   avgLength
#>    <chr>    <chr>    <int> <fct>      <fct> <fct> <fct> <fct>     <int>
#>  1 ENSG000… SRR103…    448 GSM1275863 N613… trt   untrt SRR1…       126
#>  2 ENSG000… SRR103…      0 GSM1275863 N613… trt   untrt SRR1…       126
#>  3 ENSG000… SRR103…    515 GSM1275863 N613… trt   untrt SRR1…       126
#>  4 ENSG000… SRR103…    211 GSM1275863 N613… trt   untrt SRR1…       126
#>  5 ENSG000… SRR103…     55 GSM1275863 N613… trt   untrt SRR1…       126
#>  6 ENSG000… SRR103…      0 GSM1275863 N613… trt   untrt SRR1…       126
#>  7 ENSG000… SRR103…   3679 GSM1275863 N613… trt   untrt SRR1…       126
#>  8 ENSG000… SRR103…   1062 GSM1275863 N613… trt   untrt SRR1…       126
#>  9 ENSG000… SRR103…    380 GSM1275863 N613… trt   untrt SRR1…       126
#> 10 ENSG000… SRR103…    236 GSM1275863 N613… trt   untrt SRR1…       126
#> # … with 40 more rows, and 4 more variables: Experiment <fct>, Sample <fct>,
#> #   BioSample <fct>, GRangesList <list>

We can use select to choose columns, for example, to see the sample, cell line and treatment columns.

airway %>% select(.sample, cell, dex)
#> tidySummarizedExperiment says: Key columns are missing. A data frame is returned for independent data analysis.
#> # A tibble: 512,816 × 3
#>    .sample    cell   dex  
#>    <chr>      <fct>  <fct>
#>  1 SRR1039508 N61311 untrt
#>  2 SRR1039508 N61311 untrt
#>  3 SRR1039508 N61311 untrt
#>  4 SRR1039508 N61311 untrt
#>  5 SRR1039508 N61311 untrt
#>  6 SRR1039508 N61311 untrt
#>  7 SRR1039508 N61311 untrt
#>  8 SRR1039508 N61311 untrt
#>  9 SRR1039508 N61311 untrt
#> 10 SRR1039508 N61311 untrt
#> # … with 512,806 more rows

We can combine group_by and summarise to calculate the total counts for each sample.

airway %>%
    group_by(.sample) %>%
    summarise(total_counts=sum(counts))
#> tidySummarizedExperiment says: A data frame is returned for independent data analysis.
#> # A tibble: 8 × 2
#>   .sample    total_counts
#>   <chr>             <int>
#> 1 SRR1039508     20637971
#> 2 SRR1039509     18809481
#> 3 SRR1039512     25348649
#> 4 SRR1039513     15163415
#> 5 SRR1039516     24448408
#> 6 SRR1039517     30818215
#> 7 SRR1039520     19126151
#> 8 SRR1039521     21164133

We can use mutate to create a column. For example, we could create a new sample_name column that contains shorter sample names. We can remove the SRR1039 prefix that’s present in all of the samples, as shorter names can fit better in some of the plots we will create. We can use mutate together with str_replace to remove the SRR1039 string from the sample column.

airway %>%
  mutate(sample_name=str_remove(.sample, "SRR1039")) %>%
  # select columns to view    
  select(.sample, sample_name)
#> tidySummarizedExperiment says: Key columns are missing. A data frame is returned for independent data analysis.
#> # A tibble: 512,816 × 2
#>    .sample    sample_name
#>    <chr>      <chr>      
#>  1 SRR1039508 508        
#>  2 SRR1039508 508        
#>  3 SRR1039508 508        
#>  4 SRR1039508 508        
#>  5 SRR1039508 508        
#>  6 SRR1039508 508        
#>  7 SRR1039508 508        
#>  8 SRR1039508 508        
#>  9 SRR1039508 508        
#> 10 SRR1039508 508        
#> # … with 512,806 more rows

Tidybulk workflow: functions/utilities available

Function	Description
`aggregate_duplicates`	Aggregate abundance and annotation of duplicated transcripts in a robust way
`identify_abundant` `keep_abundant`	identify or keep the abundant genes
`keep_variable`	Filter for top variable features
`scale_abundance`	Scale (normalise) abundance for RNA sequencing depth
`reduce_dimensions`	Perform dimensionality reduction (PCA, MDS, tSNE, UMAP)
`cluster_elements`	Labels elements with cluster identity (kmeans, SNN)
`remove_redundancy`	Filter out elements with highly correlated features
`adjust_abundance`	Remove known unwanted variation (Combat)
`test_differential_abundance`	Differential transcript abundance testing (DESeq2, edgeR, voom)
`deconvolve_cellularity`	Estimated tissue composition (Cibersort, llsr, epic, xCell, mcp_counter, quantiseq
`test_differential_cellularity`	Differential cell-type abundance testing
`test_stratification_cellularity`	Estimate Kaplan-Meier survival differences
`test_gene_enrichment`	Gene enrichment analyses (EGSEA)
`test_gene_overrepresentation`	Gene enrichment on list of transcript names (no rank)
`test_gene_rank`	Gene enrichment on list of transcript (GSEA)
`impute_missing_abundance`	Impute abundance for missing data points using sample groupings

Utilities	Description
`get_bibliography`	Get the bibliography of your workflow
`pivot_sample`	Select sample-wise columns/information
`pivot_transcript`	Select transcript-wise columns/information
`describe_transcript`	Add gene description from gene symbol

Setting up the data

We’ll set up the airway data for our RNA sequencing analysis. We’ll create a column with shorter sample names and a column with gene symbols. We can get the gene symbols for these Ensembl gene ids using the Bioconductor annotation package for human, org.Hs.eg.db.

# setup data workflow
counts <-
  airway %>%
  mutate(sample_name = str_remove(.sample, "SRR1039")) %>%
  mutate(symbol = AnnotationDbi::mapIds(org.Hs.eg.db::org.Hs.eg.db,
    keys = .feature,
    keytype = "ENSEMBL",
    column = "SYMBOL",
    multiVals = "first"
  ))
#> 
#> 'select()' returned 1:many mapping between keys and columns
# take a look
counts
#> # A SummarizedExperiment-tibble abstraction: 512,816 × 15
#> [90m# Transcripts=64102 | Samples=8 | Assays=counts[39m
#>    .feature .sample counts SampleName cell  dex   albut Run   avgLength
#>    <chr>    <chr>    <int> <fct>      <fct> <fct> <fct> <fct>     <int>
#>  1 ENSG000… SRR103…    679 GSM1275862 N613… untrt untrt SRR1…       126
#>  2 ENSG000… SRR103…      0 GSM1275862 N613… untrt untrt SRR1…       126
#>  3 ENSG000… SRR103…    467 GSM1275862 N613… untrt untrt SRR1…       126
#>  4 ENSG000… SRR103…    260 GSM1275862 N613… untrt untrt SRR1…       126
#>  5 ENSG000… SRR103…     60 GSM1275862 N613… untrt untrt SRR1…       126
#>  6 ENSG000… SRR103…      0 GSM1275862 N613… untrt untrt SRR1…       126
#>  7 ENSG000… SRR103…   3251 GSM1275862 N613… untrt untrt SRR1…       126
#>  8 ENSG000… SRR103…   1433 GSM1275862 N613… untrt untrt SRR1…       126
#>  9 ENSG000… SRR103…    519 GSM1275862 N613… untrt untrt SRR1…       126
#> 10 ENSG000… SRR103…    394 GSM1275862 N613… untrt untrt SRR1…       126
#> # … with 40 more rows, and 6 more variables: Experiment <fct>, Sample <fct>,
#> #   BioSample <fct>, sample_name <chr>, symbol <chr>, GRangesList <list>

Filtering lowly transcribed genes

Genes with very low counts across all libraries provide little evidence for differential expression and they can interfere with some of the statistical approximations that are used later in the pipeline. They also add to the multiple testing burden when estimating false discovery rates, reducing the power to detect differentially expressed genes. These genes should be filtered out prior to further analysis.

We can perform the filtering using tidybulk keep_abundant or identify_abundant. These functions can use the edgeR filterByExpr function described in (Law et al. 2016) to automatically identify the genes with adequate abundance for differential expression testing. By default, this will keep genes with ~10 counts in a minimum number of samples, the number of the samples in the smallest group. In this dataset, the smallest group size is four (as we have four dex-treated samples versus four untreated). Alternatively, we could use identify_abundant to identify which genes are abundant or not (TRUE/FALSE), rather than just keeping the abundant ones.

# Filtering counts
counts_filtered <- counts %>% keep_abundant(factor_of_interest = dex)

# take a look
counts_filtered
#> # A SummarizedExperiment-tibble abstraction: 127,408 × 16
#> [90m# Transcripts=15926 | Samples=8 | Assays=counts[39m
#>    .feature .sample counts SampleName cell  dex   albut Run   avgLength
#>    <chr>    <chr>    <int> <fct>      <fct> <fct> <fct> <fct>     <int>
#>  1 ENSG000… SRR103…    679 GSM1275862 N613… untrt untrt SRR1…       126
#>  2 ENSG000… SRR103…    467 GSM1275862 N613… untrt untrt SRR1…       126
#>  3 ENSG000… SRR103…    260 GSM1275862 N613… untrt untrt SRR1…       126
#>  4 ENSG000… SRR103…     60 GSM1275862 N613… untrt untrt SRR1…       126
#>  5 ENSG000… SRR103…   3251 GSM1275862 N613… untrt untrt SRR1…       126
#>  6 ENSG000… SRR103…   1433 GSM1275862 N613… untrt untrt SRR1…       126
#>  7 ENSG000… SRR103…    519 GSM1275862 N613… untrt untrt SRR1…       126
#>  8 ENSG000… SRR103…    394 GSM1275862 N613… untrt untrt SRR1…       126
#>  9 ENSG000… SRR103…    172 GSM1275862 N613… untrt untrt SRR1…       126
#> 10 ENSG000… SRR103…   2112 GSM1275862 N613… untrt untrt SRR1…       126
#> # … with 40 more rows, and 7 more variables: Experiment <fct>, Sample <fct>,
#> #   BioSample <fct>, sample_name <chr>, symbol <chr>, .abundant <lgl>,
#> #   GRangesList <list>

After running keep_abundant we have a column called .abundant containing TRUE (identify_abundant would have TRUE/FALSE).

Scaling counts to normalise

Scaling of counts (normalisation) is performed to eliminate uninteresting differences between samples due to sequencing depth or composition. A more detailed explanation can be found here. In the tidybulk package, the function scale_abundance generates scaled counts, with scaling factors calculated on abundant (filtered) transcripts and applied to all transcripts. We can choose from different normalisation methods. Here we will use the default edgeR’s trimmed mean of M values (TMM), (Robinson and Oshlack 2010). TMM normalisation (and most scaling normalisation methods) scale relative to one sample.

# Scaling counts
counts_scaled <- counts_filtered %>% scale_abundance()

# take a look
counts_scaled
#> # A SummarizedExperiment-tibble abstraction: 127,408 × 19
#> [90m# Transcripts=15926 | Samples=8 | Assays=counts, counts_scaled[39m
#>    .feature .sample counts counts_scaled SampleName cell  dex   albut Run  
#>    <chr>    <chr>    <int>         <dbl> <fct>      <fct> <fct> <fct> <fct>
#>  1 ENSG000… SRR103…    679         961.  GSM1275862 N613… untrt untrt SRR1…
#>  2 ENSG000… SRR103…    467         661.  GSM1275862 N613… untrt untrt SRR1…
#>  3 ENSG000… SRR103…    260         368.  GSM1275862 N613… untrt untrt SRR1…
#>  4 ENSG000… SRR103…     60          84.9 GSM1275862 N613… untrt untrt SRR1…
#>  5 ENSG000… SRR103…   3251        4601.  GSM1275862 N613… untrt untrt SRR1…
#>  6 ENSG000… SRR103…   1433        2028.  GSM1275862 N613… untrt untrt SRR1…
#>  7 ENSG000… SRR103…    519         734.  GSM1275862 N613… untrt untrt SRR1…
#>  8 ENSG000… SRR103…    394         558.  GSM1275862 N613… untrt untrt SRR1…
#>  9 ENSG000… SRR103…    172         243.  GSM1275862 N613… untrt untrt SRR1…
#> 10 ENSG000… SRR103…   2112        2989.  GSM1275862 N613… untrt untrt SRR1…
#> # … with 40 more rows, and 10 more variables: avgLength <int>,
#> #   Experiment <fct>, Sample <fct>, BioSample <fct>, sample_name <chr>,
#> #   TMM <dbl>, multiplier <dbl>, symbol <chr>, .abundant <lgl>,
#> #   GRangesList <list>

After we run, we should see some columns have been added at the end. The counts_scaled column contains the scaled counts.

We can visualise the difference in abundance densities before and after scaling. As tidybulk output is compatible with tidyverse, we can simply pipe it into standard tidyverse functions such as filter, pivot_longer and ggplot. We can also take advantage of ggplot’s facet_wrap to easily create multiple plots.

counts_scaled %>%

  # Reshaping
  pivot_longer(cols = c("counts", "counts_scaled"), names_to = "source", values_to = "abundance") %>%

  # Plotting
  ggplot(aes(x = abundance + 1, color = sample_name)) +
  geom_density() +
  facet_wrap(~source) +
  scale_x_log10() +
  custom_theme
#> tidySummarizedExperiment says: A data frame is returned for independent data analysis.

In this dataset, the distributions of the counts are not very different to each other before scaling, but scaling does make the distributions more similar. If we saw a sample with a very different distribution, we may need to investigate it.

As tidybulk smoothly integrates with ggplot2 and other tidyverse packages it can save on typing and make plots easier to generate. Compare the code for creating density plots with tidybulk versus standard base R below (standard code adapted from (Law et al. 2016)).

tidybulk

# tidybulk
airway %>%
  keep_abundant(factor_of_interest = dex) %>%
  scale_abundance() %>%
  pivot_longer(cols = c("counts", "counts_scaled"), names_to = "source", values_to = "abundance") %>%
  ggplot(aes(x = abundance + 1, color = sample)) +
  geom_density() +
  facet_wrap(~source) +
  scale_x_log10() +
  custom_theme

base R using edgeR

# Example code, no need to run

# Prepare data set
library(edgeR)
dgList <- SE2DGEList(airway)
group <- factor(dgList$samples$dex)
keep.exprs <- filterByExpr(dgList, group = group)
dgList <- dgList[keep.exprs, , keep.lib.sizes = FALSE]
nsamples <- ncol(dgList)
logcounts <- log2(dgList$counts)

# Setup graphics
col <- RColorBrewer::brewer.pal(nsamples, "Paired")
par(mfrow = c(1, 2))

# Plot raw counts
plot(density(logcounts[, 1]), col = col[1], lwd = 2, ylim = c(0, 0.26), las = 2, main = "", xlab = "")
title(main = "Counts")
for (i in 2:nsamples) {
  den <- density(logcounts[, i])
  lines(den$x, den$y, col = col[i], lwd = 2)
}
legend("topright", legend = dgList$samples$Run, text.col = col, bty = "n")

# Plot scaled counts
dgList_norm <- calcNormFactors(dgList)
lcpm_n <- cpm(dgList_norm, log = TRUE)
plot(density(lcpm_n[, 1]), col = col[1], lwd = 2, ylim = c(0, 0.26), las = 2, main = "", xlab = "")
title("Counts scaled")
for (i in 2:nsamples) {
  den <- density(lcpm_n[, i])
  lines(den$x, den$y, col = col[i], lwd = 2)
}
legend("topright", legend = dgList_norm$samples$Run, text.col = col, bty = "n")

Exploratory analyses

Dimensionality reduction

By far, one of the most important plots we make when we analyse RNA sequencing data are principal-component analysis (PCA) or multi-dimensional scaling (MDS) plots. We reduce the dimensions of the data to identify the greatest sources of variation in the data. A principal components analysis is an example of an unsupervised analysis, where we don’t need to specify the groups. If your experiment is well controlled and has worked well, what we hope to see is that the greatest sources of variation in the data are the treatments/groups we are interested in. It is also an incredibly useful tool for quality control and checking for outliers. We can use the reduce_dimensions function to calculate the dimensions.

# Get principal components
counts_scal_PCA <-
  counts_scaled %>%
  reduce_dimensions(method = "PCA")
#> Getting the 500 most variable genes
#> Fraction of variance explained by the selected principal components
#> # A tibble: 2 × 2
#>   `Fraction of variance`    PC
#>                    <dbl> <int>
#> 1                  0.353     1
#> 2                  0.312     2
#> tidybulk says: to access the raw results do `attr(..., "internals")$PCA`

Challenge: What fraction of variance is explained by PC3?
Select one of the multiple choice options in www.menti.com (code in Google doc).

This joins the result to the counts’ object.

# Take a look
counts_scal_PCA
#> # A SummarizedExperiment-tibble abstraction: 127,408 × 21
#> [90m# Transcripts=15926 | Samples=8 | Assays=counts, counts_scaled[39m
#>    .feature .sample counts counts_scaled SampleName cell  dex   albut Run  
#>    <chr>    <chr>    <int>         <dbl> <fct>      <fct> <fct> <fct> <fct>
#>  1 ENSG000… SRR103…    679         961.  GSM1275862 N613… untrt untrt SRR1…
#>  2 ENSG000… SRR103…    467         661.  GSM1275862 N613… untrt untrt SRR1…
#>  3 ENSG000… SRR103…    260         368.  GSM1275862 N613… untrt untrt SRR1…
#>  4 ENSG000… SRR103…     60          84.9 GSM1275862 N613… untrt untrt SRR1…
#>  5 ENSG000… SRR103…   3251        4601.  GSM1275862 N613… untrt untrt SRR1…
#>  6 ENSG000… SRR103…   1433        2028.  GSM1275862 N613… untrt untrt SRR1…
#>  7 ENSG000… SRR103…    519         734.  GSM1275862 N613… untrt untrt SRR1…
#>  8 ENSG000… SRR103…    394         558.  GSM1275862 N613… untrt untrt SRR1…
#>  9 ENSG000… SRR103…    172         243.  GSM1275862 N613… untrt untrt SRR1…
#> 10 ENSG000… SRR103…   2112        2989.  GSM1275862 N613… untrt untrt SRR1…
#> # … with 40 more rows, and 12 more variables: avgLength <int>,
#> #   Experiment <fct>, Sample <fct>, BioSample <fct>, sample_name <chr>,
#> #   TMM <dbl>, multiplier <dbl>, PC1 <dbl>, PC2 <dbl>, symbol <chr>,
#> #   .abundant <lgl>, GRangesList <list>

For plotting, we can select just the sample-wise information with pivot_sample.

# take a look
counts_scal_PCA %>% pivot_sample()
#> # A tibble: 8 × 15
#>   sample     SampleName cell    dex   albut Run     avgLength Experiment Sample 
#>   <chr>      <fct>      <fct>   <fct> <fct> <fct>       <int> <fct>      <fct>  
#> 1 SRR1039508 GSM1275862 N61311  untrt untrt SRR103…       126 SRX384345  SRS508…
#> 2 SRR1039509 GSM1275863 N61311  trt   untrt SRR103…       126 SRX384346  SRS508…
#> 3 SRR1039512 GSM1275866 N052611 untrt untrt SRR103…       126 SRX384349  SRS508…
#> 4 SRR1039513 GSM1275867 N052611 trt   untrt SRR103…        87 SRX384350  SRS508…
#> 5 SRR1039516 GSM1275870 N080611 untrt untrt SRR103…       120 SRX384353  SRS508…
#> 6 SRR1039517 GSM1275871 N080611 trt   untrt SRR103…       126 SRX384354  SRS508…
#> 7 SRR1039520 GSM1275874 N061011 untrt untrt SRR103…       101 SRX384357  SRS508…
#> 8 SRR1039521 GSM1275875 N061011 trt   untrt SRR103…        98 SRX384358  SRS508…
#> # … with 6 more variables: BioSample <fct>, sample_name <chr>, TMM <dbl>,
#> #   multiplier <dbl>, PC1 <dbl>, PC2 <dbl>

We can now plot the reduced dimensions.

# PCA plot
counts_scal_PCA %>%
  pivot_sample() %>%
  ggplot(aes(x = PC1, y = PC2, colour = dex, shape = cell)) +
  geom_point() +
  geom_text_repel(aes(label = sample_name), show.legend = FALSE) +
  custom_theme

The samples group by treatment on PC1 which is what we hope to see. PC2 separates the N080611 cell line from the other samples, indicating a greater difference between that cell line and the others.

Hierarchical clustering with heatmaps

An alternative to principal component analysis for examining relationships between samples is using hierarchical clustering. Heatmaps are a nice visualisation to examine hierarchical clustering of your samples. tidybulk has a simple function we can use, keep_variable, to extract the most variable genes which we can then plot with tidyHeatmap.

counts_scal_PCA %>%

  # extract 500 most variable genes
  keep_variable(.abundance = counts_scaled, top = 500) %>%
  as_tibble() %>%
  
  # create heatmap
  heatmap(
    .column = sample_name,
    .row = .feature,
    .value = counts_scaled,
    transform = log1p
  ) %>%
  add_tile(dex) %>%
  add_tile(cell)
#> Getting the 500 most variable genes
#> tidyHeatmap says: (once per session) from release 1.2.3 the grouping labels have white background by default. To add color for one-ay grouping specify palette_grouping = list(c("red", "blue"))
#> Warning in add_annotation(., !!.column, type = "tile", palette_discrete
#> = .data@data %>% : tidyHeatmap says: nested/list column are present in your data
#> frame and have been dropped as their unicity cannot be identified by dplyr.

#> Warning in add_annotation(., !!.column, type = "tile", palette_discrete
#> = .data@data %>% : tidyHeatmap says: nested/list column are present in your data
#> frame and have been dropped as their unicity cannot be identified by dplyr.

In the heatmap, we can see the samples cluster into two groups, treated and untreated, for three of the cell lines, and the cell line (N080611) again is further away from the others.

Tidybulk enables a simplified way of generating a clustered heatmap of variable genes. Compare the code below for tidybulk versus a base R method.

base R using edgeR

# Example code, no need to run

library(edgeR)
dgList <- SE2DGEList(airway)
group <- factor(dgList$samples$dex)
keep.exprs <- filterByExpr(dgList, group = group)
dgList <- dgList[keep.exprs, , keep.lib.sizes = FALSE]
dgList <- calcNormFactors(dgList)
logcounts <- cpm(dgList, log = TRUE)
var_genes <- apply(logcounts, 1, var)
select_var <- names(sort(var_genes, decreasing = TRUE))[1:500]
highly_variable_lcpm <- logcounts[select_var, ]
colours <- c("#440154FF", "#21908CFF", "#fefada")
col.group <- c("red", "grey")[group]
gplots::heatmap.2(highly_variable_lcpm, col = colours, trace = "none", ColSideColors = col.group, scale = "row")

Differential expression

tidybulk integrates several popular methods for differential transcript abundance testing: the edgeR quasi-likelihood (Chen, Lun, and Smyth 2016) (tidybulk default method), edgeR likelihood ratio (McCarthy, Chen, and Smyth 2012), limma-voom (Law et al. 2014) and DESeq2 (Love, Huber, and Anders 2014). A common question researchers have is which method to choose. With tidybulk, we can easily run multiple methods and compare.

We give test_differential_abundance our tidybulk counts object and a formula, specifying the column that contains our groups to be compared. If all our samples were from the same cell line, and there were no additional factors contributing variance, such as batch differences, we could use the formula ~ dex. However, each treated and untreated sample is from a different cell line, so we add the cell line as an additional factor ~ dex + cell.

de_all <-

  counts_scal_PCA %>%

  # edgeR QLT
  test_differential_abundance(
    ~ dex + cell,
    method = "edgeR_quasi_likelihood",
    prefix = "edgerQLT_"
  ) %>%

  # edgeR LRT
  test_differential_abundance(
    ~ dex + cell,
    method = "edgeR_likelihood_ratio",
    prefix = "edgerLR_"
  ) %>%

  # limma-voom
  test_differential_abundance(
    ~ dex + cell,
    method = "limma_voom",
    prefix = "voom_"
  ) %>%

  # DESeq2
  test_differential_abundance(
    ~ dex + cell,
    method = "deseq2",
    prefix = "deseq2_"
  )

# take a look

de_all
#> # A SummarizedExperiment-tibble abstraction: 127,408 × 43
#> [90m# Transcripts=15926 | Samples=8 | Assays=counts, counts_scaled[39m
#>    .feature .sample counts counts_scaled SampleName cell  dex   albut Run  
#>    <chr>    <chr>    <int>         <dbl> <fct>      <fct> <fct> <fct> <fct>
#>  1 ENSG000… SRR103…    679         961.  GSM1275862 N613… untrt untrt SRR1…
#>  2 ENSG000… SRR103…    467         661.  GSM1275862 N613… untrt untrt SRR1…
#>  3 ENSG000… SRR103…    260         368.  GSM1275862 N613… untrt untrt SRR1…
#>  4 ENSG000… SRR103…     60          84.9 GSM1275862 N613… untrt untrt SRR1…
#>  5 ENSG000… SRR103…   3251        4601.  GSM1275862 N613… untrt untrt SRR1…
#>  6 ENSG000… SRR103…   1433        2028.  GSM1275862 N613… untrt untrt SRR1…
#>  7 ENSG000… SRR103…    519         734.  GSM1275862 N613… untrt untrt SRR1…
#>  8 ENSG000… SRR103…    394         558.  GSM1275862 N613… untrt untrt SRR1…
#>  9 ENSG000… SRR103…    172         243.  GSM1275862 N613… untrt untrt SRR1…
#> 10 ENSG000… SRR103…   2112        2989.  GSM1275862 N613… untrt untrt SRR1…
#> # … with 40 more rows, and 34 more variables: avgLength <int>,
#> #   Experiment <fct>, Sample <fct>, BioSample <fct>, sample_name <chr>,
#> #   TMM <dbl>, multiplier <dbl>, PC1 <dbl>, PC2 <dbl>, symbol <chr>,
#> #   .abundant <lgl>, edgerQLT_logFC <dbl>, edgerQLT_logCPM <dbl>,
#> #   edgerQLT_F <dbl>, edgerQLT_PValue <dbl>, edgerQLT_FDR <dbl>,
#> #   edgerLR_logFC <dbl>, edgerLR_logCPM <dbl>, edgerLR_LR <dbl>,
#> #   edgerLR_PValue <dbl>, edgerLR_FDR <dbl>, voom_logFC <dbl>,
#> #   voom_AveExpr <dbl>, voom_t <dbl>, voom_P.Value <dbl>, voom_adj.P.Val <dbl>,
#> #   voom_B <dbl>, deseq2_baseMean <dbl>, deseq2_log2FoldChange <dbl>,
#> #   deseq2_lfcSE <dbl>, deseq2_stat <dbl>, deseq2_pvalue <dbl>,
#> #   deseq2_padj <dbl>, GRangesList <list>

This outputs the columns from each method such as log-fold change (logFC), false-discovery rate (FDR) and probability value (p-value). logFC is log2(treated/untreated).

Comparison of methods

We can visually compare the significance for all methods. We will notice that there is some difference between the methods.

de_all %>%
  pivot_transcript() %>%
  select(edgerQLT_PValue, edgerLR_PValue, voom_P.Value, deseq2_pvalue, feature) %>%
  ggpairs(1:4)

Challenge: Which method detects the largest no. of differentially abundant transcripts, p value adjusted for multiple testing <  0.05 (FDR, adj.P.Val, padj)?
Select one of the multiple choice options in www.menti.com (code in Google doc).

Single method

If we just wanted to run one differential testing method we could do that. The default method is edgeR quasi-likelihood.

counts_de <- counts_scal_PCA %>%
  test_differential_abundance(~ dex + cell)
#> tidybulk says: The design column names are "(Intercept), dexuntrt, cellN061011, cellN080611, cellN61311"
#> tidybulk says: to access the raw results (fitted GLM) do `attr(..., "internals")$edgeR_quasi_likelihood`

Tidybulk enables a simplified way of performing an RNA sequencing differential expression analysis (with the benefit of smoothly integrating with ggplot2 and other tidyverse packages). Compare the code for a tidybulk edgeR analysis versus standard edgeR below.

standard edgeR

# Example code, no need to run

library(edgeR)
dgList <- SE2DGEList(airway)
group <- factor(dgList$samples$dex)
keep.exprs <- filterByExpr(dgList, group = group)
dgList <- dgList[keep.exprs, , keep.lib.sizes = FALSE]
dgList <- calcNormFactors(dgList)
cell <- factor(dgList$samples$cell)
design <- model.matrix(~ group + cell)
dgList <- estimateDisp(dgList, design)
fit <- glmQLFit(dgList, design)
qlf <- glmQLFTest(fit, coef=2)

Volcano plots

Volcano plots are a useful genome-wide tool for checking that the analysis looks good. Volcano plots enable us to visualise the significance of change (p-value) versus the fold change (logFC). Highly significant genes are towards the top of the plot. We can also colour significant genes, e.g. genes with false-discovery rate < 0.05. To decide which genes are differentially expressed, we usually take a cut-off of 0.05 on the FDR (or adjusted P-value), NOT the raw p-value. This is because we are testing many genes (multiple testing), and the chances of finding differentially expressed genes are very high when you do that many tests. Hence we need to control the false discovery rate, the adjusted p-value column in the results table. That is, if 100 genes are significant at a 5% false discovery rate, we are willing to accept that 5 will be false positives.

# volcano plot, minimal
counts_de %>%
  ggplot(aes(x = logFC, y = PValue, colour = FDR < 0.05)) +
  geom_point() +
  scale_y_continuous(trans = "log10_reverse") +
  custom_theme

We’ll extract the symbols for a few top genes (by P value) to use in a more informative volcano plot, integrating some of the packages in tidyverse.

topgenes_symbols <-
  counts_de %>%
  pivot_transcript() %>%
  slice_min(PValue, n = 6) %>%
  pull(symbol)

topgenes_symbols
#> [1] "CACNB2"  "ZBTB16"  "PRSS35"  "PDPN"    "SPARCL1" "DUSP1"

counts_de %>%
  pivot_transcript() %>%

  # Subset data
  mutate(significant = FDR < 0.05 & abs(logFC) >= 2) %>%
  mutate(symbol = ifelse(symbol %in% topgenes_symbols, as.character(symbol), "")) %>%

  # Plot
  ggplot(aes(x = logFC, y = PValue, label = symbol)) +
  geom_point(aes(color = significant, size = significant, alpha = significant)) +
  geom_text_repel() +

  # Custom scales
  custom_theme +
  scale_y_continuous(trans = "log10_reverse") +
  scale_color_manual(values = c("black", "#e11f28")) +
  scale_size_discrete(range = c(0, 2))
#> Scale for 'colour' is already present. Adding another scale for 'colour',
#> which will replace the existing scale.

Automatic bibliography

Tidybulk provides a handy function called get_bibliography that keeps track of the references for the methods used in your tidybulk workflow. The references are in BibTeX format and can be imported into your reference manager.

get_bibliography(counts_de)
#>  @Article{tidybulk,
#>   title = {tidybulk: an R tidy framework for modular transcriptomic data analysis},
#>   author = {Stefano Mangiola and Ramyar Molania and Ruining Dong and Maria A. Doyle & Anthony T. Papenfuss},
#>   journal = {Genome Biology},
#>   year = {2021},
#>   volume = {22},
#>   number = {42},
#>   url = {https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02233-7},
#>   }
#> @article{wickham2019welcome,
#>   title={Welcome to the Tidyverse},
#>   author={Wickham, Hadley and Averick, Mara and Bryan, Jennifer and Chang, Winston and McGowan, Lucy D'Agostino and Francois, Romain and Grolemund, Garrett and Hayes, Alex and Henry, Lionel and Hester, Jim and others},
#>   journal={Journal of Open Source Software},
#>   volume={4},
#>   number={43},
#>   pages={1686},
#>   year={2019}
#>  }
#> @article{robinson2010edger,
#>   title={edgeR: a Bioconductor package for differential expression analysis of digital gene expression data},
#>   author={Robinson, Mark D and McCarthy, Davis J and Smyth, Gordon K},
#>   journal={Bioinformatics},
#>   volume={26},
#>   number={1},
#>   pages={139--140},
#>   year={2010},
#>   publisher={Oxford University Press}
#>  }
#> @article{robinson2010scaling,
#>   title={A scaling normalization method for differential expression analysis of RNA-seq data},
#>   author={Robinson, Mark D and Oshlack, Alicia},
#>   journal={Genome biology},
#>   volume={11},
#>   number={3},
#>   pages={1--9},
#>   year={2010},
#>   publisher={BioMed Central}
#>  }
#> @Manual{,
#>     title = {R: A Language and Environment for Statistical Computing},
#>     author = {{R Core Team}},
#>     organization = {R Foundation for Statistical Computing},
#>     address = {Vienna, Austria},
#>     year = {2020},
#>     url = {https://www.R-project.org/},
#>   }
#> @article{lund2012detecting,
#>   title={Detecting differential expression in RNA-sequence data using quasi-likelihood with shrunken dispersion estimates},
#>   author={Lund, Steven P and Nettleton, Dan and McCarthy, Davis J and Smyth, Gordon K},
#>   journal={Statistical applications in genetics and molecular biology},
#>   volume={11},
#>   number={5},
#>   year={2012},
#>   publisher={De Gruyter}
#>     }

Cell type composition analysis

If we are sequencing tissue samples, we may want to know what cell types are present and if there are differences in expression between them. tidybulk has a deconvolve_cellularity function that can help us do this.

For this example, we will use a subset of the breast cancer dataset from The Cancer Genome Atlas (TCGA).

BRCA <- rpharma2021tidytranscriptomics::BRCA 
BRCA
#> # A SummarizedExperiment-tibble abstraction: 198,374 × 5
#> [90m# Transcripts=9017 | Samples=22 | Assays=count[39m
#>    .feature .sample      count  time event_occurred
#>    <chr>    <chr>        <int> <dbl>          <int>
#>  1 A1BG     TCGA-A2-A04P    13   102              1
#>  2 A1BG-AS1 TCGA-A2-A04P    61   102              1
#>  3 A2M      TCGA-A2-A04P 19277   102              1
#>  4 A2MP1    TCGA-A2-A04P     9   102              1
#>  5 AAAS     TCGA-A2-A04P  1917   102              1
#>  6 AACS     TCGA-A2-A04P  1998   102              1
#>  7 AADAT    TCGA-A2-A04P   212   102              1
#>  8 AAGAB    TCGA-A2-A04P  1194   102              1
#>  9 AAK1     TCGA-A2-A04P  1468   102              1
#> 10 AAMP     TCGA-A2-A04P  5798   102              1
#> # … with 40 more rows

With tidybulk, we can easily infer the proportions of cell types within a tissue using one of several published methods (Cibersort (Newman et al. 2015), EPIC (Racle et al. 2017) and llsr (Abbas et al. 2009)). Here we will use Cibersort which provides a default signature called LM22 to define the cell types. LM22 contains 547 genes that identify 22 human immune cell types.

BRCA_cell_type <-
  BRCA %>%
  deconvolve_cellularity(prefix="cibersort: ", cores = 1)

BRCA_cell_type
#> # A SummarizedExperiment-tibble abstraction: 198,374 × 27
#> [90m# Transcripts=9017 | Samples=22 | Assays=count[39m
#>    .feature .sample count  time event_occurred `cibersort: B c… `cibersort: B c…
#>    <chr>    <chr>   <int> <dbl>          <int>            <dbl>            <dbl>
#>  1 A1BG     TCGA-A…    13   102              1            0.201                0
#>  2 A1BG-AS1 TCGA-A…    61   102              1            0.201                0
#>  3 A2M      TCGA-A… 19277   102              1            0.201                0
#>  4 A2MP1    TCGA-A…     9   102              1            0.201                0
#>  5 AAAS     TCGA-A…  1917   102              1            0.201                0
#>  6 AACS     TCGA-A…  1998   102              1            0.201                0
#>  7 AADAT    TCGA-A…   212   102              1            0.201                0
#>  8 AAGAB    TCGA-A…  1194   102              1            0.201                0
#>  9 AAK1     TCGA-A…  1468   102              1            0.201                0
#> 10 AAMP     TCGA-A…  5798   102              1            0.201                0
#> # … with 40 more rows, and 20 more variables: `cibersort: Plasma cells` <dbl>,
#> #   `cibersort: T cells CD8` <dbl>, `cibersort: T cells CD4 naive` <dbl>,
#> #   `cibersort: T cells CD4 memory resting` <dbl>, `cibersort: T cells CD4
#> #   memory activated` <dbl>, `cibersort: T cells follicular helper` <dbl>,
#> #   `cibersort: T cells regulatory (Tregs)` <dbl>, `cibersort: T cells gamma
#> #   delta` <dbl>, `cibersort: NK cells resting` <dbl>, `cibersort: NK cells
#> #   activated` <dbl>, `cibersort: Monocytes` <dbl>, `cibersort: Macrophages
#> #   M0` <dbl>, `cibersort: Macrophages M1` <dbl>, `cibersort: Macrophages
#> #   M2` <dbl>, `cibersort: Dendritic cells resting` <dbl>, `cibersort:
#> #   Dendritic cells activated` <dbl>, `cibersort: Mast cells resting` <dbl>,
#> #   `cibersort: Mast cells activated` <dbl>, `cibersort: Eosinophils` <dbl>,
#> #   `cibersort: Neutrophils` <dbl>

Cell type proportions are added to the tibble as new columns. The prefix makes it easy to reshape the data frame if needed, for visualisation or further analyses.

BRCA_cell_type_long <-
  BRCA_cell_type %>%
  pivot_sample() %>%
  
  # Reshape
  pivot_longer(
    contains("cibersort"),
    names_prefix = "cibersort: ",
    names_to = "cell_type",
    values_to = "proportion"
  )

BRCA_cell_type_long
#> # A tibble: 484 × 5
#>    sample        time event_occurred cell_type                        proportion
#>    <chr>        <dbl>          <int> <chr>                                 <dbl>
#>  1 TCGA-A2-A04P   102              1 cibersort..B.cells.naive             0.201 
#>  2 TCGA-A2-A04P   102              1 cibersort..B.cells.memory            0     
#>  3 TCGA-A2-A04P   102              1 cibersort..Plasma.cells              0     
#>  4 TCGA-A2-A04P   102              1 cibersort..T.cells.CD8               0.0297
#>  5 TCGA-A2-A04P   102              1 cibersort..T.cells.CD4.naive         0     
#>  6 TCGA-A2-A04P   102              1 cibersort..T.cells.CD4.memory.r…     0.0841
#>  7 TCGA-A2-A04P   102              1 cibersort..T.cells.CD4.memory.a…     0     
#>  8 TCGA-A2-A04P   102              1 cibersort..T.cells.follicular.h…     0.0556
#>  9 TCGA-A2-A04P   102              1 cibersort..T.cells.regulatory..…     0.0484
#> 10 TCGA-A2-A04P   102              1 cibersort..T.cells.gamma.delta       0     
#> # … with 474 more rows

We can plot the proportions of immune cell types for each patient.

BRCA_cell_type_long %>%

  # Plot proportions
  ggplot(aes(x = sample, y = proportion, fill = cell_type)) +
  geom_bar(stat = "identity") +
  custom_theme

Challenge: What is the most abundant cell type overall in BRCA samples?
Select one of the multiple choice options in www.menti.com (code in Google doc).

Key Points

Bulk RNA sequencing data can be represented and analysed in a ‘tidy’ way using tidySummarizedExperiment, tidybulk and the tidyverse.
tidySummarizedExperiment enables us to visualise and manipulate a Bioconductor SummarizedExperiment object as if it were in tidy data format.
Some of the key steps in an RNA sequencing analysis are filtering lowly abundant transcripts, adjusting for differences in sequencing depth and composition, testing for differential expression, and visualising the data, which can all be performed in a tidy way with tidybulk.
tidybulk allows streamlined multi-method analyses
tidybulk allow easy analyses of cell-type composition

Part 2 Single-cell RNA sequencing with tidySingleCellExperiment

A typical single-cell RNA sequencing workflow is shown in the Workshop Introduction section. We don’t have time in this workshop to go into depth on each step but you can read more about single-cell RNA sequencing workflows in the online book Orchestrating Single-Cell Analysis with Bioconductor.

In Part 1, we showed how we can study the cell-type composition of a biological sample using bulk RNA sequencing. Single-cell sequencing enables a more direct estimation of cell-type composition and gives a greater resolution. For bulk RNA sequencing, we need to infer the cell types using the abundance of transcripts in the whole sample. With single-cell RNA sequencing, we can directly measure the transcripts in each cell and then classify the cells into cell types.

# load packages
library(tibble)
library(ggplot2)
library(purrr)
library(scater)
library(scran)
library(igraph)
library(batchelor)
library(SingleR)
library(scuttle)
library(EnsDb.Hsapiens.v86)
library(celldex)
library(ggbeeswarm)
friendly_cols <- dittoSeq::dittoColors()

Introduction to tidySingleCellExperiment

The single-cell RNA sequencing data used here is 3000 cells in total, subsetted from 20 samples from 10 peripheral blood mononuclear cell (PBMC) datasets. The datasets are from GSE115189/SRR7244582 (Freytag et al. 2018), SRR11038995 [Cai et al. (2020), SCP345 (singlecell.broadinstitute.org), SCP424 (Ding et al. 2020), SCP591 (Karagiannis et al. 2020) and 10x-derived 6K and 8K datasets (support.10xgenomics.com/). The data is in SingleCellExperiment format. SingleCellExperiment is a very popular container of single-cell RNA sequencing data.

Similar to what we saw with tidySummarizedExperiment, tidySingleCellExperiment package enables any SingleCellExperiment object to be displayed and manipulated according to tidy data principles without affecting any SingleCellExperiment behaviour.

# load pbmc single cell RNA sequencing data
pbmc <- rpharma2021tidytranscriptomics::pbmc

# take a look
pbmc
#> class: SingleCellExperiment 
#> dim: 51958 3000 
#> metadata(0):
#> assays(2): counts logcounts
#> rownames(51958): DDX11L1 WASH7P ... RP11-141O19.1 RP11-157J13.1
#> rowData names(0):
#> colnames(3000): CCAGTCACACTGGT-1 ATGAGCACATCTTC-1 ... CTTTGCGAGTACGCCC
#>   GTATTCTAGGCTACGA
#> colData names(7): file orig.ident ... G2M.Score ident
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):

If we load the tidySingleCellExperiment package and then view the PBMC data, it displays as a tibble.

library(tidySingleCellExperiment)

pbmc
#> # A SingleCellExperiment-tibble abstraction: 3,000 × 8
#> [90m# Features=51958 | Assays=counts, logcounts[39m
#>    cell    file      orig.ident nCount_RNA nFeature_RNA  S.Score G2M.Score ident
#>    <chr>   <chr>     <chr>           <dbl>        <int>    <dbl>     <dbl> <fct>
#>  1 CCAGTC… ../data/… SeuratPro…       3421          979 -5.42e-2  -0.0107  G1   
#>  2 ATGAGC… ../data/… SeuratPro…       2752          898 -5.01e-2  -0.00416 G1   
#>  3 TATGAA… ../data/… SeuratPro…       2114          937 -2.95e-5  -0.0229  G1   
#>  4 CATATA… ../data/… SeuratPro…       3122         1086 -6.65e-2  -0.0488  G1   
#>  5 GAGGCA… ../data/… SeuratPro…       2341          957 -3.74e-3   0.0241  G2M  
#>  6 AGCTGC… ../data/… SeuratPro…       5472         1758 -5.88e-2   0.00241 G2M  
#>  7 TGATTA… ../data/… SeuratPro…       1258          542 -2.51e-2  -0.0269  G1   
#>  8 ACGAAG… ../data/… SeuratPro…       7683         1926 -1.33e-1  -0.116   G1   
#>  9 CGGCAT… ../data/… SeuratPro…       3500         1092 -6.87e-2  -0.0622  G1   
#> 10 ATAGCG… ../data/… SeuratPro…       3092          974 -1.24e-2  -0.0271  G1   
#> # … with 2,990 more rows

It can be interacted with using SingleCellExperiment commands such as assayNames.

assayNames(pbmc)
#> [1] "counts"    "logcounts"

We can also interact with our object as we do with any tidyverse tibble.

Tidyverse commands

And now we can also use tidyverse commands, such as filter, select and mutate to explore the tidySingleCellExperiment object. Some examples are shown below and more can be seen at the tidySingleCellExperiment website here. We can also use the tidyverse pipe %>%. This ‘pipes’ the output from the command on the left into the command on the right/below. Using the pipe is not essential but it reduces the amount of code we need to write when we have multiple steps. It also can make the steps clearer and easier to see. For more details on the pipe see here.

We can use filter to choose rows, for example, to see just the rows for the treated samples.

pbmc %>% filter(ident == "G1")
#> # A SingleCellExperiment-tibble abstraction: 1,728 × 8
#> [90m# Features=51958 | Assays=counts, logcounts[39m
#>    cell    file      orig.ident nCount_RNA nFeature_RNA  S.Score G2M.Score ident
#>    <chr>   <chr>     <chr>           <dbl>        <int>    <dbl>     <dbl> <fct>
#>  1 CCAGTC… ../data/… SeuratPro…       3421          979 -5.42e-2 -0.0107   G1   
#>  2 ATGAGC… ../data/… SeuratPro…       2752          898 -5.01e-2 -0.00416  G1   
#>  3 TATGAA… ../data/… SeuratPro…       2114          937 -2.95e-5 -0.0229   G1   
#>  4 CATATA… ../data/… SeuratPro…       3122         1086 -6.65e-2 -0.0488   G1   
#>  5 TGATTA… ../data/… SeuratPro…       1258          542 -2.51e-2 -0.0269   G1   
#>  6 ACGAAG… ../data/… SeuratPro…       7683         1926 -1.33e-1 -0.116    G1   
#>  7 CGGCAT… ../data/… SeuratPro…       3500         1092 -6.87e-2 -0.0622   G1   
#>  8 ATAGCG… ../data/… SeuratPro…       3092          974 -1.24e-2 -0.0271   G1   
#>  9 CTATAG… ../data/… SeuratPro…       2637          949 -3.10e-2 -0.0473   G1   
#> 10 ACGTTG… ../data/… SeuratPro…       2379          888 -5.82e-2 -0.000362 G1   
#> # … with 1,718 more rows

We can use select to choose columns, for example, to see the sample, cell, total cellular RNA

pbmc %>% select(cell, nCount_RNA , ident)
#> # A SingleCellExperiment-tibble abstraction: 3,000 × 3
#> [90m# Features=51958 | Assays=counts, logcounts[39m
#>    cell             nCount_RNA ident
#>    <chr>                 <dbl> <fct>
#>  1 CCAGTCACACTGGT-1       3421 G1   
#>  2 ATGAGCACATCTTC-1       2752 G1   
#>  3 TATGAATGGAGGAC-1       2114 G1   
#>  4 CATATAGACTAAGC-1       3122 G1   
#>  5 GAGGCAGACTTGCC-1       2341 G2M  
#>  6 AGCTGCCTTTCATC-1       5472 G2M  
#>  7 TGATTAGATGACTG-1       1258 G1   
#>  8 ACGAAGCTCTGAGT-1       7683 G1   
#>  9 CGGCATCTTCGTAG-1       3500 G1   
#> 10 ATAGCGTGCCCTTG-1       3092 G1   
#> # … with 2,990 more rows

pbmc %>%
  mutate(ident_l=tolower(ident)) %>%
  # select columns to view
  select(ident, ident_l)
#> tidySingleCellExperiment says: Key columns are missing. A data frame is returned for independent data analysis.
#> # A tibble: 3,000 × 2
#>    ident ident_l
#>    <fct> <chr>  
#>  1 G1    g1     
#>  2 G1    g1     
#>  3 G1    g1     
#>  4 G1    g1     
#>  5 G2M   g2m    
#>  6 G2M   g2m    
#>  7 G1    g1     
#>  8 G1    g1     
#>  9 G1    g1     
#> 10 G1    g1     
#> # … with 2,990 more rows

Join datasets

We can join datasets as if they were tibbles

pbmc %>% bind_rows(pbmc)
#> Warning in bind_rows.SingleCellExperiment(., pbmc): tidySingleCellExperiment
#> says: you have duplicated cell names, they will be made unique.
#> # A SingleCellExperiment-tibble abstraction: 6,000 × 8
#> [90m# Features=51958 | Assays=counts, logcounts[39m
#>    cell    file      orig.ident nCount_RNA nFeature_RNA  S.Score G2M.Score ident
#>    <chr>   <chr>     <chr>           <dbl>        <int>    <dbl>     <dbl> <fct>
#>  1 CCAGTC… ../data/… SeuratPro…       3421          979 -5.42e-2  -0.0107  G1   
#>  2 ATGAGC… ../data/… SeuratPro…       2752          898 -5.01e-2  -0.00416 G1   
#>  3 TATGAA… ../data/… SeuratPro…       2114          937 -2.95e-5  -0.0229  G1   
#>  4 CATATA… ../data/… SeuratPro…       3122         1086 -6.65e-2  -0.0488  G1   
#>  5 GAGGCA… ../data/… SeuratPro…       2341          957 -3.74e-3   0.0241  G2M  
#>  6 AGCTGC… ../data/… SeuratPro…       5472         1758 -5.88e-2   0.00241 G2M  
#>  7 TGATTA… ../data/… SeuratPro…       1258          542 -2.51e-2  -0.0269  G1   
#>  8 ACGAAG… ../data/… SeuratPro…       7683         1926 -1.33e-1  -0.116   G1   
#>  9 CGGCAT… ../data/… SeuratPro…       3500         1092 -6.87e-2  -0.0622  G1   
#> 10 ATAGCG… ../data/… SeuratPro…       3092          974 -1.24e-2  -0.0271  G1   
#> # … with 5,990 more rows

Setting up the data

In this case, we want to polish an annotation column. We will extract the sample, dataset and group information from the file name column into separate columns.

# First take a look at the file column
pbmc %>% select(file)
#> tidySingleCellExperiment says: Key columns are missing. A data frame is returned for independent data analysis.
#> # A tibble: 3,000 × 1
#>    file                                                
#>    <chr>                                               
#>  1 ../data/GSE115189_1/outs/filtered_feature_bc_matrix/
#>  2 ../data/GSE115189_2/outs/filtered_feature_bc_matrix/
#>  3 ../data/GSE115189_2/outs/filtered_feature_bc_matrix/
#>  4 ../data/GSE115189_1/outs/filtered_feature_bc_matrix/
#>  5 ../data/GSE115189_2/outs/filtered_feature_bc_matrix/
#>  6 ../data/GSE115189_1/outs/filtered_feature_bc_matrix/
#>  7 ../data/GSE115189_1/outs/filtered_feature_bc_matrix/
#>  8 ../data/GSE115189_1/outs/filtered_feature_bc_matrix/
#>  9 ../data/GSE115189_1/outs/filtered_feature_bc_matrix/
#> 10 ../data/GSE115189_1/outs/filtered_feature_bc_matrix/
#> # … with 2,990 more rows

# Create columns for sample, dataset and groups
pbmc <-
  pbmc %>%

  # Extract sample and group
  extract(file, "sample", "../data/([a-zA-Z0-9_]+)/outs.+", remove = FALSE) %>%

  # Extract data source
  extract(file, c("dataset", "groups"), "../data/([a-zA-Z0-9_]+)_([0-9])/outs.+")

# Take a look
pbmc %>% select(sample, dataset, groups)
#> tidySingleCellExperiment says: Key columns are missing. A data frame is returned for independent data analysis.
#> # A tibble: 3,000 × 3
#>    sample      dataset   groups
#>    <chr>       <chr>     <chr> 
#>  1 GSE115189_1 GSE115189 1     
#>  2 GSE115189_2 GSE115189 2     
#>  3 GSE115189_2 GSE115189 2     
#>  4 GSE115189_1 GSE115189 1     
#>  5 GSE115189_2 GSE115189 2     
#>  6 GSE115189_1 GSE115189 1     
#>  7 GSE115189_1 GSE115189 1     
#>  8 GSE115189_1 GSE115189 1     
#>  9 GSE115189_1 GSE115189 1     
#> 10 GSE115189_1 GSE115189 1     
#> # … with 2,990 more rows

Quality control

A key quality control step performed in single-cell analyses is the assessment of the proportion of mitochondrial transcripts. A high mitochondrial count indicates cell death, and it is useful for filtering cells in a dying state.

We get the chromosomal location for each gene in the dataset so we can identify the mitochondrial genes. We’ll get a warning that some of the ids don’t find a match, but it should be just a small proportion.

location <- mapIds(
  EnsDb.Hsapiens.v86,
  keys = rownames(pbmc),
  column = "SEQNAME",
  keytype = "SYMBOL"
)
#> Warning: Unable to map 5313 of 51958 requested IDs.

We’ll first show the mitochondrial analysis for one of the 10 datasets.

one_dataset <- pbmc %>% filter(dataset =="GSE115189")

Next we calculate the mitchondrial content for each cell in the dataset.

mito_info_one_dataset <- perCellQCMetrics(one_dataset, subsets = list(Mito = which(location == "MT")))
mito_info_one_dataset
#> DataFrame with 300 rows and 6 columns
#>                        sum  detected subsets_Mito_sum subsets_Mito_detected
#>                  <numeric> <integer>        <numeric>             <integer>
#> CCAGTCACACTGGT-1      3421       979              132                    11
#> ATGAGCACATCTTC-1      2752       898               33                     9
#> TATGAATGGAGGAC-1      2114       937               21                     7
#> CATATAGACTAAGC-1      3122      1086               54                    10
#> GAGGCAGACTTGCC-1      2341       957               30                    10
#> ...                    ...       ...              ...                   ...
#> GATATCCTAGAAGT-1      2432       925               34                    10
#> GAACCAACTTCCGC-1      1682       819               34                     9
#> CGACGTCTGAGGCA-1      1136       448               16                     8
#> TCAGCAGACTCCAC-1      2807       931               30                     9
#> AACACGTGTACGAC-1      3743      1250              105                    12
#>                  subsets_Mito_percent     total
#>                             <numeric> <numeric>
#> CCAGTCACACTGGT-1             3.858521      3421
#> ATGAGCACATCTTC-1             1.199128      2752
#> TATGAATGGAGGAC-1             0.993377      2114
#> CATATAGACTAAGC-1             1.729660      3122
#> GAGGCAGACTTGCC-1             1.281504      2341
#> ...                               ...       ...
#> GATATCCTAGAAGT-1              1.39803      2432
#> GAACCAACTTCCGC-1              2.02140      1682
#> CGACGTCTGAGGCA-1              1.40845      1136
#> TCAGCAGACTCCAC-1              1.06876      2807
#> AACACGTGTACGAC-1              2.80524      3743

We then label the cells with high mitochondrial content as outliers.

mito_info_one_dataset <- mito_info_one_dataset %>%
      # Converting to tibble
      as_tibble(rownames = "cell") %>%
      # Label cells with high mitochondrial content
      mutate(high_mitochondrion = isOutlier(subsets_Mito_percent, type = "higher"))
mito_info_one_dataset
#> # A tibble: 300 × 8
#>    cell    sum detected subsets_Mito_sum subsets_Mito_de… subsets_Mito_pe… total
#>    <chr> <dbl>    <int>            <dbl>            <int>            <dbl> <dbl>
#>  1 CCAG…  3421      979              132               11            3.86   3421
#>  2 ATGA…  2752      898               33                9            1.20   2752
#>  3 TATG…  2114      937               21                7            0.993  2114
#>  4 CATA…  3122     1086               54               10            1.73   3122
#>  5 GAGG…  2341      957               30               10            1.28   2341
#>  6 AGCT…  5472     1758              169               13            3.09   5472
#>  7 TGAT…  1258      542               41                9            3.26   1258
#>  8 ACGA…  7683     1926              276               14            3.59   7683
#>  9 CGGC…  3500     1092               75               10            2.14   3500
#> 10 ATAG…  3092      974               37               10            1.20   3092
#> # … with 290 more rows, and 1 more variable: high_mitochondrion <otlr.flt>

Finally, we join the mitochondrial information back to the original data so we will be able to filter out the cells with high mitochondrial content.

mito_info_one_dataset <- one_dataset %>%  left_join(mito_info_one_dataset, by = "cell")
mito_info_one_dataset
#> # A SingleCellExperiment-tibble abstraction: 300 × 17
#> [90m# Features=51958 | Assays=counts, logcounts[39m
#>    cell  dataset groups sample orig.ident nCount_RNA nFeature_RNA  S.Score
#>    <chr> <chr>   <chr>  <chr>  <chr>           <dbl>        <int>    <dbl>
#>  1 CCAG… GSE115… 1      GSE11… SeuratPro…       3421          979 -5.42e-2
#>  2 ATGA… GSE115… 2      GSE11… SeuratPro…       2752          898 -5.01e-2
#>  3 TATG… GSE115… 2      GSE11… SeuratPro…       2114          937 -2.95e-5
#>  4 CATA… GSE115… 1      GSE11… SeuratPro…       3122         1086 -6.65e-2
#>  5 GAGG… GSE115… 2      GSE11… SeuratPro…       2341          957 -3.74e-3
#>  6 AGCT… GSE115… 1      GSE11… SeuratPro…       5472         1758 -5.88e-2
#>  7 TGAT… GSE115… 1      GSE11… SeuratPro…       1258          542 -2.51e-2
#>  8 ACGA… GSE115… 1      GSE11… SeuratPro…       7683         1926 -1.33e-1
#>  9 CGGC… GSE115… 1      GSE11… SeuratPro…       3500         1092 -6.87e-2
#> 10 ATAG… GSE115… 1      GSE11… SeuratPro…       3092          974 -1.24e-2
#> # … with 290 more rows, and 9 more variables: G2M.Score <dbl>, ident <fct>,
#> #   sum <dbl>, detected <int>, subsets_Mito_sum <dbl>,
#> #   subsets_Mito_detected <int>, subsets_Mito_percent <dbl>, total <dbl>,
#> #   high_mitochondrion <otlr.flt>

In the interest of time, we load the iteration for all datasets.

pbmc <- rpharma2021tidytranscriptomics::pbmc_mito_info_all_datasets

We can use tidyverse to reshape the data and create beeswarm plots to visualise the mitochondrial content.

pbmc %>%

  # Reshaping
  pivot_longer(c(detected, sum, subsets_Mito_percent)) %>%
  ggplot(aes(
    x = sample, y = value,
    color = high_mitochondrion,
    alpha = high_mitochondrion,
    size = high_mitochondrion
  )) +

  # Plotting
  geom_quasirandom() +
  facet_wrap(~name, scale = "free_y") +

  # Customisation
  scale_color_manual(values = c("black", "#e11f28")) +
  scale_size_discrete(range = c(0, 2)) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 50, hjust = 1, vjust = 1))
#> tidySingleCellExperiment says: A data frame is returned for independent data analysis.

In the faceted plot, “detected” is the number of genes in each of the 10 datasets, “sum” is the total counts.

We then proceed to filter out cells with high mitochondrial content.

pbmc <- pbmc %>% filter(!high_mitochondrion)

Scaling and Integrating

As we are working with multiple datasets, we need to integrate them and adjust for technical variability between them. Here we’ll nest by dataset (batch), normalise within each batch with multiBatchNorm and correct for batch effects with fastMNN.

# Scaling within each dataset
pbmc <-
  pbmc %>%

  # Define batch
  nest(data = -dataset) %>%

  # Normalisation
  mutate(data = multiBatchNorm(data)) %>%

  # Integration
  pull(data) %>%
  fastMNN() %>%

  # Join old information
  left_join(pbmc %>% as_tibble())

Identify clusters

We proceed with identifying cell clusters.

# Assign clusters to the 'colLabels'
# of the SingleCellExperiment object
colLabels(pbmc) <- # from SingleCellExperiment
  pbmc %>%
  buildSNNGraph(use.dimred="corrected") %>% # from scran - shared nearest neighbour
  cluster_walktrap() %$% # from igraph
  membership %>%
  as.factor()

# Reorder columns
pbmc %>% select(label, everything())
#> # A SingleCellExperiment-tibble abstraction: 2,899 × 24
#> [90m# Features=51958 | Assays=reconstructed[39m
#>    cell  label batch groups sample orig.ident nCount_RNA nFeature_RNA  S.Score
#>    <chr> <fct> <int> <chr>  <chr>  <chr>           <dbl>        <int>    <dbl>
#>  1 CCAG… 5         1 1      GSE11… SeuratPro…       3421          979 -5.42e-2
#>  2 ATGA… 3         1 2      GSE11… SeuratPro…       2752          898 -5.01e-2
#>  3 TATG… 7         1 2      GSE11… SeuratPro…       2114          937 -2.95e-5
#>  4 CATA… 11        1 1      GSE11… SeuratPro…       3122         1086 -6.65e-2
#>  5 GAGG… 2         1 2      GSE11… SeuratPro…       2341          957 -3.74e-3
#>  6 AGCT… 4         1 1      GSE11… SeuratPro…       5472         1758 -5.88e-2
#>  7 TGAT… 1         1 1      GSE11… SeuratPro…       1258          542 -2.51e-2
#>  8 ACGA… 1         1 1      GSE11… SeuratPro…       7683         1926 -1.33e-1
#>  9 CGGC… 5         1 1      GSE11… SeuratPro…       3500         1092 -6.87e-2
#> 10 ATAG… 5         1 1      GSE11… SeuratPro…       3092          974 -1.24e-2
#> # … with 2,889 more rows, and 15 more variables: G2M.Score <dbl>, ident <fct>,
#> #   sum <dbl>, detected <int>, subsets_Mito_sum <dbl>,
#> #   subsets_Mito_detected <int>, subsets_Mito_percent <dbl>, total <dbl>,
#> #   high_mitochondrion <lgl>, dataset <chr>, corrected1 <dbl>,
#> #   corrected2 <dbl>, corrected3 <dbl>, corrected4 <dbl>, corrected5 <dbl>

Thanks to tidySingleCellExperiment we can interrogate the object with tidyverse commands and use count to count the number of cells in each cluster.

pbmc %>% count(label)
#> tidySingleCellExperiment says: A data frame is returned for independent data analysis.
#> # A tibble: 11 × 2
#>    label     n
#>    <fct> <int>
#>  1 1       367
#>  2 2       482
#>  3 3       732
#>  4 4       547
#>  5 5       339
#>  6 6        21
#>  7 7       256
#>  8 8        74
#>  9 9        41
#> 10 10       14
#> 11 11       26

Reduce dimensions

Besides PCA which is a linear dimensionality reduction, we can apply neighbour aware methods such as UMAP, to better define locally similar cells.

# Calculate UMAP with scater
pbmc %>%
runUMAP(ncomponents = 2, dimred="corrected") # from scater
#> # A SingleCellExperiment-tibble abstraction: 2,899 × 26
#> [90m# Features=51958 | Assays=reconstructed[39m
#>    cell  batch groups sample orig.ident nCount_RNA nFeature_RNA  S.Score
#>    <chr> <int> <chr>  <chr>  <chr>           <dbl>        <int>    <dbl>
#>  1 CCAG…     1 1      GSE11… SeuratPro…       3421          979 -5.42e-2
#>  2 ATGA…     1 2      GSE11… SeuratPro…       2752          898 -5.01e-2
#>  3 TATG…     1 2      GSE11… SeuratPro…       2114          937 -2.95e-5
#>  4 CATA…     1 1      GSE11… SeuratPro…       3122         1086 -6.65e-2
#>  5 GAGG…     1 2      GSE11… SeuratPro…       2341          957 -3.74e-3
#>  6 AGCT…     1 1      GSE11… SeuratPro…       5472         1758 -5.88e-2
#>  7 TGAT…     1 1      GSE11… SeuratPro…       1258          542 -2.51e-2
#>  8 ACGA…     1 1      GSE11… SeuratPro…       7683         1926 -1.33e-1
#>  9 CGGC…     1 1      GSE11… SeuratPro…       3500         1092 -6.87e-2
#> 10 ATAG…     1 1      GSE11… SeuratPro…       3092          974 -1.24e-2
#> # … with 2,889 more rows, and 18 more variables: G2M.Score <dbl>, ident <fct>,
#> #   sum <dbl>, detected <int>, subsets_Mito_sum <dbl>,
#> #   subsets_Mito_detected <int>, subsets_Mito_percent <dbl>, total <dbl>,
#> #   high_mitochondrion <lgl>, dataset <chr>, label <fct>, corrected1 <dbl>,
#> #   corrected2 <dbl>, corrected3 <dbl>, corrected4 <dbl>, corrected5 <dbl>,
#> #   UMAP1 <dbl>, UMAP2 <dbl>

Is the variability of the 1st UMAP dimension when calculating 2 components (ncomponents = 2) equal/more/less than when calculating 3 components?
Select one of the multiple choice options in www.menti.com (code in Google doc).

Tip: your can use as_tibble() to convert the tibble abstraction to a simple (and light) cell-wise tibble

We can calculate the first 3 UMAP dimensions using the scater framework.

# Calculate UMAP with scater
pbmc <-
  pbmc %>%
  runUMAP(ncomponents = 3, dimred="corrected") # from scater

And we can plot the clusters as a 3D plot using plotly. This time we are colouring by estimated cluster labels to visually check the cluster labels.

pbmc %>%
  plot_ly(
    x = ~`UMAP1`,
    y = ~`UMAP2`,
    z = ~`UMAP3`,
    colors = friendly_cols[1:10],
    color = ~label,
    size=0.5
  )

Cell type classification

Manual cell type classification

We can identify cluster markers (genes). As example, we are selecting the top 10 for each cluster. We can then plot a heatmap of those gene markers across cells.

# Identify top 10 markers per cluster
marker_genes <-
    pbmc %>%
    findMarkers(groups=pbmc$label, assay.type = "reconstructed") %>%  # scran
    as.list() %>%
    map(~ head(.x, 10) %>%  rownames()) %>%
    unlist()

# Plot heatmap
pbmc %>%
  plotHeatmap(                                  # from scater
    features=marker_genes,
    columns=order(pbmc$label),
    colour_columns_by=c("label"),
    exprs_values  = "reconstructed"
  )

Automatic cell type classification

We can infer cell type identities (e.g. T cell) using SingleR (Aran et al. 2019) and manipulate the output using tidyverse. SingleR accepts any log-normalised transcript abundance matrix

# Reference cell types
blueprint <- BlueprintEncodeData()
#> snapshotDate(): 2021-10-18
#> see ?celldex and browseVignettes('celldex') for documentation
#> loading from cache
#> see ?celldex and browseVignettes('celldex') for documentation
#> loading from cache

cell_type <-

  # extracting counts from SingleCellExperiment object
  assays(pbmc)$reconstructed %>%

    # SingleR
  SingleR(
    ref = blueprint,
    labels = blueprint$label.main,
    clusters = pbmc %>% pull(label)
  ) %>%

  # Formatting results
  as.data.frame() %>%
  as_tibble(rownames = "label") %>%
  select(label, first.labels)

cell_type
#> # A tibble: 11 × 2
#>    label first.labels   
#>    <chr> <chr>          
#>  1 1     B-cells        
#>  2 2     NK cells       
#>  3 3     Skeletal muscle
#>  4 4     Fibroblasts    
#>  5 5     Monocytes      
#>  6 6     Smooth muscle  
#>  7 7     NK cells       
#>  8 8     Neutrophils    
#>  9 9     Macrophages    
#> 10 10    B-cells        
#> 11 11    Monocytes

We join the cell type information to our pbmc data.

# Join cell type info
pbmc <-
  pbmc %>%
  left_join(cell_type, by = "label")

# Reorder columns
pbmc %>% select(cell, first.labels, everything())
#> # A SingleCellExperiment-tibble abstraction: 2,899 × 28
#> [90m# Features=51958 | Assays=reconstructed[39m
#>    cell  first.labels batch groups sample orig.ident nCount_RNA nFeature_RNA
#>    <chr> <chr>        <int> <chr>  <chr>  <chr>           <dbl>        <int>
#>  1 CCAG… Monocytes        1 1      GSE11… SeuratPro…       3421          979
#>  2 ATGA… Skeletal mu…     1 2      GSE11… SeuratPro…       2752          898
#>  3 TATG… NK cells         1 2      GSE11… SeuratPro…       2114          937
#>  4 CATA… Monocytes        1 1      GSE11… SeuratPro…       3122         1086
#>  5 GAGG… NK cells         1 2      GSE11… SeuratPro…       2341          957
#>  6 AGCT… Fibroblasts      1 1      GSE11… SeuratPro…       5472         1758
#>  7 TGAT… B-cells          1 1      GSE11… SeuratPro…       1258          542
#>  8 ACGA… B-cells          1 1      GSE11… SeuratPro…       7683         1926
#>  9 CGGC… Monocytes        1 1      GSE11… SeuratPro…       3500         1092
#> 10 ATAG… Monocytes        1 1      GSE11… SeuratPro…       3092          974
#> # … with 2,889 more rows, and 20 more variables: S.Score <dbl>,
#> #   G2M.Score <dbl>, ident <fct>, sum <dbl>, detected <int>,
#> #   subsets_Mito_sum <dbl>, subsets_Mito_detected <int>,
#> #   subsets_Mito_percent <dbl>, total <dbl>, high_mitochondrion <lgl>,
#> #   dataset <chr>, label <chr>, corrected1 <dbl>, corrected2 <dbl>,
#> #   corrected3 <dbl>, corrected4 <dbl>, corrected5 <dbl>, UMAP1 <dbl>,
#> #   UMAP2 <dbl>, UMAP3 <dbl>

Challenge: Which cell type (first.label) has the largest no. of cells?
Select one of the multiple choice options in www.menti.com (code in Google doc).

Pseudobulk analyses

It is sometime useful to aggregate cell-wise transcript abundance into pseudobulk samples. It is possible to explore data and perform hypothesis testing with tools and data-source that we are more familiar with. For example, we can use edgeR in tidybulk to perform differential expression testing. For more details on pseudobulk analysis see here.

Data exploration using pseudobulk samples

To do this, we will use a helper function called aggregate_cells, available in this workshop package, to create a group for each sample.

# Aggregate
pbmc_bulk <-
  rpharma2021tidytranscriptomics::pbmc %>%
  rpharma2021tidytranscriptomics::aggregate_cells(file)

pbmc_bulk %>%

  # Tidybulk operations
  tidybulk::identify_abundant() %>%
  tidybulk::scale_abundance()
#> No group or design set. Assuming all samples belong to one group.
#> # A SummarizedExperiment-tibble abstraction: 1,039,160 × 9
#> [90m# Transcripts=51958 | Samples=20 | Assays=abundance_counts, abundance_logcounts, abundance_counts_scaled[39m
#>    .feature .sample abundance_counts abundance_logco… abundance_count…
#>    <chr>    <chr>              <dbl>            <dbl>            <dbl>
#>  1 7SK.2    ../dat…                1                1             2.78
#>  2 A.1      ../dat…                0                0             0   
#>  3 A.2      ../dat…                0                0             0   
#>  4 A.3      ../dat…                0                0             0   
#>  5 A.5      ../dat…                0                0             0   
#>  6 A1BG     ../dat…                7                7            19.4 
#>  7 A1BG-AS1 ../dat…                3                3             8.33
#>  8 A1CF     ../dat…                0                0             0   
#>  9 A2M      ../dat…                0                0             0   
#> 10 A2M-AS1  ../dat…                1                1             2.78
#> # … with 40 more rows, and 4 more variables: orig.ident <chr>, TMM <dbl>,
#> #   multiplier <dbl>, .abundant <lgl>

Introduction to tidyseurat

In Part 2 we showed how we can study the cell-type composition of a biological sample using bulk RNA sequencing. Single cell sequencing enables a more direct estimation of cell-type composition and gives greater resolution. For bulk RNA sequencing we need to infer the cell types using the abundance of transcripts in the whole sample, with single-cell RNA sequencing we can directly measure the transcripts in each cell and then classify the cells into cell types.

Seurat is a very popular analysis toolkit for single cell RNA sequencing data (Butler et al. 2018; Stuart et al. 2019) .

tidyseurat provides a bridge between the Seurat single-cell package (Butler et al. 2018; Stuart et al. 2019) and the tidyverse (Wickham et al. 2019). It creates an invisible layer that enables viewing the Seurat object as a tidyverse tibble, and provides Seurat-compatible dplyr, tidyr, ggplot and plotly functions.

pbmc_seurat <- rpharma2021tidytranscriptomics::pbmc_seurat
#> Registered S3 method overwritten by 'spatstat.geom':
#>   method     from
#>   print.boxx cli

library(tidyseurat)

pbmc_seurat
#> # A Seurat-tibble abstraction: 80 × 9
#> [90m# Features=230 | Active assay=RNA | Assays=RNA[39m
#>    cell  orig.ident nCount_RNA nFeature_RNA RNA_snn_res.0.8 letter.idents groups
#>    <chr> <fct>           <dbl>        <int> <fct>           <fct>         <chr> 
#>  1 ATGC… SeuratPro…         70           47 0               A             g2    
#>  2 CATG… SeuratPro…         85           52 0               A             g1    
#>  3 GAAC… SeuratPro…         87           50 1               B             g2    
#>  4 TGAC… SeuratPro…        127           56 0               A             g2    
#>  5 AGTC… SeuratPro…        173           53 0               A             g2    
#>  6 TCTG… SeuratPro…         70           48 0               A             g1    
#>  7 TGGT… SeuratPro…         64           36 0               A             g1    
#>  8 GCAG… SeuratPro…         72           45 0               A             g1    
#>  9 GATA… SeuratPro…         52           36 0               A             g1    
#> 10 AATG… SeuratPro…        100           41 0               A             g1    
#> # … with 70 more rows, and 2 more variables: RNA_snn_res.1 <fct>, file <chr>

It can be interacted with using SingleCellExperiment commands such as assayNames.

Assays(pbmc_seurat)
#> [1] "RNA"

We can also interact with our object as we do with any tidyverse tibble.

Tidyverse commands

We can interact to this object in the same way we interact with SingleCellExperiment, through the tidy paradigm

# Filter
pbmc_seurat %>% filter(groups == "g1")
#> # A Seurat-tibble abstraction: 44 × 9
#> [90m# Features=230 | Active assay=RNA | Assays=RNA[39m
#>    cell  orig.ident nCount_RNA nFeature_RNA RNA_snn_res.0.8 letter.idents groups
#>    <chr> <fct>           <dbl>        <int> <fct>           <fct>         <chr> 
#>  1 CATG… SeuratPro…         85           52 0               A             g1    
#>  2 TCTG… SeuratPro…         70           48 0               A             g1    
#>  3 TGGT… SeuratPro…         64           36 0               A             g1    
#>  4 GCAG… SeuratPro…         72           45 0               A             g1    
#>  5 GATA… SeuratPro…         52           36 0               A             g1    
#>  6 AATG… SeuratPro…        100           41 0               A             g1    
#>  7 AGAG… SeuratPro…        191           61 0               A             g1    
#>  8 CTAA… SeuratPro…        168           44 0               A             g1    
#>  9 TTGG… SeuratPro…        135           45 0               A             g1    
#> 10 CATC… SeuratPro…         79           43 0               A             g1    
#> # … with 34 more rows, and 2 more variables: RNA_snn_res.1 <fct>, file <chr>

# Select
pbmc_seurat %>% select(cell, nCount_RNA , groups)
#> # A Seurat-tibble abstraction: 80 × 3
#> [90m# Features=230 | Active assay=RNA | Assays=RNA[39m
#>    cell           nCount_RNA groups
#>    <chr>               <dbl> <chr> 
#>  1 ATGCCAGAACGACT         70 g2    
#>  2 CATGGCCTGTGCAT         85 g1    
#>  3 GAACCTGATGAACC         87 g2    
#>  4 TGACTGGATTCTCA        127 g2    
#>  5 AGTCAGACTGCACA        173 g2    
#>  6 TCTGATACACGTGT         70 g1    
#>  7 TGGTATCTAAACAG         64 g1    
#>  8 GCAGCTCTGTTTCT         72 g1    
#>  9 GATATAACACGCAT         52 g1    
#> 10 AATGTTGACAGTCA        100 g1    
#> # … with 70 more rows

# Mutate
pbmc_seurat %>%
  mutate(groups_l=tolower(groups)) %>%
  # select columns to view    
  select(groups, groups_l)
#> tidyseurat says: Key columns are missing. A data frame is returned for independent data analysis.
#> # A tibble: 80 × 2
#>    groups groups_l
#>    <chr>  <chr>   
#>  1 g2     g2      
#>  2 g1     g1      
#>  3 g2     g2      
#>  4 g2     g2      
#>  5 g2     g2      
#>  6 g1     g1      
#>  7 g1     g1      
#>  8 g1     g1      
#>  9 g1     g1      
#> 10 g1     g1      
#> # … with 70 more rows

# Bind datasets
pbmc_seurat %>% bind_rows(pbmc_seurat)
#> Warning in CheckDuplicateCellNames(object.list = objects): Some cell names are
#> duplicated across objects provided. Renaming to enforce unique cell names.
#> # A Seurat-tibble abstraction: 160 × 9
#> [90m# Features=230 | Active assay=RNA | Assays=RNA[39m
#>    cell  orig.ident nCount_RNA nFeature_RNA RNA_snn_res.0.8 letter.idents groups
#>    <chr> <chr>           <dbl>        <int> <chr>           <chr>         <chr> 
#>  1 ATGC… SeuratPro…         70           47 0               A             g2    
#>  2 CATG… SeuratPro…         85           52 0               A             g1    
#>  3 GAAC… SeuratPro…         87           50 1               B             g2    
#>  4 TGAC… SeuratPro…        127           56 0               A             g2    
#>  5 AGTC… SeuratPro…        173           53 0               A             g2    
#>  6 TCTG… SeuratPro…         70           48 0               A             g1    
#>  7 TGGT… SeuratPro…         64           36 0               A             g1    
#>  8 GCAG… SeuratPro…         72           45 0               A             g1    
#>  9 GATA… SeuratPro…         52           36 0               A             g1    
#> 10 AATG… SeuratPro…        100           41 0               A             g1    
#> # … with 150 more rows, and 2 more variables: RNA_snn_res.1 <chr>, file <chr>

Key Points

Some basic steps of a single-cell RNA sequencing analysis are dimensionality reduction, cluster identification and cell type classification
tidySingleCellExperiment is an invisible layer that operates on a SingleCellExperiment object and enables us to visualise and manipulate data as if it were a tidy data frame.
tidySingleCellExperiment object is a SingleCellExperiment object so it can be used with any SingleCellExperiment compatible method

Feedback

We would be very grateful if you could please complete the feedback survey so we can gather feedback on the effectiveness and usefulness of this tutorial. The link to the survey is here: https://forms.gle/QVGUqQAjC8Zgg5EaA

Contributing

If you want to suggest improvements for this workshop or ask questions, you can do so as described here.

Reproducibility

Record package and version information with sessionInfo

sessionInfo()
#> R version 4.1.1 (2021-08-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.3 LTS
#> 
#> Matrix products: default
#> BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] tidyseurat_0.3.0               SeuratObject_4.0.2            
#>  [3] tidySingleCellExperiment_1.4.0 ggbeeswarm_0.6.0              
#>  [5] celldex_1.4.0                  EnsDb.Hsapiens.v86_2.99.0     
#>  [7] ensembldb_2.18.0               AnnotationFilter_1.18.0       
#>  [9] GenomicFeatures_1.46.1         AnnotationDbi_1.56.1          
#> [11] SingleR_1.8.0                  batchelor_1.10.0              
#> [13] igraph_1.2.7                   scran_1.22.0                  
#> [15] scater_1.22.0                  scuttle_1.4.0                 
#> [17] SingleCellExperiment_1.16.0    tidySummarizedExperiment_1.5.1
#> [19] tidybulk_1.6.0                 tidyHeatmap_1.3.1             
#> [21] GGally_2.1.2                   ggrepel_0.9.1                 
#> [23] plotly_4.10.0                  purrr_0.3.4                   
#> [25] ggplot2_3.3.5                  stringr_1.4.0                 
#> [27] readr_2.0.2                    tidyr_1.1.4                   
#> [29] dplyr_1.0.7                    tibble_3.1.5                  
#> [31] airway_1.14.0                  SummarizedExperiment_1.24.0   
#> [33] Biobase_2.54.0                 GenomicRanges_1.46.0          
#> [35] GenomeInfoDb_1.30.0            IRanges_2.28.0                
#> [37] S4Vectors_0.32.0               BiocGenerics_0.40.0           
#> [39] MatrixGenerics_1.6.0           matrixStats_0.61.0            
#> 
#> loaded via a namespace (and not attached):
#>   [1] rappdirs_0.3.3                       
#>   [2] rtracklayer_1.54.0                   
#>   [3] scattermore_0.7                      
#>   [4] ragg_1.2.0                           
#>   [5] bit64_4.0.5                          
#>   [6] knitr_1.36                           
#>   [7] irlba_2.3.3                          
#>   [8] DelayedArray_0.20.0                  
#>   [9] rpart_4.1-15                         
#>  [10] data.table_1.14.2                    
#>  [11] KEGGREST_1.34.0                      
#>  [12] RCurl_1.98-1.5                       
#>  [13] doParallel_1.0.16                    
#>  [14] generics_0.1.1                       
#>  [15] preprocessCore_1.56.0                
#>  [16] ScaledMatrix_1.2.0                   
#>  [17] cowplot_1.1.1                        
#>  [18] RSQLite_2.2.8                        
#>  [19] RANN_2.6.1                           
#>  [20] proxy_0.4-26                         
#>  [21] future_1.23.0                        
#>  [22] bit_4.0.4                            
#>  [23] tzdb_0.2.0                           
#>  [24] spatstat.data_2.1-0                  
#>  [25] xml2_1.3.2                           
#>  [26] httpuv_1.6.3                         
#>  [27] assertthat_0.2.1                     
#>  [28] viridis_0.6.2                        
#>  [29] xfun_0.27                            
#>  [30] hms_1.1.1                            
#>  [31] jquerylib_0.1.4                      
#>  [32] evaluate_0.14                        
#>  [33] promises_1.2.0.1                     
#>  [34] fansi_0.5.0                          
#>  [35] restfulr_0.0.13                      
#>  [36] progress_1.2.2                       
#>  [37] dendextend_1.15.2                    
#>  [38] dbplyr_2.1.1                         
#>  [39] DBI_1.1.1                            
#>  [40] geneplotter_1.72.0                   
#>  [41] htmlwidgets_1.5.4                    
#>  [42] reshape_0.8.8                        
#>  [43] spatstat.geom_2.3-0                  
#>  [44] ellipsis_0.3.2                       
#>  [45] RSpectra_0.16-0                      
#>  [46] annotate_1.72.0                      
#>  [47] deldir_1.0-6                         
#>  [48] biomaRt_2.50.0                       
#>  [49] sparseMatrixStats_1.6.0              
#>  [50] vctrs_0.3.8                          
#>  [51] ROCR_1.0-11                          
#>  [52] abind_1.4-5                          
#>  [53] cachem_1.0.6                         
#>  [54] withr_2.4.2                          
#>  [55] sctransform_0.3.2                    
#>  [56] GenomicAlignments_1.30.0             
#>  [57] prettyunits_1.1.1                    
#>  [58] dittoSeq_1.6.0                       
#>  [59] goftest_1.2-3                        
#>  [60] cluster_2.1.2                        
#>  [61] ExperimentHub_2.2.0                  
#>  [62] lazyeval_0.2.2                       
#>  [63] crayon_1.4.2                         
#>  [64] genefilter_1.76.0                    
#>  [65] edgeR_3.36.0                         
#>  [66] pkgconfig_2.0.3                      
#>  [67] labeling_0.4.2                       
#>  [68] nlme_3.1-153                         
#>  [69] vipor_0.4.5                          
#>  [70] ProtGenerics_1.26.0                  
#>  [71] globals_0.14.0                       
#>  [72] rlang_0.4.12                         
#>  [73] miniUI_0.1.1.1                       
#>  [74] lifecycle_1.0.1                      
#>  [75] filelock_1.0.2                       
#>  [76] BiocFileCache_2.2.0                  
#>  [77] rsvd_1.0.5                           
#>  [78] AnnotationHub_3.2.0                  
#>  [79] polyclip_1.10-0                      
#>  [80] rprojroot_2.0.2                      
#>  [81] lmtest_0.9-38                        
#>  [82] Matrix_1.3-4                         
#>  [83] zoo_1.8-9                            
#>  [84] beeswarm_0.4.0                       
#>  [85] ggridges_0.5.3                       
#>  [86] GlobalOptions_0.1.2                  
#>  [87] pheatmap_1.0.12                      
#>  [88] png_0.1-7                            
#>  [89] viridisLite_0.4.0                    
#>  [90] rjson_0.2.20                         
#>  [91] bitops_1.0-7                         
#>  [92] KernSmooth_2.23-20                   
#>  [93] Biostrings_2.62.0                    
#>  [94] blob_1.2.2                           
#>  [95] DelayedMatrixStats_1.16.0            
#>  [96] shape_1.4.6                          
#>  [97] parallelly_1.28.1                    
#>  [98] beachmat_2.10.0                      
#>  [99] scales_1.1.1                         
#> [100] ica_1.0-2                            
#> [101] memoise_2.0.0                        
#> [102] magrittr_2.0.1                       
#> [103] plyr_1.8.6                           
#> [104] zlibbioc_1.40.0                      
#> [105] compiler_4.1.1                       
#> [106] dqrng_0.3.0                          
#> [107] BiocIO_1.4.0                         
#> [108] RColorBrewer_1.1-2                   
#> [109] clue_0.3-60                          
#> [110] DESeq2_1.34.0                        
#> [111] fitdistrplus_1.1-6                   
#> [112] Rsamtools_2.10.0                     
#> [113] cli_3.1.0                            
#> [114] XVector_0.34.0                       
#> [115] listenv_0.8.0                        
#> [116] pbapply_1.5-0                        
#> [117] patchwork_1.1.1                      
#> [118] mgcv_1.8-38                          
#> [119] MASS_7.3-54                          
#> [120] tidyselect_1.1.1                     
#> [121] stringi_1.7.5                        
#> [122] textshaping_0.3.6                    
#> [123] highr_0.9                            
#> [124] yaml_2.2.1                           
#> [125] BiocSingular_1.10.0                  
#> [126] locfit_1.5-9.4                       
#> [127] grid_4.1.1                           
#> [128] sass_0.4.0                           
#> [129] tools_4.1.1                          
#> [130] future.apply_1.8.1                   
#> [131] parallel_4.1.1                       
#> [132] circlize_0.4.13                      
#> [133] bluster_1.4.0                        
#> [134] foreach_1.5.1                        
#> [135] metapod_1.2.0                        
#> [136] gridExtra_2.3                        
#> [137] farver_2.1.0                         
#> [138] Rtsne_0.15                           
#> [139] digest_0.6.28                        
#> [140] BiocManager_1.30.16                  
#> [141] FNN_1.1.3                            
#> [142] shiny_1.7.1                          
#> [143] Rcpp_1.0.7                           
#> [144] BiocVersion_3.14.0                   
#> [145] later_1.3.0                          
#> [146] RcppAnnoy_0.0.19                     
#> [147] org.Hs.eg.db_3.14.0                  
#> [148] httr_1.4.2                           
#> [149] rpharma2021tidytranscriptomics_0.11.0
#> [150] ComplexHeatmap_2.10.0                
#> [151] colorspace_2.0-2                     
#> [152] tensor_1.5                           
#> [153] reticulate_1.22                      
#> [154] XML_3.99-0.8                         
#> [155] fs_1.5.0                             
#> [156] splines_4.1.1                        
#> [157] uwot_0.1.10                          
#> [158] statmod_1.4.36                       
#> [159] spatstat.utils_2.2-0                 
#> [160] pkgdown_1.6.1                        
#> [161] systemfonts_1.0.3                    
#> [162] xtable_1.8-4                         
#> [163] jsonlite_1.7.2                       
#> [164] R6_2.5.1                             
#> [165] pillar_1.6.4                         
#> [166] htmltools_0.5.2                      
#> [167] mime_0.12                            
#> [168] glue_1.4.2                           
#> [169] fastmap_1.1.0                        
#> [170] BiocParallel_1.28.0                  
#> [171] BiocNeighbors_1.12.0                 
#> [172] class_7.3-19                         
#> [173] interactiveDisplayBase_1.32.0        
#> [174] codetools_0.2-18                     
#> [175] utf8_1.2.2                           
#> [176] spatstat.sparse_2.0-0                
#> [177] lattice_0.20-45                      
#> [178] bslib_0.3.1                          
#> [179] ResidualMatrix_1.4.0                 
#> [180] curl_4.3.2                           
#> [181] leiden_0.3.9                         
#> [182] survival_3.2-13                      
#> [183] limma_3.50.0                         
#> [184] rmarkdown_2.11                       
#> [185] desc_1.4.0                           
#> [186] munsell_0.5.0                        
#> [187] e1071_1.7-9                          
#> [188] GetoptLong_1.0.5                     
#> [189] GenomeInfoDbData_1.2.7               
#> [190] iterators_1.0.13                     
#> [191] reshape2_1.4.4                       
#> [192] gtable_0.3.0                         
#> [193] spatstat.core_2.3-0                  
#> [194] Seurat_4.0.5

References

Abbas, Alexander R, Kristen Wolslegel, Dhaya Seshasayee, Zora Modrusan, and Hilary F Clark. 2009. “Deconvolution of Blood Microarray Data Identifies Cellular Activation Patterns in Systemic Lupus Erythematosus.” PloS One 4 (7): e6098.

Aran, Dvir, Agnieszka P Looney, Leqian Liu, Esther Wu, Valerie Fong, Austin Hsu, Suzanna Chak, et al. 2019. “Reference-Based Analysis of Lung Single-Cell Sequencing Reveals a Transitional Profibrotic Macrophage.” Nature Immunology 20 (2): 163–72.

Butler, Andrew, Paul Hoffman, Peter Smibert, Efthymia Papalexi, and Rahul Satija. 2018. “Integrating Single-Cell Transcriptomic Data Across Different Conditions, Technologies, and Species.” Nature Biotechnology 36 (5): 411–20.

Cai, Yi, Youchao Dai, Yejun Wang, Qianqing Yang, Jiubiao Guo, Cailing Wei, Weixin Chen, et al. 2020. “Single-Cell Transcriptomics of Blood Reveals a Natural Killer Cell Subset Depletion in Tuberculosis.” EBioMedicine 53 (March): 102686. https://doi.org/10.1016/j.ebiom.2020.102686.

Chen, Yunshun, Aaron TL Lun, and Gordon K Smyth. 2016. “From Reads to Genes to Pathways: Differential Expression Analysis of RNA-Seq Experiments Using Rsubread and the edgeR Quasi-Likelihood Pipeline.” F1000Research 5.

Ding, Jiarui, Xian Adiconis, Sean K. Simmons, Monika S. Kowalczyk, Cynthia C. Hession, Nemanja D. Marjanovic, Travis K. Hughes, et al. 2020. “Systematic Comparison of Single-Cell and Single-Nucleus RNA-Sequencing Methods.” Nature Biotechnology 38 (6): 737–46. https://doi.org/10.1038/s41587-020-0465-8.

Freytag, Saskia, Luyi Tian, Ingrid Lönnstedt, Milica Ng, and Melanie Bahlo. 2018. “Comparison of Clustering Tools in r for Medium-Sized 10x Genomics Single-Cell RNA-Sequencing Data.” F1000Research 7 (December): 1297. https://doi.org/10.12688/f1000research.15809.2.

Himes, Blanca E, Xiaofeng Jiang, Peter Wagner, Ruoxi Hu, Qiyu Wang, Barbara Klanderman, Reid M Whitaker, et al. 2014. “RNA-Seq Transcriptome Profiling Identifies Crispld2 as a Glucocorticoid Responsive Gene That Modulates Cytokine Function in Airway Smooth Muscle Cells.” PloS One 9 (6): e99625.

Karagiannis, Tanya T., John P. Cleary, Busra Gok, Andrew J. Henderson, Nicholas G. Martin, Masanao Yajima, Elliot C. Nelson, and Christine S. Cheng. 2020. “Single Cell Transcriptomics Reveals Opioid Usage Evokes Widespread Suppression of Antiviral Gene Program.” Nature Communications 11 (1). https://doi.org/10.1038/s41467-020-16159-y.

Law, Charity W, Monther Alhamdoosh, Shian Su, Xueyi Dong, Luyi Tian, Gordon K Smyth, and Matthew E Ritchie. 2016. “RNA-Seq Analysis Is Easy as 1-2-3 with Limma, Glimma and edgeR.” F1000Research 5.

Law, Charity W, Yunshun Chen, Wei Shi, and Gordon K Smyth. 2014. “Voom: Precision Weights Unlock Linear Model Analysis Tools for RNA-Seq Read Counts.” Genome Biology 15 (2): R29.

Love, Michael I, Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2.” Genome Biology 15 (12): 550.

McCarthy, Davis J, Yunshun Chen, and Gordon K Smyth. 2012. “Differential Expression Analysis of Multifactor RNA-Seq Experiments with Respect to Biological Variation.” Nucleic Acids Research 40 (10): 4288–97.

Newman, Aaron M, Chih Long Liu, Michael R Green, Andrew J Gentles, Weiguo Feng, Yue Xu, Chuong D Hoang, Maximilian Diehn, and Ash A Alizadeh. 2015. “Robust Enumeration of Cell Subsets from Tissue Expression Profiles.” Nature Methods 12 (5): 453–57.

Racle, Julien, Kaat de Jonge, Petra Baumgaertner, Daniel E Speiser, and David Gfeller. 2017. “Simultaneous Enumeration of Cancer and Immune Cell Types from Bulk Tumor Gene Expression Data.” Elife 6: e26476.

Robinson, Mark D, and Alicia Oshlack. 2010. “A Scaling Normalization Method for Differential Expression Analysis of RNA-Seq Data.” Genome Biology 11 (3): 1–9.

Stuart, Tim, Andrew Butler, Paul Hoffman, Christoph Hafemeister, Efthymia Papalexi, William M Mauck III, Yuhan Hao, Marlon Stoeckius, Peter Smibert, and Rahul Satija. 2019. “Comprehensive Integration of Single-Cell Data.” Cell 177 (7): 1888–1902.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43): 1686.

Wickham, Hadley, and others. 2014. “Tidy Data.” Journal of Statistical Software 59 (10): 1–23.

<maria.doyle at petermac.org>↩︎
<mangiola.s at wehi.edu.au>↩︎

Introduction to Tidy Transcriptomics

Maria Doyle, Peter MacCallum Cancer Centre1

Stefano Mangiola, Walter and Eliza Hall Institute2

Workshop introduction

Instructors

Overview

Pre-requisites

Time outline

Workshop goals and objectives

Learning goals

Learning objectives

Google doc with all links

Audience Questions

What is transcriptomics?

Why use transcriptomics?

Possible experimental design

How does transcriptomics work?

Types of transcriptomic analyses

Bulk RNA sequencing differential expression workflow

Differences between bulk and single-cell RNA sequencing

Single-cell RNA sequencing analysis workflow

Tidy data and the tidyverse

Getting started

Cloud

Local

What is tidytranscriptomics?

Part 1 Bulk RNA sequencing with tidySummarizedExperiment and tidybulk

Acknowledgements

Introduction

tidySummarizedExperiment

Tidyverse commands

Tidybulk workflow: functions/utilities available

Setting up the data

Filtering lowly transcribed genes

Scaling counts to normalise

Exploratory analyses

Dimensionality reduction

Hierarchical clustering with heatmaps

Differential expression

Comparison of methods

Single method

Volcano plots

Automatic bibliography

Cell type composition analysis

Key Points

Part 2 Single-cell RNA sequencing with tidySingleCellExperiment

Introduction to tidySingleCellExperiment

Tidyverse commands

Join datasets

Setting up the data

Quality control

Scaling and Integrating

Identify clusters

Reduce dimensions

Cell type classification

Manual cell type classification

Automatic cell type classification

Pseudobulk analyses

Data exploration using pseudobulk samples

Introduction to tidyseurat

Tidyverse commands

Key Points

Feedback

Contributing

Reproducibility

References

Maria Doyle, Peter MacCallum Cancer Centre¹

Stefano Mangiola, Walter and Eliza Hall Institute²