scRNA-seq Analysis of Human Skin Samples

Original paper: Solé-Boldo, L., Raddatz, G., Schütz, S. et al. Single-cell transcriptomes of the human skin reveal age-related loss of fibroblast priming. Commun Biol 3, 188 (2020). https://doi.org/10.1038/s42003-020-0922-4

The paper analyses five human skin samples of young and old donors and compares them to understand what changes in skin cells over time. The skin is taken from a region that is not exposed to the sun to avoid the effects of UV radiation and it includes around 16 thousand cells before QC.

Here, we'll focus primarily on one young sample and try to replicate the paper's findings about fibroblast subpopulations. We won't go as deep into it as the study authors do, but we'll also explore sample integration and functional annotation.

Table of contents

  1. Technical preparation
  2. Cleaning and quality control
  3. Dimensionality reduction and clustering
  4. Marker genes
  5. Functional analysis
  6. Sample integration with Harmony
  7. Closing thoughts

Technical preparation

This notebook is intended to run with a kernel located at /mnt/storage/r0978323/system/.local/share/jupyter/kernels/project_1/kernel.json. It should be provided with the submission, but either way, the purpose of it is to set up a python environment running python 3.11.11 with a collection of all the necessary packages. A requirements.txt file with these contents should work to replicate the environment:

scanpy[leiden]
anndata==0.10.9
scrublet
numpy==1.26
gprofiler-official
harmonypy

A copy of the output of running the notebook should be present in /mnt/storage/r0978323/project_data/skin_aging/output, where it could be investigated if there are problems with running it independently.

The relevant modules are imported below and a random state is picked for reproducibility purposes:

Scanpy can automatically save generated plots, though it doesn't provide a lot of fine-grained control, but only a global "settings" object. We'll increase the figure size a bit to improve image resolution:

The original files from the study have been downloaded from GEO: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE130973

Three files have been downloaded to /mnt/storage/r0978323/project_data/skin_aging, unzipped and renamed:

Note that the labels of the data files say "filtered", but that is only with respect to CellRanger. The authors explain their methods like this:

16,062 cells passed the quality control steps performed by Cell Ranger. To remove possible cell doublets, we filtered out cells with more than 7500 expressed genes, and to remove potential apoptotic cells we discarded cells with more than 5% mitochondrial reads. The application of these filters resulted in a final dataset of 15,457 single-cell transcriptomes.

As we'll see below, the "filtered" dataset contains exactly 16,062 cells, so we can be sure that these files will need some additional cleanup. We'll start by creating a local directory that symlinks to the downloaded data:

Scanpy has some trouble reading the matrix and genes with its read_10x_mtx function, so we can't use it as-is. It seems that the function expects version 3 of the CellRanger format, but the study used version 2. This seems to be a problem with gzipping in particular, but unzipping doesn't quite fix it. There are more details in this github issue: https://github.com/scverse/scanpy/issues/1916.

There's a number of proposed fixes, but we've chosen here to construct an AnnData matrix manually:

Sample barcoding is described in the individual sample pages on GEO, for instance this one for sample y1: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM3758115

the cell barcode of this samples for the processed data files is -1

The sample barcodes are ordered from 1 to 5 and they correspond to the following sample names used in the paper:

We've saved the y1 and y2 samples to files, but going forward, we'll mostly work with the 3130 cells of y1.

Cleaning and quality control

Before analyzing cell clusters, we'll perform some cleanup and doublet detection. We'll start by measuring the highest-expressed genes in the downloaded dataset:

The highest-expressed gene seems to be MALAT1, a long non-coding RNA found in the nucleus, which is what we'd expect from a single-cell experiment. (In fact, a recent bioRxiv preprint argues that it can be used as a quality measure: https://www.biorxiv.org/content/10.1101/2024.07.14.603469v2).

The other genes seem to be mostly for ribosomal proteins, and there are also two mitochondrial reads. We'll filter out most of the cells with high MT- expression, but we'll see in a bit that ribosomal genes are a bit trickier to deal with.

For now, let's ask Scrublet to predict doublets in the collection:

The automatic threshold picked by scrublet doesn't seem appropriate. We can clearly see a bimodal distribution in the right image, but the line is chosen to the right of the second peak rather than in the valley between them. A better threshold might be 0.20, so we'll pick that one, save the resulting predictions and re-plot the histogram:

We'll try to visualize doublet locations relative to cell type clusters via UMAP:

Some of the doublets are a bit spread out, but there are also concentrated clusters. We might expect that the locations are between clusters or at the edges of them. We'll remove them later by clustering with a high resolution parameter. For now, we'll proceed with the rest of the cleanup, like ignoring genes with low cell expression and removing cells with low gene expression:

We'll try to locate mitochondrial genes by looking for names starting with MT-. A high percentage of those might indicate dead cells that we should remove. Hemoglobin-heavy cells should also be removed, though as we'll see, this does not seem to be a problem in this particular sample.

We'll note that the HBEGF gene does not seem to be hemoglobin-related, according to both uniprot and the human protein atlas. The "HB" here stands for "heparin-binding", so we'll make sure not to include it in the "hb" set of genes.

We'll also detect ribosomal genes and observe how they distribute.

Plotting the cells by the number of genes they express, we can see that the bulk is located below 3500 with a spread of outliers above. We'll use 3500 as a threshold later.

Mitochondrial gene percentages are not too high, but we'll use the threshold of 5% to avoid potential apoptotic cells. There are essentially no blood cells that we can detect -- hemoglobin percentages go up to 0.4 at the maximum. Other samples do have a few, but here we don't need to do anything. Still, we'll add a filter for hb < 5 just to show that it does not change the cell count.

Ribosomal proteins are a different topic. There's a wide range of their expression, up to around 60% in some cases. The 10x genomics documentation suggests that this is not a signal of poor quality (https://kb.10xgenomics.com/hc/en-us/articles/218169723-What-fraction-of-reads-map-to-ribosomal-proteins):

Single Cell 3’ v2 libraries do contain reads mapping to ribosomal protein transcripts (Rps, Rpl). The fraction of reads varies based on cell type and overall cell health.

They note that ribosomal proteins are not the same as rRNA, these are poly-adenylated (see e.g. https://pmc.ncbi.nlm.nih.gov/articles/PMC1474067/) and do not neccesarily show quality issues:

Although both ribosomal proteins and ribosomal RNA (rRNA) make up the ribosome complex, ribosomal protein transcripts are not equivalent to ribosomal RNA (rRNA). Ribosomal protein transcript detection will not necessarily correlate with either rRNA or mitochondrial transcripts.

As we'll see later, these are distributed fairly evenly across cell types, so they should hopefully not interfere much with our analysis. We'll use highly variable genes, which is what the 10x genomics website recommends as well:

Only include in the PC analysis genes that are "highly variable". We do this by selecting the genes with the highest dispersion across the dataset and performing PCA on those genes only. The Seurat package uses a similar approach.

To try to exclude outliers, however, we'll cut off cells that include more than 50% ribosomal proteins, based on the violin plot above.

We still have the MT-CO1 gene expressed, but it's a fairly low percentage. Many of the ribosomal protein genes are still in this list, but the scanpy documentaiton suggests that this is normal:

We expect to see the “usual suspects”, i.e., mitochondrial genes, actin, ribosomal protein, MALAT1

Let's write the cleaned dataset of 2728 observations to disk, so we can rerun the next sections independently:

Dimensionality reduction and clustering

We'll start by allocating a place for visual inspection of clustering. We'll set a figdir for scanpy that will store the figures for later inspection. Particularly when generating a large amount of figures, we'll opt to store them and attach select results to markdown cells.

Now, we'll read the data from the previous section, normalize counts to a sum of 1,000 and store a record of highly variable genes to the AnnData object:

We'll use PCA to find linear combinations of genes that best express the variability in the data. This will not necessarily correspond to cell types, but it can give us guidance towards particular categories of genes.

We can see some potential marker genes that we should try visualizing. COL6A2 seems to be the biggest source of variability for the first principal component with other collagen-related genes. CST3, FTL, and a number of HLA (Human Leukocyte Antigen) genes in PC2 suggest a grouping of immune cells. PC3 includes PTPRCAP, CD3D and CD3E, and IL32, which also seem immunity-related. According to the scanpy tutorial, CD3D is a well-known marker for T-cells.

Some top marker genes suggested by the paper for different cell types include IL8, LUM, KRT1, so let's keep those in a list for visualisation as well.

For now, let's try to pick a limit of principal components to use for clustering by investigating their explained variability.

The paper picks 28 as their number of principal components, which seems somewhat reasonable, but it's hard to see a sharp reduction in in explained variability for a while longer. We'll pick principal component 45 as a point where it looks as though variability really tapers off.

This is a very subjective choice and it's hard to know what's "correct". Increasing the number of components might amplify minor distinctions, but setting it too low may miss important signals of differentiation. Tweaking the number also shifts umap cluster positions quite a lot, making it hard to visually determine correspondences. We'll leave this constant in a separate cell, so it's easy to experiment with:

With that, let's try calculating a UMAP embedding and plotting it with some of the genes we looked at above.

The separation looks promising, our principal components have been able to find some genes specific to major clusters. The suggested marker genes by the paper also tell us our results are compatible with theirs so far.

Let's plot the QC-related genes and investigate their distributions across clusters:

Mitochondrial genes are evenly spread out and limited to below 5%, since that was the threshold we picked. There are some outliers with higher gene counts, though some may correspond to doublets and will likely go away when we remove them.

Ribosomal genes have a much higher percentage of expression, as we noted before, but they are mostly spread out. There is one broad cluster in the center that has a fairly high expression and one smaller on the upper-right that in a previous chart corresponded to KRT1. As we'll see, we can find non-ribosomal marker genes for both of those.

At this stage, we might also take a look at doublets in the filtered and normalized datasets to see how they're distributed.

We can see some predicted doublets sprinkled between the center and upper-left clusters, which might be an indication of doublets between these two cell types. A few are present inside of islands, where we might expect the clusters to separate with a higher leiden resolution.

Speaking of which, let's try to run the leiden clustering algorithm with different resolutions and decide which ones we should try to focus on. We'll prepare a simple function that runs it on various resolutions and stores the images for later inspection:

Let's pick a range of values with varying granularity. We'll pay particular attention to the 0.5 to 0.9 range: