Original paper: Solé-Boldo, L., Raddatz, G., Schütz, S. et al. Single-cell transcriptomes of the human skin reveal age-related loss of fibroblast priming. Commun Biol 3, 188 (2020). https://doi.org/10.1038/s42003-020-0922-4
The paper analyses five human skin samples of young and old donors and compares them to understand what changes in skin cells over time. The skin is taken from a region that is not exposed to the sun to avoid the effects of UV radiation and it includes around 16 thousand cells before QC.
Here, we'll focus primarily on one young sample and try to replicate the paper's findings about fibroblast subpopulations. We won't go as deep into it as the study authors do, but we'll also explore sample integration and functional annotation.
This notebook is intended to run with a kernel located at /mnt/storage/r0978323/system/.local/share/jupyter/kernels/project_1/kernel.json. It should be provided with the submission, but either way, the purpose of it is to set up a python environment running python 3.11.11 with a collection of all the necessary packages. A requirements.txt file with these contents should work to replicate the environment:
scanpy[leiden]
anndata==0.10.9
scrublet
numpy==1.26
gprofiler-official
harmonypy
A copy of the output of running the notebook should be present in /mnt/storage/r0978323/project_data/skin_aging/output, where it could be investigated if there are problems with running it independently.
The relevant modules are imported below and a random state is picked for reproducibility purposes:
import os
from pathlib import Path
from warnings import catch_warnings
import scanpy as sc
import scrublet as scr
import anndata as ad
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
RANDOM_STATE = 978323
Scanpy can automatically save generated plots, though it doesn't provide a lot of fine-grained control, but only a global "settings" object. We'll increase the figure size a bit to improve image resolution:
sc.settings.set_figure_params(
facecolor="white",
figsize=(8, 8),
format='png',
)
The original files from the study have been downloaded from GEO: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE130973
Three files have been downloaded to /mnt/storage/r0978323/project_data/skin_aging, unzipped and renamed:
GSE130973_barcodes_filtered.tsv.gz to barcodes.tsvGSE130973_genes_filtered.tsv.gz to genes.tsvGSE130973_matrix_filtered.mtx.gz to matrix.tsvNote that the labels of the data files say "filtered", but that is only with respect to CellRanger. The authors explain their methods like this:
16,062 cells passed the quality control steps performed by Cell Ranger. To remove possible cell doublets, we filtered out cells with more than 7500 expressed genes, and to remove potential apoptotic cells we discarded cells with more than 5% mitochondrial reads. The application of these filters resulted in a final dataset of 15,457 single-cell transcriptomes.
As we'll see below, the "filtered" dataset contains exactly 16,062 cells, so we can be sure that these files will need some additional cleanup. We'll start by creating a local directory that symlinks to the downloaded data:
os.makedirs('data', exist_ok=True)
downloaded_path = Path('data/downloaded')
if not downloaded_path.is_symlink():
downloaded_path.symlink_to('/mnt/storage/r0978323/project_data/skin_aging')
print(list(downloaded_path.glob('*')))
[PosixPath('data/downloaded/barcodes.tsv'), PosixPath('data/downloaded/genes.tsv'), PosixPath('data/downloaded/matrix.mtx'), PosixPath('data/downloaded/output')]
Scanpy has some trouble reading the matrix and genes with its read_10x_mtx function, so we can't use it as-is. It seems that the function expects version 3 of the CellRanger format, but the study used version 2. This seems to be a problem with gzipping in particular, but unzipping doesn't quite fix it. There are more details in this github issue: https://github.com/scverse/scanpy/issues/1916.
There's a number of proposed fixes, but we've chosen here to construct an AnnData matrix manually:
adata = sc.read_mtx(downloaded_path / 'matrix.mtx')
adata_bc = pd.read_csv(downloaded_path / 'barcodes.tsv', header=None)
adata_features = pd.read_csv(downloaded_path / 'genes.tsv', header=None, sep='\t')
adata = adata.T
adata.obs['cell_id'] = adata_bc[0].values
adata.obs_names = adata.obs['cell_id']
adata.var['feature_name'] = adata_features[1].values
adata.var_names = adata.var['feature_name']
adata.var_names_make_unique()
# The make-unique method only touches the index, so we need to reassign the actual column:
adata.var['feature_name'] = adata.var.index
print("> Original data, as downloaded:")
print(adata)
> Original data, as downloaded:
AnnData object with n_obs × n_vars = 16062 × 32738
obs: 'cell_id'
var: 'feature_name'
Sample barcoding is described in the individual sample pages on GEO, for instance this one for sample y1: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM3758115
the cell barcode of this samples for the processed data files is -1
The sample barcodes are ordered from 1 to 5 and they correspond to the following sample names used in the paper:
# Assign sample labels
adata.obs.loc[adata.obs_names.str.endswith('-1'), 'sample_id'] = 'y1'
adata.obs.loc[adata.obs_names.str.endswith('-2'), 'sample_id'] = 'y2'
adata.obs.loc[adata.obs_names.str.endswith('-3'), 'sample_id'] = 'o1'
adata.obs.loc[adata.obs_names.str.endswith('-4'), 'sample_id'] = 'o2'
adata.obs.loc[adata.obs_names.str.endswith('-5'), 'sample_id'] = 'o3'
# Save files for later
adata[adata.obs['sample_id'] == 'y1'].copy().write('data/downloaded_y1.h5ad')
adata[adata.obs['sample_id'] == 'y2'].copy().write('data/downloaded_y2.h5ad')
We've saved the y1 and y2 samples to files, but going forward, we'll mostly work with the 3130 cells of y1.
adata = sc.read('data/downloaded_y1.h5ad')
print("> Sample y1, first young donor:")
print(adata)
> Sample y1, first young donor:
AnnData object with n_obs × n_vars = 3130 × 32738
obs: 'cell_id', 'sample_id'
var: 'feature_name'
Before analyzing cell clusters, we'll perform some cleanup and doublet detection. We'll start by measuring the highest-expressed genes in the downloaded dataset:
# Before cleanup:
sc.pl.highest_expr_genes(adata, n_top=20)
The highest-expressed gene seems to be MALAT1, a long non-coding RNA found in the nucleus, which is what we'd expect from a single-cell experiment. (In fact, a recent bioRxiv preprint argues that it can be used as a quality measure: https://www.biorxiv.org/content/10.1101/2024.07.14.603469v2).
The other genes seem to be mostly for ribosomal proteins, and there are also two mitochondrial reads. We'll filter out most of the cells with high MT- expression, but we'll see in a bit that ribosomal genes are a bit trickier to deal with.
For now, let's ask Scrublet to predict doublets in the collection:
scrub = scr.Scrublet(adata.X, expected_doublet_rate=0.06)
doublet_scores, predicted_doublets = scrub.scrub_doublets(
min_counts=2,
min_cells=3,
min_gene_variability_pctl=85,
n_prin_comps=30
)
scrub.plot_histogram()
Preprocessing... Simulating doublets... Embedding transcriptomes using PCA... Calculating doublet scores... Automatically set threshold at doublet score = 0.53 Detected doublet rate = 0.5% Estimated detectable doublet fraction = 7.7% Overall doublet rate: Expected = 6.0% Estimated = 6.2% Elapsed time: 30.6 seconds
(<Figure size 640x240 with 2 Axes>,
array([<Axes: title={'center': 'Observed transcriptomes'}, xlabel='Doublet score', ylabel='Prob. density'>,
<Axes: title={'center': 'Simulated doublets'}, xlabel='Doublet score', ylabel='Prob. density'>],
dtype=object))
The automatic threshold picked by scrublet doesn't seem appropriate. We can clearly see a bimodal distribution in the right image, but the line is chosen to the right of the second peak rather than in the valley between them. A better threshold might be 0.20, so we'll pick that one, save the resulting predictions and re-plot the histogram:
# Override default call for y1:
predicted_doublets = scrub.call_doublets(threshold=0.2)
# Save doublets and scores for later:
adata.obs['predicted_doublets'] = predicted_doublets
adata.obs['doublet_scores'] = doublet_scores
scrub.plot_histogram()
Detected doublet rate = 2.5% Estimated detectable doublet fraction = 43.5% Overall doublet rate: Expected = 6.0% Estimated = 5.7%
(<Figure size 640x240 with 2 Axes>,
array([<Axes: title={'center': 'Observed transcriptomes'}, xlabel='Doublet score', ylabel='Prob. density'>,
<Axes: title={'center': 'Simulated doublets'}, xlabel='Doublet score', ylabel='Prob. density'>],
dtype=object))
We'll try to visualize doublet locations relative to cell type clusters via UMAP:
# Ignore UMAP warnings for random state
with catch_warnings(action="ignore"):
scrub.set_embedding('UMAP', scr.get_umap(scrub.manifold_obs_, 10, min_dist=0.3))
scrub.plot_embedding('UMAP', order_points=True)
Some of the doublets are a bit spread out, but there are also concentrated clusters. We might expect that the locations are between clusters or at the edges of them. We'll remove them later by clustering with a high resolution parameter. For now, we'll proceed with the rest of the cleanup, like ignoring genes with low cell expression and removing cells with low gene expression:
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
We'll try to locate mitochondrial genes by looking for names starting with MT-. A high percentage of those might indicate dead cells that we should remove. Hemoglobin-heavy cells should also be removed, though as we'll see, this does not seem to be a problem in this particular sample.
We'll note that the HBEGF gene does not seem to be hemoglobin-related, according to both uniprot and the human protein atlas. The "HB" here stands for "heparin-binding", so we'll make sure not to include it in the "hb" set of genes.
We'll also detect ribosomal genes and observe how they distribute.
adata.var['mt'] = adata.var_names.str.startswith('MT-')
adata.var['ribo'] = adata.var_names.str.startswith(("RPS", "RPL"))
adata.var['hb'] = adata.var_names.str.contains("^HB[^(P)]")
adata.var.loc[adata.var_names == 'HBEGF', 'hb'] = False
sc.pp.calculate_qc_metrics(
adata,
qc_vars=['mt', 'ribo', 'hb'],
percent_top=None,
log1p=False,
inplace=True,
)
figure, axes = plt.subplots(1, 4, figsize=(20, 5))
sc.pl.scatter(adata, x='total_counts', y='pct_counts_mt', show=False, ax=axes[0])
sc.pl.scatter(adata, x='total_counts', y='pct_counts_ribo', show=False, ax=axes[1])
sc.pl.scatter(adata, x='total_counts', y='pct_counts_hb', show=False, ax=axes[2])
sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts', show=False, ax=axes[3])
figure.show()
sc.pl.violin(
adata,
['n_genes_by_counts', 'total_counts', 'pct_counts_mt', 'pct_counts_ribo', 'pct_counts_hb'],
jitter=0.4,
multi_panel=True,
)
Plotting the cells by the number of genes they express, we can see that the bulk is located below 3500 with a spread of outliers above. We'll use 3500 as a threshold later.
Mitochondrial gene percentages are not too high, but we'll use the threshold of 5% to avoid potential apoptotic cells. There are essentially no blood cells that we can detect -- hemoglobin percentages go up to 0.4 at the maximum. Other samples do have a few, but here we don't need to do anything. Still, we'll add a filter for hb < 5 just to show that it does not change the cell count.
Ribosomal proteins are a different topic. There's a wide range of their expression, up to around 60% in some cases. The 10x genomics documentation suggests that this is not a signal of poor quality (https://kb.10xgenomics.com/hc/en-us/articles/218169723-What-fraction-of-reads-map-to-ribosomal-proteins):
Single Cell 3’ v2 libraries do contain reads mapping to ribosomal protein transcripts (Rps, Rpl). The fraction of reads varies based on cell type and overall cell health.
They note that ribosomal proteins are not the same as rRNA, these are poly-adenylated (see e.g. https://pmc.ncbi.nlm.nih.gov/articles/PMC1474067/) and do not neccesarily show quality issues:
Although both ribosomal proteins and ribosomal RNA (rRNA) make up the ribosome complex, ribosomal protein transcripts are not equivalent to ribosomal RNA (rRNA). Ribosomal protein transcript detection will not necessarily correlate with either rRNA or mitochondrial transcripts.
As we'll see later, these are distributed fairly evenly across cell types, so they should hopefully not interfere much with our analysis. We'll use highly variable genes, which is what the 10x genomics website recommends as well:
Only include in the PC analysis genes that are "highly variable". We do this by selecting the genes with the highest dispersion across the dataset and performing PCA on those genes only. The Seurat package uses a similar approach.
To try to exclude outliers, however, we'll cut off cells that include more than 50% ribosomal proteins, based on the violin plot above.
adata = adata[adata.obs.n_genes_by_counts < 3500, :]
print("> After removal by gene count:")
print(adata.shape)
adata = adata[adata.obs.pct_counts_mt < 5, :]
print("After removal by MT percentage:")
print(adata.shape)
adata = adata[adata.obs.pct_counts_hb < 5, :]
print("After removal by HB percentage (note: no change)")
print(adata.shape)
adata = adata[adata.obs.pct_counts_ribo < 50, :]
print("After removal by Ribosomal percentage")
print(adata.shape)
adata_clean = adata.copy()
sc.pl.highest_expr_genes(adata_clean, n_top=20)
> After removal by gene count: (3076, 17852) After removal by MT percentage: (2735, 17852) After removal by HB percentage (note: no change) (2735, 17852) After removal by Ribosomal percentage (2728, 17852)
We still have the MT-CO1 gene expressed, but it's a fairly low percentage. Many of the ribosomal protein genes are still in this list, but the scanpy documentaiton suggests that this is normal:
We expect to see the “usual suspects”, i.e., mitochondrial genes, actin, ribosomal protein, MALAT1
Let's write the cleaned dataset of 2728 observations to disk, so we can rerun the next sections independently:
adata_clean.write('data/cleaned_y1.h5ad')
We'll start by allocating a place for visual inspection of clustering. We'll set a figdir for scanpy that will store the figures for later inspection. Particularly when generating a large amount of figures, we'll opt to store them and attach select results to markdown cells.
sc.settings.figdir = 'figures/leiden_y1'
os.makedirs(sc.settings.figdir, exist_ok=True)
Now, we'll read the data from the previous section, normalize counts to a sum of 1,000 and store a record of highly variable genes to the AnnData object:
adata = sc.read('data/cleaned_y1.h5ad')
# Normalize counts
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
# Save current snapshot before further rescaling
adata.raw = adata
# Find variable genes
sc.pp.highly_variable_genes(adata, flavor="seurat", n_top_genes=2000, inplace=True)
We'll use PCA to find linear combinations of genes that best express the variability in the data. This will not necessarily correspond to cell types, but it can give us guidance towards particular categories of genes.
# Scale all values to a consistent range of values:
sc.pp.scale(adata, max_value=10)
# Run PCA for 100 components, limiting to highly variable genes
sc.pp.pca(adata, n_comps=100, mask_var="highly_variable")
# Visualize loadings:
sc.pl.pca_loadings(adata, include_lowest=False)
We can see some potential marker genes that we should try visualizing. COL6A2 seems to be the biggest source of variability for the first principal component with other collagen-related genes. CST3, FTL, and a number of HLA (Human Leukocyte Antigen) genes in PC2 suggest a grouping of immune cells. PC3 includes PTPRCAP, CD3D and CD3E, and IL32, which also seem immunity-related. According to the scanpy tutorial, CD3D is a well-known marker for T-cells.
Some top marker genes suggested by the paper for different cell types include IL8, LUM, KRT1, so let's keep those in a list for visualisation as well.
For now, let's try to pick a limit of principal components to use for clustering by investigating their explained variability.
with plt.rc_context({'figure.figsize': (10, 5)}):
sc.pl.pca_variance_ratio(adata, log=True, n_pcs=100)
The paper picks 28 as their number of principal components, which seems somewhat reasonable, but it's hard to see a sharp reduction in in explained variability for a while longer. We'll pick principal component 45 as a point where it looks as though variability really tapers off.
This is a very subjective choice and it's hard to know what's "correct". Increasing the number of components might amplify minor distinctions, but setting it too low may miss important signals of differentiation. Tweaking the number also shifts umap cluster positions quite a lot, making it hard to visually determine correspondences. We'll leave this constant in a separate cell, so it's easy to experiment with:
PCA_COMPONENTS=45
With that, let's try calculating a UMAP embedding and plotting it with some of the genes we looked at above.
# UMAP clustering based on number of PCs:
sc.pp.neighbors(adata, n_pcs=PCA_COMPONENTS)
sc.tl.umap(adata, random_state=RANDOM_STATE)
# Labels from PCs:
sc.pl.umap(adata, color=['COL6A2', 'CST3', 'PTPRCAP'])
# Labels from the paper:
sc.pl.umap(adata, color=['IL8', 'LUM', 'KRT1'])
The separation looks promising, our principal components have been able to find some genes specific to major clusters. The suggested marker genes by the paper also tell us our results are compatible with theirs so far.
Let's plot the QC-related genes and investigate their distributions across clusters:
sc.pl.umap(
adata,
color=['pct_counts_mt', 'pct_counts_ribo', 'n_genes_by_counts']
)
Mitochondrial genes are evenly spread out and limited to below 5%, since that was the threshold we picked. There are some outliers with higher gene counts, though some may correspond to doublets and will likely go away when we remove them.
Ribosomal genes have a much higher percentage of expression, as we noted before, but they are mostly spread out. There is one broad cluster in the center that has a fairly high expression and one smaller on the upper-right that in a previous chart corresponded to KRT1. As we'll see, we can find non-ribosomal marker genes for both of those.
At this stage, we might also take a look at doublets in the filtered and normalized datasets to see how they're distributed.
sc.pl.umap(
adata,
color=['predicted_doublets', 'doublet_scores']
)
We can see some predicted doublets sprinkled between the center and upper-left clusters, which might be an indication of doublets between these two cell types. A few are present inside of islands, where we might expect the clusters to separate with a higher leiden resolution.
Speaking of which, let's try to run the leiden clustering algorithm with different resolutions and decide which ones we should try to focus on. We'll prepare a simple function that runs it on various resolutions and stores the images for later inspection:
def perform_leiden(adata, resolutions, label):
for res in resolutions:
key = 'leiden_' + str(res)
sc.tl.leiden(
adata,
resolution=res,
flavor="igraph",
# Note: we'll let the algorithm iterate until convergence, it does not take long:
n_iterations=-1,
key_added=key,
random_state=RANDOM_STATE,
)
sc.pl.umap(
adata,
color=[key],
legend_loc='on data',
show=False,
save=f"_{key}_{label}.png",
)
Let's pick a range of values with varying granularity. We'll pay particular attention to the 0.5 to 0.9 range:
perform_leiden(adata, resolutions=[0.1, 0.5, 0.6, 0.7, 0.8, 0.9, 2, 2.5, 5], label="exploratory")
WARNING: saving figure to file figures/leiden_y1/umap_leiden_0.1_exploratory.png WARNING: saving figure to file figures/leiden_y1/umap_leiden_0.5_exploratory.png WARNING: saving figure to file figures/leiden_y1/umap_leiden_0.6_exploratory.png WARNING: saving figure to file figures/leiden_y1/umap_leiden_0.7_exploratory.png WARNING: saving figure to file figures/leiden_y1/umap_leiden_0.8_exploratory.png WARNING: saving figure to file figures/leiden_y1/umap_leiden_0.9_exploratory.png WARNING: saving figure to file figures/leiden_y1/umap_leiden_2_exploratory.png WARNING: saving figure to file figures/leiden_y1/umap_leiden_2.5_exploratory.png WARNING: saving figure to file figures/leiden_y1/umap_leiden_5_exploratory.png
def visualize_resolution_images(label, resolutions):
figure, axes = plt.subplots(1, len(resolutions), figsize=(18, 10))
for ax in axes:
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
for i, resolution in enumerate(resolutions):
axes[i].imshow(mpimg.imread(f'{sc.settings.figdir}/umap_leiden_{resolution}_{label}.png'))
figure.tight_layout()
figure.show()
visualize_resolution_images('exploratory', [0.1, 0.9, 5])