Original paper: Solé-Boldo, L., Raddatz, G., Schütz, S. et al. Single-cell transcriptomes of the human skin reveal age-related loss of fibroblast priming. Commun Biol 3, 188 (2020). https://doi.org/10.1038/s42003-020-0922-4
The paper analyses five human skin samples of young and old donors and compares them to understand what changes in skin cells over time. The skin is taken from a region that is not exposed to the sun to avoid the effects of UV radiation and it includes around 16 thousand cells before QC.
Here, we'll focus primarily on one young sample and try to replicate the paper's findings about fibroblast subpopulations. We won't go as deep into it as the study authors do, but we'll also explore sample integration and functional annotation.
This notebook is intended to run with a kernel located at /mnt/storage/r0978323/system/.local/share/jupyter/kernels/project_1/kernel.json
. It should be provided with the submission, but either way, the purpose of it is to set up a python environment running python 3.11.11 with a collection of all the necessary packages. A requirements.txt
file with these contents should work to replicate the environment:
scanpy[leiden]
anndata==0.10.9
scrublet
numpy==1.26
gprofiler-official
harmonypy
A copy of the output of running the notebook should be present in /mnt/storage/r0978323/project_data/skin_aging/output
, where it could be investigated if there are problems with running it independently.
The relevant modules are imported below and a random state is picked for reproducibility purposes:
import os
from pathlib import Path
from warnings import catch_warnings
import scanpy as sc
import scrublet as scr
import anndata as ad
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
RANDOM_STATE = 978323
Scanpy can automatically save generated plots, though it doesn't provide a lot of fine-grained control, but only a global "settings" object. We'll increase the figure size a bit to improve image resolution:
sc.settings.set_figure_params(
facecolor="white",
figsize=(8, 8),
format='png',
)
The original files from the study have been downloaded from GEO: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE130973
Three files have been downloaded to /mnt/storage/r0978323/project_data/skin_aging
, unzipped and renamed:
GSE130973_barcodes_filtered.tsv.gz
to barcodes.tsv
GSE130973_genes_filtered.tsv.gz
to genes.tsv
GSE130973_matrix_filtered.mtx.gz
to matrix.tsv
Note that the labels of the data files say "filtered", but that is only with respect to CellRanger. The authors explain their methods like this:
16,062 cells passed the quality control steps performed by Cell Ranger. To remove possible cell doublets, we filtered out cells with more than 7500 expressed genes, and to remove potential apoptotic cells we discarded cells with more than 5% mitochondrial reads. The application of these filters resulted in a final dataset of 15,457 single-cell transcriptomes.
As we'll see below, the "filtered" dataset contains exactly 16,062 cells, so we can be sure that these files will need some additional cleanup. We'll start by creating a local directory that symlinks to the downloaded data:
os.makedirs('data', exist_ok=True)
downloaded_path = Path('data/downloaded')
if not downloaded_path.is_symlink():
downloaded_path.symlink_to('/mnt/storage/r0978323/project_data/skin_aging')
print(list(downloaded_path.glob('*')))
[PosixPath('data/downloaded/barcodes.tsv'), PosixPath('data/downloaded/genes.tsv'), PosixPath('data/downloaded/matrix.mtx'), PosixPath('data/downloaded/output')]
Scanpy has some trouble reading the matrix and genes with its read_10x_mtx
function, so we can't use it as-is. It seems that the function expects version 3 of the CellRanger format, but the study used version 2. This seems to be a problem with gzipping in particular, but unzipping doesn't quite fix it. There are more details in this github issue: https://github.com/scverse/scanpy/issues/1916.
There's a number of proposed fixes, but we've chosen here to construct an AnnData matrix manually:
adata = sc.read_mtx(downloaded_path / 'matrix.mtx')
adata_bc = pd.read_csv(downloaded_path / 'barcodes.tsv', header=None)
adata_features = pd.read_csv(downloaded_path / 'genes.tsv', header=None, sep='\t')
adata = adata.T
adata.obs['cell_id'] = adata_bc[0].values
adata.obs_names = adata.obs['cell_id']
adata.var['feature_name'] = adata_features[1].values
adata.var_names = adata.var['feature_name']
adata.var_names_make_unique()
# The make-unique method only touches the index, so we need to reassign the actual column:
adata.var['feature_name'] = adata.var.index
print("> Original data, as downloaded:")
print(adata)
> Original data, as downloaded: AnnData object with n_obs × n_vars = 16062 × 32738 obs: 'cell_id' var: 'feature_name'
Sample barcoding is described in the individual sample pages on GEO, for instance this one for sample y1: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM3758115
the cell barcode of this samples for the processed data files is -1
The sample barcodes are ordered from 1 to 5 and they correspond to the following sample names used in the paper:
# Assign sample labels
adata.obs.loc[adata.obs_names.str.endswith('-1'), 'sample_id'] = 'y1'
adata.obs.loc[adata.obs_names.str.endswith('-2'), 'sample_id'] = 'y2'
adata.obs.loc[adata.obs_names.str.endswith('-3'), 'sample_id'] = 'o1'
adata.obs.loc[adata.obs_names.str.endswith('-4'), 'sample_id'] = 'o2'
adata.obs.loc[adata.obs_names.str.endswith('-5'), 'sample_id'] = 'o3'
# Save files for later
adata[adata.obs['sample_id'] == 'y1'].copy().write('data/downloaded_y1.h5ad')
adata[adata.obs['sample_id'] == 'y2'].copy().write('data/downloaded_y2.h5ad')
We've saved the y1 and y2 samples to files, but going forward, we'll mostly work with the 3130 cells of y1.
adata = sc.read('data/downloaded_y1.h5ad')
print("> Sample y1, first young donor:")
print(adata)
> Sample y1, first young donor: AnnData object with n_obs × n_vars = 3130 × 32738 obs: 'cell_id', 'sample_id' var: 'feature_name'
Before analyzing cell clusters, we'll perform some cleanup and doublet detection. We'll start by measuring the highest-expressed genes in the downloaded dataset:
# Before cleanup:
sc.pl.highest_expr_genes(adata, n_top=20)
The highest-expressed gene seems to be MALAT1, a long non-coding RNA found in the nucleus, which is what we'd expect from a single-cell experiment. (In fact, a recent bioRxiv preprint argues that it can be used as a quality measure: https://www.biorxiv.org/content/10.1101/2024.07.14.603469v2).
The other genes seem to be mostly for ribosomal proteins, and there are also two mitochondrial reads. We'll filter out most of the cells with high MT- expression, but we'll see in a bit that ribosomal genes are a bit trickier to deal with.
For now, let's ask Scrublet to predict doublets in the collection:
scrub = scr.Scrublet(adata.X, expected_doublet_rate=0.06)
doublet_scores, predicted_doublets = scrub.scrub_doublets(
min_counts=2,
min_cells=3,
min_gene_variability_pctl=85,
n_prin_comps=30
)
scrub.plot_histogram()
Preprocessing... Simulating doublets... Embedding transcriptomes using PCA... Calculating doublet scores... Automatically set threshold at doublet score = 0.53 Detected doublet rate = 0.5% Estimated detectable doublet fraction = 7.7% Overall doublet rate: Expected = 6.0% Estimated = 6.2% Elapsed time: 30.6 seconds
(<Figure size 640x240 with 2 Axes>, array([<Axes: title={'center': 'Observed transcriptomes'}, xlabel='Doublet score', ylabel='Prob. density'>, <Axes: title={'center': 'Simulated doublets'}, xlabel='Doublet score', ylabel='Prob. density'>], dtype=object))
The automatic threshold picked by scrublet doesn't seem appropriate. We can clearly see a bimodal distribution in the right image, but the line is chosen to the right of the second peak rather than in the valley between them. A better threshold might be 0.20, so we'll pick that one, save the resulting predictions and re-plot the histogram:
# Override default call for y1:
predicted_doublets = scrub.call_doublets(threshold=0.2)
# Save doublets and scores for later:
adata.obs['predicted_doublets'] = predicted_doublets
adata.obs['doublet_scores'] = doublet_scores
scrub.plot_histogram()
Detected doublet rate = 2.5% Estimated detectable doublet fraction = 43.5% Overall doublet rate: Expected = 6.0% Estimated = 5.7%
(<Figure size 640x240 with 2 Axes>, array([<Axes: title={'center': 'Observed transcriptomes'}, xlabel='Doublet score', ylabel='Prob. density'>, <Axes: title={'center': 'Simulated doublets'}, xlabel='Doublet score', ylabel='Prob. density'>], dtype=object))
We'll try to visualize doublet locations relative to cell type clusters via UMAP:
# Ignore UMAP warnings for random state
with catch_warnings(action="ignore"):
scrub.set_embedding('UMAP', scr.get_umap(scrub.manifold_obs_, 10, min_dist=0.3))
scrub.plot_embedding('UMAP', order_points=True)
Some of the doublets are a bit spread out, but there are also concentrated clusters. We might expect that the locations are between clusters or at the edges of them. We'll remove them later by clustering with a high resolution parameter. For now, we'll proceed with the rest of the cleanup, like ignoring genes with low cell expression and removing cells with low gene expression:
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
We'll try to locate mitochondrial genes by looking for names starting with MT-
. A high percentage of those might indicate dead cells that we should remove. Hemoglobin-heavy cells should also be removed, though as we'll see, this does not seem to be a problem in this particular sample.
We'll note that the HBEGF gene does not seem to be hemoglobin-related, according to both uniprot and the human protein atlas. The "HB" here stands for "heparin-binding", so we'll make sure not to include it in the "hb" set of genes.
We'll also detect ribosomal genes and observe how they distribute.
adata.var['mt'] = adata.var_names.str.startswith('MT-')
adata.var['ribo'] = adata.var_names.str.startswith(("RPS", "RPL"))
adata.var['hb'] = adata.var_names.str.contains("^HB[^(P)]")
adata.var.loc[adata.var_names == 'HBEGF', 'hb'] = False
sc.pp.calculate_qc_metrics(
adata,
qc_vars=['mt', 'ribo', 'hb'],
percent_top=None,
log1p=False,
inplace=True,
)
figure, axes = plt.subplots(1, 4, figsize=(20, 5))
sc.pl.scatter(adata, x='total_counts', y='pct_counts_mt', show=False, ax=axes[0])
sc.pl.scatter(adata, x='total_counts', y='pct_counts_ribo', show=False, ax=axes[1])
sc.pl.scatter(adata, x='total_counts', y='pct_counts_hb', show=False, ax=axes[2])
sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts', show=False, ax=axes[3])
figure.show()
sc.pl.violin(
adata,
['n_genes_by_counts', 'total_counts', 'pct_counts_mt', 'pct_counts_ribo', 'pct_counts_hb'],
jitter=0.4,
multi_panel=True,
)
Plotting the cells by the number of genes they express, we can see that the bulk is located below 3500 with a spread of outliers above. We'll use 3500 as a threshold later.
Mitochondrial gene percentages are not too high, but we'll use the threshold of 5% to avoid potential apoptotic cells. There are essentially no blood cells that we can detect -- hemoglobin percentages go up to 0.4 at the maximum. Other samples do have a few, but here we don't need to do anything. Still, we'll add a filter for hb < 5
just to show that it does not change the cell count.
Ribosomal proteins are a different topic. There's a wide range of their expression, up to around 60% in some cases. The 10x genomics documentation suggests that this is not a signal of poor quality (https://kb.10xgenomics.com/hc/en-us/articles/218169723-What-fraction-of-reads-map-to-ribosomal-proteins):
Single Cell 3’ v2 libraries do contain reads mapping to ribosomal protein transcripts (Rps, Rpl). The fraction of reads varies based on cell type and overall cell health.
They note that ribosomal proteins are not the same as rRNA, these are poly-adenylated (see e.g. https://pmc.ncbi.nlm.nih.gov/articles/PMC1474067/) and do not neccesarily show quality issues:
Although both ribosomal proteins and ribosomal RNA (rRNA) make up the ribosome complex, ribosomal protein transcripts are not equivalent to ribosomal RNA (rRNA). Ribosomal protein transcript detection will not necessarily correlate with either rRNA or mitochondrial transcripts.
As we'll see later, these are distributed fairly evenly across cell types, so they should hopefully not interfere much with our analysis. We'll use highly variable genes, which is what the 10x genomics website recommends as well:
Only include in the PC analysis genes that are "highly variable". We do this by selecting the genes with the highest dispersion across the dataset and performing PCA on those genes only. The Seurat package uses a similar approach.
To try to exclude outliers, however, we'll cut off cells that include more than 50% ribosomal proteins, based on the violin plot above.
adata = adata[adata.obs.n_genes_by_counts < 3500, :]
print("> After removal by gene count:")
print(adata.shape)
adata = adata[adata.obs.pct_counts_mt < 5, :]
print("After removal by MT percentage:")
print(adata.shape)
adata = adata[adata.obs.pct_counts_hb < 5, :]
print("After removal by HB percentage (note: no change)")
print(adata.shape)
adata = adata[adata.obs.pct_counts_ribo < 50, :]
print("After removal by Ribosomal percentage")
print(adata.shape)
adata_clean = adata.copy()
sc.pl.highest_expr_genes(adata_clean, n_top=20)
> After removal by gene count: (3076, 17852) After removal by MT percentage: (2735, 17852) After removal by HB percentage (note: no change) (2735, 17852) After removal by Ribosomal percentage (2728, 17852)
We still have the MT-CO1 gene expressed, but it's a fairly low percentage. Many of the ribosomal protein genes are still in this list, but the scanpy documentaiton suggests that this is normal:
We expect to see the “usual suspects”, i.e., mitochondrial genes, actin, ribosomal protein, MALAT1
Let's write the cleaned dataset of 2728 observations to disk, so we can rerun the next sections independently:
adata_clean.write('data/cleaned_y1.h5ad')
We'll start by allocating a place for visual inspection of clustering. We'll set a figdir
for scanpy that will store the figures for later inspection. Particularly when generating a large amount of figures, we'll opt to store them and attach select results to markdown cells.
sc.settings.figdir = 'figures/leiden_y1'
os.makedirs(sc.settings.figdir, exist_ok=True)
Now, we'll read the data from the previous section, normalize counts to a sum of 1,000 and store a record of highly variable genes to the AnnData object:
adata = sc.read('data/cleaned_y1.h5ad')
# Normalize counts
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
# Save current snapshot before further rescaling
adata.raw = adata
# Find variable genes
sc.pp.highly_variable_genes(adata, flavor="seurat", n_top_genes=2000, inplace=True)
We'll use PCA to find linear combinations of genes that best express the variability in the data. This will not necessarily correspond to cell types, but it can give us guidance towards particular categories of genes.
# Scale all values to a consistent range of values:
sc.pp.scale(adata, max_value=10)
# Run PCA for 100 components, limiting to highly variable genes
sc.pp.pca(adata, n_comps=100, mask_var="highly_variable")
# Visualize loadings:
sc.pl.pca_loadings(adata, include_lowest=False)
We can see some potential marker genes that we should try visualizing. COL6A2 seems to be the biggest source of variability for the first principal component with other collagen-related genes. CST3, FTL, and a number of HLA (Human Leukocyte Antigen) genes in PC2 suggest a grouping of immune cells. PC3 includes PTPRCAP, CD3D and CD3E, and IL32, which also seem immunity-related. According to the scanpy tutorial, CD3D is a well-known marker for T-cells.
Some top marker genes suggested by the paper for different cell types include IL8, LUM, KRT1, so let's keep those in a list for visualisation as well.
For now, let's try to pick a limit of principal components to use for clustering by investigating their explained variability.
with plt.rc_context({'figure.figsize': (10, 5)}):
sc.pl.pca_variance_ratio(adata, log=True, n_pcs=100)
The paper picks 28 as their number of principal components, which seems somewhat reasonable, but it's hard to see a sharp reduction in in explained variability for a while longer. We'll pick principal component 45 as a point where it looks as though variability really tapers off.
This is a very subjective choice and it's hard to know what's "correct". Increasing the number of components might amplify minor distinctions, but setting it too low may miss important signals of differentiation. Tweaking the number also shifts umap cluster positions quite a lot, making it hard to visually determine correspondences. We'll leave this constant in a separate cell, so it's easy to experiment with:
PCA_COMPONENTS=45
With that, let's try calculating a UMAP embedding and plotting it with some of the genes we looked at above.
# UMAP clustering based on number of PCs:
sc.pp.neighbors(adata, n_pcs=PCA_COMPONENTS)
sc.tl.umap(adata, random_state=RANDOM_STATE)
# Labels from PCs:
sc.pl.umap(adata, color=['COL6A2', 'CST3', 'PTPRCAP'])
# Labels from the paper:
sc.pl.umap(adata, color=['IL8', 'LUM', 'KRT1'])
The separation looks promising, our principal components have been able to find some genes specific to major clusters. The suggested marker genes by the paper also tell us our results are compatible with theirs so far.
Let's plot the QC-related genes and investigate their distributions across clusters:
sc.pl.umap(
adata,
color=['pct_counts_mt', 'pct_counts_ribo', 'n_genes_by_counts']
)
Mitochondrial genes are evenly spread out and limited to below 5%, since that was the threshold we picked. There are some outliers with higher gene counts, though some may correspond to doublets and will likely go away when we remove them.
Ribosomal genes have a much higher percentage of expression, as we noted before, but they are mostly spread out. There is one broad cluster in the center that has a fairly high expression and one smaller on the upper-right that in a previous chart corresponded to KRT1. As we'll see, we can find non-ribosomal marker genes for both of those.
At this stage, we might also take a look at doublets in the filtered and normalized datasets to see how they're distributed.
sc.pl.umap(
adata,
color=['predicted_doublets', 'doublet_scores']
)
We can see some predicted doublets sprinkled between the center and upper-left clusters, which might be an indication of doublets between these two cell types. A few are present inside of islands, where we might expect the clusters to separate with a higher leiden resolution.
Speaking of which, let's try to run the leiden clustering algorithm with different resolutions and decide which ones we should try to focus on. We'll prepare a simple function that runs it on various resolutions and stores the images for later inspection:
def perform_leiden(adata, resolutions, label):
for res in resolutions:
key = 'leiden_' + str(res)
sc.tl.leiden(
adata,
resolution=res,
flavor="igraph",
# Note: we'll let the algorithm iterate until convergence, it does not take long:
n_iterations=-1,
key_added=key,
random_state=RANDOM_STATE,
)
sc.pl.umap(
adata,
color=[key],
legend_loc='on data',
show=False,
save=f"_{key}_{label}.png",
)
Let's pick a range of values with varying granularity. We'll pay particular attention to the 0.5 to 0.9 range:
perform_leiden(adata, resolutions=[0.1, 0.5, 0.6, 0.7, 0.8, 0.9, 2, 2.5, 5], label="exploratory")
WARNING: saving figure to file figures/leiden_y1/umap_leiden_0.1_exploratory.png WARNING: saving figure to file figures/leiden_y1/umap_leiden_0.5_exploratory.png WARNING: saving figure to file figures/leiden_y1/umap_leiden_0.6_exploratory.png WARNING: saving figure to file figures/leiden_y1/umap_leiden_0.7_exploratory.png WARNING: saving figure to file figures/leiden_y1/umap_leiden_0.8_exploratory.png WARNING: saving figure to file figures/leiden_y1/umap_leiden_0.9_exploratory.png WARNING: saving figure to file figures/leiden_y1/umap_leiden_2_exploratory.png WARNING: saving figure to file figures/leiden_y1/umap_leiden_2.5_exploratory.png WARNING: saving figure to file figures/leiden_y1/umap_leiden_5_exploratory.png
def visualize_resolution_images(label, resolutions):
figure, axes = plt.subplots(1, len(resolutions), figsize=(18, 10))
for ax in axes:
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
for i, resolution in enumerate(resolutions):
axes[i].imshow(mpimg.imread(f'{sc.settings.figdir}/umap_leiden_{resolution}_{label}.png'))
figure.tight_layout()
figure.show()
visualize_resolution_images('exploratory', [0.1, 0.9, 5])
The three resolutions we'll focus on are 0.1, 0.9, and 5. The last one is only interesting for picking out doublet clusters with some level of precision. The one on the left has 7 clusters that identify the major subtypes of cells and we should be able to make easy inferences about what kinds of cells they represent. Its number will drop to 6 once we remove some doublets and recalculate.
The middle clustering uses a resolution of 0.9 and is the finest resolution by 0.1 increments that separates the island in the top right (what we'll later show is keratinocytes) into two parts. It also produces a separation of fibroblasts into four clusters and one outlier, which mostly corresponds to the organization the paper settles on. (Though removing doublets seems to produce another axis of separation on what is currently cluster 5).
Let's handle potential doublets by filtering out clusters above a certain threshold of predicted doublets. We'll use the function from our course tutorials with some minor stylistic modifications:
def fraction_of_doublets(adata, resolution):
cluster_labels = np.unique(adata.obs[resolution])
cell_counts = []
doublet_fractions = []
for cluster_label in cluster_labels:
clus_adata = adata[adata.obs[resolution] == cluster_label]
cell_count = clus_adata.shape[0]
doublet_count = sum(clus_adata.obs['predicted_doublets'])
doublet_fractions.append(doublet_count / cell_count)
cell_counts.append(cell_count)
return pd.DataFrame({
'cluster_label': cluster_labels,
'cell_count': cell_counts,
'doublet_fraction': doublet_fractions
}).sort_values('doublet_fraction', ascending=False)
# We pick a very high resolution to find clusters with a large amount of doublets:
doublet_fractions = fraction_of_doublets(adata, resolution='leiden_5')
doublet_fractions.head(10)
cluster_label | cell_count | doublet_fraction | |
---|---|---|---|
36 | 41 | 36 | 0.361111 |
42 | 47 | 21 | 0.190476 |
30 | 36 | 6 | 0.166667 |
33 | 39 | 50 | 0.080000 |
16 | 23 | 60 | 0.066667 |
18 | 25 | 30 | 0.066667 |
5 | 13 | 36 | 0.055556 |
40 | 45 | 38 | 0.052632 |
46 | 6 | 79 | 0.050633 |
0 | 0 | 80 | 0.037500 |
Predicted doublets seem to be mostly spread out across clusters, so the removal of entire clusters may not be too useful. Still, we'll drop the top 3 clusters with a predicted doublet rate above 0.1 and filter out the other cells individiually:
doublet_clusters = [
label
for label
in doublet_fractions[doublet_fractions['doublet_fraction'] >= 0.1]['cluster_label']
]
adata = adata[~adata.obs['leiden_5'].isin(doublet_clusters)].copy()
adata = adata[adata.obs['predicted_doublets'] == False].copy()
Let's rerun clustering on the cleaned dataset:
perform_leiden(adata, resolutions=[0.1, 0.9], label="post_doublet")
visualize_resolution_images('post_doublet', [0.1, 0.9])
WARNING: saving figure to file figures/leiden_y1/umap_leiden_0.1_post_doublet.png WARNING: saving figure to file figures/leiden_y1/umap_leiden_0.9_post_doublet.png
Let's also save a snapshot of this anndata object so we can rerun the next section without repeating the clustering process, if we need to.
adata.write('data/cleaned_clustered_y1.h5ad')
We'll start from the broad clusters defined by the 0.1 resolution and try to understand the functionality of these 7 groups of cells. Let's set up some configuration for this section and then define a class that will let us generate lists and query for particular genes.
sc.settings.figdir = 'figures/markers_y1'
os.makedirs(sc.settings.figdir, exist_ok=True)
os.makedirs('data/markers_y1/all', exist_ok=True)
os.makedirs('data/markers_y1/up', exist_ok=True)
os.makedirs('data/markers_y1/gprofiler', exist_ok=True)
We can calculate gene markers by using Wilcoxon tests of differential expression between the clusters. We'll pick a log-fold threshold of 2 for the up-regulated genes and we'll require certain levels of in-group and out-group fractions. We'll also keep the full list of differentially expressed genes for GOrilla.
We'll process the ranking inside of the constructor and then provide a gene_clusters
method to look up where certain genes are located. We'll also add a simple gprofiler lookup method to enable us to make simple functional enrichment queries.
class MarkerGenes():
def __init__(self, adata, resolution):
# Initialize inputs
self.adata = adata
self.cluster_labels = np.unique(self.adata.obs[f'leiden_{resolution}'])
# Rank genes by comparing raw counts via wilcoxon tests:
# (ignore warnings for fragmented data frames)
with catch_warnings(action='ignore'):
sc.tl.rank_genes_groups(
adata,
f'leiden_{resolution}',
method='wilcoxon',
use_raw=True,
key_added=f"rank_{resolution}",
)
# Only select up-regulated genes specific to the cluster:
sc.tl.filter_rank_genes_groups(
adata,
min_in_group_fraction=0.25,
max_out_group_fraction=0.5,
min_fold_change=2,
key=f"rank_{resolution}",
key_added=f"rank_{resolution}_up",
)
# Collect ranked genes into dictionaries for easy access:
self.gene_dfs = {'all': {}, 'up': {}}
for label in self.cluster_labels:
gene_df = sc.get.rank_genes_groups_df(
adata,
label,
key=f"rank_{resolution}",
)
# We'll only keep significant marker genes
self.gene_dfs['all'][label] = gene_df[gene_df["pvals_adj"] < 0.05].copy()
gene_df = sc.get.rank_genes_groups_df(
adata,
label,
key=f"rank_{resolution}_up",
).dropna()
# We remove NA values that didn't pass the "up" filter
self.gene_dfs['up'][label] = gene_df[gene_df["pvals_adj"] < 0.05].dropna().copy()
def gene_clusters(self, gene_names, type='all'):
"""
Find clusters that include any of the given gene names. The type can
either be "all" for the full list of differentially-expressed genes or
"up" for the upregulated ones.
"""
gene_names = set(gene_names)
result_df = pd.DataFrame({
'clusters': [],
'names': [],
'logfoldchanges': [],
'pvals_adj': [],
})
for cluster in self.cluster_labels:
cluster_df = self.gene_dfs[type][cluster]
found_df = cluster_df[cluster_df['names'].isin(gene_names)].copy()
if len(found_df) > 0:
found_df['clusters'] = cluster
if len(result_df) > 0:
result_df = pd.concat((result_df, found_df[[*result_df.columns]]))
else:
result_df = found_df[[*result_df.columns]]
return result_df
def gprofiler_lookup(self, cluster, sources):
"""
Make an enrichment query to g:Profiler for the given cluster's
upregulated genes. Return a subset of columns from the resulting
dataframe.
"""
gene_df = self.gene_dfs['up'][cluster]
gene_names = [gene for gene in gene_df['names']]
result = sc.queries.enrich(
gene_names,
org='hsapiens',
gprofiler_kwargs={'sources': sources}
)
return result[["source", "name", "description", "p_value"]].copy()
There are 6 clusters at the 0.1 resolution that we can try to understand by using g:Profiler information. For instance, here's a list of biological processes that are upregulated in cluster 0 and a list of possible cell types from the Human Protein Atlas:
marker_genes_broad = MarkerGenes(adata, '0.1')
lookup = marker_genes_broad.gprofiler_lookup('0', ['GO:BP'])
lookup.head(10)
source | name | description | p_value | |
---|---|---|---|---|
0 | GO:BP | T cell activation | "The change in morphology and behavior of a ma... | 2.334303e-14 |
1 | GO:BP | immune system process | "Any process involved in the development or fu... | 1.302104e-13 |
2 | GO:BP | regulation of immune system process | "Any process that modulates the frequency, rat... | 1.595562e-13 |
3 | GO:BP | lymphocyte activation | "A change in morphology and behavior of a lymp... | 1.985179e-13 |
4 | GO:BP | leukocyte activation | "A change in morphology and behavior of a leuk... | 1.228464e-12 |
5 | GO:BP | cell activation | "A multicellular organismal process by which e... | 3.309318e-12 |
6 | GO:BP | immune response | "Any immune system process that functions in t... | 5.339767e-12 |
7 | GO:BP | positive regulation of immune system process | "Any process that activates or increases the f... | 1.432505e-10 |
8 | GO:BP | regulation of leukocyte activation | "Any process that modulates the frequency, rat... | 1.202071e-09 |
9 | GO:BP | mononuclear cell differentiation | "The process in which a relatively unspecializ... | 1.370003e-09 |
lookup = marker_genes_broad.gprofiler_lookup('0', ['HPA'])
lookup.head(5)
source | name | description | p_value | |
---|---|---|---|---|
0 | HPA | tonsil; non-germinal center cells[≥Medium] | tonsil; non-germinal center cells[≥Medium] | 2.712662e-10 |
1 | HPA | lymph node; non-germinal center cells[≥Medium] | lymph node; non-germinal center cells[≥Medium] | 1.723807e-08 |
2 | HPA | tonsil; non-germinal center cells[≥Low] | tonsil; non-germinal center cells[≥Low] | 4.821424e-07 |
3 | HPA | tonsil; non-germinal center cells[High] | tonsil; non-germinal center cells[High] | 5.659463e-07 |
4 | HPA | spleen | spleen | 7.909529e-07 |
Based on these biological processes, it's clear that the cluster is made up of immune cells: lymphocytes, T-cells, or other kinds of leukocytes. We know that this is a skin sample and not one from the tonsils or spleen, but the "lymph node" suggestion by the Human Protein Atlas (HPA) supports the guess of "lymphocites". Since we're exploring this cell map in broad strokes, we can call this cluster "Immunity 1" (since there will be one more).
Let's take a look at cluster 2:
lookup = marker_genes_broad.gprofiler_lookup('2', ['GO:BP'])
lookup.head(5)
source | name | description | p_value | |
---|---|---|---|---|
0 | GO:BP | circulatory system development | "The process whose specific outcome is the pro... | 1.068127e-13 |
1 | GO:BP | blood vessel development | "The process whose specific outcome is the pro... | 7.740081e-12 |
2 | GO:BP | vasculature development | "The process whose specific outcome is the pro... | 1.833645e-11 |
3 | GO:BP | blood vessel morphogenesis | "The process in which the anatomical structure... | 8.658498e-10 |
4 | GO:BP | anatomical structure formation involved in mor... | "The developmental process pertaining to the i... | 1.986941e-08 |
lookup = marker_genes_broad.gprofiler_lookup('2', ['HPA'])
lookup.head(5)
source | name | description | p_value | |
---|---|---|---|---|
0 | HPA | endometrium; smooth muscle cells[High] | endometrium; smooth muscle cells[High] | 4.776062e-09 |
1 | HPA | endometrium; smooth muscle cells[≥Medium] | endometrium; smooth muscle cells[≥Medium] | 3.158520e-07 |
2 | HPA | endometrium; smooth muscle cells[≥Low] | endometrium; smooth muscle cells[≥Low] | 3.287242e-06 |
3 | HPA | skin 2; endothelial cells[≥Medium] | skin 2; endothelial cells[≥Medium] | 1.224378e-04 |
4 | HPA | endometrium | endometrium | 2.325990e-04 |
These might be smooth muscle cells, possibly the ones lining blood vessels, or maybe stem cells that can differentiate into them.
With this in mind, we can rerun the gprofile queries on all clusters and attach some very rough annotations with lists of processes that seem relevant. We'll save the results into files and add the annotations manually. In the comments, we'll quote select biological processes from the g:Profiler CSVs.
for cluster in marker_genes_broad.cluster_labels:
lookup = marker_genes_broad.gprofiler_lookup(cluster, ['GO:BP', 'HPA'])
lookup.to_csv(f'data/markers_y1/gprofiler/cluster_{int(cluster):02}.csv')
broad_annotation = {
'0': 'Immunity 1 (Lymphocytes)', # t-cell/lymphocyte activation, immune response
'1': 'Immunity 2 (Macrophages)', # regulation of immune processes, defense response, macrophages
'2': 'Smooth muscles', # cell motility, actin filament-based process, muscle contraction
'3': 'Fibroblasts', # extracellular matrix organisation, cell adhesion, collagen fibril organisation
'4': 'Development 1 (Epidermis)', # skin/epidermis/epithelium/tissue development
'5': 'Development 2 (Blood vessels)', # angiogenesis, vascular/circulatory system development
}
adata.obs['broad_celltype'] = [
broad_annotation[cluster]
for cluster
in adata.obs['leiden_0.1']
]
with plt.rc_context({'figure.figsize': (6, 6)}):
sc.pl.umap(
adata,
color=['broad_celltype']
)
Let's start with a few statements from the paper to try to map our fine-grained clusters to their descriptions.
Keratinocytes were detected in three clusters (#5, #7 and #15) and their diversity was mainly due to their degree of differentiation. While epidermal stem cells (EpSC) and other undifferentiated progenitors (#7 and #15) expressed markers such as KRT5, KRT14, TP63, ITGA6, and ITGB1, differentiated keratinocytes (#5) were defined by KRT1, KRT10, SBSN, and KRTDAP expression
We'll use the 0.9 resolution and see if we can find these specific genes in our rankings. We'll re-plot our umap clustering to try to locate them visually.
marker_genes_narrow = MarkerGenes(adata, '0.9')
print("> Epidermal stem cells and undifferentiated progenitors:")
clusters_df = marker_genes_narrow.gene_clusters(['KRT5', 'TP63', 'ITGA6', 'ITGB1'], type='up')
print(f"Clusters: {np.unique(clusters_df['clusters'])}")
print(clusters_df)
print("> Differentiated keratinocytes:")
clusters_df = marker_genes_narrow.gene_clusters(['KRT1', 'KRT10', 'SBSN', 'KRTDAP'], type='up')
print(f"Clusters: {np.unique(clusters_df['clusters'])}")
print(clusters_df)
> Epidermal stem cells and undifferentiated progenitors: Clusters: ['15' '18' '7' '8'] clusters names logfoldchanges pvals_adj 136 15 ITGA6 3.402256 2.495204e-14 108 18 ITGA6 3.229118 1.216474e-02 33 7 KRT5 4.440905 5.100525e-16 128 7 TP63 4.200040 6.527335e-08 514 7 ITGA6 2.336281 1.722833e-02 1 8 KRT5 6.768706 1.057143e-64 45 8 TP63 5.512927 2.983660e-27 72 8 ITGA6 3.883892 2.669389e-20 > Differentiated keratinocytes: Clusters: ['7'] clusters names logfoldchanges pvals_adj 2 7 KRT1 7.685723 5.556684e-26 5 7 KRT10 6.142022 1.043802e-25 32 7 KRTDAP 6.632085 5.100525e-16 112 7 SBSN 6.395090 6.068905e-09
with plt.rc_context({'figure.figsize': (6, 6)}):
sc.pl.umap(
adata,
color=['leiden_0.9'],
legend_loc='on data'
)
Cluster 7 on this chart was part of what we originally classifed as "Development 1 (Epidermis)", but now we see that this is because the larger cluster includes undifferentiated stem cells, whose signal hides the presence of keratinocytes. All genes suggested by the paper are present in that cluster with very high log fold changes of 6-7.
Cluster 15 we flagged as development-related, so it makes sense that it's a group of undifferentiated stem cells. Cluster 18 was previously part of the smooth muscle cell group, but we did notice some developmental functions in our initial g:Profile exploration, which might be explained by this group being absorbed into it due to the broad leiden resolution.
We can see that both clusters 7 and 8 express KRT5 and TP63, which the paper suggests should be marker genes that separate the two. The log fold changes are higher for 8, and so are the adjusted p-values, but it might also be the case that integrating several samples or making the resolution finer might make the thresholds cleaner.
The main focus on the study is fibroblasts, so it's interesting to explore how they are clustered in our UMAP. Fibroblasts take up the entire top left cluster, which in our case is split into 5 clusters (and an extra outlier) instead of 4. We'll use the marker genes listed by the paper:
Fibroblasts were identified by their archetypal markers LUM, DCN, VIM, PDGFRA, and COL1A227
print("> Fibroblasts:")
clusters_df = marker_genes_narrow.gene_clusters(['LUM', 'DCN', 'VIM', 'PDGFRA', 'COL1A2'], type='up')
print(f"Clusters: {np.unique(clusters_df['clusters'])})")
print(clusters_df)
> Fibroblasts: Clusters: ['10' '11' '13' '17' '5' '6']) clusters names logfoldchanges pvals_adj 2 10 LUM 5.064789 1.159146e-14 8 11 LUM 3.254276 5.424429e-44 49 11 PDGFRA 2.463679 1.605754e-16 63 13 LUM 2.885251 6.212636e-41 81 17 PDGFRA 2.791304 3.867666e-20 47 5 LUM 2.153567 3.991706e-19 2 6 LUM 3.470551 2.628202e-34
Instead of four clusters, we end up with six, whose gene expression log fold changes do not seem particularly different. More than that, only two of the genes seem to show up in the upregulated list.
We can expand our criteria for log fold changes and look for "all" genes with a log fold change above 1:
clusters_df = marker_genes_narrow.gene_clusters(['LUM', 'DCN', 'VIM', 'PDGFRA', 'COL1A2'], type='all')
clusters_df[clusters_df["logfoldchanges"] > 1]
clusters | names | logfoldchanges | pvals_adj | |
---|---|---|---|---|
2 | 10 | LUM | 5.064789 | 1.159146e-14 |
17 | 10 | DCN | 3.166609 | 5.806327e-07 |
2 | 11 | COL1A2 | 3.448838 | 2.592614e-57 |
8 | 11 | LUM | 3.254276 | 5.424429e-44 |
17 | 11 | DCN | 2.552973 | 4.142319e-32 |
49 | 11 | PDGFRA | 2.463679 | 1.605754e-16 |
1 | 13 | DCN | 5.254685 | 1.639696e-107 |
7 | 13 | COL1A2 | 4.582181 | 5.198708e-90 |
63 | 13 | LUM | 2.885251 | 6.212636e-41 |
344 | 13 | PDGFRA | 1.472978 | 1.452574e-05 |
10 | 17 | COL1A2 | 3.342720 | 1.426196e-59 |
12 | 17 | DCN | 3.472524 | 2.345787e-56 |
81 | 17 | PDGFRA | 2.791304 | 3.867666e-20 |
334 | 17 | LUM | 1.009381 | 3.340726e-04 |
14 | 5 | DCN | 3.225979 | 1.207637e-40 |
47 | 5 | LUM | 2.153567 | 3.991706e-19 |
294 | 5 | PDGFRA | 1.383599 | 5.541931e-03 |
2 | 6 | LUM | 3.470551 | 2.628202e-34 |
11 | 6 | DCN | 2.671365 | 1.921617e-24 |
15 | 6 | COL1A2 | 2.386633 | 2.296702e-23 |
22 | 6 | VIM | 1.105719 | 5.124640e-21 |
157 | 6 | PDGFRA | 1.742508 | 5.751365e-04 |
Cluster 10 looks like an outlier in log fold expression of LUM and DCN. We might be looking at another undetected doublet cluster or a different kind of issue. We'll exclude it in further analysis, since it's also a comparatively small group of cells.
Let's save the fibroblast clusters' marker genes to files, excluding 10. We'll continue with a more detailed functional analysis on them in the g:Profiler website, and submit the full list of differentially-expressed genes to GOrilla to avoid artefacts of our arbitrary log-fold threshold.
for cluster in ['5', '6', '11', '13', '17']:
# Save up-regulated genes with log-fold changes above 2:
marker_genes_narrow.gene_dfs['up'][cluster]["names"].to_csv(
f'data/markers_y1/up/cluster_{int(cluster):02}.txt',
index=False,
header=False
)
# Save all genes, both up- and down-regulated for GOrilla:
marker_genes_narrow.gene_dfs['all'][cluster].sort_values('logfoldchanges', ascending=False)["names"].to_csv(
f'data/markers_y1/all/cluster_{int(cluster):02}.txt',
index=False,
header=False
)
We'll go over the individual clusters one by one and investigate the biological processes associated with them in both g:Profiler and GOrilla. To guide us in the right direction, we'll consult with the paper's analysis and try to map our clusters to their descriptions and see what we can figure out for the extra cluster.
The way they categorize the fibroblast subpopulations is:
Let's see if we can align the clusters we have to that categorization.
We'll start with cluster 5, which seems to have a notable immune response functionality. We can see the relevant GO:BP terms in the g:Profiler screenshot and the relevant cluster in the GOrilla graph:
Cluster | g:Profiler | GOrilla |
---|---|---|
We can also take note of the RelA transcription factor here, otherwise known as TP65, part of the NF-kappa-B complex, which is "Essential for cytokine gene expression in T-cells":
iRegulon also predicts that RelA binds 45 of the upregulated targets in this group:
We can expand a few of the GOrilla clusters to see mostly chemotaxis-related proteins, which according to the study has to do with positioning leukocytes:
Significant examples include inflammatory response, cell chemotaxis or negative regulation of cell proliferation, necessary for the final anchoring of leukocytes
It's a very nice match to the paper's cluster 2, down to the fairly low significance values compared to the other clusters. The p-values do cross a significance threshold, and, as the paper notes, this cluster does not seem enriched in "standard" fibroblast processes like building the extracellular matrix.
Let's give a similar treatment to cluster 11. We can see that we have very high-confidence enrichment for "system development" and "developmental process", which strongly suggests that these are mesenchymal cells, able to differentiate into multiple different cell types.
Cluster | g:Profiler | GOrilla |
---|---|---|
While they do express genes related to extracellular matrix organisation, the main functionality of fibroblasts, they differentiate from the others based on this strong developmental component. The g:Profiler results also include the largest number of transcription factors associated with expression in these cell types:
iRegulon suggests somewhat different factors, though several of them seem development-related like TCF12, involved in neuronal differentiation, EP300, a chromatin-remodeling protein, and NFIC, reportedly "involved in cellular differentiation and stem cell maintenance":
Going forward, it'll be more difficult to find associated biological processes that differentiate the clusters. Looking at clusters 13 and 17, they both score highly significant hits into for ECM organisation and collagen binding. We can see that cluster 13 has some mesenchymal functions, which might suggest the "reticular" subtype.
Cluster | g:Profiler | GOrilla |
---|---|---|
At this point, the authors of the paper seem to have focused more on known marker genes rather than on GO annotations. Cluster 13 has several marker genes with a high log fold change compared to the other clusters that indicate its reticular subtype: WISP2, SLPI, CTHRC1.
Likewise, cluster 17 is mostly focused on the ECM and collagen binding, with some developmental processes mentioned, but also some processes closer to the outer layer of the skin like wound healing.
Cluster | g:Profiler | GOrilla |
---|---|---|
The marker genes upregulated in this cluster include APCDD1, ID1, COL18A1.
So how do we sort cluster 6? So far, we've been able to associate each cluster with one in the paper, either based on the functional annotation or based on specific marker genes. Deciphering the role of this group of cells will be a bit harder. We can start with some of the best-supported biological processes suggested by g:Profiler:
Cluster | g:Profiler |
---|---|
What we can see at the top is an immunity regulation function, but less significant than in the pro-inflammatory cluster 5. It's also mixed with ECM organisation regulation and developmental processes.
GOrilla shows several different clusters of functionality with low, but significant p-values:
Regulation | Catabolic processes | ECM and developmental |
---|---|---|
In terms of marker genes, APOE is a protein expressed in the central and peripheral nervous system (uniprot), PTGDE is present in the tissues of the blood-brain barrier and in leukocytes (uniprot), and the highest-expressed protein by log fold change, DPT, "May serve as a communication link between the dermal fibroblast cell surface and its extracellular matrix environment" (uniprot).
Given the broad spread of functionality, we might speculate that this is a type of signaling cell or possibly a multipotent cell that is able to differentiate into neurons. A KEGG pathway reference that has relatively high support for this cell type is "complement and coagulation cascades": KEGG. This involves both inflammation and ECM changes and might be coordinated by these kinds of cells. Their catabolic functionality may be coming from a larger ribosomal protein expression:
iRegulon finds support for TCF12 and EP300, both present in the mesenchymal cluster. It also suggests STAT4 (which is added to the network as "CAT") that binds 41 targets and is categorized as a "Signal transducer" that is "mainly expressed in hematopoietic cells that plays a critical role in cellular growth, differentiation and immune response".
It's hard to say anything for sure if we're not domain experts. The cell type could also be an artefact of the particular choice of clustering resolution and an abnormal amount of ribosomal protein capture. If we decrease the granularity of clustering, the group splits between clusters 5 (pro-inflammatory) and 11 (mesenchymal). This does lend support to the multipotency of these cells and explains its participation in the aforementioned KEGG pathway.
We've done this analysis on a single sample, but the paper merges the two young samples together. Let's do the same and see if cluster 6 turns out to be a feature of the sample and disappears, or if something stranger happens.
We'll read the y1 data we cleaned and marked with doublets and we'll run through the steps to clean up the y2 samples.
sc.settings.figdir = 'figures/harmony'
os.makedirs(sc.settings.figdir, exist_ok=True)
adata_y1 = sc.read(f"data/cleaned_y1.h5ad")
adata_y2 = sc.read(f"data/downloaded_y2.h5ad")
scrub = scr.Scrublet(adata_y2.X, expected_doublet_rate=0.06)
doublet_scores, predicted_doublets = scrub.scrub_doublets(
min_counts=2,
min_cells=3,
min_gene_variability_pctl=85,
n_prin_comps=30
)
# Call it at 0.22:
predicted_doublets = scrub.call_doublets(threshold=0.22)
scrub.plot_histogram()
Preprocessing... Simulating doublets... Embedding transcriptomes using PCA... Calculating doublet scores... Automatically set threshold at doublet score = 0.51 Detected doublet rate = 0.2% Estimated detectable doublet fraction = 8.4% Overall doublet rate: Expected = 6.0% Estimated = 2.5% Elapsed time: 16.9 seconds Detected doublet rate = 1.7% Estimated detectable doublet fraction = 34.4% Overall doublet rate: Expected = 6.0% Estimated = 4.9%
(<Figure size 640x240 with 2 Axes>, array([<Axes: title={'center': 'Observed transcriptomes'}, xlabel='Doublet score', ylabel='Prob. density'>, <Axes: title={'center': 'Simulated doublets'}, xlabel='Doublet score', ylabel='Prob. density'>], dtype=object))
adata_y2.obs['predicted_doublets'] = predicted_doublets
adata_y2.obs['doublet_scores'] = doublet_scores
sc.pp.filter_cells(adata_y2, min_genes=200)
sc.pp.filter_genes(adata_y2, min_cells=3)
adata_y2.var['mt'] = adata_y2.var_names.str.startswith('MT-')
adata_y2.var['ribo'] = adata_y2.var_names.str.startswith(("RPS", "RPL"))
adata_y2.var['hb'] = adata_y2.var_names.str.contains("^HB[^(P)]")
adata_y2.var.loc[adata_y2.var_names == 'HBEGF', 'hb'] = False
sc.pp.calculate_qc_metrics(
adata_y2,
qc_vars=['mt', 'ribo', 'hb'],
percent_top=None,
log1p=False,
inplace=True,
)
sc.pl.violin(
adata_y2,
['n_genes_by_counts', 'total_counts', 'pct_counts_mt', 'pct_counts_ribo', 'pct_counts_hb'],
jitter=0.4,
multi_panel=True,
)
# Looks like we should pick a slightly higher threshold for this sample, let's go with 4000:
adata_y2 = adata_y2[adata_y2.obs.n_genes_by_counts < 4000, :].copy()
print(adata_y2.shape)
adata_y2 = adata_y2[adata_y2.obs.pct_counts_mt < 5, :].copy()
print(adata_y2.shape)
# We do have some blood cells to remove:
adata_y2 = adata_y2[adata_y2.obs.pct_counts_hb < 5, :].copy()
print(adata_y2.shape)
(2553, 17272) (2396, 17272) (2353, 17272)
# Save to disk to have a record of this step:
adata_y2.write('data/clean_y2.h5ad')
Now that we have two cleaned datasets with samples, we can merge them and perform the following normalization, scaling, and clustering steps to the combined dataset:
adata = ad.concat(
[adata_y1, adata_y2],
uns_merge="unique",
)
print(adata)
AnnData object with n_obs × n_vars = 5081 × 16599 obs: 'cell_id', 'sample_id', 'predicted_doublets', 'doublet_scores', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb'
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
# Save current snapshot before further rescaling
adata.raw = adata
# Find variable genes
sc.pp.highly_variable_genes(adata, flavor="seurat", n_top_genes=2000, inplace=True)
# PCA
sc.pp.scale(adata, max_value=10)
sc.pp.pca(adata, n_comps=100, mask_var="highly_variable")
# We'll pick the same number of components as before, we'd expect the variances to be compatible:
PCA_COMPONENTS=45
sc.pp.neighbors(adata, n_pcs=PCA_COMPONENTS)
sc.tl.umap(adata, random_state=RANDOM_STATE)
We've calculated a UMAP embedding before applying harmony, so we can plot how the two samples fit together in terms of embedding density:
adata.obsm['X_umap_preharmony'] = adata.obsm['X_umap'].copy()
sc.pl.umap(adata, color="sample_id")
sc.tl.embedding_density(
adata,
basis='umap_preharmony',
groupby='sample_id',
key_added='density_X_umap_sample_id',
)
sc.pl.embedding_density(
adata,
basis='umap_preharmony',
key='density_X_umap_sample_id'
)
There are regions of low density, though the area that we'll later confirm represents the fibroblasts is mostly well-covered by both samples. Let's integrate the principal components with harmony regardless:
sc.external.pp.harmony_integrate(
adata,
'sample_id',
basis='X_pca',
adjusted_basis='X_pca_harmony',
)
sc.pp.neighbors(
adata,
use_rep='X_pca_harmony',
key_added='neighbors_harmony',
n_pcs=PCA_COMPONENTS,
random_state=RANDOM_STATE,
)
sc.tl.umap(adata, neighbors_key='neighbors_harmony', random_state=RANDOM_STATE)
sc.pl.umap(adata, color='sample_id')
2025-01-05 11:24:03,294 - harmonypy - INFO - Computing initial centroids with sklearn.KMeans... 2025-01-05 11:24:06,022 - harmonypy - INFO - sklearn.KMeans initialization complete. 2025-01-05 11:24:06,899 - harmonypy - INFO - Iteration 1 of 10 2025-01-05 11:24:08,637 - harmonypy - INFO - Iteration 2 of 10 2025-01-05 11:24:10,180 - harmonypy - INFO - Iteration 3 of 10 2025-01-05 11:24:11,254 - harmonypy - INFO - Converged after 3 iterations
sc.tl.embedding_density(
adata,
basis='umap',
groupby='sample_id',
key_added='density_X_umap_sample_id',
)
sc.pl.embedding_density(
adata,
basis='umap',
key='density_X_umap_sample_id'
)
The samples' datapoints seem better mixed now. Let's rerun the leiden clustering function on this new sample and visualize a few of the resolutions:
perform_leiden(adata, resolutions=[0.1, 0.5, 0.6, 0.7, 0.8, 0.9, 2, 2.5, 5], label="exploratory")
visualize_resolution_images('exploratory', [0.1, 0.9, 5])
WARNING: saving figure to file figures/harmony/umap_leiden_0.1_exploratory.png WARNING: saving figure to file figures/harmony/umap_leiden_0.5_exploratory.png WARNING: saving figure to file figures/harmony/umap_leiden_0.6_exploratory.png WARNING: saving figure to file figures/harmony/umap_leiden_0.7_exploratory.png WARNING: saving figure to file figures/harmony/umap_leiden_0.8_exploratory.png WARNING: saving figure to file figures/harmony/umap_leiden_0.9_exploratory.png WARNING: saving figure to file figures/harmony/umap_leiden_2_exploratory.png WARNING: saving figure to file figures/harmony/umap_leiden_2.5_exploratory.png WARNING: saving figure to file figures/harmony/umap_leiden_5_exploratory.png
While the 0.1 resolution shapes up well with a few extra groups showing up, separating at 0.9 ends up splitting fibroblasts into a number of additional groups. Let's clean up doublets to avoid one source of noise:
# Clean doublet clusters
doublet_fractions = fraction_of_doublets(adata, resolution='leiden_5')
print(doublet_fractions.sort_values('doublet_fraction', ascending=False).head(10))
doublet_clusters = [
label
for label
in doublet_fractions[doublet_fractions['doublet_fraction'] >= 0.1]['cluster_label']
]
adata = adata[~adata.obs['leiden_5'].isin(doublet_clusters)].copy()
adata = adata[adata.obs['predicted_doublets'] == False].copy()
perform_leiden(adata, label='post_doublet', resolutions=[0.1, 0.9, 5])
visualize_resolution_images('post_doublet', [0.1, 0.9, 5])
cluster_label cell_count doublet_fraction 38 43 20 0.800000 50 54 11 0.545455 48 52 23 0.130435 30 36 8 0.125000 15 22 65 0.076923 44 49 63 0.063492 17 24 32 0.062500 43 48 19 0.052632 46 50 86 0.046512 51 55 23 0.043478 WARNING: saving figure to file figures/harmony/umap_leiden_0.1_post_doublet.png WARNING: saving figure to file figures/harmony/umap_leiden_0.9_post_doublet.png WARNING: saving figure to file figures/harmony/umap_leiden_5_post_doublet.png
Unfortunately, this doesn't seem to help. Note that cluster 6 is the previous clusters 7 and 8 made up of keratinocytes. We could broaden the resolution to try to maintain fibroblasts into 4-5 clusters, but even at this resolution, keratinocytes can't be separated anymore.
We can confirm that the top cluster consists of fibroblasts by using our MarkerGenes
class to search for the same marker genes as before:
marker_genes = MarkerGenes(adata, '0.9')
clusters_df = marker_genes.gene_clusters(['LUM', 'DCN', 'VIM', 'PDGFRA', 'COL1A2'], type='up')
print(f"Clusters: {np.unique(clusters_df['clusters'])})")
print(clusters_df)
Clusters: ['10' '11' '13' '17' '5' '8' '9']) clusters names logfoldchanges pvals_adj 27 10 LUM 2.612704 1.297550e-31 13 11 LUM 2.889503 2.979106e-51 69 11 PDGFRA 2.095949 2.159267e-18 64 13 LUM 2.840708 3.092165e-59 42 17 PDGFRA 3.099354 4.181933e-55 33 5 LUM 2.268180 1.571616e-49 1 8 LUM 4.666899 4.618369e-22 6 9 LUM 3.760450 5.626671e-47
marker_genes.gprofiler_lookup('13', ['GO:BP']).head(5)
source | name | description | p_value | |
---|---|---|---|---|
0 | GO:BP | extracellular matrix organization | "A process that is carried out at the cellular... | 1.968961e-29 |
1 | GO:BP | extracellular structure organization | "A process that is carried out at the cellular... | 2.221784e-29 |
2 | GO:BP | external encapsulating structure organization | "A process that is carried out at the cellular... | 2.506022e-29 |
3 | GO:BP | cell adhesion | "The attachment of a cell, either to another c... | 3.722681e-15 |
4 | GO:BP | collagen fibril organization | "Any process that determines the size and arra... | 1.088421e-12 |
Both marker genes and biological processes from GO suggest that the top cluster is, in fact, made up of fibroblasts. So why are they split into so many groups now?
One possibility that has to be acknowledged is some technical error of ours in mixing the samples, either through harmony or through the selection of principal components, QC thresholds, etc. It's hard to see where we've gone wrong, but it's certainly not impossible.
Provided we've performed the steps correctly, it could be that fibroblasts simply vary too much between samples compared to all the other cell types. Fibroblasts are a flexible cell type with the ability to differentiate into mesenchymal cells or into epithelial cells, as described by Thiery and Sleeman in "Complex networks orchestrate epithelial–mesenchymal transitions". Between-sample variability may be detectable by leiden clustering even after adjusting for batch effects with Harmony.
It's hard to say why the paper wouldn't run into similar issues when mixing replicates, however. This is a description of their fibroblast clustering:
For the second-level clustering of the fibroblasts (Supplementary Figs. 4a, b) we subsetted the cells and ran again the functions FindNeighbors() and FindClusters(). In this case, we used 20 PCA dimensions for both functions and a resolution of 0.5 for the FindClusters() function. For visualization we re-calculated the UMAP plot with RunUMAP() function with default parameters and using 20 PCA dimensions.
They reduced their number of principal components to 20, but this produces more or less the same results for us. The Seurat functions use a different algorithm, "shared nearest neighbour modularity optimisation", but we'd have to redo the analysis by using Seurat to make the comparison. No easy answer to be found at this time.
This skin sample analysis by Solé-Boldo et al is fascinating, showing the complex variation in cell roles in this outermost layer of the human body, our main contact with the outside world. Fibroblasts show remarkable flexibility and the capacity to take multiple roles: organizing skin tissues, repairing wounds, or differentiating into different types of cells.
While we had some trouble with sample integration, we successfully reproduced the four fibroblast subpopulations that the paper describes and identified them with good support from several different functional enrichment services. Both GO function annotation and transcription factor enrichment were useful in drawing the lines between cell types. And when all else fails, leaning on the paper for hints about specific marker genes worked out well.
The extra cluster that we found could be a mix of cells from the mesenchymal and pro-inflammatory groups, or it could have its own distinct role. In the first place, our taxonomy of types is mostly there to simplify things for our own inferences, and it's unsurprising that cells would rarely fit neatly into each box. Biology is messy, but as long as we accept that and apply our best efforts, we can draw out interesting conclusions both to enhance our understanding, and hopefully to improve our health and well-being.