PRC2 Targets in CpG Islands

Original paper: Moqri, Mahdi et al. "PRC2-AgeIndex as a universal biomarker of aging and rejuvenation." Nature communications vol. 15,1 5956. 16 Jul. 2024, doi:10.1038/s41467-024-50098-2.

The paper by Moqri et al looks at low-methylation regions (LMRs) that are enriched with PRC2 targets. The particular CpG islands that are bound by this transcriptional repressor seem to gain methylation with age at a very consistent rate and the authors suggest that it could be used as a universal biomarker of cellular aging. A really nice illustration of PRC2 binding is available in the graphical abstract of a different paper by Prorok et al, "Loss of Ezh2 function remodels the DNA replication initiation landscape", Cell Reports vol. 42, issue 4, 2023, doi:10.1016/j.celrep.2023.112280:

1-s2.0-S2211124723002917-fx1.jpg

The uniprot entry for EZH2 describes the mechanism very succintly:

Catalytic subunit of the PRC2/EED-EZH2 complex, which methylates 'Lys-9' (H3K9me) and 'Lys-27' (H3K27me) of histone H3, leading to transcriptional repression of the affected target gene. Able to mono-, di- and trimethylate 'Lys-27' of histone H3 to form H3K27me1, H3K27me2 and H3K27me3, respectively. Displays a preference for substrates with less methylation, loses activity when progressively more methyl groups are incorporated into H3K27, H3K27me0 > H3K27me1 > H3K27me2 (PubMed:22323599, PubMed:30923826).

In this project, we'll focus on only one of the samples of their ChIP-seq assay. It comes from epidermis cells from an 84-year-old donor and the antibody used for the pulldown was for EZH2. We won't go into the differential analysis, because the paper compares methylation of the LMRs, rather than differential expression. Instead, we'll investigate the binding sites of EZH2 across the human genome.

Table of contents:

  1. Technical preparation
  2. Sequence download
  3. Quality control
  4. Alignment
  5. Coverage
  6. Peak calling
  7. Peak comparison
  8. Motif discovery
  9. Gene discovery
  10. Closing thoughts

Technical preparation

Before we download any samples, we'll prepare a directory structure of where we'll be storing intermediate and output files. To debug possible issues, we'll collect some logs and figures that we can include in the notebook.

We'll use a GRCh38 (hg38) genome assembly from NCBI: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.40/. This has been compiled into a bowtie2 reference that we've stored on the gbiomed server under /mnt/storage/r0978323/. A few other tools have been installed there and will be added to the PATH, to make sure we have newer versions that can handle the dataset. For instance, FastQC v0.11.5 raises warnings that it is parsing the FASTQ files incorrectly, but the one we're using, v0.12.1, doesn't have any issues with it.

A copy of the full output from the notebook can be found under /mnt/storage/r0978323/project_data/prc2_age_index/output. Since many of the steps may take a long time to run, the fastqc output files or exploratory leiden images can be evaluated from that directory instead.

Sequence download

We'll start by downloading the replicate sample and its input, performing quality control and aligning the two fastq files to hg38. The two files can be found on GEO:

Sample SRA id SRA run id URL
Old replicate 1 SRX23335288 SRR27667476 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM8027713
Old replicate 1, input SRX23335283 SRR27667481 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM8027708

The GEO accession also includes some of the processed files, like the called peaks: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE253773. We'll later take those and compare them with our results to validate what we've done.

We'll start with the replicate fastq file to examine the results step by step. Then we'll prepare the input and move on to coverage and peak calling. The file's reads are paired, so we'll run fastq-dump with the recommended --split-3 parameter to get reads in both directions. A third file doesn't seem to be created, so we can be confident that all the reads are correctly paired up.

Quality control

The number of reads in the file is 21,073,282 which matches the SRA information exactly. To see what we have, we'll run FastQC on this downloaded file and look for fixable issues.

Unfortunately, the raw set of reads includes adapters that we'll need to clean up: Overrepresented sequences

The paper uses TrimGalore to automatically cut adapters from the sequence. It's a perl script that uses cutadapt underneath and performs other quality checks. It provides a way to run it with paired fastq sequences, so we'll provide both and then go back to examining the resulting files with FastQC:

We are now left with 20,949,697 reads, some number of them being removed after adaptor cuts made them too short. There's a few potential issues that FastQC flags:

Per-base sequence content Per-sequence duplication percentage Per-sequence GC content
per_base_sequence_content.png per_sequence_duplication.png per_sequence_gc_content.png

Since we are looking at a ChIP-seq experiment, we can expect that some motifs will be present at the edges of the reads, so it's natural that there is some variability at the beginning and end of the sequences. This also explains why there is some percentage of sequence duplication and possibly the GC content distribution. The PRC2 complex is known to bind close to CpG islands, so that might also be skewing the GC percentages.

Alignment

Now that we have trimmed fastq files, we can align them to GRCh38. We'll provide both trimmed fastq files as pairs and produce a temporary SAM file with the aligned reads.

Let's convert the SAM files to binary and take a look at the mapping using samtools flagstat:

There is an overall alignment rate of 93.10%, which is not bad. There is some discordant mapping and some unmapped reads. We'll remove them by filtering out the unmapped and secondary alignments and putting the filtered results in a "clean" file:

We no longer have unmapped or discordant sequences in the resulting file. The number of reads is 39,009,846, which divided by two (for both reads in a pair) makes 19,504,923 reads for us to work with.

Let's now sort the BAM file and index it, so we can view it in the Integrated Genome Viewer (IGV).

If we've reached this point, we should be able to remove the fastq files to save on space:

Now we can do the same for the input sequences. We'll loop over both files, just in case we want to ignore the individual steps above and run this combined one. The script will check if a fastq file is already downloaded or processed before attempting to run the time-consuming steps on it.

We won't run the initial FastQC before trimming, but we'll perform a FastQC analysis after that and store it in the fastqc/ directory.

The FastQC report of the input reads doesn't suggest any different issues than the replicate, so we'll move forwards with the next steps. The number of reads of the downloaded file is 23,622,543, which goes down to 23,102,060 after alignment (97.96% overall) and removal of unmapped reads.

Coverage

To avoid having to load all reads into IGV, we'll produce a BigWig file with just the coverage. We'll use a small bin size of 5 to capture some level of detail.

For the effective genome size, we took the advice of the deeptools documentation to run faCount on our genome fasta file:

There are two common alternative ways to calculate this:

  1. The number of non-N bases in the genome.
  2. The number of regions (of some size) in the genome that are uniquely mappable (possibly given some maximal edit distance). Option 1 can be computed using faCount from Kents tools.

We downloaded the tool and ran it on our HG38 fasta file:

The total sums of non-N bases is:

We can now take a look at our BigWig file and compare it with the peaks we can see ourselves on the read alignment. Some screenshots from IGV show reads stacking around the transcription start sites of several genes:

HOXA13 ← DAB2IP → ELOVL2 ←
2025-01-03-115556_1920x1080_scrot.png 2025-01-03-120035_1920x1080_scrot.png 2025-01-03-120205_1920x1080_scrot.png

The ELOVL2 gene is mentioned in our main paper, while DAB2IP was picked as a target from Min et al and HOXA13 from Buschbeck et al. A number of other homeobox genes are also suppressed by the PRC2 complex, as we'll see a bit later.

The BigWig peaks seem to align very nicely with the reads. Now we have to determine which of these are actually significant by comparing the replicate with the input control file.

Peak calling

We'll use macs2 to call peaks based on the read counts with the input reads given as a control. We'll use the same effective genome size as we did for coverage generation and a q-value threshold of 0.05.

We'll execute the given R script that describes the macs2 peak-calling model and examine the images in a later cell below:

As we can see below, the number of peaks we find is 12,344. It's a fairly large amount, but we're looking at a complex that acts quite broadly on the genome, so it's to be expected. We can see from the initial illustration that it doesn't bind directly to DNA sequences but to histones, with preference for low-methylated CpG islands.

We'll create a bed file that we'll use for functional analysis and finish this section by generating a heatmap that visualizes the distribution of normalized reads around the peaks. First, we need to compute a matrix of normalized counts which will take about 15 minutes. In the next cell, we'll generate the actual heatmap.

Heatmap Peak model Cross-correlation
ezh2_old_peaks.png ezh2_old_model_peaks.png ezh2_old_model_cross_correlation.png

The heatmap for EZH2 peaks across the genome is on the left. We can see a very sharp peak in the middle, but a fairly wide tail of other binding around it in a 3kbp radius. It's an interesting shape, because it's somewhat inbetween a narrow and a broad binding pattern. If we explore some of the bindings in IGV, we can sometimes see peaks called both at the starts of genes, and shorter ones in their bodies:

2025-01-04-101532_1920x1080_scrot.png

The model for forward and reverse reads shows them being distributed around 150bp before and after the binding site, confirmed by the cross-correlation measurement pinning the lag between them at 282bp. The model suggests a bimodal distribution of reads, as described by Wilbanks et al in "Evaluation of Algorithm Performance in ChIP-Seq Peak Detection", doi:10.1371/journal.pone.0011471. Even though the complex binds to histones, it seems to bind areas with some specificity and act more like a DNA-binding factor.

Peak comparison

Now that we have a narrow peaks file, we would like to see how it compares to the processed file uploaded by the authors of the study. The GEO accession includes their narrow peak files as well, which we've placed in /mnt/storage/r0978323/project_data/prc2_age_index/downloaded. We'll symlink the files here and upload them to IGV to compare.

We can start by comparing the number of peaks detected between the files:

The old peak replicate that we downloaded has more than twice the peaks we detected! The neonatal sample has more than twice that amount, too.

The fact that the neonatal ChIP-seq would produce more binding sites is to be expected. This complex binds low-methylated regions that increase in methylation with age. It's natural that it would bind to fewer and fewer places over time.

It's harder to judge why our old sample is different from theirs. Let's try a visual inspection of a particular binding site, DAB2IP:

2025-01-03-141452_1920x1080_scrot.png

At the top, we can see the BigWig coverage that we generated, then from top to bottom, we have the downloaded neonatal peaks, the downloaded old peaks, and our own peaks. The neonatal peaks agree in the central area, but add more binding sites in the UTR and first intron of the gene. If we zoom out and render a track with the reads stored in the bam file...

2025-01-03-142105_1920x1080_scrot.png

There's certainly some visible peaks in the downloaded sample that our tool did not consider to be significant. An attempt was made to loosen the q-value threshold from 0.05 to 0.10, but that only resulted in about 200 more peaks. The paper does not specify a q-value threshold, but the default is 0.05, so it's hard to say if that's the reason for the discrepancy. They used a slightly older version of MACS2, but it seems unlikely that their algorithm would change differential calling like that. Here's what they have to say about peak calling in their "Methods" section:

The following processing steps were conducted as per the ENCODE3 ChIP-Seq Pipeline. [...] MACS2 (version 2.2.7.1) callpeak was run on each individual tissue with replicates merged for each tissue/passage, and a fold enrichment signal track was built using the bdgcmp (narrowpeak files were also called for each replicate separately for differential binding analysis, see next section).

The "next section" goes into differential analysis, but does not elaborate on the specifics of their calling method. The bdgcmp command should deduct noise, so it seems more likely to remove peaks than to call extra ones. One thing that might be causing the discrepancy is that we mapped the reads to a genome that includes scaffolds that are not mapped to chromosomes, which may have diverted some reads that would have been stacked in our missing regions.

Still, the peaks we see in IGV are close to the ones we call, so we would expect that the differences should not change the gene list we end up too much. Let's start by inspecting enriched motifs in RSAT and then find a list of genes close to the binding sites through GREAT.

Motif discovery

We'll start by extracting FASTA sequences for the bed file we created earlier. We'll use the same genome we used for mapping:

RSAT returns a lot of interesting information about the kinds of motifs that the PRC2 complex associates with. We can start with sequence composition:

2025-01-03-190132_1920x1080_scrot.png

Peak lengths seem to separate into two clusters, one with ~280bp and one with 1000bp peaks. It's a bit odd to see one peak of length an entire kilobasepair, but we can imagine that some regions of heterochromatin might be densely bound with PRC2. In order to limit the results to the subset of short, more targeted bindings, we tried RSAT's "Reduce peak sequences" option to cut peaks to +/-150bp around the center, but it didn't produce notable changes in the results:

2025-01-04-102209_1920x1080_scrot.png

What we can say is that the central region around the peaks is very even in terms of nucleotide composition with a noticeable bias towards C and G sites. The most common dinucleotide pairs are GG and CC with GC close behind. Only CG sites are fairly depleted, likely due to CpG methylation, but they're still more common than GT, TA, and AT, suggesting a high GC content and a presence of (yet) unmethylated CpG islands. The found motifs also show a strong preference for stretches of C and G:

Motifs part 1 Motifs part 2
2025-01-04-103030_1920x1080_scrot.png 2025-01-04-103033_1920x1080_scrot.png

Some of the logos show a very strong consensus with almost 2 bits of information for CGGAGC and CCCCTC, for instance. Others show weaker per-base informational content, but are quite broad, like positions_7nt_m1 running for 24 bases and made up exclusively of Gs and Cs, supporting the documented preference of the complex for CpG islands.

Gene discovery

RSAT doesn't find any genes for these motifs in the databases we tried, likely because they're calibrated for transcription factors, where PRC2 is a repressive complex that binds to histones. We can still find target genes by using the online tool GREAT.

In order to submit our bed file to GREAT, we'll need to relabel it, since NCBI's human genome uses labels like NC_ for chromosomes, while NT_ and NW_ represent contigs or scaffolds (HGVS Nomenclature). We'll run a script to relabel the chromosomes and remove non-chromosomal reads:

In the search settings, we'll remove the distal extension of 1000kbp completely by setting it to 0. We'd only like to find genes that PRC2 represses directly. The biological processes GREAT suggests for these peaks are mostly related to development and cell differentiation:

2025-01-04-105014_1920x1080_scrot.png

This is only the top 20 processes, but we can already see very strong significance values. We've stored the full TSV of GO:BP results and we can explore it to see that around 30% of the found processes are related to development, differentiation, or morphogenesis:

The Polycomb Repressive Complex 2 is part of the polycomb complexes known from Drosophila to suppress Hox genes, so seeing these terms enriched makes perfect sense. The list of genes that GREAT finds is 1912 entries long:

2025-01-04-105858_1920x1080_scrot.png

It's easier to look through it programmatically. We've saved a copy of the two lists, gene-to-region and region-to-gene, and we can look for some particular targets important to development and cell differentiation:

If we were in doubt that PRC2 interacts with organismal development and cell differentiation, these are just a few examples that might dispel that.

It's interesting to check whether the peak file we downloaded from the GEO accession finds the same genes. Recall that it listed more detected peaks, but we assumed those peaks were co-located next to ours. We can upload a bed file from the downloaded narrowPeak file and see how many results we get from that:

Frustratingly, it seems there's around 900 more genes that are found with these extra peaks:

2025-01-04-110834_1920x1080_scrot.png

We can download that list and rerun the above greps to investigate the targets:

It's interesting, because the second list is not a complete superset of the first. For example, our peaks are missing HOXA11 and HOXC11, but the downloaded peaks miss HOXC12. Here's an IGV visualization of where peaks are located relative to HOXA11:

2025-01-04-111652_1920x1080_scrot.png

We can see that in this case it's a matter of degree -- our peaks are there, but they're far enough that GREAT does not consider them as interfering with transcription. A stricter sensitivity of peak detection might mean we miss this (possible) interaction.

HOXC12 is an odd case:

2025-01-04-111854_1920x1080_scrot.png

Both our peaks and the downloaded ones seem to cover and repress HOXC12, but GREAT only shows this as a result in our peaks and now the downloaded ones. What might be happening is that GREAT only considers the center of the peak, so the broad downloaded one isn't detected, even if it visibly covers the entire area.

Closing thoughts

Building our own bowtie2 database out of the downloaded NCBI genome might have made our lives more complicated than they needed to be. Sticking to UCSC could have saved us some work and might have made our results more compatible with the ones in the study. Still, it was a useful experience, a peek into the complexity of the work that goes into deriving inferences from genomic data.

The ChIP-seq experiment not being a classic transcription factor gives us a different view on the results than we would get from a TP53 assay, for instance. The bindings we found were mostly punctate around transcription starts, but they would also spread across genes, which seems like a reasonable pattern for a histone-bound repressor. Motif enrichment gave us strong support for PRC2 binding in CpG islands, and gene enrichment showed us very significant associations with developmental genes, supporting our previous knowledge of the complex being part of the polycomb group proteins with vital roles in epigenetic silencing during and after embryonal development.

While this part of the analysis is just a small part of the original study, it was an interesting exploration. There are some remaining questions about the exact concordance of our results with the ones in the paper, but ultimately, our qualitative inferences are consistent with what we can find in the literature.