TransWikia.com

Chromatin accessibility level on promoters with different CpG ratio calculation

Bioinformatics Asked by krushnach Chandra on August 22, 2021

I want to categorize promoters based upon high, intermediate and low-CpG content (high-CpG-density promoters (HCPs), intermediate-CpG-density promoters(ICPs), and low-CpG-density promoters (LCPs)).

So the data I have is for promoter which i have annotated 1000+/- around TSS and taken them as something less than 1kb is promoter region and beyond distal.

In term of tool i have used chipseeker to annotate and do the above step.

So now if i have to find the CpG density as i have mentioned above how do i get that?

One Answer

Below I import the peaks call in narrowbed format from macs2 into a GRanges object. If you have a bed file, you can just use rtracklayer for it:

library(rtracklayer)
library(BSgenome.Hsapiens.UCSC.hg38)

peaks_narrowbed = "https://www.encodeproject.org/files/ENCFF846JAO/@@download/ENCFF846JAO.bed.gz"

extraCols <- c(signalValue = "numeric", pValue = "numeric",
                          qValue = "numeric", peak = "integer")
gr <- import(peaks_narrowbed, format = "BED",extraCols = extraCols)

head(gr)

GRanges object with 6 ranges and 6 metadata columns:
      seqnames        ranges strand |
         <Rle>     <IRanges>  <Rle> |
  [1]     chr1   10049-10326      * |
  [2]     chr1 180686-181040      * |
  [3]     chr1 186648-186903      * |
  [4]     chr1 191293-191604      * |
  [5]     chr1 267872-268146      * |
  [6]     chr1 586110-586324      * |
                                                   name     score signalValue
                                            <character> <numeric>   <numeric>
  [1] atacseq_170613.t8_rep1.dedup.masked.sorted_peak_1       135     7.10907
  [2] atacseq_170613.t8_rep1.dedup.masked.sorted_peak_2       309    11.65888
  [3] atacseq_170613.t8_rep1.dedup.masked.sorted_peak_3        81     5.40289
  [4] atacseq_170613.t8_rep1.dedup.masked.sorted_peak_4        56     4.54981
  [5] atacseq_170613.t8_rep1.dedup.masked.sorted_peak_5       145     7.39343
  [6] atacseq_170613.t8_rep1.dedup.masked.sorted_peak_6       116     6.54035

Then we extract the sequences that are within these regions, you can of course redefine the region by extending it from the peak:

library(Biostrings)
SEQ = getSeq(Hsapiens,gr)
head(SEQ)

A DNAStringSet instance of length 6
    width seq
[1]   278 TAACCCTAACCCTAACCCTAACCCTAACCCTAAC...CCCCAACCCCAACCCCAACCCCAACCCCAACCC
[2]   355 CTAACCCCTAACCCCTAACCCTAACCCTACCCTA...CTCAGCCGGCCCGCCCGCCCGGGTCTGACCTGA
[3]   256 GCAACCACCTGAGCGCGGGCATCCTGTGTGCAGA...CCCCTTCTTTCCATTGGTTTAATTAGGAACGGG
[4]   312 CAATCAGCAGGGACCGTGCACTCTCTTGGAGCCA...CCCTGCCCCTGTCTCTTCCGTGCAGGAGGAGCA
[5]   275 TGGATGTGTGCATTTTCCTGAGAGGAAAGCTTTC...TTGTTTAGTTTGCGTTGTGTTTCTCCAACTTTG
[6]   215 GAGGCTGAAGGAGACTGATGTGGTTTCTCCTCAG...CCAACTTTGTGCCTCATCAGGAAAAGCTTTGGA

This is the CG, and you can decide how you want to bin your regions:

freq = dinucleotideFrequency(SEQ,as.prob=TRUE)
hist(freq[,"CG"],br=100)

enter image description here

Correct answer by StupidWolf on August 22, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP