TransWikia.com

How to annotate gene length to a list of gene symbols using UCSC data?

Bioinformatics Asked by DN1 on February 11, 2021

I have a list of HGNC gene symbols, I am looking to get the gene length of each gene. Although I also describe these genes with lots of UCSC datasets as features, so I am wondering if there is a dataset in UCSC I can use to also get gene length from?

I’ve been looking in the data that is downloadable from UCSC table browser (I’ve been aiming to find start and ends for each gene to subtract to get gene length) but there are a lot of files and I’m not sure which dataset to take from which will also match to my HGNC gene symbols.

One Answer

You could do a few command-line operations to answer this question. This assumes the use of hg38 assembly.

First, get a list of genes from GENCODE:

$ wget -qO- ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.annotation.gff3.gz 
    | gunzip --stdout - 
    | awk '$3 == "gene"' - 
    | convert2bed -i gff - 
    > genes.bed

Then use grep to filter those genes with your list of HGNC symbols:

$ grep -wFf hgnc_symbols.txt genes.bed > filtered_genes.bed

You can run this through awk to get lengths:

$ awk -vFS="t" '{ print $3-$2 }' filtered_genes.bed > filtered_gene_lengths.txt

Answered by Alex Reynolds on February 11, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP