Basic question about finding mRNA sequence in transcriptome

Question

I am a computer scientist just starting out in bioinformatics topics, and would appreciate any guidance that can be given here:
I have an mRNA sequence- an isoform - whose length is about 4000 base pairs (bp) - the origin is provided in the link.
I also have a bam file/fasta file of the entire transcriptome of a kidney cell (see Data Access tab).
My objective is to quantify mRNA expression levels of the mRNA sequence in the transcriptome.
How does one get started? Should I use the original format (TenX bam, about 10 GB) or SRA archive data (fasta, about 3.5 GB)?  Why are the files so different in size?
I have a MacBook with samtools installed and 16GB of RAM. Is this enough to process the data and what basic commands should be used to quantify the 4000 bp sequence overlaps in the transcriptome?

geek_y · Answer

Basically you want to know the VHL Gene expression in that 10x dataset. They have provided the barcode, matrix and feature files on GEO GSE131685 as Supplementary file. You could use those files and get the gene expression matrix as described on 10x website.
library(Matrix)

matrix_dir = "/opt/sample345/outs/filtered_feature_bc_matrix/"
barcode.path <- paste0(matrix_dir, "barcodes.tsv.gz")
features.path <- paste0(matrix_dir, "features.tsv.gz")
matrix.path <- paste0(matrix_dir, "matrix.mtx.gz")

mat <- readMM(file = matrix.path)

feature.names = read.delim(features.path, 
                           header = FALSE,
                           stringsAsFactors = FALSE)

barcode.names = read.delim(barcode.path, 
                           header = FALSE,
                           stringsAsFactors = FALSE)

colnames(mat) = barcode.names$V1
rownames(mat) = feature.names$V1

This mat can be further loaded into Seurat for any downstream analysis. All this can be done on a laptop.
If you want to re-quantify the gene expression for whatever the reason, you need to re-run the cellRanger pipeline using GTF of your interest (it needs some preparation), which may not work on your laptop. It requires large memory and long running times, ideally to submit as job on cluster.

swbarnes2 · Answer

Should I use the original format (TenX bam, about 10 GB) or SRA
archive data (fasta, about 3.5 GB)? Why are the files so different in
size?

I strongly recommend you hold your horses; how do you think you can analyze something if you don't even know what you are looking at?
Since this is a "TenX" bam, you can probably do the job with samtools and grep.  Looking up 10XGenomics bam tags might be helpful.  You might need to look up SAM flags too.

My objective is to quantify mRNA expression levels of the mRNA
sequence in the transcriptome.

You understand that this isn't really a thing that's possible?  With spike-ins, maybe, but you don't have that.  All you can do is work out the proportions of reads belonging to that transcript  (Or gene, really, unless you want to put in a great deal more work, beyond I believe what 10XGenomics does).  Which I guess you want to do by lumping all the cells together, and just pretending there is no QC to do?

Basic question about finding mRNA sequence in transcriptome

2 Answers

Add your own answers!

Ask a Question