TransWikia.com

How to identify to each scaffold a read belongs to, inside a .sam file?

Bioinformatics Asked by fullmooninu on November 5, 2020

I have a fasta file assembly and combining it with the raw reads we produced a .bam file which I converted to .sam .

The .sam information lines look like this:

A00321:42:HLLVYDSXX:2:2302:6153:3505    99      NODE_1_length_3415511_cov_137.721502    16      60      128M    =       607     742     CGATTAGTCCGGCCAAATCGCCGTCGAGCGCAATGAACATAACGGTCTTGCCCTCAGCGCGCAGCGCATCGGCCTTGGCGTCGATTGTGGAGTGCTCGACGCCCATGATGTCCATCATAGCACCATTG        FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF        RX:Z:TTGAGGGTATAGTAGT   QX:Z:FFFFFFFFFFFFFFFF   TR:Z:GACACCG    TQ:Z:FFFFFFF    BC:Z:AGTTGCAG   QT:Z:FFFFFFFF   XS:i:-10        AS:i:0  XM:Z:0  AM:Z:0  XT:i:1  RG:Z:over_1kb:LibraryNotSpecified:1:unknown_fc:0        OM:i:60

Separated by mandatory fields it would be something like this:

QNAME: A00321:42:HLLVYDSXX:1:1644:2248:3881
FLAG: 99
RNAME: NODE_1_length_3415511_cov_137.721502
POS: 1
MAPQ: 60
CIGAR: 1S127M
RNEXT: =
PNEXT: 536
TLEN: 386
SEQ:  ATCGGGTCTGACACCGCGATTAGTCCGGCCAAATCGCCGTCGAGCGCAATGAACATAACGGTCTTGCCCTCAGCGCGCAGCGCATCGGCCTTGGCGTCGATTGTGGAGTGCTCGACGCCCATGATGTC
QUAL: FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

I’m actually interested in the meta data. I want to know how the RX: and BC: fields are distributed across the scaffolds in the original assembly.

I imagined the .sam file already contains the information about the assembly used to produce it. If I’m wrong, I’m sorry and please correct me, I’m just assuming.

What I want to do is, for each read in the .sam file, I find out its position in the assembled scaffold, and I record, Read_ID,Scaffold_ID,Read_Position_Inside_Scaffold,RX,BC

Then I want to use that database to analyse the distribution of RX and BC inside each scaffold.

That’s what I want.

Ultimately what I’m trying to do is evaluate the quality of my assemblies based on the Barcode distribution.

I’m good at programming and parsing, I’m just having trouble figuring out, where, inside the .sam file, can I find the scaffold and scaffold position of each read.

One Answer

What I want to do is, for each read in the .sam file, I find out its position in the assembled scaffold, and I record, Read_ID,Scaffold_ID,Read_Position_Inside_Scaffold,RX,BC

I'm just having trouble figuring out, where, inside the .sam file, can I find the scaffold and scaffold position of each read.

You already listed all those:

  • ReadID -> QNAME
  • ScaffoldID -> RNAME
  • Position -> Pos

Assembly evaluation is a wide topic, but other software you can use includes BUSCO, QUAST or something like LTRi. Here's a more in-depth guide by the SciLife lab in Sweden. Usually, when aligning reads back to an assembly, you care about the mapped insert size distribution compared to the theoretical insert size distribution, which tells you about collapsing or exapnding regions (i.e. structural misassemblies). This is something that tools like FRC_Align do.

Answered by Bastian Schiffthaler on November 5, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP