TransWikia.com

How do I Read a Sam File into a String Using ifstream in C++?

Bioinformatics Asked on April 15, 2021

I am trying to read a .sam file into a single string in C++. My end goal is to read this string into a vector, where tabs indicate separations between elements of the vector. After making a vector containing all of the data from the sam file, I will separate the relevant data types (chromosome name, locus, etc.) into their own vectors based on position in the master vector.

The sam file was pre-fetched from the SRA database, split into paired-end reads, and aligned to the reference genome using the following Linux commands:

prefetch --output-directory ./ SRR
fastq-dump --defline-seq '@$sn[_$rn]/$ri' --split-files SRR.sra
bwa mem -M reference_genome.fasta SRR_1.fastq SRR_2.fastq > SRR.sam

To make the file easier to read in C++, I altered the .sam file in Linux by removing the header, replacing newline characters with tabs, and replacing gaps between "columns" with true tabs rather than spaces. Since the number of spaces between each column was variable, I accomplished this last part by replacing successively smaller numbers of spaces (12, 11, 10, etc) with tabs. Here is the Linux command line code for the three steps listed above:

grep -v '^@' SRR.sam > headerless_SRR.sam

# Condense sam files into a single line
sed 's/n/t/g' headerless_SRR.sam > condensed_SRR.sam

# Replace all spaces in sam files with tabs. Since some columns have multiple spaces, the spaces must be replaced in progressively smaller chunks
sed 's/            /t/g' condensed_SRR791885.sam > tch_SRR791885.sam     # "tcd" is tabbed, condensed, headerless
sed 's/           /t/g' tch_SRR.sam > tch2_SRR791885.sam     # Repeat this line with progressively smaller numbers of spaces until all spaces are replaced with single tabs

To open the resulting file and read it into a string, I created the following C++ script, compiled it using the g++ command in Linux, and ran the resulting .out file using "./"

#include <iostream>
#include <fstream>
#include <sstream>
#include <vector>
#include <string>

int main(){
    // Creates an input file stream opbject and fills it with data from the sam file
    std::ifstream sam_file;
    sam_file.open("Sample_Sam.txt", std::ifstream::in);
    // Reads the first character in the sam file and stores it as a "character" object
    char sam_character = sam_file.get();
    // Creates a string into which characters can be pushed
    std::string sam_string;
    // While characters are still detected in the sam file, add the characters to a string and move to the next character in the file
    while(sam_file.good()){
        sam_string.push_back(sam_character);
        sam_character = sam_file.get();
    }
    sam_file.close();
}

When I run the C++ script on a dummy sam file containing random blocks of data, it runs fine and I can even output the resulting string to "std::cout" or another file. However, when I run it on a "tch_SRR" sam file, the "./" file hangs without creating an output or anything. Does anyone know how I could fix this?

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP