Converting aligned fasta to plink ped/bed

Question

I have an alignment of multiple sequences in a FASTA file (output from MAFFT), for which I would like to simulate a phenotype using plink, but for that I need to have my alignment in a PED file or plink BED. There’s plenty online about converting in the other direction, but the only mention I found of converting FASTA to BED is about aligning a raw FASTQ file for use in plink – my data is already aligned! How can I do this conversion?

Note that How to convert FASTA to BED is about the UCSC BED format, not the plink binary PED.

fasta format conversion plink

Note that How to convert FASTA to BED is about the UCSC BED format, not the plink binary PED.

PPK · Accepted Answer

An alignment can be the result of two slightly different analyses:
There is multiple sequence alignment (which is what you get from MAFFT) where sequences are aligned so that similar regions are on top of each other. This may require introducing indels (insertions / deletions) if a particular region is absent in some of the sequences.
Then there is alignment to a reference sequence. This is usually how you get variant data and most of the tools deal with this use case. Therefore, in the link you provided the solution includes aligning the data to a reference sequence. Usually the refernce sequence is much larger (e.g. the human genome) and after alignment the result is a file (.bam) that tells you where the query sequences match in the reference.
I did a quick search for ways of converting a multiple sequence alignment (MSA) to VCF. There is a tool called msa2vcf in the Jvarkit collection of utilities that can do this. The example is for the CLUSTAW format but FASTA is accepted as well.
You shoud check that indels were treated correctly because these are most likely to cause trouble.
Then you can simply convert the VCF to PLINK format  using PLINK.

Converting aligned fasta to plink ped/bed

One Answer

Add your own answers!

Ask a Question