Bioinformatics Asked by Empiromancer on September 3, 2021
I have an alignment of multiple sequences in a FASTA file (output from MAFFT), for which I would like to simulate a phenotype using plink, but for that I need to have my alignment in a PED file or plink BED. There’s plenty online about converting in the other direction, but the only mention I found of converting FASTA to BED is about aligning a raw FASTQ file for use in plink – my data is already aligned! How can I do this conversion?
Note that How to convert FASTA to BED is about the UCSC BED format, not the plink binary PED.
An alignment can be the result of two slightly different analyses:
There is multiple sequence alignment (which is what you get from MAFFT) where sequences are aligned so that similar regions are on top of each other. This may require introducing indels (insertions / deletions) if a particular region is absent in some of the sequences.
Then there is alignment to a reference sequence. This is usually how you get variant data and most of the tools deal with this use case. Therefore, in the link you provided the solution includes aligning the data to a reference sequence. Usually the refernce sequence is much larger (e.g. the human genome) and after alignment the result is a file (.bam) that tells you where the query sequences match in the reference.
I did a quick search for ways of converting a multiple sequence alignment (MSA) to VCF. There is a tool called msa2vcf
in the Jvarkit collection of utilities that can do this. The example is for the CLUSTAW format but FASTA is accepted as well.
You shoud check that indels were treated correctly because these are most likely to cause trouble.
Then you can simply convert the VCF to PLINK format using PLINK.
Correct answer by PPK on September 3, 2021
2 Asked on June 8, 2021 by tnocs
1 Asked on June 7, 2021 by 20-21
0 Asked on June 7, 2021 by gabt
2 Asked on June 3, 2021 by malia-w
0 Asked on June 3, 2021 by reza-rezaei
1 Asked on June 3, 2021
1 Asked on June 3, 2021 by swimming-bird
0 Asked on June 2, 2021 by user9085
1 Asked on May 31, 2021
3 Asked on May 30, 2021 by alwaystrying44
covid 19 database genome sequencing public databases sars cov 2
1 Asked on May 29, 2021 by kendal-b
1 Asked on May 29, 2021
2 Asked on May 28, 2021
Get help from others!
Recent Questions
Recent Answers
© 2023 AnswerBun.com. All rights reserved. Sites we Love: PCI Database, MenuIva, UKBizDB, Menu Kuliner, Sharing RPP, SolveDir