Finding discriminating positions for a sequence alignment column

I don’t have any theoretical background in biology, so forgive me if my question is a bit.. well..dumb. I’m trying to use a Monte Carlo approach to find the discriminating position of a given sequence alignment column. So essentially, given a sequence alignment, I want to find the position at which we can distinguish active and inactive conformations.

So, here is my code for the Monte Carlo method:

def MonteCarlo(S, max_iter, threshold):
"""
Computes the discriminating position for a sequence column

Input: S = sequence column, max_iter = max iterations, threshold = threshold of acceptance for boltzmann distribution (between 0 and 1)
Output: 2 classes
"""
n = len(S)
S_copy = S.copy()
num_class1 = int(0.30*n) #number in class1
print(num_class1)
for _ in range(num_class1): #intialisation
rand_position = random.randint(0,len(S)-1)
if S_copy[rand_position] == 'X':
rand_position = random.randint(0,len(S)) #reselect if used position
S_copy[rand_position] = 'X'

#Monte Carlo with Insertions and Deletions

s_entropy = entropy(S)
n_iter = 0 #counter for iterations
class_1 = dict() #class 1 determined by X's
class_2 = dict() #class 2 is rest of S
for i in range(len(S)):
if S_copy[i] == 'X':
class_1[i] =S[i]
else:
class_2[i] =S[i]

res = score(S, class_1, class_2)

while n_iter < max_iter:
r = random.random()

if r < 0.25: #25% chance to do an insertion
(c1,c2,pos) = Insertion(S, class_1, class_2)

elif r < 0.5: #25% chance to do a deletion
(c1,c2,pos) = Deletion(S, class_1, class_2)

else: # 50% chance to do a swap
(c1,c2,pos) = Swap(S, class_1, class_2)

res2 = score(S, c1, c2)

#Boltzmann Distribution coefficient
delta_S = abs(res2 - res)
lambda_, N = 0.2, 19
size = 6000
ind1 = random.randint(0,size - 1)
rv = boltzmann.rvs(lambda_, N, size=size)/N
beta = rv[ind1] #pick float between 0 and 1 with boltzmann distribution
if  (res2 > res) or (beta*delta_S > threshold): #if better score i.e. higher entropy gain
res = res2
class_1 = c1
class_2 = c2

n_iter += 1

return class_1, class_2, pos



Here I’m using the Boltzmann distribution as a probability measure. All the other functions (swap, insertion, deletion, entropy) calculate pretty much what the name says (if you need more clarification on my side, I’d be happy to provide it!)

Let me explain what my code is trying to do: Given a sequence alignment (as a list) say [‘M’,’M’,’M’,…,’L’,’L’,’L’] I first set a bunch of random positions and mark them out as ‘X’ (the number of Xs are determined by num_class1). Next, in the while loop I do swaps,insertions, and deletions with the goal of obtaining a better score. This meansI’m adding, removing, or swapping positions between class1 and class2. Eventually, my code returns two classes and the discriminating position.

While my code works, I don’t think it’s giving the right result. When I use a base case example of a list with two letters ‘M’ and ‘L with a clear discriminating position, I get random results.

Could someone explain theoretically (or via code) how one can find the discriminating position using a Monte Carlo method as above?

Thanks!

Without asking further questions, my suspicion would be how indels (gaps) are counted in your code against the aligment. Its the classic place for alignment mis-position errors. In other words, based on my experience your calculation probably performs fine, but linking the calculated alignment position against the actual observed alignment position is classic area of logical errors and needs careful monitoring of variables using code breakpoints within your loops.

Answered by M__ on March 25, 2021

Related Questions

maftools tcgaCompare plot: x-axis labels getting clipped

1  Asked on August 22, 2021

scRNA-Seq: Account for sequencing depth and gene length?

0  Asked on August 22, 2021 by bluescreen

How to get a list of genes corresponding to the list of SNPs (rs ids)?

6  Asked on August 22, 2021

Merge / Reconciliate several de novo transcriptome assemblies with different kmers

2  Asked on August 22, 2021 by pitagoras-alves

How are the values of prop.part() and prop.clades() calculated?

0  Asked on August 22, 2021 by namenlos

How does picard’s MarkDuplicate handle unmapped reads?

1  Asked on August 22, 2021 by init_js

Can Maven (mass spectrometry software) read mzML files?

1  Asked on August 22, 2021 by sren

Java error when trying to run UBCG jar (java.lang.ClassNotFoundException)

1  Asked on August 22, 2021 by maloki

How can i get data set of illness and not illness people?

1  Asked on August 22, 2021 by user5520049

How I change a function not to need an input

1  Asked on August 22, 2021 by exhausted

DESeqDataSetFromTximport all(lengths > 0) is not TRUE

2  Asked on August 22, 2021

3  Asked on August 22, 2021

How do I get cromwell to create soft-links?

0  Asked on August 22, 2021 by greg-dougherty

VCF spec: is it possible to have other alleles in addition to the MISSING value (‘.’) in ALTS?

0  Asked on August 22, 2021 by revl

How can I restrict the search space when querying GEO?

1  Asked on August 22, 2021 by an-ignorant-wanderer

1  Asked on August 21, 2021 by astro_guy

Human Herpes Virus, fusion motif in protein vs. SARS-CoV-2?

1  Asked on August 21, 2021

Identification of starting point for the synthesis of a protein

1  Asked on August 21, 2021 by roslup