soedinglab / ccmgen Goto Github PK

License: GNU Affero General Public License v3.0

Python 6.38% C 0.88% HTML 92.71% Shell 0.03%

ccmgen's Issues

Model with 2 states for amino acids ?

Hi,

I would like to infer a model on a generated data set where "proteins" have only 2 states available for each site. For example :

MSA = [ [ 1, 0, 1, ... , 0, 1, 0],
         ......,
        [0, 0, 1, ...., 1, 1, 0]]

Could it be possible to infer a model with only 2 states for each amino acid?
Furthermore, could I use CCMgen for generating data with only 2 states based on the field inferred?

Thank you for your help

when gaps included in x_single, there is an error in sampling

the ccmpred.parameter_handling.structured_to_linear() function linearizes x_single and x_pair.
by default it is assumed that x_single is a vector of size (L,20), if the flag nogapstate=False, it's assumed x_single is (L,21).

So far so good... but then if you try sample sequences using:
x = ccmpred.parameter_handling.structured_to_linear(...,nogapstate=False)
ccmpred.sampling.generate_mcmc_sample(x,...)

the "x" appears to be incorrectly parsed by function, assuming x_single is (L,20), and sampling is out of wack.

Appears this assumption (N_ALPHA - 1) is hard-coded in:
ccmpred/objfun/cd/cext/cd.h
#define X1_INDEX(i,a) (i) * (N_ALPHA - 1) + (a)
#define E1(i,a) x[X1_INDEX(i,a)]

then later in:
ccmpred/objfun/cd/cext/cd.c
void compute_conditional_probs(...) {
int nsingle = ncol * (N_ALPHA - 1);
for (a = 0; a < N_ALPHA - 1; a++) {cond_probs[a] = E1(i,a);}
...
}

AttributeError: 'str' object has no attribute 'decode' (scipy1.7)

Hello,

When running the simple example provided in your wiki, CCMpred failed with the following error in ccmpred/algorithm/lbfgs.py, line 91, in minimize:
AttributeError: 'str' object has no attribute 'decode'
It seems to be because I am using a newer version of scipy (scipy1.7.1).

Replacing line 91 of ccmpred/algorithm/lbfgs.py:
"message": res.message.decode("utf-8"),
with
"message": res.message,
fixed it for me.

Error with CCMgen --tree-newick

Hi,

I am trying to use CCMgen with a tree inferred by fasttree2.

Two problems :

The "intermediate" clades (not the leaf) of the tree inferred have no names (None). That creates an error in the generation of sequence :
File "/home/andonis/anaconda3/lib/python3.8/site-packages/ccmpred/sampling/__init__.py", line 169, in sample_to_neff_increasingly msa_sampled = np.empty((nseq, ncol), dtype="uint8") TypeError: 'NoneType' object cannot be interpreted as an integer

To avoid this error, I changed the function def split_tree(tree, id0): in the file trees.py. I replaced :
id_to_node = dict((cl.name, cl) for cl in bfs_iterator(tree.clade))
by
id_to_node = dict((cl.name,cl) for cl in bfs_iterator(tree.clade) if not cl.name is None )

The tree inferred by fasttree has no clade with the name "root". So, I added the name "root" to the first clade :

tree = Phylo.read("file.fasta.tree","newick")
tree.clade.name = "root"
tree.clade.branch_length = 0
Phylo.write(tree,"file_root.fasta.tree","newick")

Now, if I do
ccmgen 1atzA.braw.gz new_msa.fas --tree-newick file_root.fasta.tree --alnfile file.fasta --mcmc-sample-random-gapped --mcmc-burn-in 0 --mutation-rate-nef
the program properly finished. But, I am scared that I over simplified my tree with my solution 1)

Thank you for your help!

Interpretation of compared output

I am trying to use the raw prediction matrix output from CCMpred (-r option), but could not find any documentation on how these files are structured (sorry if I missed an obvious documentation).

I figure the first L lines are the h_i parameters for the each position in the protein and the 21 lines x 21 entries blocks afterwards are the J_ij coupling coefficients for all position pairs (including a gap state).

My questions are:

What is the ordering of the amino acids / gap state?
How are the position pairs enumerated and which position corresponds to rows of the 21x21 matrices and which position corresponds to the columns?

Thanks a lot for your help

Joern

How to use --seq0-file option ?

Hi,

I would like to sample sequence on a tree by starting from a random sequence (instead of polyA sequence). For doing that, I am creating a fasta file with one random sequence that I use with --seq0-file option.

ccmgen --tree-newick my.tree  --seq0-file  seq0_file.fasta --mutation-rate 1 1atzA.braw.gz msa.fas

In the output, I see :
Ancestor sequence (polyA --> 10 gibbs steps --> seq0) :
which means that the option --seq0-file is not considered.

Have you any idea why? I am adding the file to reproduce this issue url file

Thank you for your help !

The total output of this command is :

  ┏━╸┏━╸┏┳┓┏━╸┏━╸┏┓╻  version 1.0.0
  ┃  ┃  ┃┃┃┃╺┓┣╸ ┃┗┫  Vorberg, Seemayer and Soeding (2018)
  ┗━╸┗━╸╹ ╹┗━┛┗━╸╹ ╹  https://github.com/soedinglab/ccmgen

Using 36 threads for OMP parallelization.

Successfully loaded model parameters from /home/gerardos/analyse_sequence/Code_for_cluster/field_ccmpred/1atzA.braw.gz.
Created newick tree with 23633 leaves, 41705 nodes, avg branch length=0.091, depth_min=7.4035e-01, depth_max=4.2180e+00

Ancestor sequence (polyA --> 10 gibbs steps --> seq0) :
AKSQFVAAISHELQSPLNAILMFSRILSRLTEHGREAFGDIRRSGMHALELIDGLLRFSRIEAGVLIIEDNGADRHGMSLLLESTGAQRVADAATGERGLQAFRAGGLGCVLVDLSLPGRSGQAMIGAIQEADRRIPAIVISGHVDAQAVRRCMRFGAEAFLTKPFEPSVLAEAVN
avg number of amino acid substitutions (parent -> child): 16.0

Alignment with 23633 sequences was sampled with mutation rate 1.0 and has Neff 4.08072

Pseudo-likelihood optimization does not take pseudo-counts into account

Hello,

Since pseudo-likelihood is optimized on the MSA and not on the frequencies, pseudo-counts are not actually used (except for v* centering).

Do you plan on working on this in a new version ? Or do you think it simply cannot be fixed because of the nature of pseudo-likelihood optimization itself ?

Best regards

nmut approximation via rounded expectation can be inaccurate in edge cases.

CCMgen/ccmpred/sampling/__init__.py

Line 37 in d874668

nmut = int(clade.branch_length * mutation_rate * ncol)

The current implementation uses the rounded expectation over the full sequence as the number of mutations nmut . This can be problematic in edge cases (sequences very short or mutation rate very low).

A better alternative in these cases would be drawing nmut probabilistically:

nmut = np.random.poisson(lam=clade.branch_length*mutation_rate, size=ncol).sum()

soedinglab / ccmgen Goto Github PK

ccmgen's Issues

Model with 2 states for amino acids ?

when gaps included in x_single, there is an error in sampling

AttributeError: 'str' object has no attribute 'decode' (scipy1.7)

Error with CCMgen --tree-newick

Interpretation of compared output

How to use --seq0-file option ?

Pseudo-likelihood optimization does not take pseudo-counts into account

nmut approximation via rounded expectation can be inaccurate in edge cases.

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent