soedinglab / ccmgen Goto Github PK
View Code? Open in Web Editor NEWLicense: GNU Affero General Public License v3.0
License: GNU Affero General Public License v3.0
CCMgen/ccmpred/sampling/__init__.py
Line 37 in d874668
The current implementation uses the rounded expectation over the full sequence as the number of mutations nmut
. This can be problematic in edge cases (sequences very short or mutation rate very low).
A better alternative in these cases would be drawing nmut
probabilistically:
nmut = np.random.poisson(lam=clade.branch_length*mutation_rate, size=ncol).sum()
Hi,
I would like to infer a model on a generated data set where "proteins" have only 2 states available for each site. For example :
MSA = [ [ 1, 0, 1, ... , 0, 1, 0],
......,
[0, 0, 1, ...., 1, 1, 0]]
Could it be possible to infer a model with only 2 states for each amino acid?
Furthermore, could I use CCMgen for generating data with only 2 states based on the field inferred?
Thank you for your help
the ccmpred.parameter_handling.structured_to_linear() function linearizes x_single and x_pair.
by default it is assumed that x_single is a vector of size (L,20), if the flag nogapstate=False, it's assumed x_single is (L,21).
So far so good... but then if you try sample sequences using:
x = ccmpred.parameter_handling.structured_to_linear(...,nogapstate=False)
ccmpred.sampling.generate_mcmc_sample(x,...)
the "x" appears to be incorrectly parsed by function, assuming x_single is (L,20), and sampling is out of wack.
Appears this assumption (N_ALPHA - 1) is hard-coded in:
ccmpred/objfun/cd/cext/cd.h
#define X1_INDEX(i,a) (i) * (N_ALPHA - 1) + (a)
#define E1(i,a) x[X1_INDEX(i,a)]
then later in:
ccmpred/objfun/cd/cext/cd.c
void compute_conditional_probs(...) {
int nsingle = ncol * (N_ALPHA - 1);
for (a = 0; a < N_ALPHA - 1; a++) {cond_probs[a] = E1(i,a);}
...
}
Hello,
When running the simple example provided in your wiki, CCMpred failed with the following error in ccmpred/algorithm/lbfgs.py, line 91, in minimize:
AttributeError: 'str' object has no attribute 'decode'
It seems to be because I am using a newer version of scipy (scipy1.7.1).
Replacing line 91 of ccmpred/algorithm/lbfgs.py:
"message": res.message.decode("utf-8"),
with
"message": res.message,
fixed it for me.
Hi,
I would like to sample sequence on a tree by starting from a random sequence (instead of polyA sequence). For doing that, I am creating a fasta file with one random sequence that I use with --seq0-file option.
ccmgen --tree-newick my.tree --seq0-file seq0_file.fasta --mutation-rate 1 1atzA.braw.gz msa.fas
In the output, I see :
Ancestor sequence (polyA --> 10 gibbs steps --> seq0) :
which means that the option --seq0-file
is not considered.
Have you any idea why? I am adding the file to reproduce this issue url file
Thank you for your help !
The total output of this command is :
┏━╸┏━╸┏┳┓┏━╸┏━╸┏┓╻ version 1.0.0
┃ ┃ ┃┃┃┃╺┓┣╸ ┃┗┫ Vorberg, Seemayer and Soeding (2018)
┗━╸┗━╸╹ ╹┗━┛┗━╸╹ ╹ https://github.com/soedinglab/ccmgen
Using 36 threads for OMP parallelization.
Successfully loaded model parameters from /home/gerardos/analyse_sequence/Code_for_cluster/field_ccmpred/1atzA.braw.gz.
Created newick tree with 23633 leaves, 41705 nodes, avg branch length=0.091, depth_min=7.4035e-01, depth_max=4.2180e+00
Ancestor sequence (polyA --> 10 gibbs steps --> seq0) :
AKSQFVAAISHELQSPLNAILMFSRILSRLTEHGREAFGDIRRSGMHALELIDGLLRFSRIEAGVLIIEDNGADRHGMSLLLESTGAQRVADAATGERGLQAFRAGGLGCVLVDLSLPGRSGQAMIGAIQEADRRIPAIVISGHVDAQAVRRCMRFGAEAFLTKPFEPSVLAEAVN
avg number of amino acid substitutions (parent -> child): 16.0
Alignment with 23633 sequences was sampled with mutation rate 1.0 and has Neff 4.08072
I am trying to use the raw prediction matrix output from CCMpred (-r option), but could not find any documentation on how these files are structured (sorry if I missed an obvious documentation).
I figure the first L lines are the h_i parameters for the each position in the protein and the 21 lines x 21 entries blocks afterwards are the J_ij coupling coefficients for all position pairs (including a gap state).
My questions are:
Thanks a lot for your help
Joern
Hi,
I am trying to use CCMgen with a tree inferred by fasttree2.
Two problems :
The "intermediate" clades (not the leaf) of the tree inferred have no names (None). That creates an error in the generation of sequence :
File "/home/andonis/anaconda3/lib/python3.8/site-packages/ccmpred/sampling/__init__.py", line 169, in sample_to_neff_increasingly msa_sampled = np.empty((nseq, ncol), dtype="uint8") TypeError: 'NoneType' object cannot be interpreted as an integer
To avoid this error, I changed the function def split_tree(tree, id0):
in the file trees.py. I replaced :
id_to_node = dict((cl.name, cl) for cl in bfs_iterator(tree.clade))
by
id_to_node = dict((cl.name,cl) for cl in bfs_iterator(tree.clade) if not cl.name is None )
The tree inferred by fasttree has no clade with the name "root". So, I added the name "root" to the first clade :
tree = Phylo.read("file.fasta.tree","newick")
tree.clade.name = "root"
tree.clade.branch_length = 0
Phylo.write(tree,"file_root.fasta.tree","newick")
Now, if I do
ccmgen 1atzA.braw.gz new_msa.fas --tree-newick file_root.fasta.tree --alnfile file.fasta --mcmc-sample-random-gapped --mcmc-burn-in 0 --mutation-rate-nef
the program properly finished. But, I am scared that I over simplified my tree with my solution 1)
Thank you for your help!
Hello,
Since pseudo-likelihood is optimized on the MSA and not on the frequencies, pseudo-counts are not actually used (except for v* centering).
Do you plan on working on this in a new version ? Or do you think it simply cannot be fixed because of the nature of pseudo-likelihood optimization itself ?
Best regards
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.