Git Product home page Git Product logo

ccmgen's People

Contributors

odannis avatar sseemayer avatar susannvorberg avatar ulupo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ccmgen's Issues

nmut approximation via rounded expectation can be inaccurate in edge cases.

nmut = int(clade.branch_length * mutation_rate * ncol)

The current implementation uses the rounded expectation over the full sequence as the number of mutations nmut . This can be problematic in edge cases (sequences very short or mutation rate very low).

A better alternative in these cases would be drawing nmut probabilistically:

nmut = np.random.poisson(lam=clade.branch_length*mutation_rate, size=ncol).sum()

Model with 2 states for amino acids ?

Hi,

I would like to infer a model on a generated data set where "proteins" have only 2 states available for each site. For example :

MSA = [ [ 1, 0, 1, ... , 0, 1, 0],
         ......,
        [0, 0, 1, ...., 1, 1, 0]] 

Could it be possible to infer a model with only 2 states for each amino acid?
Furthermore, could I use CCMgen for generating data with only 2 states based on the field inferred?

Thank you for your help

when gaps included in x_single, there is an error in sampling

the ccmpred.parameter_handling.structured_to_linear() function linearizes x_single and x_pair.
by default it is assumed that x_single is a vector of size (L,20), if the flag nogapstate=False, it's assumed x_single is (L,21).

So far so good... but then if you try sample sequences using:
x = ccmpred.parameter_handling.structured_to_linear(...,nogapstate=False)
ccmpred.sampling.generate_mcmc_sample(x,...)

the "x" appears to be incorrectly parsed by function, assuming x_single is (L,20), and sampling is out of wack.

Appears this assumption (N_ALPHA - 1) is hard-coded in:
ccmpred/objfun/cd/cext/cd.h
#define X1_INDEX(i,a) (i) * (N_ALPHA - 1) + (a)
#define E1(i,a) x[X1_INDEX(i,a)]

then later in:
ccmpred/objfun/cd/cext/cd.c
void compute_conditional_probs(...) {
int nsingle = ncol * (N_ALPHA - 1);
for (a = 0; a < N_ALPHA - 1; a++) {cond_probs[a] = E1(i,a);}
...
}

How to use --seq0-file option ?

Hi,

I would like to sample sequence on a tree by starting from a random sequence (instead of polyA sequence). For doing that, I am creating a fasta file with one random sequence that I use with --seq0-file option.

ccmgen --tree-newick my.tree  --seq0-file  seq0_file.fasta --mutation-rate 1 1atzA.braw.gz msa.fas

In the output, I see :
Ancestor sequence (polyA --> 10 gibbs steps --> seq0) :
which means that the option --seq0-file is not considered.

Have you any idea why? I am adding the file to reproduce this issue url file

Thank you for your help !

The total output of this command is :

  ┏━╸┏━╸┏┳┓┏━╸┏━╸┏┓╻  version 1.0.0
  ┃  ┃  ┃┃┃┃╺┓┣╸ ┃┗┫  Vorberg, Seemayer and Soeding (2018)
  ┗━╸┗━╸╹ ╹┗━┛┗━╸╹ ╹  https://github.com/soedinglab/ccmgen

Using 36 threads for OMP parallelization.

Successfully loaded model parameters from /home/gerardos/analyse_sequence/Code_for_cluster/field_ccmpred/1atzA.braw.gz.
Created newick tree with 23633 leaves, 41705 nodes, avg branch length=0.091, depth_min=7.4035e-01, depth_max=4.2180e+00

Ancestor sequence (polyA --> 10 gibbs steps --> seq0) :
AKSQFVAAISHELQSPLNAILMFSRILSRLTEHGREAFGDIRRSGMHALELIDGLLRFSRIEAGVLIIEDNGADRHGMSLLLESTGAQRVADAATGERGLQAFRAGGLGCVLVDLSLPGRSGQAMIGAIQEADRRIPAIVISGHVDAQAVRRCMRFGAEAFLTKPFEPSVLAEAVN
avg number of amino acid substitutions (parent -> child): 16.0

Alignment with 23633 sequences was sampled with mutation rate 1.0 and has Neff 4.08072

Interpretation of compared output

I am trying to use the raw prediction matrix output from CCMpred (-r option), but could not find any documentation on how these files are structured (sorry if I missed an obvious documentation).

I figure the first L lines are the h_i parameters for the each position in the protein and the 21 lines x 21 entries blocks afterwards are the J_ij coupling coefficients for all position pairs (including a gap state).

My questions are:

  • What is the ordering of the amino acids / gap state?
  • How are the position pairs enumerated and which position corresponds to rows of the 21x21 matrices and which position corresponds to the columns?

Thanks a lot for your help

Joern

Error with CCMgen --tree-newick

Hi,

I am trying to use CCMgen with a tree inferred by fasttree2.

Two problems :

  1. The "intermediate" clades (not the leaf) of the tree inferred have no names (None). That creates an error in the generation of sequence :
    File "/home/andonis/anaconda3/lib/python3.8/site-packages/ccmpred/sampling/__init__.py", line 169, in sample_to_neff_increasingly msa_sampled = np.empty((nseq, ncol), dtype="uint8") TypeError: 'NoneType' object cannot be interpreted as an integer

    To avoid this error, I changed the function def split_tree(tree, id0): in the file trees.py. I replaced :
    id_to_node = dict((cl.name, cl) for cl in bfs_iterator(tree.clade))
    by
    id_to_node = dict((cl.name,cl) for cl in bfs_iterator(tree.clade) if not cl.name is None )

  2. The tree inferred by fasttree has no clade with the name "root". So, I added the name "root" to the first clade :

    tree = Phylo.read("file.fasta.tree","newick")
    tree.clade.name = "root"
    tree.clade.branch_length = 0
    Phylo.write(tree,"file_root.fasta.tree","newick")
    

Now, if I do
ccmgen 1atzA.braw.gz new_msa.fas --tree-newick file_root.fasta.tree --alnfile file.fasta --mcmc-sample-random-gapped --mcmc-burn-in 0 --mutation-rate-nef
the program properly finished. But, I am scared that I over simplified my tree with my solution 1)

Thank you for your help!

Pseudo-likelihood optimization does not take pseudo-counts into account

Hello,

Since pseudo-likelihood is optimized on the MSA and not on the frequencies, pseudo-counts are not actually used (except for v* centering).

Do you plan on working on this in a new version ? Or do you think it simply cannot be fixed because of the nature of pseudo-likelihood optimization itself ?

Best regards

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.