Git Product home page Git Product logo

ccmgen's Issues

Model with 2 states for amino acids ?

Hi,

I would like to infer a model on a generated data set where "proteins" have only 2 states available for each site. For example :

MSA = [ [ 1, 0, 1, ... , 0, 1, 0],
         ......,
        [0, 0, 1, ...., 1, 1, 0]] 

Could it be possible to infer a model with only 2 states for each amino acid?
Furthermore, could I use CCMgen for generating data with only 2 states based on the field inferred?

Thank you for your help

when gaps included in x_single, there is an error in sampling

the ccmpred.parameter_handling.structured_to_linear() function linearizes x_single and x_pair.
by default it is assumed that x_single is a vector of size (L,20), if the flag nogapstate=False, it's assumed x_single is (L,21).

So far so good... but then if you try sample sequences using:
x = ccmpred.parameter_handling.structured_to_linear(...,nogapstate=False)
ccmpred.sampling.generate_mcmc_sample(x,...)

the "x" appears to be incorrectly parsed by function, assuming x_single is (L,20), and sampling is out of wack.

Appears this assumption (N_ALPHA - 1) is hard-coded in:
ccmpred/objfun/cd/cext/cd.h
#define X1_INDEX(i,a) (i) * (N_ALPHA - 1) + (a)
#define E1(i,a) x[X1_INDEX(i,a)]

then later in:
ccmpred/objfun/cd/cext/cd.c
void compute_conditional_probs(...) {
int nsingle = ncol * (N_ALPHA - 1);
for (a = 0; a < N_ALPHA - 1; a++) {cond_probs[a] = E1(i,a);}
...
}

Error with CCMgen --tree-newick

Hi,

I am trying to use CCMgen with a tree inferred by fasttree2.

Two problems :

  1. The "intermediate" clades (not the leaf) of the tree inferred have no names (None). That creates an error in the generation of sequence :
    File "/home/andonis/anaconda3/lib/python3.8/site-packages/ccmpred/sampling/__init__.py", line 169, in sample_to_neff_increasingly msa_sampled = np.empty((nseq, ncol), dtype="uint8") TypeError: 'NoneType' object cannot be interpreted as an integer

    To avoid this error, I changed the function def split_tree(tree, id0): in the file trees.py. I replaced :
    id_to_node = dict((cl.name, cl) for cl in bfs_iterator(tree.clade))
    by
    id_to_node = dict((cl.name,cl) for cl in bfs_iterator(tree.clade) if not cl.name is None )

  2. The tree inferred by fasttree has no clade with the name "root". So, I added the name "root" to the first clade :

    tree = Phylo.read("file.fasta.tree","newick")
    tree.clade.name = "root"
    tree.clade.branch_length = 0
    Phylo.write(tree,"file_root.fasta.tree","newick")
    

Now, if I do
ccmgen 1atzA.braw.gz new_msa.fas --tree-newick file_root.fasta.tree --alnfile file.fasta --mcmc-sample-random-gapped --mcmc-burn-in 0 --mutation-rate-nef
the program properly finished. But, I am scared that I over simplified my tree with my solution 1)

Thank you for your help!

Interpretation of compared output

I am trying to use the raw prediction matrix output from CCMpred (-r option), but could not find any documentation on how these files are structured (sorry if I missed an obvious documentation).

I figure the first L lines are the h_i parameters for the each position in the protein and the 21 lines x 21 entries blocks afterwards are the J_ij coupling coefficients for all position pairs (including a gap state).

My questions are:

  • What is the ordering of the amino acids / gap state?
  • How are the position pairs enumerated and which position corresponds to rows of the 21x21 matrices and which position corresponds to the columns?

Thanks a lot for your help

Joern

How to use --seq0-file option ?

Hi,

I would like to sample sequence on a tree by starting from a random sequence (instead of polyA sequence). For doing that, I am creating a fasta file with one random sequence that I use with --seq0-file option.

ccmgen --tree-newick my.tree  --seq0-file  seq0_file.fasta --mutation-rate 1 1atzA.braw.gz msa.fas

In the output, I see :
Ancestor sequence (polyA --> 10 gibbs steps --> seq0) :
which means that the option --seq0-file is not considered.

Have you any idea why? I am adding the file to reproduce this issue url file

Thank you for your help !

The total output of this command is :

  ┏━╸┏━╸┏┳┓┏━╸┏━╸┏┓╻  version 1.0.0
  ┃  ┃  ┃┃┃┃╺┓┣╸ ┃┗┫  Vorberg, Seemayer and Soeding (2018)
  ┗━╸┗━╸╹ ╹┗━┛┗━╸╹ ╹  https://github.com/soedinglab/ccmgen

Using 36 threads for OMP parallelization.

Successfully loaded model parameters from /home/gerardos/analyse_sequence/Code_for_cluster/field_ccmpred/1atzA.braw.gz.
Created newick tree with 23633 leaves, 41705 nodes, avg branch length=0.091, depth_min=7.4035e-01, depth_max=4.2180e+00

Ancestor sequence (polyA --> 10 gibbs steps --> seq0) :
AKSQFVAAISHELQSPLNAILMFSRILSRLTEHGREAFGDIRRSGMHALELIDGLLRFSRIEAGVLIIEDNGADRHGMSLLLESTGAQRVADAATGERGLQAFRAGGLGCVLVDLSLPGRSGQAMIGAIQEADRRIPAIVISGHVDAQAVRRCMRFGAEAFLTKPFEPSVLAEAVN
avg number of amino acid substitutions (parent -> child): 16.0

Alignment with 23633 sequences was sampled with mutation rate 1.0 and has Neff 4.08072

Pseudo-likelihood optimization does not take pseudo-counts into account

Hello,

Since pseudo-likelihood is optimized on the MSA and not on the frequencies, pseudo-counts are not actually used (except for v* centering).

Do you plan on working on this in a new version ? Or do you think it simply cannot be fixed because of the nature of pseudo-likelihood optimization itself ?

Best regards

nmut approximation via rounded expectation can be inaccurate in edge cases.

nmut = int(clade.branch_length * mutation_rate * ncol)

The current implementation uses the rounded expectation over the full sequence as the number of mutations nmut . This can be problematic in edge cases (sequences very short or mutation rate very low).

A better alternative in these cases would be drawing nmut probabilistically:

nmut = np.random.poisson(lam=clade.branch_length*mutation_rate, size=ncol).sum()

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.