jambler24 / gengraph Goto Github PK

View Code? Open in Web Editor NEW

52.0 52.0 16.0 50.76 MB

A repository for the GenGraph toolkit for the creation and manipulation of graph genomes

License: GNU General Public License v3.0

Python 100.00%

gengraph's People

Contributors

Stargazers

Watchers

Forkers

thobalose rndw aysunrhn liaoherui lls007 nuin mvtullius devinaseeruttun jackmo375 abdo3a hsnguyen aswarren shandu-m aparna024 crowmane420 mesut-unal

gengraph's Issues

Work on the ancestral genome function

This function was created to test two things, the creation of a similarity matrix based on shared nodes, and creating a consensus genome from the nodes that are most often taken through the graph, weighted based on the similarity matrix so that multiple closely related species are weighted down.

The similarity matrix is not being created properly, and the resultant trees are not a true representation of the phylogeny.

Novice to python

I am having this problem can you help please? Thank you
devina@devina-HP-ProBook-450-G3:~/GenGraph$ python3 gengraphTool.py make_genome_graph --seq_file analysis.txt --out_file_name test --recreate_check
Traceback (most recent call last):
File "gengraphTool.py", line 83, in
parsed_input_dict = parse_seq_file(args.seq_file)
File "/home/devina/GenGraph/gengraph.py", line 548, in parse_seq_file
A_seq_label_dict[a_seq_file['aln_name']] = a_seq_file['seq_name']
KeyError: 'seq_name'

Substrings of other isolate names as isolate names

If a substring of an isolate is also the name of another isolate, it will result in an error. This is seen in the example of if one sequence is isolate "CDC1551" and another is "C", then an error during refine_initGraph will occur. This is most likely due to the line
if isolate in data['ids']:
and should be replaced with a more strict check.

Look into using PyPy

It is possible that this tools can benefit from the JIT compiler.

[Question] vg versus GenGraph

Hi!

i am quite new in this field and i struggle to understand the difference between vg and this library, can you help me?

Thank you in advance and thank you for this project!

[HELP / FEATURE] is it possible to use a single fasta file containing many genomes?

The national library of medicine (https://www.ncbi.nlm.nih.gov/datasets) only allows you to download a list of genomes in a single fasta file (extension .fna), I tried to provide those files as input but it did not work.
It's my fault? Have you encountered similar problems?
Thank you!

[BUG?] Error: progressiveMauve_call error: output of progressiveMauve empty

When i run this script:
python ./gengraphTool.py make_genome_graph --seq_file TestGraphs/sequences.txt --out_file_name test
sequences.txt:

seq_name	aln_name	seq_path	annotation_path
H37Rv	seq0	/Users/filippo/Desktop/workspace/GenGraph/TestGraphs/H37Rv.fa	NA
H37Rv1	seq1	/Users/filippo/Desktop/workspace/GenGraph/TestGraphs/H37Rv1.fa	Na
H37Rv2	seq2	/Users/filippo/Desktop/workspace/GenGraph/TestGraphs/H37Rv2.fa	N

I got the error:
progressiveMauve_call error: output of progressiveMauve empty

I fixed the error changing the line 2775 in gengraph.py:
old line: number_of_lines = 3 ----- new line: number_of_lines = 2

But i'm not sure about the fix

Thank you!

Gengraph running problem

hi
One question
Why when I start running the program

(base) devina@Devinas-MacBook-Pro ~ % python3 GenGraph/gengraphTool.py make_genome_graph --seq_file Documents/anagengraph.txt --out_file_name Documents/output
Conducting progressiveMauve
progressiveMauve

It got stuck.

I am using a Mac
Processor 2.7 GHz core intel core i7
Memory 16 GB
Two sequences 4.5 MB each
Thank you for your precious help
Devina

No such file or directory: 'globalAlignment_khush.backbone'

Hey, I am trying to run example code in your repo(sequences.txt) with some modifications in local system but I am having this problem.

$ python3 ./gengraphTool.py make_genome_graph --seq_file sequences.txt --out_file_name khush --recreate_check
Running GenGraph Toolkit
Creating genome graph
[OrderedDict([('seq_name', 'H37Rv'), ('aln_name', 'seq0'), ('seq_path', '/home/noob/Documents/IIITD/tavlab/strainflow/GenGraph-master/TestGraphs/H37Rv.fa'), ('annotation_path', 'NA')]), OrderedDict([('seq_name', 'H37Rv1'), ('aln_name', 'seq1'), ('seq_path', '/home/noob/Documents/IIITD/tavlab/strainflow/GenGraph-master/TestGraphs/H37Rv1.fa'), ('annotation_path', 'Na')]), OrderedDict([('seq_name', 'H37Rv2'), ('aln_name', 'seq2'), ('seq_path', '/home/noob/Documents/IIITD/tavlab/strainflow/GenGraph-master/TestGraphs/H37Rv2.fa'), ('annotation_path', 'Na')])]
Conducting progressiveMauve
progressiveMauve Complete
Traceback (most recent call last):
  File "./gengraphTool.py", line 136, in <module>
    genome_aln_graph = bbone_to_initGraph(bbone_file, parsed_input_dict)
  File "/home/noob/Documents/IIITD/tavlab/strainflow/GenGraph-master/gengraph.py", line 1616, in bbone_to_initGraph
    backbone_lol = input_parser(bbone_file)
  File "/home/noob/Documents/IIITD/tavlab/strainflow/GenGraph-master/gengraph.py", line 1189, in input_parser
    in_file = open(file_path, 'r')
FileNotFoundError: [Errno 2] No such file or directory: 'globalAlignment_khush.backbone'

Can you please help me out?

Also, I have one more question to ask, Can I make De-Bruijn Directed graph using this library?

Incorrect version being displayed.

The current release still shows 'Welcome to GenGraph v0.1' in the help even though the Git release is tagged as 0.3.

Sequence recreate check taking too long

It should be faster. Looking into it. Related to seq_recreate_check() function.

Need help running the code.

Hello, I'm a students trying to create a graph similar to figure 3 of the GenGraph paper. I've been trying to get the code to run for more that a week and is always error after errors the latest one is this:

FileNotFoundError: [WinError 2] The system cannot find the file specified

##full code##
C:\Users\eros1\anaconda3\Lib\site-packages\GenGraph>python ./gengraphTool.py make_genome_graph --seq_file C:\Users\eros1\OneDrive\documents\Summer2022_Genome\E.coli_tab.txt --out_file_name test

Conducting progressiveMauve
({'seq0': 'K-12', 'seq1': 'Nissle-1917', 'seq2': 'O157:H7'}, {'K-12': 'C:/Users/eros1/OneDrive/Documents/Summer2022_Genome/E. Coli k-12.fasta', 'Nissle-1917': 'C:/Users/eros1/OneDrive/Documents/Summer2022_Genome/E. Coli Nissle 1917.fasta', 'O157:H7': 'C:/Users/eros1/OneDrive/Documents/Summer2022_Genome/E. Coli O157H7.fasta'}, ['C:/Users/eros1/OneDrive/Documents/Summer2022_Genome/E. Coli k-12.fasta', 'C:/Users/eros1/OneDrive/Documents/Summer2022_Genome/E. Coli Nissle 1917.fasta', 'C:/Users/eros1/OneDrive/Documents/Summer2022_Genome/E. Coli O157H7.fasta'], {'K-12': 'NA', 'Nissle-1917': 'NA', 'O157:H7': 'NA'})

Traceback (most recent call last):
File "./gengraphTool.py", line 87, in
progressiveMauve_alignment(parsed_input_dict[2], args.out_file_name)
File "C:\Users\eros1\anaconda3\Lib\site-packages\GenGraph\gengraph.py", line 1949, in progressiveMauve_alignment
return call(progressiveMauve_call, stdout=open(os.devnull, 'wb'))
File "C:\Users\eros1\anaconda3\lib\subprocess.py", line 340, in call
with Popen(*popenargs, **kwargs) as p:
File "C:\Users\eros1\anaconda3\lib\subprocess.py", line 858, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Users\eros1\anaconda3\lib\subprocess.py", line 1311, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified

NetworkX v2.4 support

As of neworkx v2.4:
G.node –> use G.nodes
https://networkx.github.io/documentation/stable/release/release_2.4.html
resulting in an error.

TypeError: set_node_atttributes() missing 1 required positional gengraph argument: 'values'

Haven't made any change to the original scripts. Still facing this issue

Remove excess debugging information

Exess print commands need to be removed for a cleaner UI.

Running problem

hi
Issue : Exception FileNotOpened thrown from
Unknown() in gnFileSource.cpp 67
Called by Unknown()
Traceback (most recent call last):
File "./GenGraph/gengraphTool.py", line 102, in module
genome_aln_graph = bbone_to_initGraph(bbone_file, parsed_input_dict)
File "/GenGraph/gengraph.py", line 830, in bbone_to_initGraph
iso_length = len(input_parser(input_dict[1][iso])[0]['DNA_seq'])
TypeError: 'NoneType' object is not subscriptable

specification: docker toolbox windows 10

Add support for multiple cores

The node realignment step can be parallelised. More research into how the best way would be is required.

Convert / import vg gfa file

Create a function to convert to and from gfa format used by vg. The format is defined here:

https://github.com/GFA-spec/GFA-spec/blob/master/GFA1.md

This may be tricky as at first glance vg does not use a coordinate system for keeping track of relative nucleotide positions. Will look into this.

Add progressiveMauve path as a flag for gengraphTool.py

deepcopy

In the fasta_alignment_to_subnet() function, there is a
copy.deepcopy(true_start)
that according to profiling is taking way too long. A suggested solution is using
g = cPickle.loads(cPickle.dumps(a, -1))
as suggested here:
https://stackoverflow.com/questions/24756712/deepcopy-is-extremely-slow
Will try this first, but otherwise the whole fasta_alignment_to_subnet() function could do with improvement.

set mauve scratch path

--scratch-path-1 is hard-coded. This needs to become relative.