atifrahman / hawk Goto Github PK
View Code? Open in Web Editor NEWHitting associations with k-mers
License: GNU General Public License v3.0
Hitting associations with k-mers
License: GNU General Public License v3.0
After running countKmers, I found that the file sorted_files.txt had incorrect specification of the sample folders.
The sample folders were indicated as Reads_catrim.Hcor_SM05L02_Red.fastq.gz, etc.
The sorted_files.txt included lines like so
/Hcor/Reads_Reads_catrim.Hcor_SM05L02_Red.fastq.gz/Reads_catrim.Hcor_SM05L02_Red.fastq.gz
Obviously, an extra Reads_ was appended to the folder specification.
I changed this to a single "Reads_" with
sed -i.bak 's/Reads_Reads_/Reads_/' sorted_files.txt
But would be good to change in code.
Hi,
while running runHAWK
I'm finding the following error message:
Loading required package: iterators
Loading required package: parallel
[,1]
[1,] 1
[2,] 1
[3,] 1
[4,] 1
[5,] 1
[6,] 1
[7,] 1
[8,] 1
[9,] 1
[10,] 1
[11,] 0
[12,] 0
[13,] 0
[14,] 0
[15,] 0
[16,] 0
[17,] 0
[18,] 0
[19,] 0
[20,] 0
Error in read.table(con, nrow = CHUNK_SIZE) : no lines available in input
Execution halted
The origin of this error seems to be related with this line
Rscript $hawkDir/log_reg_case.R
And looking at log_reg_case.R
the con
variable is read from file case_out_w_bonf_top.kmerDiff
, however this file seems to be OK with the expected number of lines:
$ wc -l case_out_w_bonf_top.kmerDiff
200000 case_out_w_bonf_top.kmerDiff
$ head case_out_w_bonf_top.kmerDiff
AAAAAAAAAAAAAATAATAAACATTCGAAAA 8244 3442 0.000000e+00 560 430 553 1224 1295 795 742 1006 762 877 206 814 308 200 231 127 306 147 202 168
AAAAAAAAAAAAATAATAAACATTCGAAAAA 8946 3615 0.000000e+00 610 470 619 1311 1397 870 803 1123 806 937 215 890 326 200 233 127 309 167 203 175
AAAAAAAAAAAATAATAAACATTCGAAAAAC 9153 3691 0.000000e+00 649 468 639 1319 1398 883 816 1148 818 1015 218 900 352 202 231 129 305 198 200 170
AAAAAAAAAAAGTCTGCCTTTTCTCTGGAGA 14081 5252 0.000000e+00 2894 1042 748 2000 1252 1179 1022 1232 999 1713 1473 976 919 164 160 55 217 58 70 42
AAAAAAAAAAATAATAAACATTCGAAAAACA 11324 4550 0.000000e+00 855 536 784 1553 1489 1071 944 1453 1111 1528 284 1111 455 257 270 154 352 277 219 202
AAAAAAAAAAATTTTTTTAATGATACGGCGA 25637 14508 0.000000e+00 3813 1093 1501 5835 2198 1966 1683 1328 3183 3037 3704 2160 3139 611 620 88 359 332 239 166
AAAAAAAAAATAATAAACATTCGAAAAACAA 11531 4580 0.000000e+00 884 565 808 1586 1489 1073 924 1505 1126 1571 277 1134 479 253 268 152 339 281 218 204
AAAAAAAAAATTAAAGCTCCGCGGAGCATCG 4874 1644 0.000000e+00 206 277 486 465 660 502 184 659 731 704 229 559 117 124 36 51 83 27 14 54
AAAAAAAAAATTTTTTTAATGATACGGCGAC 41213 22332 0.000000e+00 6029 1621 2347 9495 3585 3250 2457 1962 5330 5137 5413 3433 4948 932 963 140 564 526 383 273
AAAAAAAAAATTTTTTTCAAGCAGAAGACGG 13056 6739 0.000000e+00 2036 556 780 2873 1111 1059 755 575 1732 1579 1713 1071 1654 211 252 24 115 108 75 81
Any ideas/workarounds about this issue?
Many thanks,
Pedro
Edit:
I think this is the same as #6, which suggests that is actually OK to continue as long as files pvals_*_top.txt
have the same number of lines as the respective *_out_w_bonf_top.kmerDiff
.
I was wondering about the following test failures/skips. The modified version of Jellyfish2 installed correctly without error but when I ran make check some tests were skipped/failed which should not have from what I understand. Opening up those scripts it seems to me these might be computer resource issues (not enough memory) so I'll try on a better machine. If anyone can suggest what the problem might be I'd appreciate it. - Robert
ake[3]: Entering directory '/home/ubuntu/HAWK/supplements/jellyfish-2.2.10'
PASS: tests/generate_sequence.sh
FAIL: tests/parallel_hashing.sh
PASS: tests/merge.sh
PASS: tests/bloom_filter.sh
SKIP: tests/big.sh
PASS: tests/subset_hashing.sh
PASS: tests/multi_file.sh
PASS: tests/bloom_counter.sh
FAIL: tests/large_key.sh
SKIP: tests/sam.sh
PASS: tests/small_mers.sh
SKIP: tests/swig_python.sh
SKIP: tests/swig_ruby.sh
SKIP: tests/swig_perl.sh
PASS: unit_tests/unit_tests.sh
make[4]: Entering directory '/home/ubuntu/HAWK/
Hello,
I have two questions relating to HAWK.
./jellyfish/err.hpp:98:39: error: throw will always call terminate() [-Werror=terminate]
I have tried different versions of compiler to no avail. However, installing the latest version of non-HAWK jellyfish is fine. What are the disadvantages of not using the included jellyfish?
Best wishes,
Jack
Okay, moving farther along in the pipeline, I'm having trouble getting smartpca to work properly (which I think is the first indication that I'm falling off the rails). Specifically, I get:
$ $hawkDir/smartpca -p $hawkDir/parfile.txt
... snip ....
## To get Tracy-Widom statistics: recompile smartpca with TWTAB correctly specified in Makefile, or
just run twstats (see README file in POPGEN directory)
kurtosis snps indivs
eigenvector 1 1.065 1.000
eigenvector 2 -nan -nan
population: 0 Case 2
population: 1 Control 0 ***
... more data ...
eigenvector 1:means
Case 0.000
Control 0.000
*** warning: bad anova popsizes too small
eigenvector 2:means
Case 0.000
Control 0.000
*** warning: bad anova popsizes too small
##end of smartpca run
Happy to figure out how to get copies of my gwas_eigenstratX*
files to you, but this is what they look like (in case the format of any of them seems way off):
$ head gwas_eigenstratX*
==> gwas_eigenstratX.geno <==
1 0
0 1
1 0
1 0
0 1
0 1
1 0
0 1
1 0
1 0
==> gwas_eigenstratX.ind <==
TSI_0 U Case
TSI_1 U Case
TSI_2 U Case
TSI_3 U Case
TSI_4 U Case
TSI_5 U Case
TSI_6 U Case
TSI_7 U Case
TSI_8 U Case
TSI_9 U Case
==> gwas_eigenstratX.snp <==
AAAAAAAAAAAAAAAAAAAAAAAGAGACCCC 1 0.000000 0
AAAAAAAAAAAAAAAAAAAAAAAACAATCAG 1 0.000000 0
AAAAAAAAAAAAAAAAAAAAAAAGATGTTGC 1 0.000000 0
AAAAAAAAAAAAAAAAAAAAAAAGGCGAACA 1 0.000000 0
AAAAAAAAAAAAAAAAAAAAAAAAACTTCTA 1 0.000000 0
AAAAAAAAAAAAAAAAAAAAAAACCAAACGA 1 0.000000 0
AAAAAAAAAAAAAAAAAAAAAAAAGCTCTTA 1 0.000000 0
AAAAAAAAAAAAAAAAAAAAAAAACGAGGCT 1 0.000000 0
AAAAAAAAAAAAAAAAAAAAAAAAGGGATGG 1 0.000000 0
AAAAAAAAAAAAAAAAAAAAAAAAGCATACC 1 0.000000 0
==> gwas_eigenstratX.total <==
884693180
873243637
817020044
820305452
812869594
811915218
774600059
877385766
874686116
878343007
HAWK has apparently run successfully on my data and I have an interesting case_kmers.fasta file that I could assemble with ABYSS.
However, no control_kmers.fasta file was produced.
Any thoughts on why and how to recover it?
Greetings! I encountered the following error message while running runHawk
.
Loading required package: iterators
Loading required package: parallel
[,1]
[1,] 1
[2,] 0
[3,] 0
Error in { :
task 1 failed - "variable lengths differ (found for 'totals[, 1]')"
Calls: %dopar% -> <Anonymous>
Execution halted
It looks like it's related to one of the R scripts, but other than that the error message provides very few clues to assist with debugging. Do you have any ideas what problem this message might indicate?
Thanks!
Hello,
I'm trying to install HAWK 1.5.0. I downloaded with git clone but cannot find any file that corresponds to hawk-X.Y.Z-beta.tar??
I tried unzipping the file and make from the master folder, but none of the run files seem to be in the folder either. Could you please give guidance?
I've been testing this out, and in the first step of the pipeline (using jellyfish) it processes a _kmers_sorted.txt file for each sample. These are quite large for me (12-18G), which is annoying large for 200 samples. If I gzip them, will subsequent steps work in the pipeline?
Hello,
I tried running Hawk, but I got stuck when running the runHawk
script. I reran the runHawk
script with the -eux
bash flags (bash -eux ./runHawk
) in order to check what is happening in each step and where is the error. The log file is here: HAWK_error.txt
The command-line that fails is Rscript ~/HAWK/hawk-0.9.8-beta/log_reg_case.R
. However, although ~/HAWK/hawk-0.9.8-beta/hawk 47 70
seems to work, I could verify that case_out_w_bonf.kmerDiff
and control_out_w_bonf.kmerDiff
are both empty, and it seems this is the reason that the Rscript fails.
Could I have any help on this?
Thank you in advance.
Kind regards.
Hi,
I have 91 samples and it takes too long time (already 18 days) to run "hawk.out 42 49", the hawk_out.txt file is still empty, I set noThread=10, what is wrong with that? The read coverage in each sample is about 8x and the size of the genome is 2.2Gb.
Thank you!
Hello,
Would it be possible to provide a quickstart guide or some kind of basic tutorial that explains how to setup the file and directory structure? It is currently not very clear to myself, at least.
For example, is the directory my illumina files (.fastq only, or .fastq.gz?) reside required to be named "Reads"? Do my files need to have the prefix "Reads"?
Apologies for the low level technical question.
Hi,
how can I get which is the % of total k-mers that are significant (in case and control separately)? If I count lines in case_out_wo_bonf.kmerDiff
and case_out_w_bonf.kmerDiff
I get different numbers and the file total_kmers.txt
I guess contains the total sum but not separated by group.
Thanks
Hello,
I am trying to run HAWK 0.9.8 beta (downloaded from http://atifrahman.github.io/HAWK/) on bacterial strains where I do not have a reference genome. I managed to run HAWK 0.8.3 Beta on these strains, but I can't do it for the new HAWK version associated with the publication https://elifesciences.org/articles/32920. I am blocked when running EIG6.0.1-Hawk/bin/smartpca -V -p <hawk_dir>/parfile.txt
. The main problem is that smartpca
takes 3 input files:
genotypename: gwas_eigenstratX.geno
snpname: gwas_eigenstratX.snp
indivname: gwas_eigenstratX.ind
but the previous steps (k-mer counting with script supplements/countKmers
and the lines before invoking smartpca
in script supplements/runHawk
) did not produce gwas_eigenstratX.geno
and gwas_eigenstratX.snp
, but only gwas_eigenstratX.ind
. Since these files do not exist, smartpca
fails, and since these 3 input files are required, I do not manage to run them. I also did not find any documentation on this regard.
Could you point me any direction on how to solve this?
Thank you in advance.
I ran runHawk and the output fasta (below) was not a fasta file. There's numbers where there should be strings. Is this supposed to look like this?
0.000000e+00
65
0.000000e+00
47
0.000000e+00
36
0.000000e+00
40
0.000000e+00
47
0.000000e+00
787
0.000000e+00
45
0.000000e+00
36
0.000000e+00
89
0.000000e+00
49
0.000000e+00
813
0.000000e+00
43
0.000000e+00
39
Hi Atif,
I'm running through the pipeline with the Earle bacterial data set to make sure I understand all the steps and what they're doing, and when I get to the runHawk stage, the case_out_w_bonf.kmerDiff
and case_out_wo_bonf.kmerDiff
have the exact same number of lines:
$ wc -l earle/case_out_w_bonf.kmerDiff
1505783 earle/case_out_w_bonf.kmerDiff
$ wc -l earle/case_out_wo_bonf.kmerDiff
1505783 earle/case_out_wo_bonf.kmerDiff
Any guesses what might be going on here? Thanks!
Hello! I am currently getting this error message while excecuting runHawk (complete log adjacent file ).
(base) user@DESKTOP-NOLVR42:/mnt/e/TestHAWK/Test2MG/Gwas2$ ./runHawk
fatalx:
(putgtypes)
./runHawk: line 30: 32451 Aborted (core dumped) $eigenstratDir/bin/smartpca -p $hawkDir/parfile.txt > log_eigen.txt
Error while null model optimizing at kmer 0 to 0
singularity error : 0 nan error : 1
Null model optimization has exited early due to singularity error
Achived error 0.0625 , Exited at iteration 0
Error while alt model optimizing at kmer 0 to 416
singularity error : 0 nan error : 1
Alt model optimization has exited early due to singularity error
Achived error Error while alt model optimizing at kmer 0 to 416
singularity error : 0 nan error : 1
Alt model optimization has exited early due to singularity error
Achived error Error while alt model optimizing at kmer 0 to 416
singularity error : 0 nan error : 1
Alt model optimization has exited early due to singularity error
Achived error Error while alt model optimizing at kmer 0 to 416
singularity error : 0 nan error : 1
Alt model optimization has exited early due to singularity error
Achived error Error while alt model optimizing at kmer 0 to 416
singularity error : 0 nan error : 1
Alt model optimization has exited early due to singularity error
Achived error Error while alt model optimizing at kmer 0 to 416
singularity error : 0 nan error : 1
Alt model optimization has exited early due to singularity error
Achived error 0.0625 , Exited at iteration 0
0.0625 , Exited at iteration 0
0.0625 , Exited at iteration 0
0.0625 , Exited at iteration 0
0.0625 , Exited at iteration 0
0.0625 , Exited at iteration 0
Nan error happend or log_likelihood_ratio is <0 while estimating log_liklelihood_ratio at kmer no. 0 at thread 0
alt model : Nan error happend or log_likelihood_ratio is <0 while estimating log_liklelihood_ratio at kmer no. 1 at thread 1
alt model : Nan error happend or log_likelihood_ratio is <0 while estimating log_liklelihood_ratio at kmer no. 5 at thread 5
alt model : Nan error happend or log_likelihood_ratio is <0 while estimating log_liklelihood_ratio at kmer no. 4 at thread 4
alt model : Nan error happend or log_likelihood_ratio is <0 while estimating log_liklelihood_ratio at kmer no. 2 at thread 2
alt model : 1 1 1 1
This is happening always while "log_reg_case" is being executed
I am using jelly-fish version present in supplemets folder with 4 samples, two cases and two controls
I will appreciate some help
I'm trying to get HAWK to work on a pooled sample (data akin to that in Tehranchi, et al., 2016 Cell 165: 730โ741, but that's neither here nor there), but the version of Jellyfish provided doesn't want to work with my reads. Specifically, it very quickly gives me the error:
$ ~/Downloads/HAWK/supplements/jellyfish-Hawk/bin/jellyfish count -C -m 31 -t 30 -s 40 <( pigz -dc *.fastq.gz )
terminate called after throwing an instance of 'SquareBinaryMatrix::SingularMatrix'
what(): Matrix is singular
However, the official release of jellyfish (v1.1.12) works just fine (or at least takes longer than the couple minutes I've waited thus far to error out). Any idea what's going on? I'm running on Ubunto 18.04-LTS, if that makes a difference...
Hi,
Thanks for this exciting repository.
I was using the runHAWK script on some of my own data and the provided e.coli examples. I run in to the same error on both data sets at the loop https://github.com/atifrahman/HAWK/blob/master/log_reg_case.R#L66-L109
It always exits on
Error in read.table(con, nrow = CHUNK_SIZE) : no lines available in input
Execution halted
However the files pvals_case_top.txt and pvals_control_top.txt do have 200,000 lines each, so I assume I am ok to continue with this error message?
wc -l pvals_control_top.txt
200000 pvals_control_top.txt
wc -l pvals_case_top.txt
200000 pvals_case_top.txt
If so, I have another query. I am not sure if the execution of converToFasta is correct. After running runHAWK case_kmers.fasta contained 5664 k-mers while control_kmers.fasta contained 0. Is this correct? I didn't recieve any error messages suggesting that is terminated prematurely and the rm commands at the end of runHAWK were executed.
grep -c '^>' case_kmers.fasta
5664
grep -c '^>' control_kmers.fasta
0
These numbers don't seem in line with what is significant in the two pvals*top_merged_sorted.txt files.
awk '$1 < 0.05' pvals_case_top_merged_sorted.txt | wc -l
181402
awk '$1 < 0.05' pvals_control_top_merged_sorted.txt | wc -l
144012
Finally, is there a reason that only the top 200,000 kmers from each *out_w_bonf_sorted.kmerDiff are used? Would there ever be a situation where you think this should be increased or decreased?
Thanks in advance for you help!
Hello- thanks for developing this very intriguing software.
I'm currently exploring its use for several resequenced genomes (~1GB each). I'm curious what expected runtimes are? Currently, I am using in a shared/HPC configuration. When running the runHAWK script, and am unsure of the wall time/memory requirements are. Any pointers or benchmarks?
Second, the strings output by the HAWK.cpp program are not immediately clear what they indicate-- they are many orders of magnitude larger than the total number of k-mers in my dataset. I cannot infer what these correspond to. Some clarity in this may help me benchmark the program on my own data.
Hi again,
Once I worked through the previous issues with my test data, I ramped up to run a full analysis on 266 samples (140 case and 126 controls). Everything was running smoothly and it produced all expected output (including significant hits this time!), but I was printing stdout/stderr to a file and discovered the following in that file:
terminate called after throwing an instance of 'alglib::ap_error'
./runHawk: line 53: 33265 Aborted (core dumped) $hawkDir/log_reg_case.out -t $noThread $sex_confounder_arg -p $noPC > pvals_case_top.txt
terminate called after throwing an instance of 'alglib::ap_error'
./runHawk: line 53: 35031 Aborted (core dumped) $hawkDir/log_reg_control.out -t $noThread $sex_confounder_arg -p $noPC > pvals_control_top.txt
As I said, all expected output files were generated regardless of this error, but I'm concerned that those results might not be trustworthy if something failed during the run. Can you provide any insight on this?
I'm running on a Linux machine with 64 CPUs and 252 Gb and running the exact same code that previously worked (without error) for my ~20 sample test dataset. I haven't modified anything for threading and according to htop it looks like HAWK is running on 16 threads by default on this system.
Thanks!
Hi,
I git clone
'd the most recent HAWK repository (1.5.0 BETA) and I was able to get the modified version of jellyfish installed and run for my samples. I am at the runHawk
step and after copying and modifying that script to my scratch space and carrying out make
in the main HAWK directory, I have tried to run it as described in the readme file. I am consistently getting an error almost immediately:
./runHawk: line 15: 63826 Segmentation fault (core dumped) $hawkDir/preProcess.out
The number after line 15
changes, but the Segmentation fault always occurs at this step. The program will continue running and starts to work on the hawk.out
step given what I'm seeing in htop, but I wanted to see if the preProcess error can be fixed (or if it should be ignored since the program continues to run)?
Thanks!
runHAWK, sorted_files.txt, total_kmer_counts.txt, and gwas_info.txt are in the same directory. gwas_info.txt contains
Hyp38749 M Control
Hyp38750 M Control
Hyp38760 F Case
Hyp38874 F Case
Hyp38883 F Case
When I execute './runHAWK', is seg faults on '/home/roex0050/bin/HAWK-0.9.8-beta/hawk 3 2' with the message
Hyp38760 file doesn't exist
F file doesn't exist
Case file doesn't exist
I am running the 0.9.8 beta release on Ubuntu 18.04.4 LTS.
Did I configure something wrong maybe?
Hi,
I'm running into an error when calling the log_reg_case.out
program.
$ $hawkDir/log_reg_case.out -t $noThread $sex_confounder_arg -p $noPC > pvals_case_top.txt
terminate called after throwing an instance of 'alglib::ap_error'
Aborted
the values of the variables set above are: noThread=1
, noPC=2
, and sex_confounder_arg
is not set.
The number of lines in the (truncated) output is much less than those in the original sorted files:
$ wc -l pvals_case_top.txt
790000 pvals_case_top.txt
$ wc -l case_out_w_bonf_top.kmerDiff
107759269 case_out_w_bonf_top.kmerDiff
Any ideas on how to fix this?
Many thanks,
Pedro
Hi,
is it possible to use a different k-mer size with HAWK? I see this definition in kmer.h
:
#define KMER_LENGTH 31
and was wondering if by changing the value and compiling HAWK again will suffice or if there are any other internals that depend on a k-mer of 31.
Thanks,
Pedro
Edit:
other lines that seem to me to need to be changed are int kmerLength=31;
in files kmersearch.cpp
(L23), kmersummary.cpp
(L25), bonf_fasta.cpp
(L25) and hawk.cpp
(L27). I can change all these lines for a different k-mer size, but my initial question remains: "changing the value and compiling HAWK again will suffice or if there are any other internals that depend on a k-mer of 31."
Edit 2:
and apparently there are also a few cases in hawk.cpp
where one would need to change the calls to getKmer()
in order to pass the correct value.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.