atifrahman / hawk Goto Github PK

View Code? Open in Web Editor NEW

46.0 46.0 20.0 70.46 MB

Hitting associations with k-mers

License: GNU General Public License v3.0

C++ 88.20% Awk 0.01% Makefile 0.01% C 0.01% HTML 11.75% R 0.03%

hawk's People

Contributors

Stargazers

Watchers

Forkers

elifesciences-publications petercombs jasongallant standage jakewendt tlemane kfletcher88 pythseq linzhangtuesday wangdi2014 biocko pacificanalytics sunnycqcn jaiaid jhcuarta nikolichi venkataduvvuri shankarkshakya swineologist

hawk's Issues

sorted_files.txt incorrect

After running countKmers, I found that the file sorted_files.txt had incorrect specification of the sample folders.
The sample folders were indicated as Reads_catrim.Hcor_SM05L02_Red.fastq.gz, etc.
The sorted_files.txt included lines like so
/Hcor/Reads_Reads_catrim.Hcor_SM05L02_Red.fastq.gz/Reads_catrim.Hcor_SM05L02_Red.fastq.gz
Obviously, an extra Reads_ was appended to the folder specification.
I changed this to a single "Reads_" with
sed -i.bak 's/Reads_Reads_/Reads_/' sorted_files.txt
But would be good to change in code.

Error while running the Rscript of HAWK

Hi,

while running runHAWK I'm finding the following error message:

Loading required package: iterators
Loading required package: parallel
      [,1]
 [1,]    1
 [2,]    1
 [3,]    1
 [4,]    1
 [5,]    1
 [6,]    1
 [7,]    1
 [8,]    1
 [9,]    1
[10,]    1
[11,]    0
[12,]    0
[13,]    0
[14,]    0
[15,]    0
[16,]    0
[17,]    0
[18,]    0
[19,]    0
[20,]    0
Error in read.table(con, nrow = CHUNK_SIZE) : no lines available in input
Execution halted

The origin of this error seems to be related with this line

Rscript $hawkDir/log_reg_case.R

And looking at log_reg_case.R the con variable is read from file case_out_w_bonf_top.kmerDiff, however this file seems to be OK with the expected number of lines:

$ wc -l case_out_w_bonf_top.kmerDiff
200000 case_out_w_bonf_top.kmerDiff
$ head case_out_w_bonf_top.kmerDiff
AAAAAAAAAAAAAATAATAAACATTCGAAAA	8244	3442	0.000000e+00	560	430	553	1224	1295	795	742	1006	762	877	206	814	308	200	231	127	306	147	202	168
AAAAAAAAAAAAATAATAAACATTCGAAAAA	8946	3615	0.000000e+00	610	470	619	1311	1397	870	803	1123	806	937	215	890	326	200	233	127	309	167	203	175
AAAAAAAAAAAATAATAAACATTCGAAAAAC	9153	3691	0.000000e+00	649	468	639	1319	1398	883	816	1148	818	1015	218	900	352	202	231	129	305	198	200	170
AAAAAAAAAAAGTCTGCCTTTTCTCTGGAGA	14081	5252	0.000000e+00	2894	1042	748	2000	1252	1179	1022	1232	999	1713	1473	976	919	164	160	55	217	58	70	42
AAAAAAAAAAATAATAAACATTCGAAAAACA	11324	4550	0.000000e+00	855	536	784	1553	1489	1071	944	1453	1111	1528	284	1111	455	257	270	154	352	277	219	202
AAAAAAAAAAATTTTTTTAATGATACGGCGA	25637	14508	0.000000e+00	3813	1093	1501	5835	2198	1966	1683	1328	3183	3037	3704	2160	3139	611	620	88	359	332	239	166
AAAAAAAAAATAATAAACATTCGAAAAACAA	11531	4580	0.000000e+00	884	565	808	1586	1489	1073	924	1505	1126	1571	277	1134	479	253	268	152	339	281	218	204
AAAAAAAAAATTAAAGCTCCGCGGAGCATCG	4874	1644	0.000000e+00	206	277	486	465	660	502	184	659	731	704	229	559	117	124	36	51	83	27	14	54
AAAAAAAAAATTTTTTTAATGATACGGCGAC	41213	22332	0.000000e+00	6029	1621	2347	9495	3585	3250	2457	1962	5330	5137	5413	3433	4948	932	963	140	564	526	383	273
AAAAAAAAAATTTTTTTCAAGCAGAAGACGG	13056	6739	0.000000e+00	2036	556	780	2873	1111	1059	755	575	1732	1579	1713	1071	1654	211	252	24	115	108	75	81

Any ideas/workarounds about this issue?

Many thanks,
Pedro

Edit:

I think this is the same as #6, which suggests that is actually OK to continue as long as files pvals_*_top.txt have the same number of lines as the respective *_out_w_bonf_top.kmerDiff.

make check tests failed;

I was wondering about the following test failures/skips. The modified version of Jellyfish2 installed correctly without error but when I ran make check some tests were skipped/failed which should not have from what I understand. Opening up those scripts it seems to me these might be computer resource issues (not enough memory) so I'll try on a better machine. If anyone can suggest what the problem might be I'd appreciate it. - Robert

ake[3]: Entering directory '/home/ubuntu/HAWK/supplements/jellyfish-2.2.10'
PASS: tests/generate_sequence.sh
FAIL: tests/parallel_hashing.sh
PASS: tests/merge.sh
PASS: tests/bloom_filter.sh
SKIP: tests/big.sh
PASS: tests/subset_hashing.sh
PASS: tests/multi_file.sh
PASS: tests/bloom_counter.sh
FAIL: tests/large_key.sh
SKIP: tests/sam.sh
PASS: tests/small_mers.sh
SKIP: tests/swig_python.sh
SKIP: tests/swig_ruby.sh
SKIP: tests/swig_perl.sh
PASS: unit_tests/unit_tests.sh
make[4]: Entering directory '/home/ubuntu/HAWK/

Jellyfish install and SureSelect

Hello,
I have two questions relating to HAWK.

I have been having difficulty installing the included version of jellyfish, specifically:

./jellyfish/err.hpp:98:39: error: throw will always call terminate() [-Werror=terminate]

I have tried different versions of compiler to no avail. However, installing the latest version of non-HAWK jellyfish is fine. What are the disadvantages of not using the included jellyfish?

I have SureSelect data for a large number of genes and regions in our study organims, and we are especially interested in CNVs between case and controls. Are there any reasons for not using HAWK to look for associations because of this?

Best wishes,
Jack

smartpca "bad anova popsizes too small"

Okay, moving farther along in the pipeline, I'm having trouble getting smartpca to work properly (which I think is the first indication that I'm falling off the rails). Specifically, I get:

 $ $hawkDir/smartpca -p $hawkDir/parfile.txt
 ... snip ....
 ## To get Tracy-Widom statistics: recompile smartpca with TWTAB correctly specified in Makefile, or
 just run twstats (see README file in POPGEN directory)
 kurtosis           snps    indivs
 eigenvector    1     1.065     1.000
 eigenvector    2      -nan      -nan
 population:   0                 Case    2
 population:   1              Control    0 ***
 ... more data ...
 eigenvector 1:means
            Case      0.000
         Control      0.000
 *** warning: bad anova popsizes too small
 eigenvector 2:means
            Case      0.000
         Control      0.000
 *** warning: bad anova popsizes too small
 ##end of smartpca run

Happy to figure out how to get copies of my gwas_eigenstratX* files to you, but this is what they look like (in case the format of any of them seems way off):

 $ head gwas_eigenstratX*
 ==> gwas_eigenstratX.geno <==
 1       0
 0       1
 1       0
 1       0
 0       1
 0       1
 1       0
 0       1
 1       0
 1       0
 
 ==> gwas_eigenstratX.ind <==
 TSI_0   U       Case
 TSI_1   U       Case
 TSI_2   U       Case
 TSI_3   U       Case
 TSI_4   U       Case
 TSI_5   U       Case
 TSI_6   U       Case
 TSI_7   U       Case
 TSI_8   U       Case
 TSI_9   U       Case
 
 ==> gwas_eigenstratX.snp <==
 AAAAAAAAAAAAAAAAAAAAAAAGAGACCCC 1       0.000000        0
 AAAAAAAAAAAAAAAAAAAAAAAACAATCAG 1       0.000000        0
 AAAAAAAAAAAAAAAAAAAAAAAGATGTTGC 1       0.000000        0
 AAAAAAAAAAAAAAAAAAAAAAAGGCGAACA 1       0.000000        0
 AAAAAAAAAAAAAAAAAAAAAAAAACTTCTA 1       0.000000        0
 AAAAAAAAAAAAAAAAAAAAAAACCAAACGA 1       0.000000        0
 AAAAAAAAAAAAAAAAAAAAAAAAGCTCTTA 1       0.000000        0
 AAAAAAAAAAAAAAAAAAAAAAAACGAGGCT 1       0.000000        0
 AAAAAAAAAAAAAAAAAAAAAAAAGGGATGG 1       0.000000        0
 AAAAAAAAAAAAAAAAAAAAAAAAGCATACC 1       0.000000        0
 
 ==> gwas_eigenstratX.total <==
 884693180
 873243637
 817020044
 820305452
 812869594
 811915218
 774600059
 877385766
 874686116
 878343007

no control_kmers.fasta file

HAWK has apparently run successfully on my data and I have an interesting case_kmers.fasta file that I could assemble with ABYSS.
However, no control_kmers.fasta file was produced.
Any thoughts on why and how to recover it?

Issue with runHawk and R scripts

Greetings! I encountered the following error message while running runHawk.

Loading required package: iterators
Loading required package: parallel
      [,1]
[1,]    1
[2,]    0
[3,]    0

Error in { : 
  task 1 failed - "variable lengths differ (found for 'totals[, 1]')"
Calls: %dopar% -> <Anonymous>
Execution halted

It looks like it's related to one of the R scripts, but other than that the error message provides very few clues to assist with debugging. Do you have any ideas what problem this message might indicate?

Thanks!

Can't find hawk-X.Y.Z-beta.tar

Hello,

I'm trying to install HAWK 1.5.0. I downloaded with git clone but cannot find any file that corresponds to hawk-X.Y.Z-beta.tar??
I tried unzipping the file and make from the master folder, but none of the run files seem to be in the folder either. Could you please give guidance?

compressing kmer counts

I've been testing this out, and in the first step of the pipeline (using jellyfish) it processes a _kmers_sorted.txt file for each sample. These are quite large for me (12-18G), which is annoying large for 200 samples. If I gzip them, will subsequent steps work in the pipeline?

Error running runHawk script

Hello,

I tried running Hawk, but I got stuck when running the runHawk script. I reran the runHawk script with the -eux bash flags (bash -eux ./runHawk) in order to check what is happening in each step and where is the error. The log file is here: HAWK_error.txt

The command-line that fails is Rscript ~/HAWK/hawk-0.9.8-beta/log_reg_case.R. However, although ~/HAWK/hawk-0.9.8-beta/hawk 47 70 seems to work, I could verify that case_out_w_bonf.kmerDiff and control_out_w_bonf.kmerDiff are both empty, and it seems this is the reason that the Rscript fails.

Could I have any help on this?

Thank you in advance.
Kind regards.

hawk.out run too long

Hi,
I have 91 samples and it takes too long time (already 18 days) to run "hawk.out 42 49", the hawk_out.txt file is still empty, I set noThread=10, what is wrong with that? The read coverage in each sample is about 8x and the size of the genome is 2.2Gb.
Thank you!

Quick start guide?

Hello,
Would it be possible to provide a quickstart guide or some kind of basic tutorial that explains how to setup the file and directory structure? It is currently not very clear to myself, at least.

For example, is the directory my illumina files (.fastq only, or .fastq.gz?) reside required to be named "Reads"? Do my files need to have the prefix "Reads"?

Apologies for the low level technical question.

Total k-mer vs significant k-mer by groups

Hi,
how can I get which is the % of total k-mers that are significant (in case and control separately)? If I count lines in case_out_wo_bonf.kmerDiff and case_out_w_bonf.kmerDiff I get different numbers and the file total_kmers.txt I guess contains the total sum but not separated by group.

Thanks

Problem on running HAWK 0.9.8 beta: on the EIG6.0.1-Hawk/bin/smartpca step

Hello,

I am trying to run HAWK 0.9.8 beta (downloaded from http://atifrahman.github.io/HAWK/) on bacterial strains where I do not have a reference genome. I managed to run HAWK 0.8.3 Beta on these strains, but I can't do it for the new HAWK version associated with the publication https://elifesciences.org/articles/32920. I am blocked when running EIG6.0.1-Hawk/bin/smartpca -V -p <hawk_dir>/parfile.txt. The main problem is that smartpca takes 3 input files:

genotypename: gwas_eigenstratX.geno
snpname: gwas_eigenstratX.snp
indivname: gwas_eigenstratX.ind

but the previous steps (k-mer counting with script supplements/countKmers and the lines before invoking smartpca in script supplements/runHawk) did not produce gwas_eigenstratX.geno and gwas_eigenstratX.snp, but only gwas_eigenstratX.ind. Since these files do not exist, smartpca fails, and since these 3 input files are required, I do not manage to run them. I also did not find any documentation on this regard.

Could you point me any direction on how to solve this?

Thank you in advance.

runHawk fasta files for cases and controls have number not fasta sequence in them

I ran runHawk and the output fasta (below) was not a fasta file. There's numbers where there should be strings. Is this supposed to look like this?

0.000000e+00
65
0.000000e+00
47
0.000000e+00
36
0.000000e+00
40
0.000000e+00
47
0.000000e+00
787
0.000000e+00
45
0.000000e+00
36
0.000000e+00
89
0.000000e+00
49
0.000000e+00
813
0.000000e+00
43
0.000000e+00
39

Hawk doesn't seem to do Bonferroni correction

Hi Atif,
I'm running through the pipeline with the Earle bacterial data set to make sure I understand all the steps and what they're doing, and when I get to the runHawk stage, the case_out_w_bonf.kmerDiff and case_out_wo_bonf.kmerDiff have the exact same number of lines:

$ wc -l earle/case_out_w_bonf.kmerDiff
1505783 earle/case_out_w_bonf.kmerDiff
$ wc -l earle/case_out_wo_bonf.kmerDiff
1505783 earle/case_out_wo_bonf.kmerDiff

Any guesses what might be going on here? Thanks!

runHawk Error line 30: 32451 Aborted

Hello! I am currently getting this error message while excecuting runHawk (complete log adjacent file ).

(base) user@DESKTOP-NOLVR42:/mnt/e/TestHAWK/Test2MG/Gwas2$ ./runHawk
fatalx:
(putgtypes)
./runHawk: line 30: 32451 Aborted (core dumped) $eigenstratDir/bin/smartpca -p $hawkDir/parfile.txt > log_eigen.txt
Error while null model optimizing at kmer 0 to 0
singularity error : 0 nan error : 1
Null model optimization has exited early due to singularity error
Achived error 0.0625 , Exited at iteration 0
Error while alt model optimizing at kmer 0 to 416
singularity error : 0 nan error : 1
Alt model optimization has exited early due to singularity error
Achived error Error while alt model optimizing at kmer 0 to 416
singularity error : 0 nan error : 1
Alt model optimization has exited early due to singularity error
Achived error Error while alt model optimizing at kmer 0 to 416
singularity error : 0 nan error : 1
Alt model optimization has exited early due to singularity error
Achived error Error while alt model optimizing at kmer 0 to 416
singularity error : 0 nan error : 1
Alt model optimization has exited early due to singularity error
Achived error Error while alt model optimizing at kmer 0 to 416
singularity error : 0 nan error : 1
Alt model optimization has exited early due to singularity error
Achived error Error while alt model optimizing at kmer 0 to 416
singularity error : 0 nan error : 1
Alt model optimization has exited early due to singularity error
Achived error 0.0625 , Exited at iteration 0
0.0625 , Exited at iteration 0
0.0625 , Exited at iteration 0
0.0625 , Exited at iteration 0
0.0625 , Exited at iteration 0
0.0625 , Exited at iteration 0
Nan error happend or log_likelihood_ratio is <0 while estimating log_liklelihood_ratio at kmer no. 0 at thread 0
alt model : Nan error happend or log_likelihood_ratio is <0 while estimating log_liklelihood_ratio at kmer no. 1 at thread 1
alt model : Nan error happend or log_likelihood_ratio is <0 while estimating log_liklelihood_ratio at kmer no. 5 at thread 5
alt model : Nan error happend or log_likelihood_ratio is <0 while estimating log_liklelihood_ratio at kmer no. 4 at thread 4
alt model : Nan error happend or log_likelihood_ratio is <0 while estimating log_liklelihood_ratio at kmer no. 2 at thread 2
alt model : 1 1 1 1

This is happening always while "log_reg_case" is being executed

I am using jelly-fish version present in supplemets folder with 4 samples, two cases and two controls

I will appreciate some help

runHawk_Error.txt

Jellyfish SquareBinaryMatrix::SingularMatrix

I'm trying to get HAWK to work on a pooled sample (data akin to that in Tehranchi, et al., 2016 Cell 165: 730–741, but that's neither here nor there), but the version of Jellyfish provided doesn't want to work with my reads. Specifically, it very quickly gives me the error:

$ ~/Downloads/HAWK/supplements/jellyfish-Hawk/bin/jellyfish count -C -m 31 -t 30 -s 40 <( pigz -dc *.fastq.gz )
terminate called after throwing an instance of 'SquareBinaryMatrix::SingularMatrix'
  what():  Matrix is singular

However, the official release of jellyfish (v1.1.12) works just fine (or at least takes longer than the couple minutes I've waited thus far to error out). Any idea what's going on? I'm running on Ubunto 18.04-LTS, if that makes a difference...

log_reg_case.R & log_reg_control.R execution halted

Hi,
Thanks for this exciting repository.

I was using the runHAWK script on some of my own data and the provided e.coli examples. I run in to the same error on both data sets at the loop https://github.com/atifrahman/HAWK/blob/master/log_reg_case.R#L66-L109

It always exits on

Error in read.table(con, nrow = CHUNK_SIZE) : no lines available in input
Execution halted

However the files pvals_case_top.txt and pvals_control_top.txt do have 200,000 lines each, so I assume I am ok to continue with this error message?

wc -l pvals_control_top.txt
200000 pvals_control_top.txt

wc -l pvals_case_top.txt
200000 pvals_case_top.txt

If so, I have another query. I am not sure if the execution of converToFasta is correct. After running runHAWK case_kmers.fasta contained 5664 k-mers while control_kmers.fasta contained 0. Is this correct? I didn't recieve any error messages suggesting that is terminated prematurely and the rm commands at the end of runHAWK were executed.

grep -c '^>' case_kmers.fasta
5664
grep -c '^>' control_kmers.fasta
0

These numbers don't seem in line with what is significant in the two pvals*top_merged_sorted.txt files.

awk '$1 < 0.05' pvals_case_top_merged_sorted.txt | wc -l
181402
awk '$1 < 0.05' pvals_control_top_merged_sorted.txt | wc -l
144012

Finally, is there a reason that only the top 200,000 kmers from each *out_w_bonf_sorted.kmerDiff are used? Would there ever be a situation where you think this should be increased or decreased?

Thanks in advance for you help!

Expected Runtimes for HAWK

Hello- thanks for developing this very intriguing software.

I'm currently exploring its use for several resequenced genomes (~1GB each). I'm curious what expected runtimes are? Currently, I am using in a shared/HPC configuration. When running the runHAWK script, and am unsure of the wall time/memory requirements are. Any pointers or benchmarks?

Second, the strings output by the HAWK.cpp program are not immediately clear what they indicate-- they are many orders of magnitude larger than the total number of k-mers in my dataset. I cannot infer what these correspond to. Some clarity in this may help me benchmark the program on my own data.

alglib::ap_error

Hi again,

Once I worked through the previous issues with my test data, I ramped up to run a full analysis on 266 samples (140 case and 126 controls). Everything was running smoothly and it produced all expected output (including significant hits this time!), but I was printing stdout/stderr to a file and discovered the following in that file:

terminate called after throwing an instance of 'alglib::ap_error'
./runHawk: line 53: 33265 Aborted                 (core dumped) $hawkDir/log_reg_case.out -t $noThread $sex_confounder_arg -p $noPC > pvals_case_top.txt
terminate called after throwing an instance of 'alglib::ap_error'
./runHawk: line 53: 35031 Aborted                 (core dumped) $hawkDir/log_reg_control.out -t $noThread $sex_confounder_arg -p $noPC > pvals_control_top.txt

As I said, all expected output files were generated regardless of this error, but I'm concerned that those results might not be trustworthy if something failed during the run. Can you provide any insight on this?

I'm running on a Linux machine with 64 CPUs and 252 Gb and running the exact same code that previously worked (without error) for my ~20 sample test dataset. I haven't modified anything for threading and according to htop it looks like HAWK is running on 16 threads by default on this system.

Thanks!

segmentation fault with preProcess.out

Hi,

I git clone'd the most recent HAWK repository (1.5.0 BETA) and I was able to get the modified version of jellyfish installed and run for my samples. I am at the runHawk step and after copying and modifying that script to my scratch space and carrying out make in the main HAWK directory, I have tried to run it as described in the readme file. I am consistently getting an error almost immediately:

./runHawk: line 15: 63826 Segmentation fault (core dumped) $hawkDir/preProcess.out

The number after line 15 changes, but the Segmentation fault always occurs at this step. The program will continue running and starts to work on the hawk.out step given what I'm seeing in htop, but I wanted to see if the preProcess error can be fixed (or if it should be ignored since the program continues to run)?

Thanks!

runHAWK ends with 'file doesn't exist' issue

runHAWK, sorted_files.txt, total_kmer_counts.txt, and gwas_info.txt are in the same directory. gwas_info.txt contains

Hyp38749	M	Control
Hyp38750	M	Control
Hyp38760	F	Case
Hyp38874	F	Case
Hyp38883	F	Case

When I execute './runHAWK', is seg faults on '/home/roex0050/bin/HAWK-0.9.8-beta/hawk 3 2' with the message

Hyp38760 file doesn't exist
F file doesn't exist
Case file doesn't exist

I am running the 0.9.8 beta release on Ubuntu 18.04.4 LTS.
Did I configure something wrong maybe?

Error with `log_reg_case`

Hi,

I'm running into an error when calling the log_reg_case.out program.

$ $hawkDir/log_reg_case.out -t $noThread $sex_confounder_arg -p $noPC > pvals_case_top.txt
terminate called after throwing an instance of 'alglib::ap_error'
Aborted

the values of the variables set above are: noThread=1, noPC=2, and sex_confounder_arg is not set.

The number of lines in the (truncated) output is much less than those in the original sorted files:

$ wc -l pvals_case_top.txt
790000 pvals_case_top.txt
$ wc -l case_out_w_bonf_top.kmerDiff
107759269 case_out_w_bonf_top.kmerDiff

Any ideas on how to fix this?
Many thanks,
Pedro

Using a different k-mer size

Hi,

is it possible to use a different k-mer size with HAWK? I see this definition in kmer.h:

#define KMER_LENGTH 31

and was wondering if by changing the value and compiling HAWK again will suffice or if there are any other internals that depend on a k-mer of 31.

Thanks,
Pedro

Edit:

other lines that seem to me to need to be changed are int kmerLength=31; in files kmersearch.cpp (L23), kmersummary.cpp (L25), bonf_fasta.cpp (L25) and hawk.cpp (L27). I can change all these lines for a different k-mer size, but my initial question remains: "changing the value and compiling HAWK again will suffice or if there are any other internals that depend on a k-mer of 31."

Edit 2:

and apparently there are also a few cases in hawk.cpp where one would need to change the calls to getKmer() in order to pass the correct value.

atifrahman / hawk Goto Github PK

hawk's People

Contributors

Stargazers

Watchers

Forkers

hawk's Issues

Recommend Projects

Recommend Topics

Recommend Org