Git Product home page Git Product logo

facs's People

Contributors

brainstorm avatar davemessina avatar guillermo-carrasco avatar henrikstranneheim avatar okulev avatar tzcoolman avatar wmarquesr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

facs's Issues

facs.remove() does not find contaminated read on testsuite

@tzcoolman can you test facs against tests/data/synthetic_fastq/test200.fastq and let me know why the ecoli read does not end in the _contam file? Something related with the k_mer length maybe?:

(facs)[roman@biologin facs 0]$ ./facs remove -q ../tests/data/synthetic_fastq/test200.fastq -r ../tests/data/bloom/eschColi_K12.bloom
source->../tests/data/synthetic_fastq/test200.fastq
obj_file->../tests/data/bloom/eschColi_K12.bloom
prefix->(null)
match->../tests/data/synthetic_fastq/test200_eschColi_K12_contam.fastq
mis->../tests/data/synthetic_fastq/test200_eschColi_K12_clean.fastq
finish processing...
(facs)[roman@biologin facs 0]$ cat ../tests/data/synthetic_fastq/test200_eschColi_K12_contam.fastq
(facs)[roman@biologin facs 0]$

Thanks!
Roman

Spike randomised read

For testing purposes, instead of spiking a single static Ecoli read, we could generate a random one using SimNGS. To do so, and to make sure that there is no bias, we can parametrise SimNGS so it does not introduce any kind of random error.

Another good feature would be spiking a read from any of the supported organisms instead of only Ecoli -- of course not from the target organism.

facs build

  1. The -l option does not work and there is no instructions on how the output will be handled .i.e how each generated bloom filter file will named and placed. It is also unclear how the -o flag should be handled in conjunction with the -l flag.
  2. Is the support to use directories removed?
  3. Would be nice with some more information on parameters and output e.g.:
    Set -k to 15 nucleotides
    Set -e to 0.005
    Using file/list as input
    Bloom filter(s) will be written to ...

Have FACS report full paths to sample

Hello Enze!

Can you tweak FACS so that it reports absolute paths to the sample(s)?

For instance:

/home/Enze/facs/tests/data/synthetic_fastq/simngs_phiX_100.fastq

instead of just: simngs_phiX_100.fastq?

... still with correct JSON formatting of course ;)

Thanks!

Fix fastq_screen test

The tests for fastq_screen are failing, apparently when generating the fastq_screen.conf file:

ERROR
Runs fastq_screen tests against synthetically generated fastq files folder ... ERROR

======================================================================
ERROR: Downloads and installs fastq_screen locally, generates fastq_screen.conf file
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/bubo/home/h25/guilc/facs/tests/test_fastqscreen.py", line 62, in test_1_fetch_fastqscreen
    shutil.move(fscreen_path, self.progs)
NameError: global name 'fscreen_path' is not defined

======================================================================
ERROR: Runs fastq_screen tests against synthetically generated fastq files folder
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/bubo/home/h25/guilc/facs/tests/test_fastqscreen.py", line 72, in test_2_run_fastq_screen
    cfg = open(os.path.join(self.progs, "fastq_screen.conf"), 'rU')
IOError: [Errno 2] No such file or directory: '/bubo/home/h25/guilc/facs/tests/data/bin/fastq_screen.conf'

----------------------------------------------------------------------

omp pragmas ignored during compilation

During compilation (using the current GitHub code on our cluster) all the omp pragmas are ignored, so that the final binary runs single-threaded.

e.g.:

big_query.c:159: warning: ignoring #pragma omp parallel
big_query.c:161: warning: ignoring #pragma omp single
big_query.c:165: warning: ignoring #pragma omp task

I've got openmp installed (/usr/lib/libgomp.so.1 is present). Can you suggest what else might be causing this/what I can do to fix it?

Many thanks

facs classify -h

  1. wrong header says "./facs remove" should say "./facs classify"
  2. -t tolerance rate, default is 0.0005. See issue #15.
  3. There is a bad address message when running facs classify -h.

facs remove -o

When using the -o option:
-o /proj/b2012037/private/henrik/Example/Example1

yields:
/proj/b2012037/private/henrik/Example/Example1Example1_Homo_sapiens.GRCh37.70_nochr_clean.fastq

making it look more like "-o" is an out directory command.
When specifying '-o' then the suffix "Bloom_filter.clean" or "Bloom_filter.contaminants" should be added to the specified file name (if such exists). --> Example1_Homo_sapiens_GRCh37.clean & Example1_Homo_sapiens_GRCh37.contaminants. If "-o" path ends in a "/" indicating a directory then the complete entry as in the first paragraph should be used.

facs remove

Throws seg fault when using Example2 dataset.

facs query -h

Is not updated. Is the only option really:
Options:
-r reference bloom filter to query against

python: returning a JSON document is gone

Somewhere in between those two builds:

https://travis-ci.org/SciLifeLab/facs/builds/13155812#L629
https://travis-ci.org/SciLifeLab/facs/builds/13128914#L629

The python JSON output for facs.query got lost:

$ python
Python 2.7.1 (r271:86832, Feb 14 2011, 14:03:18) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import facs
>>> facs.query('../tests/data/synthetic_fastq/simngs_phiX_100.fastq', '../tests/data/bloom/eschColi_K12.bloom')

''
>>> 

@tzcoolman Could you please help me fix this one? It seems that it originated from pull request #75 while introducing changes for MPI someone slipped under my radar :-/

It is probably related to the new handling of the report() function in that PR (https://github.com/SciLifeLab/facs/pull/75/files#diff-f011c1f4c85573e7323c6a9d9dc721e8L226), but I am not sure...

This direcly affects the benchmarks since the JSON results are not being reported, so we need this fixed asap.

Thanks Enze!

python: Bad address when inverting arguments for facs.query

An infinite error message is displayed:

Problem reading Bloom filter: Bad address
Problem reading Bloom filter: Bad address
Problem reading Bloom filter: Bad address
(...)

When accidentally inverting the arguments for facs query, i.e:

facs.query('../tests/data/bloom/eschColi_K12.bloom',
           '../tests/data/synthetic_fastq/simngs_phiX_100.fastq')

Instead of:

facs.query('../tests/data/synthetic_fastq/simngs_phiX_100.fastq',
           '../tests/data/bloom/eschColi_K12.bloom')

Giving a sequence to classify (query) rather than a file with sequences

In general I'm surprised that most sequence aligner/mappers/searchers can only take as input a file (e.g. a fastq file). Sometimes I want to on demand classify some particular sequence.

For me, it would be ideal if in the python interface you could in addition to doing

facs.query("contaminated_sample.fastq.gz", "ecoli.bloom")

You could also do something along the lines of

facs.query(seq="ATACGTTACATAACATTGAAACTGGAGGGGGAAAGAAAACCAAAAGACCAGCTTGTTCCTTCACATGGCAC", "ecoli.bloom")

and it would return True or False.

Merge remove and remove_l in the same codebase (remove.c & remove.h)

And fix compiler warnings:

cc -c   simple_remove.c -o simple_remove.o -O3 -DFIFO -D_FILE_OFFSET_BITS=64 -D_LARGE_FILE -Wall -fopenmp -g -DNODEBUG -lm -lz
cc -c   simple_remove_l.c -o simple_remove_l.o -O3 -DFIFO -D_FILE_OFFSET_BITS=64 -D_LARGE_FILE -Wall -fopenmp -g -DNODEBUG -lm -lz
simple_remove_l.c: In function 'fasta_process_ml':
simple_remove_l.c:225: warning: unused variable 'sign'
simple_remove_l.c: In function 'save_result_ml':
simple_remove_l.c:289: warning: suggest parentheses around '&&' within '||'
simple_remove_l.c:289: warning: value computed is not used
simple_remove_l.c:292: warning: suggest parentheses around '&&' within '||'
simple_remove_l.c:292: warning: value computed is not used
simple_remove_l.c: In function 'all_save':
simple_remove_l.c:427: warning: implicit declaration of function 'save_result'

Transparancy of output

  1. Currently if -o is not specified the output will be located in the same path as the binary file. I think it is bad to clutter the binary directory. I think we should use the current directory of where the script was called from.
  2. There is no notation on what the output file will be called when using the default. This is not obvious and should be recorded in the help text. The name should also reflect which modules was used e.g. Testing1_check.info, Testing1_query.info.
  3. There are few numbers in the outputfile and no documentation of what the numbers stand for. It would be great with a tab-sep file with pedagogical annotations to each number so that the user knows what each number stands for and in format that is easily parsable.
  4. The output from FACS2.0 should also include which dataset was used in the analysis and it should be easy to see and parse which statistics belong to which bloom filter if several was used in the same analysis.

facs remove -h

  1. Header says "---contamination remove---". Should say "---facs remove---".
  2. Tolerant rate is perhaps not ideal. Maybe threshold value is a better name.
  3. (The program will automatically select a value if you don't provide any.) I would prefer: The program will automatically estimate a proper threshold value from the reference size and K-mer length.
  4. -l a list containing all bloom files. --> - l a list containing all Bloom filter names in....and then what the list format should be.
  5. -r reference bloom filter file or dir --> -r Bloom filter file or directory
  6. !!! either -r or -l can only be allowed each time !!!. See build issue!

test_simnsg.py fails if executed alone

Hej,

This test fails if executed alone before any other test has been executed:


Generates a synthetic library and runs with built-in simNGS runfile ... ERROR

======================================================================
ERROR: Generates a synthetic library and runs with built-in simNGS runfile
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/guillem/facs/tests/test_simngs.py", line 57, in test_2_run_simNGS
    p1 = subprocess.Popen(cl1, stdout=subprocess.PIPE)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 679, in __init__
    errread, errwrite)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1228, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory

----------------------------------------------------------------------

The problem is that the folder tests/data has not been created.

Thanks!

License

I have not found any license information. Am I missing something?

facs remove Example3

*** glibc detected *** ./facs: free(): invalid pointer: 0x00002abf680db010 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3fa6875916]
./facs[0x402c52]
./facs[0x407dc8]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x3fa681ecdd]
./facs[0x4016d9]
======= Memory map: ========
00400000-0040c000 r-xp 00000000 00:15 3107197168 /lynx/cvol/v26/b2012037/private/facs/facs/facs
0060b000-0060c000 rw-p 0000b000 00:15 3107197168 /lynx/cvol/v26/b2012037/private/facs/facs/facs
00981000-0132c000 rw-p 00000000 00:00 0 [heap]
38ad000000-38ad015000 r-xp 00000000 08:03 416403 /lib64/libz.so.1.2.3
38ad015000-38ad214000 ---p 00015000 08:03 416403 /lib64/libz.so.1.2.3
38ad214000-38ad215000 r--p 00014000 08:03 416403 /lib64/libz.so.1.2.3
38ad215000-38ad216000 rw-p 00015000 08:03 416403 /lib64/libz.so.1.2.3
3fa6400000-3fa6420000 r-xp 00000000 08:03 394073 /lib64/ld-2.12.so
3fa661f000-3fa6620000 r--p 0001f000 08:03 394073 /lib64/ld-2.12.so
3fa6620000-3fa6621000 rw-p 00020000 08:03 394073 /lib64/ld-2.12.so
3fa6621000-3fa6622000 rw-p 00000000 00:00 0
3fa6800000-3fa6989000 r-xp 00000000 08:03 394074 /lib64/libc-2.12.so
3fa6989000-3fa6b89000 ---p 00189000 08:03 394074 /lib64/libc-2.12.so
3fa6b89000-3fa6b8d000 r--p 00189000 08:03 394074 /lib64/libc-2.12.so
3fa6b8d000-3fa6b8e000 rw-p 0018d000 08:03 394074 /lib64/libc-2.12.so
3fa6b8e000-3fa6b93000 rw-p 00000000 00:00 0
3fa6c00000-3fa6c83000 r-xp 00000000 08:03 412474 /lib64/libm-2.12.so
3fa6c83000-3fa6e82000 ---p 00083000 08:03 412474 /lib64/libm-2.12.so
3fa6e82000-3fa6e83000 r--p 00082000 08:03 412474 /lib64/libm-2.12.so
3fa6e83000-3fa6e84000 rw-p 00083000 08:03 412474 /lib64/libm-2.12.so
3fa7400000-3fa7417000 r-xp 00000000 08:03 394568 /lib64/libpthread-2.12.so
3fa7417000-3fa7617000 ---p 00017000 08:03 394568 /lib64/libpthread-2.12.so
3fa7617000-3fa7618000 r--p 00017000 08:03 394568 /lib64/libpthread-2.12.so
3fa7618000-3fa7619000 rw-p 00018000 08:03 394568 /lib64/libpthread-2.12.so
3fa7619000-3fa761d000 rw-p 00000000 00:00 0
3fa7c00000-3fa7c07000 r-xp 00000000 08:03 394947 /lib64/librt-2.12.so
3fa7c07000-3fa7e06000 ---p 00007000 08:03 394947 /lib64/librt-2.12.so
3fa7e06000-3fa7e07000 r--p 00006000 08:03 394947 /lib64/librt-2.12.so
3fa7e07000-3fa7e08000 rw-p 00007000 08:03 394947 /lib64/librt-2.12.so
3fa8400000-3fa840d000 r-xp 00000000 08:03 264890 /usr/lib64/libgomp.so.1.0.0
3fa840d000-3fa860c000 ---p 0000d000 08:03 264890 /usr/lib64/libgomp.so.1.0.0
3fa860c000-3fa860d000 rw-p 0000c000 08:03 264890 /usr/lib64/libgomp.so.1.0.0
3fa9000000-3fa9016000 r-xp 00000000 08:03 412532 /lib64/libgcc_s-4.4.6-20120305.so.1
3fa9016000-3fa9215000 ---p 00016000 08:03 412532 /lib64/libgcc_s-4.4.6-20120305.so.1
3fa9215000-3fa9216000 rw-p 00015000 08:03 412532 /lib64/libgcc_s-4.4.6-20120305.so.1
2abf07949000-2abf0794a000 rw-p 00000000 00:00 0
2abf07968000-2abf0796d000 rw-p 00000000 00:00 0
2abf0796d000-2abf27be7000 r--s 00000000 00:15 2726700565 /lynx/cvol/v1/b2012094/Innocentive/data/Example/Example3.fq
2abf27be7000-2ac0bac37000 rw-p 00000000 00:00 0
2ac0bac37000-2ac0bac38000 ---p 00000000 00:00 0
2ac0bac38000-2ac0bae38000 rw-p 00000000 00:00 0
2ac0bae38000-2ac0bae39000 ---p 00000000 00:00 0
2ac0bae39000-2ac0bb039000 rw-p 00000000 00:00 0
2ac0bb039000-2ac0bb03a000 ---p 00000000 00:00 0
2ac0bb03a000-2ac0bb23a000 rw-p 00000000 00:00 0
2ac0bb23a000-2ac0bb23b000 ---p 00000000 00:00 0
2ac0bb23b000-2ac0bb43b000 rw-p 00000000 00:00 0
2ac0bb43b000-2ac0bb43c000 ---p 00000000 00:00 0
2ac0bb43c000-2ac0bb63c000 rw-p 00000000 00:00 0
2ac0bb63c000-2ac0bb63d000 ---p 00000000 00:00 0
2ac0bb63d000-2ac0bb83d000 rw-p 00000000 00:00 0
2ac0bb83d000-2ac0bb83e000 ---p 00000000 00:00 0
2ac0bb83e000-2ac0bba43000 rw-p 00000000 00:00 0
2ac0bc000000-2ac0bcbf4000 rw-p 00000000 00:00 0
2ac0bcbf4000-2ac0c0000000 ---p 00000000 00:00 0
2ac0c4000000-2ac0c4a2b000 rw-p 00000000 00:00 0
2ac0c4a2b000-2ac0c8000000 ---p 00000000 00:00 0
2ac0cc000000-2ac0cca3a000 rw-p 00000000 00:00 0
2ac0cca3a000-2ac0d0000000 ---p 00000000 00:00 0
2ac0d4000000-2ac0d4a48000 rw-p 00000000 00:00 0
2ac0d4a48000-2ac0d8000000 ---p 00000000 00:00 0
2ac0dc000000-2ac0de76c000 rw-p 00000000 00:00 0
2ac0de76c000-2ac0e0000000 ---p 00000000 00:00 0
2ac0e4000000-2ac0e49ae000 rw-p 00000000 00:00 0
2ac0e49ae000-2ac0e8000000 ---p 00000000 00:00 0
2ac0ec000000-2ac0ec9ef000 rw-p 00000000 00:00 0
2ac0ec9ef000-2ac0f0000000 ---p 00000000 00:00 0
7fff76134000-7fff7614a000 rw-p 00000000 00:00 0 [stack]
7fff761ff000-7fff76200000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
/var/spool/slurmd/job3017171/slurm_script: rad 17: 19178 Avbruten (SIGABRT) (minnesutskrift skapad)

fastq_screen.py

@brainstorm @guillermo-carrasco
Roman I think in this test file, you added all reference libraries into .conf file. But when you run the test, if you don't specify which reference library should be used, which could cause conflicts in the final results. You sure this is a good idea?

Remove the generation of *.info files

All the relevant information is found already on stdout already and JSON-like formatted output:

{
        "total_read_count": 1804624,
        "contaminated_reads": 4253,
        "total_hits": 2148086,
        "contamination_rate": 0.002357,
        "bloom_file":"facs/tests/data/bloom/Arabidopsis_thaliana_TAIR10.bloom"
}

Not only *.info files do not add valuable information to the one it's already present, but they get overwritten (only last run present) if multiple queries are performed against a particular sample.

facs remove

I still do not get facs remove to finish within 1 hour on the example1 dataset after updating to the latest version.

FAcs build -h

  1. The header of the output is still Bloom build. It should be facs build.
  2. Filename is inconsistently spelled choose either filename or file name, do not use both in the same document.
  3. !!! either -r or -l can only be allowed each time !!! Would be better to say !!! Use either -r or -l !!!
  4. says K-mer default is 21, but the default is automatically estimated from the reference size is it not? The help text should reflect that. Why is K-mer spelled K_mer?
  5. -e error rate. Should be more specific e.g. Bloom filter false positive frequency and maybe then the flag name should be changed as well.

Further filtering/cleaning and unification of results in iriscouch for the benchmark

Now that the reporting functionality works in single thread for all programs, we should put some effort into normalizing/unifying the data reported to the CouchDB database and plot accordingly via matplotlib or D3.

I have disabled the dummy fastq files generation since what we want is only simNGS-generated reads (i.e: simngs_hostORG_contamORG_numREADS.fastq) for the different species with a single ecoli read spiked in there (for instance).

@guillermo-carrasco, @b97pla. Let me know if this sketching needs some more clarification, I will be happy to fit the code to the needs.

gzip support lacking for "facs remove"

When Daniel Lundin was looking into using facs to screen for rRNA in his dataset, he found that "facs query" supports reading from gzipped files, but not "facs remove". Why is this?

Facs version

FACS version says 0.1 in help text should say 2.0

Python's FACS remove needs to capture stdout/stderr

Right now, it is not handled properly, since the strings returned by facs remove cannot be handled by python:

http://stackoverflow.com/questions/2420317/why-does-my-hello-world-python-c-module-work-correctly-in-everything-but-idle

void writeout(const char* nullterminated)
{
    PyObject* sysmod = PyImport_ImportModuleNoBlock("sys");
    PyObject* pystdout = PyObject_GetAttrString(sysmod, "stdout");
    PyObject* result = PyObject_CallMethod(pystdout, "write", "s", nullterminated);
    Py_XDECREF(result);
    Py_XDECREF(pystdout);
    Py_XDECREF(sysmod);    
 }

Command line options

Trying out facs query for the first time, I was surprised about getting an error message for

./facs query -b hg19.bloom -q /proj/b2012094/Innocentive/data/Example/Example1.fq
query: invalid option -- 'b'

(hg19.bloom is a filter for hg19 which I previously built using facs build) The instructions say:

Options:
-b reference Bloom filter to query against
-q FASTA/FASTQ file containing the query
-l input list containing all Bloom filters, one per line
-r single input Bloom filters
-t threshold value
-s sampling rate, default is 1 so it reads the whole query file

so I thought I should use the -b flag, but apparently I am supposed to use -r. Perhaps the command line option instructions could be clarified a bit?

Deconseq does not report filter and returns incorrect contamination rate

Contamination rate should never be >100...

{
   "_id": "e206960e0662df946a20e43b4f000c36",
   "_rev": "1-15137438404e50191513c6ee194c92e4",
   "sample": "tests/data/synthetic_fastq/simngs_phiX_1000000.fastq",
   "contamination_rate": 1301,
   "start_timestamp": "2013-12-02 12:08:17.599624Z",
   "total_reads": 500000,
   "end_timestamp": "2013-12-02 12:15:26.696080Z"
}

python: do not bail out when file is not found

After a file not being found, the python interface should ideally return to the python interpreter:

>>> facs.query("jarl", "karl")
karl: No such file or directory
>>> 

Instead of exiting to the underlying shell as it does now:

>>> facs.query("jarl", "karl")
karl: No such file or directory
$ 

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.