brentp / bio-playground Goto Github PK
View Code? Open in Web Editor NEWmiscellaneous scripts for bioinformatics/genomics that dont merit their own repo.
License: MIT License
miscellaneous scripts for bioinformatics/genomics that dont merit their own repo.
License: MIT License
I am trying to generate IGV screenshots automatically for easier inspection of variants in a vcf file. Upon using the IGV wrapper here, I tried following the commands in the example:
Python 2.7.12 (default, Mar 1 2021, 11:38:31)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import igv
>>> igv=igv.IGV()
>>> igv=igv.genome('hg19')
>>> igv=igv.load('http://www.broadinstitute.org/igvdata/1KG/pilot2Bams/NA12878.SLX.bam')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'load'
>>>
Any idea why?
Hi!
If I run:
fastq filter --adjust 33 --unique input.fq > output.fq
it always returns(to stderr) incorrect total number of records (increased by 1).
Moreover, for the last read, the rl.size() is always increased by 1.
I'm not familiar with C++, but I would appreciate your feedback and help.
I would like to utilize your fastq program to return the list of unique FASTQ records
and to add to the FASTQ header the number of found duplicates.
Example:
FASTQ_header:read01
FASTQ_sequence
+
FASTQ_low_quality
FASTQ_header:read02
FASTQ_sequence
+
FASTQ_high_quality
FASTQ_header:read01:2
FASTQ_sequence
+
FASTQ_high_quality
Many thanks in advance,
pepap
I created ruby-igv based on igv.py. Thanks.
https://github.com/kojix2/ruby-igv
From wiki about FASTQ Phred Range(https://en.wikipedia.org/wiki/FASTQ_format), it says that the standard of Sanger Phred+33 has the lower bound of 33 and the upper bound of 73. Narrow down the range may increase the guess speed.
bio-playground / reads-utils / select-random-pairs.py
Current solution allows duplication of selected pairs
I believe: rand_records = sorted(random.sample(xrange(records), int(N)))
Would be an elegant solution.
Hi Brent,
I kindof have another issue. I have two lanes of HiSeq, paired end 2_100bp.
I previously removed duplicates lane by lane. I would concatenate the 2_100bp into 1_200bp (using galaxy's joiner tool), then run fastqClean To remove duplicates, I run "fastq filter --adjust 64 --unique ". Then split the fake 200bp reads back into 2_100bp reads using galaxy's tool.
I can do this fine on individual lanes. But it would make more sense to remove duplicates overall. The concatenated data from two lanes make a fastq of 84 gigabytes. I launched your tool (older version that removes all sequences with an N)... now after 45 minutes I'm up to 212gigs of memory usage. And increasing. My machine has only 256 gigs of ram, so I'll probably have to kill it.
So how about a low-mem version :) (even if it runs a lot slower)
cheers,
y
This script do not have illumina, 1.8+ ranging from 33-74
It seems the range for Illumina 1.5+ should go up to 105, not 104.
In this python file, you cite the paper:
http://www.nslij-genetics.org/wli/pub/ieee-embs06.pdf
But this link goes to a sketchy site with advertisements with things link Viagra. I'm guessing that the domain ownership must have changed or?
I'm posting as an issue so 1. you're aware and 2. because I'm curious what the actual paper is.
Thank you!
./fastq filter --adjust 64 --unique /path/to/your.fasta > unique.fasta
reorders things unnecessarily. Don't know if it would slow things down to keep the same order... (but not a real problem)
thanks!
y
Correct me if I'm wrong, but does the guess-encoding script not process base quality strings? (e.g., bb_eeeeeggggfiiiiiiiiiiiiihhiifhiiiiiihiiiiiiifffc
)
The example at the top of the script uses cut -f 5
when the 5th column is the mapping quality, not the base quality string. Shouldn't it be cut -f 11
?
A recent change to guess-encodying.py makes it output the first two columns in square brackets. Like this
['Sanger', 'Illumina-1.8'] 42 71
Before, it would have been
Sanger Illumina-1.8 42 71
This seems like a bug. Am I wrong?
Hi,
have you considered mixing matplotlib and asynchronous calls(multiprocessing module?) to produce an interactive environment to "explore" the dataset?
I am not interested in biology, btw, just in the coding.
Regards!
Hi @brentp , I would like to ask, if there is a way to load BAM files from the working directory and not URLs (shown in the example)?
It is desirable to narrow down the possible encodings that guess-encoding.py
produces, using heuristics beyond checking the min/max.
I have added some to improve the ability to uniquely detect or eliminate Illumina 1.5 format, by considering the unused scores in that scheme and the frequency of the Illumina 1.5 "special" quality score (B
).
I would be happy to work to integrate those modifications into this version, if there is interest in my doing so.
Hi Brent,
Just thought you might want to know about an issue with IGV socket response (related to your igv python class) , see here for more info:
https://groups.google.com/forum/?fromgroups=#!topic/igv-help/uS-a5EFOZC4
Briefly, IGV doesn't send stderr output to a socket request when error occurs. It only response with an 'OK' as long as it was able to connect to the port but this doesn't reflect if the internal operation (goto, load, sort, etc) has finished successfully. It is a one-way communication.
Do you think capturing the IGV stderr via "subprocess.Popen" is a good idea?
Many thanks,
Saeed
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.