p-koo / libre Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
The function accepts an activation parameter but does not use it.
the fasta file format apparently supports ;
as comments. we can implement this by ignoring lines that begin with ;
.
d885338 moves the one-hot functions. the tests need to be moved too. tests do not pass at this time.
The parse_fasta
function assumes that each sequence takes up one line only, but that is not always the case. Sequences can span multiple lines. I have an implementation here that handles multiple lines correctly.
the function convert_one_hot
can return wrong results if any letter in the sequences is not within the alphabet. a check should be added to make sure all the values in the sequences are in the alphabet.
the check could be this:
unknown = set(np.unique(s)).difference(alphabet)
if unknown:
raise ValueError("letter found that is not in alphabet")
I have an implementation to one-hot encode sequences that is 3x faster than the current implementation.
This is the current implementation:
My implementation does not strip or pad the sequences, but i can add that.
filter_encode_metatable
has some hard-coded queries, which might be better suited as variables. for example, the function keeps rows with File assembly GRCh38, but sometimes I can imagine something else might be desired.
perhaps we can re-work this function to allow the user to enter a query? pandas.DataFrame.query()
might be useful here.
jax has an example of a one-hot implementation: https://jax.readthedocs.io/en/latest/notebooks/neural_network_with_tfds_data.html#utility-and-loss-functions
def one_hot(x, k, dtype=jnp.float32):
"""Create a one-hot encoding of x of size k."""
return jnp.array(x[:, None] == jnp.arange(k), dtype)
we could try a similar implementation. might be faster than the current version
several functions in wrangle.py are also defined in singletask.py. we should choose one implementation and remove the other one.
The function enforce_constant_size
accepts a path to a bed file and writes a bed file. This feels more like command-line behavior, where input and output are files. I would suggest that this function should take in a pandas dataframe representation of a bed file, and it should return a modified dataframe. I suggest that the processing script that uses this function should take care of loading and saving files, if it is necessary.
I have an implementation of this here: https://github.com/kaczmarj/rotation-koo-lab/blob/310bdf589c59f4ef9ef8511a479e69925c4bbcf3/chip-seq-to-hdf5-dataset/chipseq_utils.py#L102-L132
pandas.read_csv
infers the compression used (if any) by default. See the documentation at https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html. So we can remove the compression argument.
the .to_csv()
method also infers compression. if you pass a filename with a .gz
extension, the file will be gzip-compressed.
The function make_directory
can be replaced with
pathlib.Path("path/to/directory").mkdir(exist_ok=True)
See https://docs.python.org/3/library/pathlib.html#pathlib.Path.mkdir. Pathlib has lots of useful operations for paths.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.