kevinhu / cancer_data Goto Github PK
View Code? Open in Web Editor NEWA unified downloader+preprocessor for cancer genomics datasets
Home Page: https://cancer_data.kevinhu.io
License: MIT License
A unified downloader+preprocessor for cancer genomics datasets
Home Page: https://cancer_data.kevinhu.io
License: MIT License
Hi, thanks for putting this repo together, it looks very handy.
On cancer_data version 0.1.0, I tried to download the tcga_normalized_gene_expression dataset via
cancer_data.download("tcga_normalized_gene_expression")
,
but this failed with the message:
243iB [00:00, 58.9kiB/s]
EB++AdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.xena.gz does not match provided md5sum. Attempting second download.
Downloading https://pancanatlas.xenahubs.net/download/EB++AdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.xena.gz
243iB [00:00, 36.1kiB/s]
Second download of EB++AdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.xena.gz failed. Recommend manual inspection.
Yet when I manually inspect the md5sum of the .gz, everything looks ok:
import hashlib
import cancer_data
schema_md5 = cancer_data.schema().loc['tcga_normalized_gene_expression']['downloaded_md5']
fname = "/Users/pat/Downloads/EB++AdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.xena.gz"
with open(fname, "rb") as f:
data = f.read()
observed_md5 = hashlib.md5(data).hexdigest()
assert schema_md5 == observed_md5 # "5fbfb5a4854a2cfc8a95c3ada5379fd4"
Am I doing something silly? Thanks in advance.
There are many changes in the recent DepMap release.
Hi, it looks like in cases where two schema entries share the same downloaded_name
, any operation that touches both will show md5 mismatches and rerun the download step.
You could put a download_as
field in the schema, or form a different filename out of a combination of fields guaranteed to be unique. Then rename the file after download.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.