maayanlab / archs4 Goto Github PK
View Code? Open in Web Editor NEWARCHS4 RNA-seq processing scripts and web server pages.
License: Other
ARCHS4 RNA-seq processing scripts and web server pages.
License: Other
Hi and thank you for this great resource! I have been trying to download the hdf5 files using the auto-generated rscripts, but I continue to run into this error regardless of the dataset I try to download.
"File download ran into problems. Please try to download again."
Do you have any recommendations on how to fix this?
Best,
Dylan
API backend does not seem to return genes
I downloaded the R code about human esophagus from https://amp.pharm.mssm.edu/archs4/data.html , and ran on R.
an error was occurred on the step of retrieving information from compressed data,
samples = h5read(destination_file, "meta/Sample_geo_accession")
Error in H5Fopen(file, "H5F_ACC_RDONLY", native = native) :
HDF5. File accessibilty. Unable to open file.
this problem has been shared here by others, but I could not solve my problem when I applied the suggested solution, because I am new in using R, please anyone can guide me through this problem, thanks a lot
Hi,
I noticed that there is new ARCHS4 data (from 2024), yet the gene correlation files are old (2018).
I was wondering if the correlation data on the site itself is up to date and if so, is there a way to download it as a file?
Thanks!
I downloaded the R code about human esophagus from https://amp.pharm.mssm.edu/archs4/data.html , and ran on R.
an error was occurred on the step of retrieving information from compressed data,
samples = h5read(destination_file, "meta/Sample_geo_accession")
Error in H5Fopen(file, "H5F_ACC_RDONLY", native = native) :
HDF5. File accessibilty. Unable to open file.
I reinstalled the "rhdf5" package, but the problem was still existing.
It seems that that the transcript information is incomplete for the mouse transcript-level expression matrix (mouse_hiseq_eid_1.0.h5
).
The data/expression
matrix has quantitation for 178,136 transcripts, however the meta/ensemblid
and meta/transcriptlength
objects only have 98,492 entries each.
The dimensionality between these same objects in the human_hiseq_eid_1.0.h5
, however, appear to be concordant.
Hi,
Thank you for your data. Could you please tell me which version of ensembl you use to create human_matrix.h5 v8 (Date: 2/2020)? I need to calculate gene-level TPM from transcript data. Thank you.
Best,
Zheng
The human_transcript.h5 (v8) file (rounded TPMs) seems to be missing most all of the metadata for the data in this file.
For instance, the human_transcript_v8.h5/meta/*
directory in the HDF5 file only has a Sample_channel_count
file in it (no transcript ids, or anything else).
Hello @lachmann12 :
I did different expression genes(DEG) by ARCHS4,but when I used DEseq,DEseq2 or EdgeR to find DEGs, the numeber of DEG is zero for GSE49110 in ARCHS4. When I use raw count from GEO,I can get more than 100 DEGs. The situation isn't accidental. There is over 0.7 for the correlation of GSM1193921 from GEO and ARCHS4. I can't find out the reason that the numeber of DEG is zero for GSE49110 in ARCHS4.
ARCHS4 is a great database,I love it.
best wish,
It would be great if the human gene-level hdf5 file included a meata/gene_ensemblid
object, like the mouse file does, so that users can use those a bit more confidently in downstream analysis.
The human_matrx.h5
(v8) file I took for a spin when v8 first came out does not include them.
Hey,
I hope this is the right place to ask these questions - please point me in the right direction if not.
There were two questions that arose while working with the downloadable gene expression h5 files:
Thank you!
Hi,
Why are the gene counts from elysium in float format whereas the gene counts from archs4 in integer?
Thanks.
Hi,
I have about 700+ fastq files (16 GSE IDs) I would like to submit to get gene expression files. Is there a way for me to programmatically do this instead of going to the elysium/biojupies webpage and uploading the file each time?
Thanks and good day.
In mouse matrix v9 there is only 307268 elements in probabilities vector, but there is 360627 samples in total.
Is it OK?
Am I right, that I can match these probabilities with first 307268 samples?
Not an issue with the GH repo per say, but I noticed that the latest ARCHS4 file for human gene expression (human_gene_v2.5.h5) fails a file integrity check.
Reproduce:
wget https://s3.dev.maayanlab.cloud/archs4/files/human_gene_v2.5.h5
sha1sum human_gene_v2.5.h5
(expected SHA1) from the download page
Expected SHA1: a7b21b55515959add7b1d620371bc4b2fb610976
Actual SHA1: ae96de0519b9f008b0dc3a9f944ee9007daf2f6a
To make sure it wasn't just a network issue on my end, I reconstructed the "etag" for the file on S3. The expected etag is 7c26b4ebb22b89d795968bf37df4b5e4-5706 (based on the output of curl -I https://s3.dev.maayanlab.cloud/archs4/files/human_gene_v2.5.h5
). I verified that the multi-part file etag I calculated for the file on my disk does match this expected etag.
Hello, I had a few questions around the metadata of downloaded h5 files. Namely:
Those are my main queries for now!
Thank you in advance,
Edgar
Hi @AviMaayan and @lachmann12
Thanks for the work here.
I read through the license page - https://github.com/MaayanLab/archs4/blob/master/LICENSE and I am not clear if say utilities like gget can be used programmatically to query the database and if the results of these queries via gget are going to be used in R&D work/presentations in biotech/pharma etc.
As an example:
gget archs4 -w tissue ACE2
Thanks in advance
Hi,
What is the pipeline used to create human_matrix_v2.1.2.h5 and mouse_matrix_v2.1.2.h5 ? Is it the same as the pipeline mentioned in the 2018 paper?
Thanks.
Hi there,
We love ARChS4 and would like to extract transcript (all isoforms) expression data for our gene (ILRUN) in different tissues. There are three transcripts for ILRUN in ensembl.
ENST00000374023.8
ENST00000374026.7
ENST00000374021.1
However, only second two are retrieved from your human_transcript_v7.h5 file. The first ENST00000374023.8 is isoform a which is believed to be the dominantly expressed transcript (principle isoform).
Looking forward to your response.
Kind regards,
Marina
Thank you kindly,
Marina
Hi again,
When I upload a fastq.gz file to elysium from a server (tunneled via SMB), I get the following error:
RequestTimeoutYour socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.
Please advise. Thanks.
Hello!
I noticed today that if you attempt to download multiple files at once, it can spawn over 100 downloads for each one on Chrome. My example is that I tried to download from the mouse+sample page several tissue-specific gene expression files at once (Image attached).
It took a while to start downloading -- and when it did start, it downloaded about 300 zipped folders totaling nearly 4 GB. I am also fairly sure it downloaded the wrong files since the gene count tsv in the colon folder only had 285 samples -- though according to the page I downloaded it from, it should contain 1169 samples.
Anyways, hope this is helpful! I really love ARCHS4 and despite this issue I think it's an amazing tool!
Best,
Henry Miller
Hello @lachmann12,
I was wondering why do mouse transcripts quantification only has 98,492 rows? Metadata says you've used Ensembl v90, which has 131,195 unique transcripts in the GTF file, and 109,282 in the cDNA file provided by Ensembl. Was there any additional filtering?
Thank you!
Hello,
The latest ARCHS4 (ARCHS4 Version 2.3) is based on Ensembl 107.
Thus, I guessed Entrez gene symbol was annotated based on gencode v41.
However, there is no information for 'human_matrix_v1.11.h5'.
The data was released on 11-16-2021, so I thought the gene symbol was annotated based on gencode v38, which was released on 05-2021. But the 2971 of 35238 genes in 'human_matrix_v1.11.h5' was not overlapped with gencode v38 gene names.
So What gencode version for 'human_matrix_v1.11.h5' gene name annotation?
Dear ARCHS4 developers,
As the title, I was wondering what's the difference between these two files? I noticed the value and number of genes are different.
Is the archs4.f more lately??
Thanks!
When an ARCHS4 query is performed for a particular gene, and the results page (e.g. https://amp.pharm.mssm.edu/archs4/gene/ACE2) displays numerous tables of output data, it would be helpful to allow users to select each individual table for download as a file; offering multiple formats such as CSV, TSV, and TXT would be convenient.
hi @lachmann12. I really appreciate this resource , it is truly great help. But apparently I noticed this too and as you may see in the screenshot attached that values of each entry is not identical, should I imply int was at transcript level rather than gene ?
Hi,
When I use the R script (h5read(destination_file, "meta/genes/gene_symbol")) the matrix generated includes genes with prefix ENSG along with regular gene symbols. Why is this?
Thanks and good day.
Are there instructions somewhere on how to run this pipeline? Would like to expand upon it, but it's clear to me what the process is, or what the dependencies are.
Much appreciated.
When I download an R script to read gene expression data for human data, the initial variables are as follows:
destination_file = "human_matrix_v10.h5"
extracted_expression_file = "GSE30017_expression_matrix.tsv"
url = "https://s3.amazonaws.com/mssm-seq-matrix/human_matrix_v10.h5"
I see that on the ARCHS4 downloads page there is a "human_matrix_v11.h5" available. Should the h5 file that the R script prompts the user to download be the updated "v11" data?
Hello ARCHS4 team.
Thanks for developing this database and making it so freely available. The fact that all the raw files were uniformly processed and kallisto counts were directly shared is pretty awesome.
My issue is regarding the GEO accession: GSE57872 for homo sapiens. In GEO database, there are 800+ samples for this study and processed data is made available for 500+ after removal of low quality cells.
But in ARCHS4, only 80 odd samples are available from this study. May I know what filters/ criteria were used during the processing to reject the remaining cells?
I could not get this information from the ARCHS4 publication or from the codes shared here. (I tried to go through them as best as i could)
I hope this isn't something silly as no-one has raised this kind of issue before
The entries (ordering) of meta/genes
has changed between the v4 and v5 mouse_matrix.h5
files, and the corresponding meta/gene_ensemblid
wasn't updated to match.
For the v5 dataset, it looks like the expression values found in data/expression
likely correspond to the re-ordered meta/genes
entries, which makes the v5 meta/gene_ensemblid
entries wrong ... and likely the other gene-level metadata in the v5 matrix (ie. I just checked that the entrez id's haven't changed from v4, which means they would also be incorrect).
library(rhdf5)
library(dplyr)
v4.h5 <- "mouse_matrix_v4.h5"
v5.h5 <- "mouse_matrix_v5.h5"
ginfo <- tibble(
v4name = h5read(v4.h5, "meta/genes"),
v4ens = h5read(v4.h5, "meta/gene_ensemblid"),
v5name = h5read(v5.h5, "meta/genes"),
v5ens = h5read(v5.h5, "meta/gene_ensemblid"))
head(ginfo)
# A tibble: 6 x 4
# v4name v4ens v5name v5ens
# <chr> <chr> <chr> <chr>
# 1 A1bg ENSMUSG00000022347 0610007P14Rik ENSMUSG00000022347
# 2 A1cf ENSMUSG00000052595 0610009B22Rik ENSMUSG00000052595
# 3 A2m ENSMUSG00000030111 0610009L18Rik ENSMUSG00000030111
# 4 A3galt2 ENSMUSG00000028794 0610009O20Rik ENSMUSG00000028794
# 5 A4galt ENSMUSG00000047878 0610010F05Rik ENSMUSG00000047878
# 6 A4gnt ENSMUSG00000037953 0610010K14Rik ENSMUSG00000037953
all.equal(ginfo$v4ens, ginfo$v5ens)
# [1] TRUE
(cc @lachmann12 )
Hi,
When I download certain gene expression files (e.g. GSE121380) from the generated R scripts I run into the following error:
Error in H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem, :
Not enough memory to read data! Try to read a subset of data by specifying the index or count parameter.
Calls: t ... tryCatch -> tryCatchList -> tryCatchOne ->
Error: Error in h5checktype(). H5Identifier not valid.
Execution halted
I've tried using up to 120gb and I still get the same error.
Pls. advise. Thanks.
All RNA-seq and ChIP-seq sample and signature search (ARCHS4) (https://maayanlab.cloud/archs4/) is a resource that provides access to gene and transcript counts uniformly processed from all human and mouse RNA-seq experiments from GEO and SRA. The ARCHS4 website provides the uniformly processed data for download and programmatic access in H5 format, and as a 3-dimensional interactive viewer and search engine. Users can search and browse the data by metadata enhanced annotations, and can submit their own gene sets for search. Subsets of selected samples can be downloaded as a tab delimited text file that is ready for loading into the R programming environment. To generate the ARCHS4 resource, the kallisto aligner is applied in an efficient parallelized cloud infrastructure. Human and mouse samples are aligned against the most recent Ensembl annotation (Ensembl 107).
Hi,
The datasets look very good here.
I wish to download the files for looking at the human data. I am more interested in the ages of the donors across all of the tissues. Is this metadata embedded within these H5 files? Or could there be a separate metadata file available that contains such information. I would love to be able to get a hold of such information before working on the main files.
Many thanks.
Hi,
In issue #30, you shared how to obtain gene abundance values from the transcript expression levels. I would like to know how to obtain CPM and TPM values from these gene abundance values (gene_abundance.tsv). From what I understand some normalization is already performed to obtain gene_abundance.tsv. Can I still just perform the regular calculations for CPM and TPM?
Thanks.
Is the status of a given sample/experiment captured somewhere in the metadata for a given dataset download? I'd like to know if a given series/experiment was tagged with a phenotype, e.g. "breast cancer", or something similar.
I am interested in RNA-Seq datasets that have rRNA depleted, would I be able to search that query in the ARCHS4 interface?
Please let me know?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.