Comments (10)
I understand. It is infinitely faster than scraping.
I've modified the query to yield the title:
#Human genes updated this week
SELECT DISTINCT ?item ?ncbi_gene ?itemLabel ?pageTitleEN
WHERE {
?gene schema:about ?item ; schema:isPartOf <https://en.wikipedia.org/> ; schema:name ?title .
BIND(REPLACE(STR(?title),"\\ ","_") AS ?pageTitleEN) .
?item wdt:P351 ?ncbi_gene ;
wdt:P703 wd:Q15978631 ;
SERVICE wikibase:label {
bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en".
}
}
Edit: Added a way to replace spaces with _
from gene-hints.
Is this what we're looking for?
SELECT DISTINCT ?item ?ncbi_gene ?itemLabel WHERE { ?item wdt:P351 ?ncbi_gene ; wdt:P703 wd:Q15978631 ; SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } }
I would need to verify that the data matches as the above query yields ~58k lines, whereas the homo-sapiens.TSV file contains ~29k lines.
I'm not sure how to retrieve whether the data is trending yet. What do we look for on that?
from gene-hints.
Nice, @gabedemattos, this is close! Your query returns the Wikidata item label, but we'll want the Wikipedia article title.
For example, the first result of that query is Q13566093, which has Wikidata item label "C2" and Wikipedia article title "Complement component 2".
how to retrieve whether the data is trending
Your query will be likely be plugged into wikipedia_trends/generate_gene_page_map.py
, and replace the web scraping approach there. That 58k figure you note is double the roughly 29k number of known human genes, so we'll refine your query's results and output the product as wikipedia_trends/gene_page_map.tsv
.
The Wikidata query approach will be more correct and faster. Our web scraping approach is ~99% correct in mapping gene symbols to Wikipedia page names, and to my understanding takes a few hours. The Wikidata approach is theoretically 100% correct and would likely take less than a minute. The higher correctness is the main benefit.
Your refined gene_page_map.tsv
will then be automatically picked up in wikipedia_trends/process_trends.py
, which determines how the gene is trending in Wikipedia page views.
@snwessel, does that sound reasonable?
from gene-hints.
That sounds great, and I'm glad to hear that this approach is working out so well!
The only thing I would add is that when you output the results to gene_page_map.tsv
, you'll need to convert any spaces in the title to underscores to match what the wikidata trends files use. This basically converts from the page title like "Complement component 2" to the string that's used at the end of it's corresponding Wikipedia page URL: https://en.wikipedia.org/wiki/Complement_component_2
You can do this in python by just doing .replace(" ", "_")
Also, once you have an updated wikipedia_trends/gene_page_map.tsv
, feel free to rerun the process_trends.py
script locally to check that the results still look reasonable. The script should only take like 10-15 minutes to run.
from gene-hints.
Also @gabedemattos let me know if you want to pair program on any of this, or if it would be helpful for me to walk you through the existing code. I would be happy to meet anytime!
from gene-hints.
Sure, I did some quick experiments with using pyspark to process trends on a private cluster. But, though the processing is much faster, there is an overhead of running the cluster. So, I'll investigate a few other tools which allow for parallel API fetching from MediaWiki. Still, with the cost of increasing network bandwidth to about ~200 MB if we're fetching all humans of the genes in one go, if it runs every day, it consumes 6 GB per month. Would you please let me know if this bandwidth is feasible?
There's a problem with the SPARQL query. By adding the title, it returns about 11k results.
@snwessel Thanks! I'll schedule a meeting for next week.
from gene-hints.
6 GB per month. Would you please let me know if this bandwidth is feasible?
Yes, usage limits for GitHub Actions note no bandwidth limit.
There's a problem with the SPARQL query. By adding the title, it returns about 11k results.
That actually sounds correct! Per https://en.wikipedia.org/wiki/Gene_Wiki#Number_of_gene_articles,
"The human genome contains an estimated 20,000–25,000 protein-coding genes.[5] The goal of the Gene Wiki project is to create seed articles for every notable human gene, that is, every gene whose function has been assigned in the peer-reviewed scientific literature. Approximately half of human genes have assigned function, therefore the total number of articles seeded by the Gene Wiki project would be expected to be in the range of 10,000–15,000. To date, approximately 11,000 articles have been created or augmented to include Gene Wiki project content.[6]"
from gene-hints.
@eweitz I don't have permission to create a branch, but here's the python implementation of the SPARL code, which is infinitely faster.
Requires SPARLWrapper and pandas library.
Requires either renaming the column there or on process_trends.
from gene-hints.
Looking good! I invited you to this repo soon after our first chat. You should see the invitation email from GitHub in your inbox around then.
Once you accept that, you can create a branch or commit code to this repo however you like. That will also ease declaring those dependencies in requirements.txt
.
from gene-hints.
Closing the issue.
from gene-hints.
Related Issues (5)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gene-hints.