Wikidata — a project related to Wikipedia — has a service that would allow us to get W

Is this what we're looking for? <div class="snippet-clipboard-content notranslate

Nice, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Also <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Use SPARQL to get Wikipedia article title given gene ID about gene-hints HOT 10 CLOSED

eweitz commented on May 26, 2024

Use SPARQL to get Wikipedia article title given gene ID

from gene-hints.

Comments (10)

gabelmattos commented on May 26, 2024 1

I understand. It is infinitely faster than scraping.

I've modified the query to yield the title:

#Human genes updated this week
SELECT DISTINCT ?item ?ncbi_gene ?itemLabel ?pageTitleEN
WHERE { 
  ?gene schema:about ?item ; schema:isPartOf <https://en.wikipedia.org/> ;  schema:name ?title .
  BIND(REPLACE(STR(?title),"\\ ","_") AS ?pageTitleEN) .
  ?item wdt:P351 ?ncbi_gene ; 
        wdt:P703 wd:Q15978631 ; 
        SERVICE wikibase:label { 
          bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". 
        } 
}

Edit: Added a way to replace spaces with _

from gene-hints.

gabelmattos commented on May 26, 2024

Is this what we're looking for?

SELECT DISTINCT ?item ?ncbi_gene ?itemLabel WHERE { ?item wdt:P351 ?ncbi_gene ; wdt:P703 wd:Q15978631 ; SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } }

I would need to verify that the data matches as the above query yields ~58k lines, whereas the homo-sapiens.TSV file contains ~29k lines.

I'm not sure how to retrieve whether the data is trending yet. What do we look for on that?

from gene-hints.

eweitz commented on May 26, 2024

Nice, @gabedemattos, this is close! Your query returns the Wikidata item label, but we'll want the Wikipedia article title.

For example, the first result of that query is Q13566093, which has Wikidata item label "C2" and Wikipedia article title "Complement component 2".

how to retrieve whether the data is trending

Your query will be likely be plugged into wikipedia_trends/generate_gene_page_map.py, and replace the web scraping approach there. That 58k figure you note is double the roughly 29k number of known human genes, so we'll refine your query's results and output the product as wikipedia_trends/gene_page_map.tsv.

The Wikidata query approach will be more correct and faster. Our web scraping approach is ~99% correct in mapping gene symbols to Wikipedia page names, and to my understanding takes a few hours. The Wikidata approach is theoretically 100% correct and would likely take less than a minute. The higher correctness is the main benefit.

Your refined gene_page_map.tsv will then be automatically picked up in wikipedia_trends/process_trends.py, which determines how the gene is trending in Wikipedia page views.

@snwessel, does that sound reasonable?

from gene-hints.

snwessel commented on May 26, 2024

That sounds great, and I'm glad to hear that this approach is working out so well!

The only thing I would add is that when you output the results to gene_page_map.tsv, you'll need to convert any spaces in the title to underscores to match what the wikidata trends files use. This basically converts from the page title like "Complement component 2" to the string that's used at the end of it's corresponding Wikipedia page URL: https://en.wikipedia.org/wiki/Complement_component_2

You can do this in python by just doing .replace(" ", "_")

Also, once you have an updated wikipedia_trends/gene_page_map.tsv, feel free to rerun the process_trends.py script locally to check that the results still look reasonable. The script should only take like 10-15 minutes to run.

from gene-hints.

snwessel commented on May 26, 2024

Also @gabedemattos let me know if you want to pair program on any of this, or if it would be helpful for me to walk you through the existing code. I would be happy to meet anytime!

from gene-hints.

gabelmattos commented on May 26, 2024

Sure, I did some quick experiments with using pyspark to process trends on a private cluster. But, though the processing is much faster, there is an overhead of running the cluster. So, I'll investigate a few other tools which allow for parallel API fetching from MediaWiki. Still, with the cost of increasing network bandwidth to about ~200 MB if we're fetching all humans of the genes in one go, if it runs every day, it consumes 6 GB per month. Would you please let me know if this bandwidth is feasible?

There's a problem with the SPARQL query. By adding the title, it returns about 11k results.

@snwessel Thanks! I'll schedule a meeting for next week.

from gene-hints.

eweitz commented on May 26, 2024

6 GB per month. Would you please let me know if this bandwidth is feasible?

Yes, usage limits for GitHub Actions note no bandwidth limit.

There's a problem with the SPARQL query. By adding the title, it returns about 11k results.

That actually sounds correct! Per https://en.wikipedia.org/wiki/Gene_Wiki#Number_of_gene_articles,

"The human genome contains an estimated 20,000–25,000 protein-coding genes.[5] The goal of the Gene Wiki project is to create seed articles for every notable human gene, that is, every gene whose function has been assigned in the peer-reviewed scientific literature. Approximately half of human genes have assigned function, therefore the total number of articles seeded by the Gene Wiki project would be expected to be in the range of 10,000–15,000. To date, approximately 11,000 articles have been created or augmented to include Gene Wiki project content.[6]"

from gene-hints.

gabelmattos commented on May 26, 2024

@eweitz I don't have permission to create a branch, but here's the python implementation of the SPARL code, which is infinitely faster.
Requires SPARLWrapper and pandas library.

Requires either renaming the column there or on process_trends.

from gene-hints.

eweitz commented on May 26, 2024

Looking good! I invited you to this repo soon after our first chat. You should see the invitation email from GitHub in your inbox around then.

Once you accept that, you can create a branch or commit code to this repo however you like. That will also ease declaring those dependencies in requirements.txt.

from gene-hints.

gabelmattos commented on May 26, 2024

Closing the issue.

from gene-hints.

Use SPARQL to get Wikipedia article title given gene ID about gene-hints HOT 10 CLOSED

Comments (10)

Related Issues (5)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent