Git Product home page Git Product logo

Comments (10)

gabelmattos avatar gabelmattos commented on May 26, 2024 1

I understand. It is infinitely faster than scraping.

I've modified the query to yield the title:

#Human genes updated this week
SELECT DISTINCT ?item ?ncbi_gene ?itemLabel ?pageTitleEN
WHERE { 
  ?gene schema:about ?item ; schema:isPartOf <https://en.wikipedia.org/> ;  schema:name ?title .
  BIND(REPLACE(STR(?title),"\\ ","_") AS ?pageTitleEN) .
  ?item wdt:P351 ?ncbi_gene ; 
        wdt:P703 wd:Q15978631 ; 
        SERVICE wikibase:label { 
          bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". 
        } 
}

Edit: Added a way to replace spaces with _

from gene-hints.

gabelmattos avatar gabelmattos commented on May 26, 2024

Is this what we're looking for?

SELECT DISTINCT ?item ?ncbi_gene ?itemLabel WHERE { ?item wdt:P351 ?ncbi_gene ; wdt:P703 wd:Q15978631 ; SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } }

I would need to verify that the data matches as the above query yields ~58k lines, whereas the homo-sapiens.TSV file contains ~29k lines.

I'm not sure how to retrieve whether the data is trending yet. What do we look for on that?

from gene-hints.

eweitz avatar eweitz commented on May 26, 2024

Nice, @gabedemattos, this is close! Your query returns the Wikidata item label, but we'll want the Wikipedia article title.

For example, the first result of that query is Q13566093, which has Wikidata item label "C2" and Wikipedia article title "Complement component 2".

how to retrieve whether the data is trending

Your query will be likely be plugged into wikipedia_trends/generate_gene_page_map.py, and replace the web scraping approach there. That 58k figure you note is double the roughly 29k number of known human genes, so we'll refine your query's results and output the product as wikipedia_trends/gene_page_map.tsv.

The Wikidata query approach will be more correct and faster. Our web scraping approach is ~99% correct in mapping gene symbols to Wikipedia page names, and to my understanding takes a few hours. The Wikidata approach is theoretically 100% correct and would likely take less than a minute. The higher correctness is the main benefit.

Your refined gene_page_map.tsv will then be automatically picked up in wikipedia_trends/process_trends.py, which determines how the gene is trending in Wikipedia page views.

@snwessel, does that sound reasonable?

from gene-hints.

snwessel avatar snwessel commented on May 26, 2024

That sounds great, and I'm glad to hear that this approach is working out so well!

The only thing I would add is that when you output the results to gene_page_map.tsv, you'll need to convert any spaces in the title to underscores to match what the wikidata trends files use. This basically converts from the page title like "Complement component 2" to the string that's used at the end of it's corresponding Wikipedia page URL: https://en.wikipedia.org/wiki/Complement_component_2

You can do this in python by just doing .replace(" ", "_")

Also, once you have an updated wikipedia_trends/gene_page_map.tsv, feel free to rerun the process_trends.py script locally to check that the results still look reasonable. The script should only take like 10-15 minutes to run.

from gene-hints.

snwessel avatar snwessel commented on May 26, 2024

Also @gabedemattos let me know if you want to pair program on any of this, or if it would be helpful for me to walk you through the existing code. I would be happy to meet anytime!

from gene-hints.

gabelmattos avatar gabelmattos commented on May 26, 2024

Sure, I did some quick experiments with using pyspark to process trends on a private cluster. But, though the processing is much faster, there is an overhead of running the cluster. So, I'll investigate a few other tools which allow for parallel API fetching from MediaWiki. Still, with the cost of increasing network bandwidth to about ~200 MB if we're fetching all humans of the genes in one go, if it runs every day, it consumes 6 GB per month. Would you please let me know if this bandwidth is feasible?

There's a problem with the SPARQL query. By adding the title, it returns about 11k results.

@snwessel Thanks! I'll schedule a meeting for next week.

from gene-hints.

eweitz avatar eweitz commented on May 26, 2024

6 GB per month. Would you please let me know if this bandwidth is feasible?

Yes, usage limits for GitHub Actions note no bandwidth limit.

There's a problem with the SPARQL query. By adding the title, it returns about 11k results.

That actually sounds correct! Per https://en.wikipedia.org/wiki/Gene_Wiki#Number_of_gene_articles,

"The human genome contains an estimated 20,000–25,000 protein-coding genes.[5] The goal of the Gene Wiki project is to create seed articles for every notable human gene, that is, every gene whose function has been assigned in the peer-reviewed scientific literature. Approximately half of human genes have assigned function, therefore the total number of articles seeded by the Gene Wiki project would be expected to be in the range of 10,000–15,000. To date, approximately 11,000 articles have been created or augmented to include Gene Wiki project content.[6]"

from gene-hints.

gabelmattos avatar gabelmattos commented on May 26, 2024

@eweitz I don't have permission to create a branch, but here's the python implementation of the SPARL code, which is infinitely faster.
Requires SPARLWrapper and pandas library.

Requires either renaming the column there or on process_trends.

from gene-hints.

eweitz avatar eweitz commented on May 26, 2024

Looking good! I invited you to this repo soon after our first chat. You should see the invitation email from GitHub in your inbox around then.

Once you accept that, you can create a branch or commit code to this repo however you like. That will also ease declaring those dependencies in requirements.txt.

from gene-hints.

gabelmattos avatar gabelmattos commented on May 26, 2024

Closing the issue.

from gene-hints.

Related Issues (5)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.