Git Product home page Git Product logo

softwareimpacthackathon2023_institutionaloss's People

Contributors

bandrow avatar dtkaczyk avatar jmelot avatar jrhoads avatar kaitlynhair avatar schatzopoulos avatar sfisher avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

softwareimpacthackathon2023_institutionaloss's Issues

Write single script that runs the `by_name` and `human_curated` pipelines end to end

Right now this is two scripts in here. It would be nice to have this all in one script that takes Anita's spreadsheet as input and outputs the final set of software to ROR links in the form consumed by consolidate_links.py. I think it would also be nice to convert this to python, now that we are out of hackathon mode, for the sake of consistency (and we could use #22 here and in #20). Would you be interested in working on this @sfisher ?

Consider integration with Wikidata

Wikidata has (incomplete) information about research software, research organizations as well as researchers, research publications and a number of other things that may be relevant here, so integrating that data โ€” or enriching it with some of yours โ€” may be worth a thought.

For some pointers, see

Write reuable python function to link DOIs and PMIDs to author affiliation RORs

Now that we're out of hackathon mode, I think it would be useful to refactor our code so we can call a single script per data source to do the end to end data retrieval for that source (see #16).

Kaitlyn wrote several very helpful scripts to map DOIs to author affiliations (see resources/rrid_dataset_mapped_to_openalex.R and resources/joss_dataset_mapped_to_openalex.R). I think it would be useful to refactor these a little bit such that we had a reusable python function like so:

def work_to_affiliation_rors(ident: str, ident_type: str) -> list:
    """
    Given a work identifier and the type of that identifier (one of 'doi' or 'pmid'), 
    retrieves the ROR ids of the author affiliations for that work.
    :param ident: Identifier for a work
    :param ident_type: Type of that identifier (either 'doi' or 'pmid')
    :return: List of ROR ids corresponding to author affiliations of that work 
    """
    pass

We could then reuse this function in the various data pipelines that depend on this capability.

@kaitlynhair I would be grateful for your thoughts here. I am a little torn - I want to make sure everyone feels comfortable continuing to contribute if they like and I think you said you currently prefer R, but at the same time I think we should converge on a single language for our core data pipelines (this probably matters less for data analysis scripts we might use in the paper, for example), and I think python is something at least most people on the team have some familiarity with. What do you think - and/or, would you be interested in taking this on?

Write reusable python function that links institution names to RORs

We use this capability in at least two places - #20 and #22. I think it would be nice if we had a reusable python function that we could use to link institution names to their "best" ROR id (if available) using OpenAlex, e.g. something like

def get_best_institution_ror(institution_name: str) -> str:
    """
    Given an institution name, try to find the ROR of its best match. Return the ROR id if
    available, else None
    :param institution_name: Insitution name (e.g. "Massachusetts Insitute of Technology")
    :return: ROR id of best match (e.g. https://ror.org/042nb2s44)
    """
    pass

See also the related thinking in #17

Would you be interested in taking this on @sfisher ?

Figure out how to select "best ROR" for github people/organizations

We matched github repo owners from ORCA to ROR using urls, with the results written to orca_org_rors.json. The data needs to be cleaned up further, though - many orgs have multiple RORs. We should figure out how to select the "best" ROR. Ideas:

  • [Dominika] Use institutional hierarchy to e.g. choose org highest in the hierarchy
  • Use string similarity between github org name and ROR org name (would only work for github organizations, though, not users)

Create human-curated evaluation set

A simple and useful version of this would be to randomly sample 100 distinct repos returned by each method and manually track down all their contributors' organizations. The results could be recorded in a csv with four columns: repo name, contributor github username (if available), contributor organization, and contributor organization's ROR id (if available). We could then use this data to evaluate the effectiveness of each method.

If anyone is interested in working on this but needs help putting the data together or has other questions, please reply to this issue!

Create single script per link extraction method

Right now, many of our data pipelines involve chaining together several scripts. For the sake of easier replication/reuse, we should have a single script per data pipeline that reads raw inputs and outputs software to ror links, as well as whatever other information the pipeline is capable of outputting. Tracking this in more granular issues under:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.