Linking open-source software to research organizations

License: MIT License

Python 100.00%

softwareimpacthackathon2023_institutionaloss's People

Contributors

Stargazers

Watchers

softwareimpacthackathon2023_institutionaloss's Issues

Process multiple citations given as "[number1, number2]" in `czi_affiliation_links_pipeline`

In some cases, a software mention might be followed by multiple formal citations in a form of [number1, number2] (eg. [10,14]). Currently, in such a situation the script will extract only the first number. It would be good to modify it so that all numbers are extracted and mapped to ROR IDs.

Manually review top 10 organizations returned by each method

This may help us identify common spurious linkages. If you would like to work on this but need help getting started, please comment on this issue!

Consider additional impact indicators

We provide an open dataset for different impact indicators on Zenodo: 10.5281/zenodo.4386934.
We can somehow consider different impact indicators (like popularity or influence of publications) apart from their citation count.

relevant publication: https://dl.acm.org/doi/abs/10.1145/3442442.3451369

Write single script that runs the `ner_text_extraction` pipeline end to end

We have a description of the method in TheStackDataset.md but I don't think we have any code committed to this repo that allows us to run this end to end. Given a directory of jsonls from The Stack, can we write a python script that does the affiliation extraction and (using #23) outputs mappings between rors and software? Would you be interested in contributing this @dtkaczyk ?

Consider funders or funding sources as annother source of connection

Brainstorm data analysis

Write single script that runs `openaire_x_czi` pipeline end to end

Right now this is partially implemented per the discussion in the README. Use #17 when available to link DOIs to ROR ids. Would you be interested in taking this on @schatzopoulos ?

Write single script that runs the `by_name` and `human_curated` pipelines end to end

Right now this is two scripts in here. It would be nice to have this all in one script that takes Anita's spreadsheet as input and outputs the final set of software to ROR links in the form consumed by consolidate_links.py. I think it would also be nice to convert this to python, now that we are out of hackathon mode, for the sake of consistency (and we could use #22 here and in #20). Would you be interested in working on this @sfisher ?

Consider integration with Wikidata

Wikidata has (incomplete) information about research software, research organizations as well as researchers, research publications and a number of other things that may be relevant here, so integrating that data — or enriching it with some of yours — may be worth a thought.

For some pointers, see

https://scholia.toolforge.org/organization/Q1150437#uses (mapping from an organization to relevant resources, including software)
https://w.wiki/7vNd (mapping from software to relevant organizations)

Write reuable python function to link DOIs and PMIDs to author affiliation RORs

Now that we're out of hackathon mode, I think it would be useful to refactor our code so we can call a single script per data source to do the end to end data retrieval for that source (see #16).

Kaitlyn wrote several very helpful scripts to map DOIs to author affiliations (see resources/rrid_dataset_mapped_to_openalex.R and resources/joss_dataset_mapped_to_openalex.R). I think it would be useful to refactor these a little bit such that we had a reusable python function like so:

def work_to_affiliation_rors(ident: str, ident_type: str) -> list:
    """
    Given a work identifier and the type of that identifier (one of 'doi' or 'pmid'), 
    retrieves the ROR ids of the author affiliations for that work.
    :param ident: Identifier for a work
    :param ident_type: Type of that identifier (either 'doi' or 'pmid')
    :return: List of ROR ids corresponding to author affiliations of that work 
    """
    pass

We could then reuse this function in the various data pipelines that depend on this capability.

@kaitlynhair I would be grateful for your thoughts here. I am a little torn - I want to make sure everyone feels comfortable continuing to contribute if they like and I think you said you currently prefer R, but at the same time I think we should converge on a single language for our core data pipelines (this probably matters less for data analysis scripts we might use in the paper, for example), and I think python is something at least most people on the team have some familiarity with. What do you think - and/or, would you be interested in taking this on?

Write single script that runs the `joss_affiliation_links` pipeline end to end

Right now this is the work of get_dois_and_repos_from_joss.py and joss_dataset_mapped_to_openalex.R. We should refactor so that we have a single script that takes JOSS data as input and outputs links between software and RORs. Depends on #17

Retrieve contributor to affiliation links from ORCID profiles

Some ORCID profiles contain github links, which we could use to link github contributors to institutions. (Inspired by @Daniel-Mietchen)

Suggest methods for ranking software-ROR link quality/filtering links

Write reusable python function that links institution names to RORs

We use this capability in at least two places - #20 and #22. I think it would be nice if we had a reusable python function that we could use to link institution names to their "best" ROR id (if available) using OpenAlex, e.g. something like

def get_best_institution_ror(institution_name: str) -> str:
    """
    Given an institution name, try to find the ROR of its best match. Return the ROR id if
    available, else None
    :param institution_name: Insitution name (e.g. "Massachusetts Insitute of Technology")
    :return: ROR id of best match (e.g. https://ror.org/042nb2s44)
    """
    pass

Add the subset of The Stack dataset we use in this project using git LFS

Figure out how to select "best ROR" for github people/organizations

We matched github repo owners from ORCA to ROR using urls, with the results written to orca_org_rors.json. The data needs to be cleaned up further, though - many orgs have multiple RORs. We should figure out how to select the "best" ROR. Ideas:

[Dominika] Use institutional hierarchy to e.g. choose org highest in the hierarchy
Use string similarity between github org name and ROR org name (would only work for github organizations, though, not users)

Improve `url_matches` pipeline documentation

Write single script that runs the `czi_affiliation_links` pipeline end to end

We have a few different scripts glued together to execute this pipeline right now. We should have a single script that takes the CZI data as input and outputs mappings from software to RORs.

Depends on #17

Check Repository for a CITATION.cff file

Integrate software-ROR links from `openaire_x_czi` into `consolidate_links.py`

Once #24 is done. Would you be interested in taking this on @schatzopoulos ? Happy to help point you to the relevant parts of consolidate_links.py if needed!

Create human-curated evaluation set

A simple and useful version of this would be to randomly sample 100 distinct repos returned by each method and manually track down all their contributors' organizations. The results could be recorded in a csv with four columns: repo name, contributor github username (if available), contributor organization, and contributor organization's ROR id (if available). We could then use this data to evaluate the effectiveness of each method.

If anyone is interested in working on this but needs help putting the data together or has other questions, please reply to this issue!

Create single script per link extraction method

Right now, many of our data pipelines involve chaining together several scripts. For the sake of easier replication/reuse, we should have a single script per data pipeline that reads raw inputs and outputs software to ror links, as well as whatever other information the pipeline is capable of outputting. Tracking this in more granular issues under:

jmelot / softwareimpacthackathon2023_institutionaloss Goto Github PK

softwareimpacthackathon2023_institutionaloss's People

Contributors

Stargazers

Watchers

softwareimpacthackathon2023_institutionaloss's Issues

Recommend Projects

Recommend Topics

Recommend Org