jmelot / softwareimpacthackathon2023_institutionaloss Goto Github PK
View Code? Open in Web Editor NEWLinking open-source software to research organizations
License: MIT License
Linking open-source software to research organizations
License: MIT License
In some cases, a software mention might be followed by multiple formal citations in a form of [number1, number2] (eg. [10,14]). Currently, in such a situation the script will extract only the first number. It would be good to modify it so that all numbers are extracted and mapped to ROR IDs.
This may help us identify common spurious linkages. If you would like to work on this but need help getting started, please comment on this issue!
We provide an open dataset for different impact indicators on Zenodo: 10.5281/zenodo.4386934.
We can somehow consider different impact indicators (like popularity or influence of publications) apart from their citation count.
relevant publication: https://dl.acm.org/doi/abs/10.1145/3442442.3451369
We have a description of the method in TheStackDataset.md but I don't think we have any code committed to this repo that allows us to run this end to end. Given a directory of jsonls from The Stack, can we write a python script that does the affiliation extraction and (using #23) outputs mappings between rors and software? Would you be interested in contributing this @dtkaczyk ?
Right now this is partially implemented per the discussion in the README. Use #17 when available to link DOIs to ROR ids. Would you be interested in taking this on @schatzopoulos ?
Right now this is two scripts in here. It would be nice to have this all in one script that takes Anita's spreadsheet as input and outputs the final set of software to ROR links in the form consumed by consolidate_links.py
. I think it would also be nice to convert this to python, now that we are out of hackathon mode, for the sake of consistency (and we could use #22 here and in #20). Would you be interested in working on this @sfisher ?
Wikidata has (incomplete) information about research software, research organizations as well as researchers, research publications and a number of other things that may be relevant here, so integrating that data โ or enriching it with some of yours โ may be worth a thought.
For some pointers, see
Now that we're out of hackathon mode, I think it would be useful to refactor our code so we can call a single script per data source to do the end to end data retrieval for that source (see #16).
Kaitlyn wrote several very helpful scripts to map DOIs to author affiliations (see resources/rrid_dataset_mapped_to_openalex.R
and resources/joss_dataset_mapped_to_openalex.R
). I think it would be useful to refactor these a little bit such that we had a reusable python function like so:
def work_to_affiliation_rors(ident: str, ident_type: str) -> list:
"""
Given a work identifier and the type of that identifier (one of 'doi' or 'pmid'),
retrieves the ROR ids of the author affiliations for that work.
:param ident: Identifier for a work
:param ident_type: Type of that identifier (either 'doi' or 'pmid')
:return: List of ROR ids corresponding to author affiliations of that work
"""
pass
We could then reuse this function in the various data pipelines that depend on this capability.
@kaitlynhair I would be grateful for your thoughts here. I am a little torn - I want to make sure everyone feels comfortable continuing to contribute if they like and I think you said you currently prefer R, but at the same time I think we should converge on a single language for our core data pipelines (this probably matters less for data analysis scripts we might use in the paper, for example), and I think python is something at least most people on the team have some familiarity with. What do you think - and/or, would you be interested in taking this on?
Right now this is the work of get_dois_and_repos_from_joss.py
and joss_dataset_mapped_to_openalex.R
. We should refactor so that we have a single script that takes JOSS data as input and outputs links between software and RORs. Depends on #17
Some ORCID profiles contain github links, which we could use to link github contributors to institutions. (Inspired by @Daniel-Mietchen)
We use this capability in at least two places - #20 and #22. I think it would be nice if we had a reusable python function that we could use to link institution names to their "best" ROR id (if available) using OpenAlex, e.g. something like
def get_best_institution_ror(institution_name: str) -> str:
"""
Given an institution name, try to find the ROR of its best match. Return the ROR id if
available, else None
:param institution_name: Insitution name (e.g. "Massachusetts Insitute of Technology")
:return: ROR id of best match (e.g. https://ror.org/042nb2s44)
"""
pass
See also the related thinking in #17
Would you be interested in taking this on @sfisher ?
We matched github repo owners from ORCA to ROR using urls, with the results written to orca_org_rors.json
. The data needs to be cleaned up further, though - many orgs have multiple RORs. We should figure out how to select the "best" ROR. Ideas:
We have a few different scripts glued together to execute this pipeline right now. We should have a single script that takes the CZI data as input and outputs mappings from software to RORs.
Depends on #17
Once #24 is done. Would you be interested in taking this on @schatzopoulos ? Happy to help point you to the relevant parts of consolidate_links.py
if needed!
A simple and useful version of this would be to randomly sample 100 distinct repos returned by each method and manually track down all their contributors' organizations. The results could be recorded in a csv with four columns: repo name, contributor github username (if available), contributor organization, and contributor organization's ROR id (if available). We could then use this data to evaluate the effectiveness of each method.
If anyone is interested in working on this but needs help putting the data together or has other questions, please reply to this issue!
Right now, many of our data pipelines involve chaining together several scripts. For the sake of easier replication/reuse, we should have a single script per data pipeline that reads raw inputs and outputs software to ror links, as well as whatever other information the pipeline is capable of outputting. Tracking this in more granular issues under:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.