Git Product home page Git Product logo

asreview / paper-megameta-postprocessing-screeningresults Goto Github PK

View Code? Open in Web Editor NEW
5.0 5.0 0.0 654 KB

The repository is part of the so-called, Mega-Meta study on reviewing factors contributing to substance use, anxiety, and depressive disorders. This repository contains the scripts for Post-Processing the screening results.

Home Page: https://www.asreview.ai

License: MIT License

R 79.87% Jupyter Notebook 14.93% Python 5.20%
asreview mega-meta systematic-review deduplication

paper-megameta-postprocessing-screeningresults's Introduction

DOI

Scripts for Post-Processing Mega-Meta Results

The repository is part of the so-called, Mega-Meta study on reviewing factors contributing to substance use, anxiety, and depressive disorders. The study protocol has been pre-registered at Prospero. The procedure for obtaining the search terms, the exact search query, and selecting key papers by expert consensus can be found on the Open Science Framework.

The screening was conducted in the software ASReview (Van de Schoot et al., 2020 using the protocol as described in Hofstee et al. (2021). The server installation is described in Melnikov (2021), and training of the hyperparameters for the CNN-model is described by Tijema et al (2021). The data can be found on DANS [LINK NEEDED].

The current repository contains the post-processing scripts to:

  1. Merge the three output files after screening in ASReview;
  2. Obtain missing DOIs;
  3. Apply another round of de-duplication (the first round of de-duplication was applied before the screening started).
  4. Deal with noisy labels corrected in two rounds of quality checks;

The scripts in the current repository result in one single dataset that can be used for future meta-analyses. The dataset itself is available on DANS[NEEDS LINK].

Datasets

Test Data

The /data folder contains test-files which can be used to test the pipeline.

NOTE: When you want to use these test files; please make sure that the empirical
data is not saved in the `/data` folder because the next step will overwrite these files.
  1. Open the pre-processing.Rproject in Rstudio;
  2. Open scrips/change_test_file_names.R and run the script. The test files will now have the same file names as those of the empirical data.
  3. Continue with Running the complete pipeline.

Results of the test data

To check whether the pipeline worked correctly on the test data, check the following values in the output:

  • Within the crossref_doi_retrieval.ipynb script 33/42 doi's should be retrieved.
  • After two rounds of deduplication in master_script_deduplication.R the total number of relevant papers (sum of the values in the composite_label column) should be 21.
  • After running the quality_check function in master_script_quality_check.R the number of changed labels should be:
    • Quality check 1: 7
    • Quality check 2: 6

Empirical Data

The empricial data is available on DANS[NEEDS LINK]. Request access, donwload the files, and add the required data into the /data folder.

Data Files Names

The following nine datasets should be available in /data:

The three export-datasets with the partly labelled data after screening in ASReview:

  • anxiety-screening-CNN-output.xlsx
  • depression-screening-CNN-output.xslx
  • substance-screening-CNN-output.xslx

The three datasets resulting from Quality Check 1:

  • anxiety-incorrectly-excluded-records.xlsx
  • depression-incorrectly-excluded-records.xlsx
  • substance-incorrectly-excluded-records.xlsx

The three datasets resulting from Quality Check 2:

  • anxiety-incorrectly-included-records
  • depression-incorrectly-included-records
  • substance-incorrectly-included-records

Requirements to get started

To get started:

  1. Open the pre-processing.Rproject in Rstudio;
  2. Open scripts/master_script_merging_after_asreview.R;
  3. Install, if necessary, the packages required by uncommenting the lines and running them.
  4. Make sure that at least the following columns are present in the data:
    • title
    • abstract
    • included
    • year (may be spelled differently as this can be changed within crossref_doi_retrieval.ipynb)

Running the complete pipeline

  1. Open the pre-processing.Rproject in Rstudio and run the master_script_merging_after_asreview.R to merge the datasets. At the end of the merging script, the file megameta_asreview_merged.xlsx is created and saved in /output.
  2. Run the scripts/crossref_doi_retrieval.ipynb in jupyter notebook to retrieve the missing doi's (you might need to install the package tqdm first: pip install tqdm). The output from the doi retrieval is stored in /output: megameta_asreview_doi_retrieved.xlsx. Note: This step might take some time! To significantly decrease run time, follow the steps in the Improving DOI retrieval speed section.
  3. For the deduplication part, open and run scripts/master_script_deduplication.R back in the Rproject in Rstudio. This result is stored in /output: megameta_asreview_deduplicated.xslx
  4. Two quality checks are performed. Manually change the labels
    1. of incorrectly excluded records to included.
    2. of incorrectly included records to excluded.
      The data which should be corrected is available on DANS. This step should add the following columns to the dataset:
  • quality_check_1(0->1) (1, 2, 3, NA): This column indicates for which subjects a record was falsely excluded:
    • 1 = anxiety
    • 2 = depression
    • 3 = substance-abuse
  • quality_check_2(1->0) (1, 2, 3, NA): This column indicates for which subjects a record was falsely included:
    • 1 = anxiety
    • 2 = depression
    • 3 = substance-abuse
  • depression_included_corrected (0, 1, NA): Combining the information from the depression_included and quality_check columns, this column contains the inclusion/exclusion/not seen labels after correction.
  • substance_included_corrected (0, 1, NA): Combining the information from the substance_included and quality_check columns, this column contains the inclusion/exclusion/not seen labels after correction.
  • anxiety_included_corrected (0, 1, NA): Combining the information from the anxiety_included and quality_check columns, this column contains the inclusion/exclusion/not seen labels after correction.
  • composite_label_corrected (0, 1, NA): A column indicating whether a record was included in at least one of the corrected_subject columns: The results after taking the quality checks into account.
  1. OPTIONAL: Create ASReview plugin-ready data by running the script master_script_process_data_for_asreview_plugin.R. This script creates a new folder in the output folder, data_for_plugin, containing several versions of the dataset created from step 4. See Data for the ASReview plugin for more information.

Improving DOI retrieval speed

It is possible to improve the speed of the doi retrieval by using the following steps:

  1. Split the dataset into smaller chunks by running the split_input_file.ipynb script. Within this script is the option to set the amount of chunks. If the records aren't split evenly, the last chunk might be smaller than the others.
  2. For each chunk, create a copy of the chunk_0.py file, and place it in the split folder. Change the name chunk_0.py to chunk_1.py, chunk_2.py, etc, for each created chunk.
  3. Within each file, change script_number = "0" to script_number = "1", script_number = "2", etc.
  4. Run each chunk_*.py file in the split folder simultaneously from a separate console. The script stores the console output to a respective result_chunk_*.txt file.
  5. Use the second half of merge_files.ipynb to merge the results of the chunk_*.py scripts.
  6. The resulting file will be stored in the same way as the crossref_doi_retrieval.ipynb would.

The split folder should look like this after each chunk has been run: Split folder

Deduplication strategy

Keeping in mind that deduplication is never perfect, scripts/master_script_deduplication.R contains a function to deduplicate the records in a very conservative way. It is assumed that it is better to miss duplicates within the data, than to falsely deduplicate records.

Therefore deduplication within the master_script_deduplication.R is based on two different rounds of deduplication. The first round uses the digitial object identifier (doi) to identify duplicates. However, many doi's, even after doi-retrieval, are still missing. Or in some cases the doi's may be different for otherwise seemingly identical records. Therefore, an extra round of deduplication is applied to the data. This conservative strategy was devised with the help of @bmkramer. The code used a deduplication script by @terrymyc as inspiration.

The exact strategy of the second deduplication round is as follows:

  1. Set all necessary columns (see below) for deduplication to lowercase characters and remove any punctuation marks.
  2. Count duplicates identified using conservative deduplication strategy. This strategy will identify duplicates based on:
  • Author
  • Title
  • Year
  • Journal or issn (if either journal or issn is an exact match, together with the above, the record is marked as a duplicate)
  1. Count duplicates identified using a less conservative deduplication strategy. This strategy will identify duplicates based on:
  • Author
  • Title
  • Year
  1. Deduplicate using the strategy from 2.

The deduplication script will also print the number of identified duplicates for both the conservative strategy and a less conservative strategy based on only authors, title, and year. In this way, we can compare the impact of different duplication strategies.

Data for the ASReview plugin.

The script master_script_process_data_for_asreview_plugin.R creates a new folder in the output folder, data_for_plugin, containing several versions of the dataset created from step 4.

  1. megameta_asreview_partly_labelled: A dataset where a column called label_included is added, which is an exact copy of the composite_label_corrected.
  2. megameta_asreview_only_potentially_relevant: A dataset with only those records which have a 1 in composite_label_corrected
  3. megameta_asreview_potentially_relevant_depression: A dataset with only those records which have a 1 in depression_included_corrected
  4. megameta_asreview_potentially_relevant_substance: A dataset with only those records which have a 1 in substance_included_corrected
  5. megameta_asreview_potentially_relevant_anxiety: A dataset with only those records which have a 1 in anxiety_included_corrected

[INSTRUCTIONS FOR PLUGIN?]

Post-processing functions

  • change_test_file_names.R - With this script the filenames of the test files are converted to the empirical datafile names.
  • merge_datasets.R - This script contains a function to merge the datasets. An unique included column is added for each dataset before the merge.
  • composite_label.R - This script contains a function to create a column with the final inclusions.
  • print_information_datasets.R - This script contains a function to print information on datasets.
  • identify_duplicates.R - This script contains a function to identify duplicate records in the dataset.
  • deduplicate_doi.R - This script contains a function to deduplicate the records, based on doi, while maintaining all information.
  • deduplicate_conservative.R - this script contains a function to deduplicate the records in a conservative way based on title, author, year and journal/issn

Result

The result of running all master scripts up until step 4 in this repository is the file output/megameta_asreview_quality_checked.xslx. In this dataset the following columns have been added:

  • index (1-165045): A simple indexing column going from 1-165045. Some numbers are not present because they have been removed after deduplication.
  • unique_record (0, 1, NA): Indicating whether the column has a unique DOI. This is NA when there is no DOI present.
  • depression_included (0, 1, NA): A column indicating whether a record was included in depression.
  • anxiety_included (0, 1, NA): A column indicating whether a record was included in anxiety.
  • substance_included (0, 1, NA): A column indicating whether a record was included in substance_abuse.
  • composite_label (0, 1, NA): A column indicating whether a record was included in at least one of the subjects.
  • quality_check_1(0->1) (1, 2, 3, NA): This column indicates for which subjects a record was falsely excluded:
    • 1 = anxiety
    • 2 = depression
    • 3 = substance-abuse
  • quality_check_2(1->0) (1, 2, 3, NA): This column indicates for which subjects a record was falsely included:
    • 1 = anxiety
    • 2 = depression
    • 3 = substance-abuse
  • depression_included_corrected (0, 1, NA): Combining the information from the depression_included and quality_check columns, this column contains the inclusion/exclusion/not seen labels after correction.
  • substance_included_corrected (0, 1, NA): Combining the information from the substance_included and quality_check columns, this column contains the inclusion/exclusion/not seen labels after correction.
  • anxiety_included_corrected (0, 1, NA): Combining the information from the anxiety_included and quality_check columns, this column contains the inclusion/exclusion/not seen labels after correction.
  • composite_label_corrected (0, 1, NA): A column indicating whether a record was included in at least one of the corrected_subject columns: The results after taking the quality checks into account.

For all columns where there are only 0's 1's and NA's, a 0 indicates a negative (excluded for example), while 1 indicates a positive (included for example). NA means Not Available.

Funding

This project is funded by a grant from the Centre for Urban Mental Health, University of Amsterdam, The Netherlands

Licence

The content in this repository is published under the MIT license.

Contact

For any questions or remarks, please send an email to the ASReview-team or Marlies Brouwer.

paper-megameta-postprocessing-screeningresults's People

Contributors

jteijema avatar lhofstee avatar rensvandeschoot avatar sagevdbrand avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

paper-megameta-postprocessing-screeningresults's Issues

request for descriptive stats

I would very much like to obtain a table with descriptive statistics including:

Generic stats:

  • total number of records per subject area
  • missing information (abstracts, titles, DOI)
  • number of prior relevant/irrelevant papers used in the first phase
  • number labelled records in the first phase (plus % relevant)
  • number of labelled records in the second phase (plus % relevant)

Quality stats:

  • number of irrelevant papers which appeared to be relevant after screening by a 2nd screener
  • number of relevant papers which appeared to be irrelevant after screening by a 2nd screener

Data for quality check 2 is incomplete

This issue is meant as an extra reminder that the data for quality check 2 (articles which have been incorrectly included) should be updated! Thus far we are working with the preliminary results.

This issue can be resolved when the final data for quality check 2 is available and the master-script is adapted to import this instead of the preliminary results.

Conservative deduplication does not run

It appears that there is an issue with the conservative deduplication strategy:
After the doi_retrieval in Python and loading the data back into R for the deduplication part, a few columns were added

> # IMPORTING RESULTS
> ## from doi retrieval 
> df <- read_xlsx(paste0(OUTPUT_PATH, DOI_RETRIEVED_PATH))
New names:
* `` -> ...1

which caused a hiccup in the conservative deduplication part:

New names:
* ...1 -> ...6
New names:
* ...1 -> ...6
 Error: not compatible: 
not compatible: 
- Cols in y but not x: `...1`.
- Cols in x but not y: `...6`.

Run `rlang::last_error()` to see where the error occurred. 

This issues causes the conservative deduplication function to fail and therefore, needs repairment.

create two datasets

Can you create two datasets as output:

  • one with all the information for the quality checks
  • one clean dataset which can be used for future studies

For the second dataset there should be five columns:

  • (ir)relevant for topic area 1-3 (output of the combined screening phases using ASReview)
  • misclassified (as part of the quality checks 1->0 or 0->1)
  • final label which can be used for future studies

In this second dataset, records should appear only once.

Quality checks unclear

While reading your impressive documentation, it remains unclear to me what you did in step 4 of the post-processing, 'Deal with noisy labels corrected in two rounds of quality checks'.
I cannot find a script with a similar name or an explanation.
Could you point me to where it is, or if not add it to the documentation?

add file with requirement

A file requirements.txt needs to be added containing a list of required R-packages including version information.

rlang returns an error

This code chunk returns an error:

# First pivot the title and doi columns
mismatch_included_no_source <- mir %>% 
  select(-contains("source")) %>%
  pivot_longer(cols = ends_with(c("title","doi")),
    names_to = c("intended_subject", ".value"),
    names_pattern = "(.+)_(.+)"
  )

error:

Error: `cols` must select at least one column.
Run `rlang::last_error()` to see where the error occurred.

solve duplicates

After merging the three datasets, it appeared there are still some duplicates in the dataset. This holds for relevant papers, for irrelevant papers, and for unseen papers. We would need a script that searches for more DOIs, for example in Crossref, so that we can apply another round of deduplication based on DOIs and a script for title-matching.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.