Git Product home page Git Product logo

cross-care's Introduction

TODO


Checkpoint 1- Arxiv POC (Proof of Concept)

  • Finalize Keywords Collection

    • Cancer Types:
      • Awaiting response from an oncology ontologist for appropriate vocabulary.
    • Normalization Terms / Other Diseases (in progress)
    • Cancer Drugs (in progress)
  • Race + Gender:

  • Test Running Co-occurrence Analysis

    • Using only Arxiv (local database).

Checkpoint 2- Full Dataset Co-occurrences

  • Expand Run to Complete Datasets

    • Redpajama
    • Pile
    • Mimic Notes
  • Considerations

    • Should an internal dataset also be included?
  • Setup Virtual Machine

    • Storage Requirement: 5TB
    • Need to find AWS credit for this setup.

Repository Structure (Desired):

LLM_Bias/
│
├── data/
│   ├── counts/  
│   ├── raw_data/  (Too big to fit here naturally)
│   └── filtered/   (Also too big to fit here)
│
├── src/
│   ├── __init__.py
│   ├── jsonl_data_filtering.py
│   └── data_analysis.py
│
├── dicts/
│   ├── dict_medical.py
│   ├── dict_gender.py
│   └── dict_racial.py
│
├── keywords/
│   ├── keywords_medical.py
│   ├── keywords_gender.py
│   └── keywords_racial.py
│
├── tests/
│   └── test_data_filtering.py
│
├── docs/
│   └── README.md
│
├── notebooks/
│   └── exploratory_analysis.ipynb
│
├── results/
│   └── figures/
│
├── configs/
│   ├── stackexchange.yaml
│   ├── arxiv.yaml
│   ├── wikipedia.yaml
│   ├── c4.yaml
│   ├── books.yaml
│   ├── github.yaml
│   └── commoncrawl.yaml
│
└── main.py

Extracting Relevant Text from Training Data:

Step 1:

Define keyword dictionaries that relate to each disease, race, and gender.

Step 2:

Filter all documents from pre-training data that mention disease AND (gender OR race).

Step 3:

Deal with ambiguous keywords, e.g., ensuring all mentions of ‘white’ and ‘black’ relate to ethnicity and not actual colors (”the black car was for sale”), and that ‘aids’ relate to the disease and not the unrelated noun (hearing or walking aids) and the verb (she ‘aids’ him). This is done using a biomedical NER tagger that is configured to only extract keyword matches that are classified as pertaining to the disease or the race (personal background). In this step, all irrelevant occurrences of the keywords are flagged, so they don’t count in the subsequent co-occurrence analysis.

Datasets

Note the datasets analyzed are all English datasets.

  • Arxiv (88 GB)
    • Total data loaded and filtered (Keyword Present - Medical AND Racial OR Gender): 77788
    • Filtered Size (Keyword Present - Medical AND Racial OR Gender): 4.6 GB
    • Filtered Size (Ambgious Keywords Filtering):
    • Meta Data Keys: timestamp, yymm, arxiv_id, language, url
  • GitHub (213 GB)
    • Filtered Size (Keyword Present - Medical AND Racial OR Gender): 2.8 GB
    • Filtered Size (Ambgious Keywords Filtering):
  • Stackexchange (74.5 GB)
    • Total data loaded and filtered: 9212
    • Filtered Size (Keyword Present - Medical AND Racial OR Gender): 70.1 MB
    • Filtered Size (Ambgious Keywords Filtering):
    • Meta Data Keys: language, url, timestamp, question_score
  • Wikipedia (112 GB) -> When filtered for English only (20.3GB)
    • Filtered Size (Keyword Present - Medical AND Racial OR Gender): 1.6 GB
    • Filtered Size (Ambgious Keywords Filtering):
    • Meta Data Keys: title, url, language, timestamp
  • Commoncrawl
    • 2019-30 Folder: (238 GB)
    • 2020-05 Folder: (285 GB)
    • 2021-04 Folder: (274 GB)
    • 2022-05 Folder: (251 GB)
    • 2023-06 Folder: (289 GB)
    • Filtered Size (Keyword Present - Medical AND Racial OR Gender): XX GB
  • C4 (807 GB)
    • Filtered Size (Keyword Present - Medical AND Racial OR Gender): 19.4 GB
    • Total data loaded and filtered: 2340188
  • Books (100.4 GB)
    • Filtered Size (Keyword Present - Medical AND Racial OR Gender): 52.3 GB
    • Filtered Size (Keyword Present - Medical AND Racial OR Gender): 40.5 GB (When removing latex formatting)
    • Total data loaded and filtered: 76297

Downloading the Training Data:

To download only the files pertaining to RedPyjama from the dataset, follow the steps below. Detailed documentation for the download can be found here.

# Download the urls.txt file which contains URLs to all the datasets
wget 'https://data.together.xyz/redpajama-data-1T/v1.0.0/urls.txt'

# Get the urls related to each of the datasets in RedPajama
grep “arxiv” urls.txt > arxiv_urls.txt
# ... [repeat for other datasets]

# Use the modified script to download only the specific files
while read line; do
    dload_loc=${line#https://data.together.xyz/redpajama-data-1T/v1.0.0/}
    mkdir -p $(dirname $dload_loc)
    wget "$line" -O "$dload_loc"
done < arxiv_urls.txt
# ... [repeat for other datasets]


# for Wikipedia, select only english articles:
jq -c 'select(.meta.language == "en")' wiki.jsonl > wiki_en.jsonl

# if you dont have jq:

## on Ubuntu:
sudo apt-get install jq

## on macOS with Homebrew:
brew install jq

How to Use

Configuration Files

Configuration files are used to specify the parameters for data filtering and analysis. These are stored in the configs/ directory and are written in YAML format. Each dataset should have its own configuration file.

Example of a configuration file (wikipedia.yaml):

data:
  input_file_path: 'data/raw_data/wikipedia.jsonl'
  output_folder_path: 'data/filtered/Wikipedia'
  metadata_keys: 
    - "language"
    - "url"
    - "timestamp"
processing:
  remove_latex: true
  save_file: true
  filename: 'wikipedia_filtered.csv'
  total_texts_filename: 'tot_texts_wiki.txt'

Data Filtering and Analysis Scripts

Two main scripts are provided for data processing:

  • main_folders.py: For datasets organized in folders with multiple .jsonl files.
  • main_single_file.py: For datasets contained in a single .jsonl file.

Running the Scripts

To run the scripts, use the following command in your terminal, replacing [script_name] with the name of the script you want to run (main_folders or main_single_file) and [config_name] with the name of your configuration file (without the .yaml extension):

python [script_name].py [config_name]

example:

python main_single_file.py wikipedia

This will:

Filter the data based on the terms defined in the dictionaries located in the dicts/ folder. Perform a co-occurrence analysis. Save the filtered data and analysis results in specified output directories.

Contributing

Contributions to improve the code or add additional functionalities are welcome! Please ensure to follow the existing code structure and comment appropriately.

cross-care's People

Contributors

lassehhansen avatar moreirap12 avatar shan23chen avatar gallifantjack avatar daniellebitt avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.