Git Product home page Git Product logo

ohsumed-preprocessed's Introduction

OHSUMED:An interactive retrieval evaluation and new large test collection for research

###Note: The files were processed to make them compatible for the data sprint @HuggingFace (Dec 2020)

About the dataset

The OHSUMED test collection is a set of 348,566 references from MEDLINE, the on-line medical information database, consisting of titles and/or abstracts from 270 medical journals over a five-year period (1987-1991). The available fields are title, abstract, MeSH indexing terms, author, source, and publication type. The National Library of Medicine has agreed to make the MEDLINE references in the test database available for experimentation, restricted to the following conditions:

  1. The data will not be used in any non-experimental clinical, library, or other setting.
  2. Any human users of the data will explicitly be told that the data is incomplete and out-of-date.

The OHSUMED document collection was obtained by William Hersh ([email protected]) and colleagues for the experiments described in the papers below:

Hersh WR, Buckley C, Leone TJ, Hickam DH, OHSUMED: An interactive retrieval evaluation and new large test collection for research, Proceedings of the 17th Annual ACM SIGIR Conference, 1994, 192-201.

Hersh WR, Hickam DH, Use of a multi-application computer workstation in a clinical setting, Bulletin of the Medical Library Association, 1994, 82: 382-389.

Data Fields

Here are the field definitions:

Column Marker Defination Key Name Notes (if any)
.I sequential identifier seq_id (important note: documents should be processed in this order)
.U MEDLINE identifier (UI) medline_ui ( used for relevance judgements)
.M Human-assigned MeSH terms (MH) mesh_terms
.T Title (TI) title
.P Publication type (PT) publication_type
.W Abstract (AB) abstract
.A Author (AU) author
.S Source (SO) source

Note: some abstracts are truncated at 250 words and some references have no abstracts at all (titles only). We do not have access to the full text of the documents.

Pre-processing steps

The train test distribution is as follows. Test data size is larger than train size!

Train:  54,710
Test : 293,858

The original dataset is in a weird format. Its something similar to this:

Row 1  Col 1
Row 1
Col 2
Row 1
Col 3
.
.
Row 2 Col 1
Row 2
Col 2 
Row 2
Col 3


Check the notebook in the repo for the pre-processing details

ohsumed-preprocessed's People

Contributors

skyprince999 avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.