ConCH
Unsupervised concept extraction from clinical text.
Currently submitted to the Journal of Biomedical Informatics.
ConCH
(Concept CHecker) extracts concepts by first extracting noun phrases from a corpus using a chunker or a parser. These noun phrases are then turned into vector representations through composition of their constituent word vectors. In the paper, we use the mean
function as a composition function, but other functions can also be used.
These phrase representations are then compared to similar representations of concepts, which are composed from the textual descriptions and names given to concepts in an ontology.
The main take-aways from the paper are that the addition of context, in the form of windows, in these representations helps very little, while the phrase representations themselves are good enough to extract a variety of concepts.
License
GPL-3.0
requirements
- numpy
- sklearn
- reach
- tqdm
- lxml
- regex
usage
Conch requires the following to work.
- A set of word vectors
- A mapping from concepts to their descriptions or strings
- A parser
If you want to replicate the experiments in the paper, you will need access to the i2b2-2010 challenge corpus.
If you have access to the i2b2-2010 challenge corpus, please run all the preprocessing scripts in conch.preprocessing
to extract noun phrases, and convert the gold standard data to IOB
format. We currently offer a conversion script from UIMA
XML
format to IOB
format. If you use another parser or chunker, you will have to write your own converter.
Concept representations are also created using a preprocessing script. The input to this script is a dictionary (we use a JSON file), with the UMLS CUIs as keys, and the descriptions as lists of strings.