Enhancing-X-ray-Image-Text-Matching

Project B 044169 – Spring 2022

Technion – Israel Institute of Technology

Summary
The SGRAF Model
Data
Proposed Improvements
Comparison
Files and Usage
How to Run the Code
References and credits

📝 Summary

Our project aimed to improve the matching of two X-ray scans with their fitting radiology report, using the SGRAF image-text matching model as a baseline. To achieve this, we tested various loss functions, architectures, and training methods.

Through our experimentation, we successfully incorporated the second X-ray scan into our models and achieved significantly better results. Our research provides insights into enhancing the accuracy of image-text matching, which can have important implications for medical diagnosis and treatment.

🫁 The SGRAF Model

The model extract features from the given image and text, and learn vector-based similarity representations between different areas in them. Then a SAF (Similarity Attention Filtartaion) module processes the vectors alignments using attention mechanisms to identify significant alignments and reduce less meaningful ones. The module outputs a matching score indicating the compatibility between the image and text. For more details see the original article [1].

🩺 Data

We used the MIMIC-CXR dataset, which contains studies with a frontal image, a lateral image and a radiology report. In the existing image-text matching models, the lateral image is often not used, even though it contains critical information.

💭 Proposed Improvements

Check different loss functions: Bi-directional ranking loss, NT-Xnet [2] and their weighted sum.
Train two regular SGRAF models simultaneously – one for each viewpoint, and use learned weights to average the similarity scores.
Concatenate the two image types features to obtain one input.
Use positional encoding to differentiate between the two viewpoints.

📊 Comparison

Here is an evaluation of the model's ability to match the image with the correct text. A higher R@K value indicates improved retrieval performance, indicating a better alignment between the image and the corresponding text.

Here is a comparison of the basic models, which trained only on one type of image (frontal or lateral).

Image type	Loss	R@1	R@5	R@10
Frontal	BRL	0.5	4.2	8.5
Lateral	BRL	0.5	1.5	3.1
Frontal	NT-Xent	6.6	18.6	27.2
Lateral	NT-Xent	5.0	13.9	21.1
Frontal	Sum	3.3	10.4	15.4
Lateral	Sum	0.3	2	3.4

Here is a comparison of the "double" models family, which has two encoders for encoding each image type (frontal and lateral). Those models are trained on both image types.

Model type	Learned weights	Shared text encoder	R@1	R@5	R@10
Uniform Average	X	X	8.1	21.3	29.3
Weighted Average	X	X	8.2	21.2	29.5
Double Model	V	X	6.7	21.1	30.4
Light Double Model	V	V	8.5	22.5	31.5
Pretrained Model	V	X	8.1	21	29.6

Here is a comparison of the "concatenation" models family, which gets as input a text and a concatenation of the frontal and lateral images. Some of those models trained with positional encoding [4] added to the images.

Model type	Positional encoding	R@1	R@5	R@10
Basic Concatenation	X	7.4	20.2	29.9
Tagged Features	X	6.6	20.1	27
Constant Positional Encoding	V	7.4	18.8	27.2
Full Positional Encoding	V	7.5	20.6	28

We can see that using the lateral images improves results as opposed to using frontal data alone. In addition, training two models at once achieves the best performance, but concatenating image features is a cheaper way to combine viewpoints.

👨‍💻 Files and Usage

File Name	Description
average_eval.py	Evaluate 2 trained models
data_xray.py	Dealing with the data loading and batching
evaluation_xray.py	Evaluate a trained model
model_xray.py	The models implementation
opts_xray.py	Running experiments using scripts
train_xray.py	For training a model

🙌 How to Run the Code

You can train a regular SGRAF model on the MIMIC-CXR dataset, using only frontal images, with this script:

opts_xray.py --model_name '../checkpoint/<model_name>' --view 'frontal' --model_num <number> --model_type 'regular_model' --batch_size 64 --num_epochs 40

🙌 References and credits

Project supervisor: Gefen Dawidowicz. Some of the algorithms were implemented based on her code.
[1] Z. M. L. Diao, "Similarity Reasoning and Filtration for Image-Text Matching", AAAI conference on artificial intelligence, 2021.
[2] K. N. H. Chen, “A Simple Framework for Contrastive Learning of Visual Representations” PMLR , pp. 1597-1607, 2020.
[3] S. M. S. P. G. Ji, “Improving Joint Learning of Chest X-Ray and Radiology Report by Word Region Alignment” MLMI , pp. 110-119, 2021.
[4] S. P. U. J. N. G. K. P. Vaswani, “Attention is all you need” Advances in neural information processing systems, 2017.

idankinderman / enhancing-x-ray-image-text-matching Goto Github PK