juliandewit / kaggle_ndsb2017 Goto Github PK
View Code? Open in Web Editor NEWKaggle datascience bowl 2017
License: MIT License
Kaggle datascience bowl 2017
License: MIT License
hiii julian
I am little bit confused about the ndsb dataset. you gave the two solutions for the preprocessing of the dataset.one is for luna dataset and ndsb dataset. from where ndsb dataset you got? also in the following link you showed that, you create the 3D chunks of the scans.How to create it exactly?
http://juliandewit.github.io/kaggle-ndsb2017/
but in the LUNA 16 challenge they also given the documentation for the extraction of the patches. why are you going for the 3D chunks of the scans rather than the patches obtained from the luna dataset???
Thank you!!!
i am getting error in step1_preprocess_luna16.py
Computer: DESKTOP-AVK4MS4
0 patient: 1.3.6.1.4.1.14519.5.2.1.6279.6001.105756658031515062000744821260
Img array: (121, 512, 512)
Annos: 0
Origin (x,y,z): [-198.100006 -195. -335.209991]
Spacing (x,y,z): [ 0.76171899 0.76171899 2.5 ]
Rescale: [ 0.76171899 0.76171899 2.5 ]
Direction: [ 1. 0. 0. 0. 1. 0. 0. 0. 1.]
Direction: [ 1. 0. 0. 0. 1. 0. 0. 0. 1.]
(390, 390)
1 patient: 1.3.6.1.4.1.14519.5.2.1.6279.6001.108197895896446896160048741492
Img array: (119, 512, 512)
Annos: 1
Origin (x,y,z): [-182.5 -190. -313.75]
Spacing (x,y,z): [ 0.74218798 0.74218798 2.5 ]
Rescale: [ 0.74218798 0.74218798 2.5 ]
Direction: [ 1. 0. 0. 0. 1. 0. 0. 0. 1.]
Direction: [ 1. 0. 0. 0. 1. 0. 0. 0. 1.]
(380, 380)
Node org (x,y,z,diam): (-100.57, 67.26, -231.82, 6.44)
Node tra (x,y,z,diam): (110.0, 347.0, 33.0)
Traceback (most recent call last):
File "step1_preprocess_luna16.py", line 718, in
process_pos_annotations_patient2()
File "step1_preprocess_luna16.py", line 642, in process_pos_annotations_patient2
process_pos_annotations_patient(src_path, patient_id)
File "step1_preprocess_luna16.py", line 280, in process_pos_annotations_patient
center_float_percent = center_float_rescaled / patient_imgs.swapaxes(0,2).shape
ValueError: bad axis2 argument to swapaxes
Hi, juliandewit.
Your work is great! How to cite your work in paper? Which papers should be added to reference?
Thanks
Gu Yu
Hey @juliandewit. Thank you so much for sharing such great knowledge. I wonder if we could get the CT viewer you used while in the competition.
I know that you may have not published it with this repo for some good reason, but still, please share the code for CT viewer. I actually wanted to visualize the data and see the output in the CT viewer.
I hope you'll understand.
Thank you.
Hi, julian,
I am trying to build a nodule detector based on you job, and thanks very much for your sharing.
May I ask some questions:
You use several types of training set:
labels from lidc, v2 from luna16, luna16 false positive, ndsb and non-lung tissue edge.
So, on the train stage, except the non-lung tissue edge, the others are all positive sample? and the label for the positive sample is YES(to say if the cube contains a nodule) for positive samples, and NO for non-lung tissue edge, right?
Another question is: When predicting, a 646464 cube is get to the net, the result is if the cub contains a nodule and the probability?
Any information will be welcomed!
Hi,juliandewit
Your source code helps me lot. I have another question to ask you. I found the function dice_coef at line 207 in step2_train_mass_segmenter.The function return (2. * intersection + 100) / (K.sum(y_true_f) + K.sum(y_pred_f) + 100).
The definition of dice coef do not contain 100. It seems to be (2. * intersection ) / (K.sum(y_true_f) + K.sum(y_pred_f) )
Why did you add 100 at both numerator and denominator? Thanks for you help.
Gu Yu
The code is that:
model.fit_generator(train_gen, len(train_files) / 1, 12, validation_data=holdout_gen, nb_val_samples=len(holdout_files) / 1, callbacks=[checkpoint, checkpoint_fixed_name, learnrate_scheduler])
Url is:
https://github.com/juliandewit/kaggle_ndsb2017/blob/master/step2_train_nodule_detector.py#L387
Why to divide 1? Maybe should divide the batch size?
I am getting this error when i ran STEP1B_PREPROCESS_MAKE_TRAIN_CUBES.PY Error has thrown for some luna16_manual_labels files. i don't understand what's happening in some specific CSV files.
i have seen few .png images in luna16_train_cubes_manual folder.
ERROR SCREENSHOT
Computer: DESKTOP-AVK4MS4
1.3.6.1.4.1.14519.5.2.1.6279.6001.128881800399702510818644205032
0 1.3.6.1.4.1.14519.5.2.1.6279.6001.128881800399702510818644205032 2
1.3.6.1.4.1.14519.5.2.1.6279.6001.160216916075817913953530562493
1 1.3.6.1.4.1.14519.5.2.1.6279.6001.160216916075817913953530562493 1
1.3.6.1.4.1.14519.5.2.1.6279.6001.161002239822118346732951898613
1.3.6.1.4.1.14519.5.2.1.6279.6001.167919147233131417984739058859
3 1.3.6.1.4.1.14519.5.2.1.6279.6001.167919147233131417984739058859 1
1.3.6.1.4.1.14519.5.2.1.6279.6001.170825539570536865106681134236
4 1.3.6.1.4.1.14519.5.2.1.6279.6001.170825539570536865106681134236 1
1.3.6.1.4.1.14519.5.2.1.6279.6001.172845185165807139298420209778
5 1.3.6.1.4.1.14519.5.2.1.6279.6001.172845185165807139298420209778 3
1.3.6.1.4.1.14519.5.2.1.6279.6001.173931884906244951746140865701
6 1.3.6.1.4.1.14519.5.2.1.6279.6001.173931884906244951746140865701 2
1.3.6.1.4.1.14519.5.2.1.6279.6001.227968442353440630355230778531
7 1.3.6.1.4.1.14519.5.2.1.6279.6001.227968442353440630355230778531 1
1.3.6.1.4.1.14519.5.2.1.6279.6001.230491296081537726468075344411
8 1.3.6.1.4.1.14519.5.2.1.6279.6001.230491296081537726468075344411 1
1.3.6.1.4.1.14519.5.2.1.6279.6001.241717018262666382493757419144
9 1.3.6.1.4.1.14519.5.2.1.6279.6001.241717018262666382493757419144 1
1.3.6.1.4.1.14519.5.2.1.6279.6001.246225645401227472829175288633
Traceback (most recent call last):
File "step1b_preprocess_make_train_cubes.py", line 271, in
make_pos_annotation_images_manual()
File "step1b_preprocess_make_train_cubes.py", line 139, in make_pos_annotation_images_manual
images = helpers.load_patient_images(patient_id, settings.LUNA16_EXTRACTED_IMAGE_DIR, "*" + CUBE_IMGTYPE_SRC + ".png")
File "C:\Users\Sangryal\Downloads\sathya\helpers.py", line 78, in load_patient_images
res = numpy.vstack(images)
File "C:\Users\Sangryal\Anaconda3\lib\site-packages\numpy\core\shape_base.py", line 234, in vstack
return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
ValueError: need at least one array to concatenate
what is the value LUNA_16_TRAIN_DIR2D2? I can not find in setting.py file
Hi Julina,
Congratulation on doing such a great work. I just have few question about the directories where you stored the data. In 'setting.py', I see u are referring to following locations:
BASE_DIR_SSD
BASE_DIR
EXTRA_DATA_DIR
NDSB3_RAW_SRC_DIR
LUNA16_RAW_SRC_DIR
I am kind of confused which folder contains what; where am i supposed to store the ndsb data and where to store the LUNA16 dataset.
Thank you so much.
Hi,
I'm running step2_train_nodule_detector.py in Linux machine with TitanX GPU.
It's taking close to 16 hours for completing a single epoch, where as in Readme.MD it's mentioned the total time for 12 epochs is 8 hours. I'm using anaconda2 python environment.
Can you please help me with this ? What am I missing ?
I run the script step2_nodule_detector.py in which for, model 2 on luna16 annotations + ndsb pos annotations I got following error,
File "", line 2, in
train(train_full_set=True, load_weights_path=None, ndsb3_holdout=0, manual_labels=True, model_name="luna_posnegndsb_v1", fold_count=2)
File "", line 12, in train
train_files, holdout_files = get_train_holdout_files(train_percentage=80, ndsb3_holdout=ndsb3_holdout, manual_labels=manual_labels, full_luna_set=train_full_set, fold_count=fold_count)
File "", line 113, in get_train_holdout_files
pos_sample_path = pos_samples[pos_idx]
IndexError: list index out of range
Thanks in advance!
Please reply as soon as possible.
I downloaded the trained_models.rar and decompressed it.Several files were extracted there.But I don't know the exact meaning and what they stand for.
Could you tell me?
Hi, julian,
Your work is great. Thanks for sharing.
I download the resource.rar and there are several folders including different data. As far as I know, the data in the folder 'luna16_annotations' is from LUNA16 and LIDC-IDRI. How about other folders?
Thanks
Liu Peng
Hi Julian,
This script is generating so many csv files. Can you please tell me what exactly does each of these functions generate?
process_lidc_annotations(only_patient=None, agreement_threshold=0)
1.process_pos_annotations_patient2()
2.process_excluded_annotations_patients(only_patient=None)
3.process_luna_candidates_patients(only_patient_id=None)
4.process_auto_candidates_patients()
Thank You.
in your submission, only the probability of having cancer is calculated. How would you calculate the accuracy of your submission??
Hi Julian,
I am more interested in your learned features rather than predicting the final outcome through the network. I was wodnering if it would possbile to extract features from the intermediate layers of ur 3D network? Or do you by any chance know any of these trained networks (preferably trained on 3D images) that can be used easily for feature extraction.
Thanks for your help in advance,
laleh
Hi, Julian,
In the function predict_cubes() of step3_predict_nodules.py, you try to predict all the 323232 cubes of each patient. You have a cube skipping condition at line 266:
if cube_mask.sum() < 2000:
skipped_count += 1
Would you like to explain why?
Thanks
tjliupeng
Thank you so much for your sharing.
I have a question in batchnorm layer.
In step2_train_mass_segmenter.py , from line 314, it's the architecture of 2d u-net. In each block, layers go like this: input -> batchnom -> conv1 -> relu -> conv2 -> relu -> pooling -> output
In other papers, traditional batch norm layers are put between conv layers and relu, in order to avoid gradient explosion. So I wonder why you put batchnorm before conv, do you have some theories to support this order? Or is it a new trick/tip in CNN?
Of course I know there's no "correct" position for every layer, and your work preforms quite well. Congratulations for the challenge!
Can anyone help me with the links to download the dataset required to run this repo?
Dear Julian,
How can draw ROC diagram?
Hello Juliandewit.
I have some dicom files about lung cancer paitients, not the kaggle ones.I run step1_preprocess_nsdb.py and step3.py with dicom files in the directory NDSB3_RAW_SRC_DIR in the setting.py, extracted resources.rar in ./resources/ directorty and extracted train_models.rar in ./models/
directorty.
But I got 9 empty folders in the NDSB3_NODULE_DETECTION_DIR directory.Following what you said in the README.MD,I thought I can get what I want.But now I am a little confused and reading the code.
Could you please tell me how to use the trained models to do the nodule detect job on other dicom files?
Thank you.
I run step3.py separately.
I downloaded the code and tried to run it on my own computer. (also with trained model and LUNA16 training data as test set) However, there are something wrong in the result --- all nodules are detected at (0,0,0) , and the "diameter_mm"s are all negative numbers.
I tried to debug step3.py and find something: at line 60-62, "center_x","center_y","center_z" equals to 0.0 no mater what the input image is.
How could I fix this problem? Waiting for your reply...
Hi, Julian,
After training, I run step3_predict _nodules.py and the result is odd: all the x, y, z coordinates are 0.0, and diameter_mm column are all negative.
What's the possible problem?
Thanks
Hi, julian,
thank you for your sharing.
I am reading and running your code. I have some problems with the csv files in the resource folder. You had answered similar problems. You said that the luna16_falsepos_labels folder was automatically generated. Can you tell me how?
thank you very much.
Hi Julian,
Take an example in resources/ndsb3_manual_label
::::::::::::::
id,x,y,z,d,mal,dmm
0,0.7380484,0.4426079,0.4596774,0.08382452,1,0
0,0.6142763,0.630854,0.3790323,0.0589391,1,0
0,0.7439424,0.6382002,0.3790323,0.07203662,1,0
0,0.6660118,0.630854,0.3360215,0.09299278,1,0
There are two fields: d, and dmm fields. Are they (predicted) diameters? Or they are malscore?
I am facing error, when i run this function in step1_preprocess_luna16.py
if True:
process_pos_annotations_patient2()
process_excluded_annotations_patients(only_patient=None)
error:
File "C:\Users\Sangryal\Downloads\sathya\kaggle_ndsb2017\kaggle_ndsb2017-master\helpers.py", line 77, in load_patient_images
images = [im.reshape((1,) + im.shape) for im in images]
File "C:\Users\Sangryal\Downloads\sathya\kaggle_ndsb2017\kaggle_ndsb2017-master\helpers.py", line 77, in
images = [im.reshape((1,) + im.shape) for im in images]
AttributeError: 'NoneType' object has no attribute 'reshape'
Hi, juliandewit:
Great work! Thank you for sharing your code.
I have a question to ask you. When I ran the step2_train_nodule_detector.py, I found that 4 hours were used for each epoch. It just waited for 3 hours and ran model.fit_generator for 1 hour. The gpu in my computer is GTX 980 Ti G1. How can I cut down the running time?
Thanks!
Gu Yu
Traceback (most recent call last):
File "C:\Program Files (x86)\JetBrains\PyCharm 2016.3.2\helpers\pydev\pydevd.py", line 1596, in
globals = debugger.run(setup['file'], None, None, is_module)
File "C:\Program Files (x86)\JetBrains\PyCharm 2016.3.2\helpers\pydev\pydevd.py", line 974, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "C:\Program Files (x86)\JetBrains\PyCharm 2016.3.2\helpers\pydev_pydev_imps_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "D:/Python/code/Kaggle/Bowl2017/place2-2/step4_train_submissions.py", line 399, in
combine_nodule_predictions(None, train_set=False, nodule_th=0.7, extensions=[model_variant])
File "D:/Python/code/Kaggle/Bowl2017/place2-2/step4_train_submissions.py", line 130, in combine_nodule_predictions
target_path = settings.BASE_DIR + "xgboost_trainsets/" "train" + extension + ".csv" if train_set else settings.BASE_DIR + "xgboost_trainsets/" + "submission" + extension + ".csv"
UnboundLocalError: local variable 'extension' referenced before assignment
Hi, Julian,
I just start to run your step3_predict_nodules.py
using your trained model.
I found it only ran on 1 GPU even I assigned 2 GPUs to it by
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
I also muted config.gpu_options.per_process_gpu_memory_fraction = 0.5
because I am allowed to use the 2 GPUs totally, but the speed was still slow.
Could you let me know how to run the code on multiple GPUs? Thanks.
Hi Julian,
In step1_preprocess_ndsb.py should I extract both the Stage 1 and Stage 2 NDSB data into the same directory? or into to separate ones?
ie.
/data/ndsb3_extracted_images/<patient_dirs>
or
/data/ndsb3_extracted_images/stage1/<patient_dirs>
/data/ndsb3_extracted_images/stage2/<patient_dirs>
?
Hi, julian,
Your work is great. Thanks for sharing.
I download the resource.rar and there are several folders including different data. As far as I know, the data of the folder 'luna16_annotations' is from LUNA16 and LIDC-IDRI ,and the data of the folders 'luna16_manual_labels' and 'ndsb3_manual_labels' are generated manually. How about other folders? Such as annotations_excluded.csv of the folder 'luna16_annotations', candidates_V2.csv of the folder 'luna16_annotations' , the folder 'luna16_falsepos_labels' and the folder 'segmenter_traindata'.
Thanks
Cao jiehui
Hi, juliandewit
why do you select "sigmoid" and "None" rather than "Relu" or "LeakyReLU" in last layer of CNN? Are Relu or LeakyReLU better than sigmoid? Thanks.
The codes are as below:
out_class = Convolution3D(1, 1, 1, 1, activation="sigmoid", name="out_class_last")(last64)
out_class = Flatten(name="out_class")(out_class)
out_malignancy = Convolution3D(1, 1, 1, 1, activation=None, name="out_malignancy_last")(last64)
out_malignancy = Flatten(name="out_malignancy")(out_malignancy)
Best regards
Gu Yu
In step4_train_submissions.py you load a csv file as such:
mass_df = pandas.read_csv(settings.BASE_DIR + "masses_predictions.csv")
Is this file created somewhere in the previous steps? or is it one that is downloaded beforehand? I can't seem to find where it is created or located.
Thanks,
Teaghan
In step3_predict_nodules.py,
settings.MANUAL_ANNOTATIONS_LABELS_DIR
pos_labels_dir = settings.LUNA_NODULE_LABELS_DIR ,what is the value for that?
Thanks in advance
Is any one completely run this codes,with a proper required outputs? Plz reply.
Thank you!
What if there exist dicom whose slices[0].ImageOrientationPatient != [1.000000, 0.000000, 0.000000, 0.000000, 1.000000, 0.000000]
? Should i just discard or do some other preprocess?
Hi Julian, who created the csv files in resources/luna16_train_cubes_manual? What is the reason to create this data? Sorry I have read your blog, but I still can't understand it.
Hi,julian
can i just change step3 to predict luna date,and make a submission for luna16
Hi julian,I don't know what '->Model' means in
def get_net(input_shape=(CUBE_SIZE, CUBE_SIZE, CUBE_SIZE, 1), load_weight_path=None, features=False, mal=False) -> Model:
and when I was running this stage code,some problems happened,is this related to python version,my version is 2.7.
would you please tell me the answer~~~~~
Hello, Julian,
Thank you for offering the code and great work.
I've a question on the data weight of trainset.
I notice that there are several data sources in the trainset such as labels from lidc, v2 from luna16, luna16 false positive, ndsb and non-lung tissue edge. Among them, lidc and nodules of luna16 should be the positive samples, the others are negative samples (the labels for them are 0,0).
But the negative samples are far more than positive ones. It is unbalanced. How about the rates of combing the trainset. I think 1(positive) : 1(false positive) : 2(non-lung tissue or edge) maybe make sense, because too many negative samples would dilute the accuracy.
Would you please give me some suggestions on this issue?
Hi, julian,
How do you extract overlays from images for the U-net mass detector?
The overlays in "resources\segmenter_traindata\ *_o.png". How to generate those files?
Thank you!
Hey Julian, this is such an impressive work. I would like to ask about the hardware you used for training and storing files. Did you use AWS cloud (EC2 + S3) or you have local resources?
Hi Julian
the code works like a charm for me. Thanks!
However, I wonder how to count back to the absolute slice positions of a nodule.
So currently, one receives coordinates such as 0.1834, 0.5272, 0.71179
in the CSV file for x, y and z.
How can I calculate in which slice the respective nodule is (Z axis), being positioned X voxels from left and Y voxels from top?
I tried multiplying the values with the image shape e.g. (261, 512, 512)
but that gives strange looking results, e.g. 47.8674, 269.9264, 364.43648
.
Do you know a way to get the correct absolute values?
Thanks a lot
Willi
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.