Git Product home page Git Product logo

Comments (18)

FabianIsensee avatar FabianIsensee commented on September 18, 2024 1

Hi,
I think @dzimmerer would be the best person to explain this as this was his idea. If I remember correctly, the rationale behind this is that python dicts are mutable so if we were to do modifications to data_dict within the transforms then we would also do modifications in the original dictionary that was passed to the transform as argument. By unpacking it and packing it back together, we create a new dictionary and thus do not modify the original one.
I can only encourage you to use our own Dataloader. It is pretty simple to use, allows maximum flexibility and it integrate seamlessly with the rest of our pipeline. If you need help migrating I'd be happy to help out.
Please note that for our approach of multiprocessing we offload the entire data loading pipeline into background workers, including the dataloader (which is the 'source'). In order to avoid getting duplicate batches you need to make sure you set the seeds in the MultiThreadedAugmenter (or leave them None) and don't use seeding in the dataloader. I always to my experiments by randomly drawing examples from my dataset and I don't guarantee that an epoch is actually iterating over all examples (thus my Dataloaders will always return random examples indefinitely). If you want to iterate over your dataset and see each example once per epoch, please refer to the file multithreaded_dataloading.py in the examples folder :-)
Best,
Fabian

from batchgenerators.

aharrison-paii avatar aharrison-paii commented on September 18, 2024

Thanks for the quick response!
I see. I confess that I'm often confused on the way python keeps track of references, so please correct me if I'm wrong. But doesn't this mean that if you're creating a new dict for each transform, does that not mean that if you modify the "data" field for example, the original reference to "data" is kept intact by the caller's dictionary, meaning that as we go through the transform chain, we'll be keeping multiple versions of "data" in system memory as it gets modified?

And thank you for the warning regarding the epochs, I did read that in your documentation. Migrating could be a good option, and it seems very straightforward, and it looks quite nice. To help me assess, I'm wondering if you've ever benchmarked your dataloader vs. pytorch's? Totally fine if not, but would be nice to know if you happen to have it on-hand.

Thanks again!

from batchgenerators.

FabianIsensee avatar FabianIsensee commented on September 18, 2024

Hi,

I must confess that I have never used pytorch dataloaders. I started developing batchgenerators when I started my PhD and back then there was no pytorch dataloaders. Also as far as I am aware we are the only framework that will really do 3D images (afaik pytorch does not handle this). So anything but batchgenerators does not make sense for me to use.
So no, I have not benchmarked ours vs Pytorch. We should probably do that at some point (as I am pretty confident we are not slower). One of our lab members, @ekmungi originally used pytorch dataloaders, ran into problems and switched to batchgenerators. Even before his bug appeared pytorch was slower (but that was on Windows, his batchgenerator experiment was on linux): pytorch/pytorch#12831

If you have a pytorch dataloader for public data (ideally something like cifar or the like) then I'd be happy to run performance tests.

I see. I confess that I'm often confused on the way python keeps track of references, so please correct me if I'm wrong. But doesn't this mean that if you're creating a new dict for each transform, does that not mean that if you modify the "data" field for example, the original reference to "data" is kept intact by the caller's dictionary, meaning that as we go through the transform chain, we'll be keeping multiple versions of "data" in system memory as it gets modified?

I also get confused a lot and it is difficult to tell exactly, unless you try it out :-) . While we are creating a new dict each time (and thus do not modify the old dictionary for example if we move keys around or add key-value pairs) the actual content of the dictionaries is changed.
This should illustrate it:
(to illustrate why we use **kwargs)

content = np.ones((2, 2, 224, 224))
a = {0: content}
b = a
a[0] = 'haha'
print(b[0])

--> 'haha'

content = np.ones((2, 2, 224, 224))
a = {0: content}
b = {**a}
a[0] = 'haha'
print(b[0].shape)

--> (2, 2, 224, 224)
As you can see, in the second example b is a new dictionary and changes we do to a does not affect it.

Now lets see what happens within content

content = np.ones((2, 2, 224, 224))
a = {0: content}
b = {**a}
a[0] *= 0
print(b[0].sum())

--> 0

As you can see, while we created a new dictionary b that has its own key-value pairs, the actual value of content will be modified in both dictionaried. That is because the value of key 0 is once again a reference (a reference to content) to a numpy array (haha gotcha). When creating b, b[0] will be a reference to content, not a copy of content. So there is no additional memory being allocated. (if we were to use deepcopy then new memory would be allocated).

That said, most of our transforms will not modify inplace and therefore the original content in the original data_dict will most likely not be changed. I am saying 'most likely' on purpose. I know we are trying to do not inplace, but I have to confess I don't know for sure. The reason we are not doing inplace is because @dzimmerer does not like it. But honestly I think inplace would be better because we will save memory...

Best,

Fabian

from batchgenerators.

FabianIsensee avatar FabianIsensee commented on September 18, 2024

I may have to retract my previous statement. I think we modify data and seg inplace, so after going through an augmentation, they will be modified in both the original and the new dictionary. I will have to check though, getting back to you later about that

from batchgenerators.

aharrison-paii avatar aharrison-paii commented on September 18, 2024

Thanks Fabian,

RE: python dataloading and 3D volumes, I get them working together by creating a transform that loads data, e.g., one that can load a Nifti image. So I create a pytorch Dataset, and the first transform in the transform chain is the data loading, followed by any pre-processing transforms that I want, e.g., like those from the batchgenerators. This dataset can then be provided as an argument to the pytorch dataloader, which handles batching and multi-processing. The pytorch dataloader is completely agnostic to how the data is loaded. I think it's a flexible approach, barring any possible inefficiencies.

from batchgenerators.

FabianIsensee avatar FabianIsensee commented on September 18, 2024

Erm. That's like Frankenstein's monster of data loading. If you need help implementing your batchgenerators dataloader just let me know =) We do multiprocessing etc, too.

from batchgenerators.

aharrison-paii avatar aharrison-paii commented on September 18, 2024

That's a strong statement, I'd be interested to understand why you say that.

I can easily swap in and out different loaders if my dataset format changes for some reason, just as I can swap in and out different transforms. How the data loading is performed is now encapsulated in a self-contained module, (in this case a transform that transforms a path into a numpy array, for example), rather than being part of a dataset or dataloader, meaning the logic of the two are separated. What am I missing?

I am certainly comfortable using your batchgenerators dataloader for my purposes (as your repo is very nice and nicely documented), but I'm just trying to understand your philosophy before committing to a data loading pipeline not part of native pytorch.

from batchgenerators.

FabianIsensee avatar FabianIsensee commented on September 18, 2024

Hey, sorry I did not mean to offend. I was just surprised that you would interleave both frameworks, given that any of these two would work by themselves. That just sounds to me like you are going through extra hassles.
But then again maybe your pipeline is a lot less complicated than I believe it to be :-)

Maybe we need a few extra examples for Dataloaders and pipelines in our repository to illustrate how easily and quickly things can be interchanged. Usually you would write your own dataloader for batchgenerators, which needs to be derived from SlimDataLoaderBase and then implement the logic for loading the data yourself. Ideally you implement some member function that loads some index of your dataset. Then if your dataset format changes you just derive from your dataloader, overwrite the loading member function and you're good to go.

In your setup, why do you need the pytorch dataloader at all? As far as I understand it the actual loading is done by some transform. Is it because you want to make use of the pytorch sampler?

Do you have some standard pytorch dataloader + pipeline that you could share? Ideally for something like MNIST or CIFAR10 so the we could implement the same setup in batchgenerators. That would make things a lot easier for new users.

from batchgenerators.

FabianIsensee avatar FabianIsensee commented on September 18, 2024

I have found this:
https://github.com/pytorch/tutorials/blob/master/beginner_source/blitz/cifar10_tutorial.py

I will do some experiments soon-ish and post the results =)

from batchgenerators.

aharrison-paii avatar aharrison-paii commented on September 18, 2024

It's ok, I'm not offended, I'm just trying to understand. The more information the better, even if it's counter to my thinking.

The idea is that a transform does not really have to be part of a framework. It can be found in the library we're writing up, it can be found in an outside source, e.g., batchgenerators, or it can be something that a user quickly wrote up, as long as it's programmed to modify a data_dict, it doesn't matter.

I agree that your approach of deriving from base Dataloader is quite flexible and sensible. We are looking for even greater flexibility however. That is, logic as self-contained and componentized as possible. Mixing and matching different dataformats and or/transforms is as simple as mixing and matching the individual components within the transform chain. In essence, making dataloading as flexible as pre-processing, i.e., adding a pre-processing transform does not require deriving a new class, so why should the same not apply to dataloading?

We do not need a pytorch dataloader at all, in fact. That is why I was interested in learning more about your repo. But for now we use the pytorch dataloader because (1) it seamlessly batches and does multi-processing, so it's something we don't have to program ourselves (2) any upgrades to the pytorch dataloading will get propagated to us automatically should there be advances on their end (3) we can be reasonably confident it will be well maintained for the near to mid-term (4) we use pytorch models anyway, so using pytorch frameworks for dataloading is natural.

I'm not sure what I'm allowed to share from our code, but I can probably write up and share on my personal github a simple Nifti dataloader that interfaces to your spatial transforms, and then gets inputted to a pytorch dataloader. That could be tested on the Decathlon dataset, for example.

Thanks again for your responses

from batchgenerators.

FabianIsensee avatar FabianIsensee commented on September 18, 2024

Now that you put it this way it makes more sense. Once again, I apologize for my previous response. That sounds like a pretty neat framework you are building up there. Given that you are using pytorch anyways it totally makes sense to use their dataloader and batching infratructure. We try to be framework agnostic for batchgenerators and that is pretty much the only reason I implemented that myself (and because way back when I started pytorch didn't exist yet). I must acknowledge that the pytorch devs are way more knowledgeable than I am so it is safe to assume that their way of doing multiprocessing as way more thorough that what I hacked together.

About the decathlon data: It would be nice to see what you are doing there, although I would like to issue a fair warning as well. If you are dealing with segmentation tasks, doing it the way you just described (loading nifti, preprocessing on the fly) will bring you a lot of headaches (unless your deep learning servers are also extremely CPU heavy). If you were to use the same preprocessing pipeline as I do, then imaging this: for each training patch (which for the case of Liver etc would be way smaller than the entire image), you would have to 1) load the entire image (not just the patch, that's not possible as far as I know), 2) do resampling on a 3d image (which can be optimized to have runtime proportional to number of voxels in the patch instead of voxels in the entire image, but is still expensive computationally), 3) run intensity normalization. Worse even, this process would be repeated every time you load a patch, thus not re-using computations at all, thus wasting CPU time. Again, I don't know what you are doing, but if you were doing it this way then there would be a lot of room for improvement =)
(in all my projects I never work with nifti data, I always do the preprocessing first, then load the already preprocessed data)

During the next days I will prepare an example to demonstrate performance of batchgenerators vs pytorch. I expect them to be pretty even, but it will be a good confirmation for us as well that we did not do anything wrong.

from batchgenerators.

aharrison-paii avatar aharrison-paii commented on September 18, 2024

Totally on the same page with you regarding pre-processing on-the-fly vs ahead of time. One thing you can do to limit memory transfers is to load one volume, and perform multiple windows on each volume, and then repeat with a different volume. This limits some of the randomness, but may not do so at a degree that's harmful. Even so, this does not address the heavy CPU demand of resampling that you outlined.

from batchgenerators.

FabianIsensee avatar FabianIsensee commented on September 18, 2024

Indeed. In the context of the decathlon I have talked to a number of participants and surprisingly many of them reported that they had difficulties getting their pipelines right. I think that a lot of what we did was just getting the entire infrastructure right.

I however would not run several patches coming from one patient. As you said, this reduces randomness. Of course I don't have hard data at hand but my gut feeling tells me this could cost you some dice points depending on how heterogeneous the dataset is.

The more I think about the pytorch dataloaders the more I realize that we are using very different names for the same concepts.
Basically, what we call DataLoader in batchgenerators is a combination of Sampler und Dataset in pytorch. What pytorch calls DataLoader ist what our MultiThreadedAugmenter is.

from batchgenerators.

FabianIsensee avatar FabianIsensee commented on September 18, 2024

Alrighty. I did some initial experiments.
This is just CIFAR10, so the batches are very small in size.

I have two implementations for batchgenerators:

  • 'batchgenerators' uses the same structure as pytorch dataloader, so there is a Dataset Class where samples are loaded via getitem and the transforms are applied by the Dataset Class as well. Here I make use of the pytorch default_collate function to put samples into a batch. Multiprocessing is done by plugging this thing into the MultithreadedAugmenter of batchgenerators. If you want to compare Multiprocessing performance, compare this one to pytorch.
  • 'high performance batchgenerators': what I would typically do. I implemented a batchgenerators DataLoader class that loads the data and creates batches. It has prior knowledge about the dataset so I can implement it more efficiently. It uses the typical batchgenerators pipeline: DataLoader, Compose of transforms, MultiThreadedAugmenter

My pytorch dataloader is as follows:
I use the same dataset class as in 'batchgenerators', I also apply transforms in the Dataset Class and use the same collate_fn. The only difference here is that I use the pytorch DataLoader to do the multiprocessing.

My benchmark runs as follows (I use the same pipieline for all loaders): I initialize the loader, I iterate once over cifar10 (as warmup), then I do 3 epochs on CIFAR10 and record the time it takes. To put some load on the CPU I apply SpatialTransform to all samples.

Here are the results:

batch_size 50
num_workers 8
pin_memory True
num_epochs 3
batchgenerators took 21.3094 seconds
high performance batchgenerators 20.5960 seconds
pytorch took 23.6012 seconds

Effectively, batchgenerators and pytorch have pretty much the same speed in realistic workloads. 'high performance batchgenerators' is consistently a little faster, but not by much.
However, there is one scenario where pytorch has an edge: when RAM IO is the bottleneck (just loading batches in CIFA with insane speed and no transforms except NumpyToTensor). Here pytorch is a little faster because it makes use of shared memory (and thus copies the data one times less than batchgenerators). This is not a realistic workload, but I thought I'd just share it because I find it interesting.
If you are interested I can share the code I used to generate these results.

Best,
Fabian

from batchgenerators.

aharrison-paii avatar aharrison-paii commented on September 18, 2024

Sorry, for the delay in my response, had a busy day yesterday.

Indeed. In the context of the decathlon I have talked to a number of participants and surprisingly many of them reported that they had difficulties getting their pipelines right. I think that a lot of what we did was just getting the entire infrastructure right.

Totally agreed. I read your arXiv paper, and thought it was bang on, and also a really important contribution - there is too much "flim-flam". A lot of raw performance just comes from a good implementation, and without that, the performance improvements seen with architectural additions could just well be Brownian motion.

I however would not run several patches coming from one patient. As you said, this reduces randomness. Of course I don't have hard data at hand but my gut feeling tells me this could cost you some dice points depending on how heterogeneous the dataset is.

I've experienced this to work out ok, but haven't rigourously characterized it, and like you said, a lot of performance is dataset dependent. It would depend on how many you sample per volume, e.g., not all your samples per batch need to come from one volume. It can come 2 or more, just with some repetitions.

The more I think about the pytorch dataloaders the more I realize that we are using very different names for the same concepts.
Basically, what we call DataLoader in batchgenerators is a combination of Sampler und Dataset in pytorch. What pytorch calls DataLoader ist what our MultiThreadedAugmenter is.

Yes, definitely.

Finally, thank you so much for those benchmarks! Very informative! Nice to see you meet or even edge out pytorch dataloading - so kudos.

I think this thread got sidelined from the initial topic of **kwargs. You may still consider moving to a positional argument for the sake of compatibility, but that'll have to depend on your own needs of course. Shall I go ahead and close this issue?

from batchgenerators.

FabianIsensee avatar FabianIsensee commented on September 18, 2024

Sorry, for the delay in my response, had a busy day yesterday.

No worries I was quite busy yesterday myself. In fact this discussion here actually made me delve deeper into pytorch dataloaders and understanding how they do it. Most importantly it made me realize that some functionality is still missing from batchgenerators such as simply iterating over a dataset in a multithreaded environment. I had something like that implemented in the examples previously, but as I never used it it was quite clunky and not straightforward to use. That is now changed, but that took some time to get it right (see DataLoader in my development branch, documentation still missing).

I think this thread got sidelined from the initial topic of **kwargs

In the context of my previous statement I don't mind some off-topic discussion at all. In fact I welcome them as they motivate me to go ahead and learn new things I would otherwise not have learned, so thank you very much for that! Also, it is not like there is 100 other people here that we would be distracting, so I don't mind it at all.

It has to be said that me matching/edging out pytorch must be interpreted in the context of the experiment I did. What I compared is the performance of batchgenerators multiprocessing vs pytorch multiprocessing by marginalizing everything else (same data loader, same dataset, same transform). There are several things I did not compare such as
a) the performance of individual transforms (batchgenerators implementation vs pytorch implementation of, say, a rotation)
b) the performance if the return value is just a numpy array, not a pytorch tensor (pytorch's default_collate will always return torch tensor, in batchgenerator you can choose. If I return just a numpy array in batchgenerators I can get a lot faster in some scenarios.
c) I only tested what I believe to be a realistic scenario where some load is put on the CPU. If I don't do that the whole speed test collapses down to which framework does fewer copies of the data in ram, which in the setting where a torch tensor is returned is pytorch, in the other setting is batchgenerators.

You can have a look at my benchmark code in /examples/cifar. I changed a lot of things yesterday night and did not rerun the benchmark since then, so results may be different.

Totally agreed. I read your arXiv paper, and thought it was bang on, and also a really important contribution - there is too much "flim-flam". A lot of raw performance just comes from a good implementation, and without that, the performance improvements seen with architectural additions could just well be Brownian motion.

Thank you so much for this statement. That is exactly what I intended to achieve! There is more merit to the Brownian motion argument than you may think ;-) I do not trust any segmentation paper anymore that does not beat the state of the art in some reputable benchmark. I have read way too many papers that provide insufficient experimental results but at the same time claim to have found the holy grail of image segmentation. If you know what the variability of an experiment on some dataset is (such as the Lung task where you can get +- 5 dice points on some split of the training data by just running an unseeded experiment twice) you can show that any arbitrary architectural modification gives an improvement if you just run enough experiments (and if you don't do statistical tests with correction for multiple hypotheses etc).

You may still consider moving to a positional argument for the sake of compatibility, but that'll have to depend on your own needs of course.

I have insufficient experience with pytorch transforms/Dataset/DataLoaders. What would be required to be compatible with them?

Best,
Fabian

from batchgenerators.

extragoya avatar extragoya commented on September 18, 2024

Commenting now with my personal account, I didn't realize I was using my work account, and we are moving away from that domain anyway, so it will soon be defunct.

Thanks for your clarifications about your experiments, I think they are informative as I was interested in pitting your Dataloader (in pytorch nomenclature) vs. the pytorch one, without the confounding factors of different transform implementations.

To make your batchgenerator transforms compatible with pytorch transforms and the pytorch Compose function, all that you would have to do would change the **data_dict argument to data_dict. In fact, I've already done that with your code and have mixed and matched it with other transforms I've created plus other pytorch transforms.

Again, this is just the transforms, which I think should be orthogonal to the Dataset and DataLoaders. With that change, I believe you can inter-operate just as readily with a pytorch Dataset/DataLoader, or your own MultiThreadedAugmenter, or however else you want to load data and batch it, as these downstream components don't care how the data was (transformed/unpacked/or not unpacked), they just want a dict.

from batchgenerators.

SouthMemory avatar SouthMemory commented on September 18, 2024

Great work amd Great issue !

from batchgenerators.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.