Hello 👋. Let's assume I have an entity: <div class="highlight h

Memory size exhausted for large collections about collection HOT 15 CLOSED

loophp commented on September 15, 2024 3

Memory size exhausted for large collections

from collection.

Comments (15)

jazithedev commented on September 15, 2024 2

Not sure if increasing the memory would be the best solution. For instance, having your code example we could probably exchange this line:

$c = Collection::fromIterable($input)->shuffle();

for

$c = Collection::fromIterable($input);
$toShuffle = $c->all();
shuffle($toShuffle);
$c = Collection::fromIterable($toShuffle);

and it will work without reaching the memory limit.
(assuming the last 3 lines of code are encapsulated within a method of a class, and $c is a parameter of that method)

I've made such change as above in my production code, and it has confirmed my guesses. It seems the implementation of shuffle(...) from RandomIterableAggregate somehow takes a lot of unneeded memory 🤔.

from collection.

drupol commented on September 15, 2024 2

Closing the issue, version 7.5.1 has been released and fix the issue.

Thank you all for your participation!

from collection.

drupol commented on September 15, 2024 1

Nice work coming up with that implementation! Sounds like the performance gains would be definitely something desirable. If you don't see any downsides we should definitely go with it.

OK ! I'll push the PR further ASAP.

An additional approach would be enhance the Collection::shuffle method with a parameter that allows using PHP's shuffle function instead - either a boolean or a closure. However, we could also argue that this might not be necessary, since for a very large collection the user can still transform to array and apply the function on that.

Your suggestion is precisely what I expected, and I believe you're already familiar with my stance on such modifications. While integrating this feature might improve performance, it raises a question: if we do this for shuffle, why not apply the same logic to other operations like map, filter, etc...? This is a path I'm hesitant to go down. Implementing these changes for one method might set a precedent for others, leading us into a cycle of continuous modifications, exceptions and added complexity, which I believe is best avoided. Using loophp/collection does affect performance, but its fully lazy nature is a trade-off we've consciously chosen. Adding such features would definitely increase the complexity of our codebase, something I'm actively working to reduce at all cost. I hope you understand my position on this matter. While I'm open to further discussion, my current inclination is to maintain our existing approach without introducing these kinds of specific alterations.

from collection.

drupol commented on September 15, 2024

Hello,

In this scenario, it appears that the problem may not be entirely due to the Collection itself.

When using operations such as shuffle, count and all, the entire collection needs to be evaluated all at once. This approach negates the advantages of lazy evaluation, which is likely the primary cause of the Fatal error you're encountering.

After making some tests:

<?php

declare(strict_types=1);

namespace App;

use loophp\collection\Collection;

include __DIR__ . '/vendor/autoload.php';

class MyEntity {
    public function __construct(
        public int $id,
    ) {
    }
};

$input = [];

for ($i = 0; $i < 2000; ++$i) {
    $input[] = new MyEntity(id: $i);
}

$c = Collection::fromIterable($input)->shuffle();

foreach ($c as $v) {
    var_dump($v);
}

It looks like this also fails.

I guess that's simply because 128M is not enough?

Efficiently shuffling a collection requires keeping track of items in an array(source: https://github.com/loophp/iterators/blob/main/src/RandomIterableAggregate.php), I guess that's why we have that behaviour. In this case, you have no other choice than raising the memory.

Using the PHP shuffle method requires using an array, what we try to avoid when using Collection. The implementation of shuffle in PHP is unknown to me, but the one I implemented in RandomIterableAggregate is Fisher-Yates.

from collection.

drupol commented on September 15, 2024

Do you see a better way to fix RandomIterableAggregate ? I'm definitely willing to improve it if you have ideas.

from collection.

jazithedev commented on September 15, 2024

Honestly, I didn't check it. I merely just recently found out what exactly is crashing my production code and how to fix it 😅.

from collection.

drupol commented on September 15, 2024

It would be nice to have a bit of help on this. I did my best implementing the Iterator aggregates, but there might be room for improvements.

from collection.

AlexandruGG commented on September 15, 2024

I had a little look at this as well 🎲 .

Indeed it seems that in comparison with the PHP shuffle() function the one from Collection supports way fewer elements before running into the memory issue. However, even with just the PHP shuffle and using array only, you still get into memory trouble around 1.5 million elements - so this should be expected.

Regarding the implementation, from what I've read the shuffle() function also uses the Fisher-Yates algo.

One potential cause could be that the Collection implementation uses recursion: https://github.com/loophp/iterators/blob/main/src/RandomIterableAggregate.php#L61.

The PHP implementation is a bit more complex but doesn't seem to use recursion: https://github.com/php/php-src/blob/765382eeeaffee74e0f0f677ccf5b3504abe337a/ext/standard/array.c#L3107.

I see a couple of routes to take:
a) refactor the RandomIterableAggregate to not use recursion
b) refactor the RandomIterableAggregate to use the PHP shuffle() function internally -> I noticed this is what other PHP collection implementations are doing; even for "lazy" collections they transform to array, use the function, then transform back to a Collection

@drupol what do you think?

Also @jazithedev would you be up for improving the implementation with the above if we agree on the approach?

from collection.

drupol commented on September 15, 2024

Hi Alex!

Good to see you around and have your POV on this.

a) I'm aware that most probably the memory issue comes from recursion, I've already tried to look for a replacement to this, but as of today I haven't found a better way to do this.
b) I'm trying to avoid converting things back and forth (this is a nonsense in a lazy collection library), but I have to admit that for doing a valid shuffling, there's no other option.

TBH, at this point, I don't know yet what is the best path here.

I guess using shuffle with an array will be the faster solution since we don't do the shuffling in userland, maybe this worth a try... are you willing to help on this?

from collection.

drupol commented on September 15, 2024

I've just published a draft PR refactoring the RandomIterableAggregate and removing recursion, while still implementing FisherYates.

Using the following test:

<?php

declare(strict_types=1);

namespace App;

use loophp\collection\Collection;

include __DIR__ . '/vendor/autoload.php';

class MyEntity
{
    public function __construct(
        public int $id,
    ) {}
}

$input = function (int $size) {
    for ($i = 0; $size > $i; ++$i) {
        yield new MyEntity(id: $i);
    }
};

$size = 100000;
$c = Collection::fromCallable($input, [$size])->shuffle();

foreach ($c as $v) {}

That's the one you posted on the first post, I can go up to 80000 items ! (Using 128Mb and PHP 8.2)
That's already a huge improvement!

However, could you please have a look and let me know what you think of it?

from collection.

jazithedev commented on September 15, 2024

I'll try to check all out tomorrow 🙂.

from collection.

AlexandruGG commented on September 15, 2024

I've just published a draft PR refactoring the RandomIterableAggregate and removing recursion, while still implementing FisherYates.

Using the following test:
<?php

declare(strict_types=1);

namespace App;

use loophp\collection\Collection;

include __DIR__ . '/vendor/autoload.php';

class MyEntity
{
    public function __construct(
        public int $id,
    ) {}
}

$input = function (int $size) {
    for ($i = 0; $size > $i; ++$i) {
        yield new MyEntity(id: $i);
    }
};

$size = 150000;
$c = Collection::fromCallable($input, [$size])->shuffle();

foreach ($c as $v) {}
That's the one you posted on the first post, I can go up to 80000 items ! (Using 128Mb and PHP 8.2) That's already a huge improvement!

However, could you please have a look and let me know what you think of it?

Hey @drupol,

Nice work coming up with that implementation! Sounds like the performance gains would be definitely something desirable. If you don't see any downsides we should definitely go with it.

An additional approach would be enhance the Collection::shuffle method with a parameter that allows using PHP's shuffle function instead - either a boolean or a closure. However, we could also argue that this might not be necessary, since for a very large collection the user can still transform to array and apply the function on that.

from collection.

drupol commented on September 15, 2024

It should be fixed on master branch, please feel free to test and report some feedback

from collection.

jazithedev commented on September 15, 2024

Hey @drupol. I've tested the newest branch and indeed it does not timeout for bigger collections now 🙂. I apologize I could not help more 🙁.

from collection.

drupol commented on September 15, 2024

That's so great to see! Going to cut a bugfix release this morning.

from collection.

Memory size exhausted for large collections about collection HOT 15 CLOSED

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent