Comments (15)
Not sure if increasing the memory would be the best solution. For instance, having your code example we could probably exchange this line:
$c = Collection::fromIterable($input)->shuffle();
for
$c = Collection::fromIterable($input);
$toShuffle = $c->all();
shuffle($toShuffle);
$c = Collection::fromIterable($toShuffle);
and it will work without reaching the memory limit.
(assuming the last 3 lines of code are encapsulated within a method of a class, and $c
is a parameter of that method)
I've made such change as above in my production code, and it has confirmed my guesses. It seems the implementation of shuffle(...)
from RandomIterableAggregate
somehow takes a lot of unneeded memory 🤔.
from collection.
Closing the issue, version 7.5.1 has been released and fix the issue.
Thank you all for your participation!
from collection.
Nice work coming up with that implementation! Sounds like the performance gains would be definitely something desirable. If you don't see any downsides we should definitely go with it.
OK ! I'll push the PR further ASAP.
An additional approach would be enhance the Collection::shuffle method with a parameter that allows using PHP's shuffle function instead - either a boolean or a closure. However, we could also argue that this might not be necessary, since for a very large collection the user can still transform to array and apply the function on that.
Your suggestion is precisely what I expected, and I believe you're already familiar with my stance on such modifications. While integrating this feature might improve performance, it raises a question: if we do this for shuffle
, why not apply the same logic to other operations like map
, filter
, etc...? This is a path I'm hesitant to go down. Implementing these changes for one method might set a precedent for others, leading us into a cycle of continuous modifications, exceptions and added complexity, which I believe is best avoided. Using loophp/collection
does affect performance, but its fully lazy nature is a trade-off we've consciously chosen. Adding such features would definitely increase the complexity of our codebase, something I'm actively working to reduce at all cost. I hope you understand my position on this matter. While I'm open to further discussion, my current inclination is to maintain our existing approach without introducing these kinds of specific alterations.
from collection.
Hello,
In this scenario, it appears that the problem may not be entirely due to the Collection itself.
When using operations such as shuffle
, count
and all
, the entire collection needs to be evaluated all at once. This approach negates the advantages of lazy evaluation, which is likely the primary cause of the Fatal error
you're encountering.
After making some tests:
<?php
declare(strict_types=1);
namespace App;
use loophp\collection\Collection;
include __DIR__ . '/vendor/autoload.php';
class MyEntity {
public function __construct(
public int $id,
) {
}
};
$input = [];
for ($i = 0; $i < 2000; ++$i) {
$input[] = new MyEntity(id: $i);
}
$c = Collection::fromIterable($input)->shuffle();
foreach ($c as $v) {
var_dump($v);
}
It looks like this also fails.
I guess that's simply because 128M is not enough?
Efficiently shuffling a collection requires keeping track of items in an array(source: https://github.com/loophp/iterators/blob/main/src/RandomIterableAggregate.php), I guess that's why we have that behaviour. In this case, you have no other choice than raising the memory.
Using the PHP shuffle
method requires using an array, what we try to avoid when using Collection. The implementation of shuffle
in PHP is unknown to me, but the one I implemented in RandomIterableAggregate
is Fisher-Yates.
from collection.
Do you see a better way to fix RandomIterableAggregate
? I'm definitely willing to improve it if you have ideas.
from collection.
Honestly, I didn't check it. I merely just recently found out what exactly is crashing my production code and how to fix it 😅.
from collection.
It would be nice to have a bit of help on this. I did my best implementing the Iterator aggregates, but there might be room for improvements.
from collection.
I had a little look at this as well 🎲 .
Indeed it seems that in comparison with the PHP shuffle()
function the one from Collection supports way fewer elements before running into the memory issue. However, even with just the PHP shuffle and using array
only, you still get into memory trouble around 1.5 million elements - so this should be expected.
Regarding the implementation, from what I've read the shuffle()
function also uses the Fisher-Yates algo.
One potential cause could be that the Collection implementation uses recursion: https://github.com/loophp/iterators/blob/main/src/RandomIterableAggregate.php#L61.
The PHP implementation is a bit more complex but doesn't seem to use recursion: https://github.com/php/php-src/blob/765382eeeaffee74e0f0f677ccf5b3504abe337a/ext/standard/array.c#L3107.
I see a couple of routes to take:
a) refactor the RandomIterableAggregate to not use recursion
b) refactor the RandomIterableAggregate to use the PHP shuffle()
function internally -> I noticed this is what other PHP collection implementations are doing; even for "lazy" collections they transform to array, use the function, then transform back to a Collection
@drupol what do you think?
Also @jazithedev would you be up for improving the implementation with the above if we agree on the approach?
from collection.
Hi Alex!
Good to see you around and have your POV on this.
a) I'm aware that most probably the memory issue comes from recursion, I've already tried to look for a replacement to this, but as of today I haven't found a better way to do this.
b) I'm trying to avoid converting things back and forth (this is a nonsense in a lazy collection library), but I have to admit that for doing a valid shuffling, there's no other option.
TBH, at this point, I don't know yet what is the best path here.
I guess using shuffle
with an array will be the faster solution since we don't do the shuffling in userland, maybe this worth a try... are you willing to help on this?
from collection.
I've just published a draft PR refactoring the RandomIterableAggregate
and removing recursion, while still implementing FisherYates.
Using the following test:
<?php
declare(strict_types=1);
namespace App;
use loophp\collection\Collection;
include __DIR__ . '/vendor/autoload.php';
class MyEntity
{
public function __construct(
public int $id,
) {}
}
$input = function (int $size) {
for ($i = 0; $size > $i; ++$i) {
yield new MyEntity(id: $i);
}
};
$size = 100000;
$c = Collection::fromCallable($input, [$size])->shuffle();
foreach ($c as $v) {}
That's the one you posted on the first post, I can go up to 80000
items ! (Using 128Mb and PHP 8.2)
That's already a huge improvement!
However, could you please have a look and let me know what you think of it?
from collection.
I'll try to check all out tomorrow 🙂.
from collection.
I've just published a draft PR refactoring the
RandomIterableAggregate
and removing recursion, while still implementing FisherYates.Using the following test:
<?php declare(strict_types=1); namespace App; use loophp\collection\Collection; include __DIR__ . '/vendor/autoload.php'; class MyEntity { public function __construct( public int $id, ) {} } $input = function (int $size) { for ($i = 0; $size > $i; ++$i) { yield new MyEntity(id: $i); } }; $size = 150000; $c = Collection::fromCallable($input, [$size])->shuffle(); foreach ($c as $v) {}
That's the one you posted on the first post, I can go up to
80000
items ! (Using 128Mb and PHP 8.2) That's already a huge improvement!However, could you please have a look and let me know what you think of it?
Hey @drupol,
Nice work coming up with that implementation! Sounds like the performance gains would be definitely something desirable. If you don't see any downsides we should definitely go with it.
An additional approach would be enhance the Collection::shuffle
method with a parameter that allows using PHP's shuffle
function instead - either a boolean or a closure. However, we could also argue that this might not be necessary, since for a very large collection the user can still transform to array and apply the function on that.
from collection.
It should be fixed on master
branch, please feel free to test and report some feedback
from collection.
Hey @drupol. I've tested the newest branch and indeed it does not timeout for bigger collections now 🙂. I apologize I could not help more 🙁.
from collection.
That's so great to see! Going to cut a bugfix release this morning.
from collection.
Related Issues (20)
- `Partition` Operation - Awkward to use? HOT 18
- Add find/search/where/single/firstWhere method HOT 23
- Modify `all` operation to prevent data loss HOT 6
- PHPStan 1.0 upgrade HOT 1
- Typed collection support HOT 17
- API oddities HOT 6
- Reduction operations should return a single value HOT 9
- Issue with cache and fromCallable HOT 13
- PHPStan reporting an error for missing optional parameters HOT 6
- [Question] Rename Collection interface to CollectionInterface HOT 1
- Dependency Dashboard HOT 54
- Collection interface doesn't extend Countable HOT 9
- Weird interplay between Collection and PDO result set HOT 13
- Plus operation RFC HOT 10
- Palm cannot infer types when using some operations HOT 4
- Unexpected behavior of pair operation over empty collection HOT 1
- [Feature request] Implement stable sorting HOT 12
- Issue with distinct HOT 3
- Distinct is slow HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from collection.