Git Product home page Git Product logo

Comments (12)

weavejester avatar weavejester commented on July 19, 2024 1

I should be able to do it. The hard part is not the code, now, but the tests :)

from medley.

weavejester avatar weavejester commented on July 19, 2024

That's not a bad idea. I'd change the default initializer, though, and replace get with find in order to account for nil values.

(defn group-with
  ([kf cf coll]
   (group-with kf cf (cf) coll))
  ([kf cf init coll]
   (reduce (fn [m v]
             (let [k (kf v)]
               (assoc m k (if-some [kv (find m k)]
                            (cf init (val kv))
                            init))))
           {}
           coll)))

The reason is that aggregation functions typically have a initial value that can be determined by calling the function with zero arguments:

(conj) => []
(+) => 0
(*) => 1

I also wonder, should the kf be before or after cf?

(group-with first conj [] coll)
;; or
(group-with conj [] first coll)

I guess following transduce it would make sense for init to be third, which would mean cf would be second, but I'd be interested in hearing your opinion.

from medley.

maxrothman avatar maxrothman commented on July 19, 2024

replace get with find in order to account for nil values

Agreed, that's an improvement.


change the default initializer...aggregation functions typically have a initial value that can be determined by calling the function with zero arguments

While I see the elegance in that approach, I think it would limit the utility of the function. Several of the examples in the top post would no longer work (or would require more complicated collation functions to work), such as the following:

;; Pick winner based on max-key
(group-with first (partial max-key count) data)

;; Pick first found rather than last found like index-by
(group-with first (fn [a _] a) data)

It's true that in most functions with multiple arities, the shorter arities act as shorthands for the longest one, but the proposed function more follows the model of reduce, where the arities have slightly different functionality. One way to think about it is that the arity-4 form performs aggregation while the arity-3 form performs selection.

Though the arity-2 form of reduce is often maligned, I think in this case it makes sense to follow that model. The reason reduce/2 is "bad" is because apply can do everything it does better because all of the interesting binary-ish functions take varargs. With group-with/3, the collation function is a little like a comparator: it takes 2 things of the same type and picks one. It's more like a classical binary reduction (reduce op [x1 x2 x3 ...] -> x1 op x2 op x3 op ...)


I also wonder, should the kf be before or after cf?

I haven't thought about it deeply, but to me it seems natural for cf to be next to init, and for init to be next-to-last, which implies the order kf cf init coll. Which of these feels more natural to you?

  • Index by first, collating by conj with initial value []
  • Collate using conj, index by first, for collation use the initial value []

Even disregarding the awkwardness of having to say collate twice, I feel like the first matches my mental information hierarchy better.


Finally, I'm interested in your thoughts on what to call this function. group-with? index-with? collate-by?

from medley.

weavejester avatar weavejester commented on July 19, 2024

I've given it some thought, and I believe you're right regarding the 3-arity form. The parallels between the 3-arity group-with and the 2-arity reduce make sense.

Regarding what to call it, group-with makes some sense. group-by-with would perhaps be more accurate but doesn't read well. We could also call it collate or collate-by - the latter you suggested as well. My inclination is to call it collate-by in order to fit with group-by and index-by, while giving the impression of a more generate purpose function.

from medley.

maxrothman avatar maxrothman commented on July 19, 2024

I agree collate-by is an accurate description, and correctly puts it in a category with index-by and group-by. My only hesitation is, do most potential users know what "collate" means, and do they associate it with the problem they would use the function to solve? Like, if what I knew I was looking for was "a more general group-by" and I ctrl+f'd medley's docs, that wouldn't necessarily turn up collate-by. But is this a real issue? ¯\_(ツ)_/¯

So what's the next step, should I put a PR together? I confess I don't have a ton of experience with optimizing functions like this, but I'm open to learning.

from medley.

weavejester avatar weavejester commented on July 19, 2024

If we add "group-by" to the docstring, it would turn up the right function if you Ctrl-F'd the docs.

Feel free to put together a PR. Take a quick look over the CONTRIBUTING.md document if you can - it's not very long.

In terms of optimising, reduce is an efficient iterator, so the only efficiency gain I can see beyond that is to use a transient map:

(defn- collate-by* [kf cf initf coll]
  (persistent!
   (reduce (fn [m v]
             (let [k (kf v)]
               (assoc! m k (if-let [kv (find m k)] 
                             (cf (val kv) v)
                             (initf v)))))
           (transient {}) 
           coll)))

(defn collate-by
  ([kf cf coll]
   (collate-by* kf cf identity coll))
  ([kf cf init coll]
   (collate-by* kf cf #(cf init %) coll)))

However, now that I write that out, I'm wondering if having an initf function instead of an init value would be a more general purpose function. It doesn't follow the pattern of reduce, but unlike reduce we can't easily get the first element of each group. That is, we know that (reduce f coll) is the same as (reduce f (first coll) (rest coll)), but we can't easily do this for collate-by.

Transducing functions do have a 1 argument form, i.e. (conj x) => x , but in a transducer this is for being called on the final result, rather than the initial value, so the purpose is different.

So would this be better?

(defn collate-by
  ([keyf collatef coll]
    (collate-by keyf collatef identity coll))
  ([keyf collatef initf coll]
   (persistent!
    (reduce (fn [m v]
              (let [k (keyf v)]
                (assoc! m k ((if-let [kv (find m k)] 
                              (collatef (val kv) v)
                              (initf v)))))
            (transient {}) 
            coll))))

The advantage of this design is that the 3-arity functionality is a subset of the 4-arity functionality, and it potentially opens up more use-cases (though I'm unsure what they would be).

The disadvantage is that you'd need to write:

(collate-by kf cf #(cf init %) coll)

But most collections have functions for this already, e.g.

(collate-by kf conj vector coll)    ;; group in vectors
(collate-by kf conj hash-set coll)  ;; group in sets
(collate-by kf conj list coll)      ;; group in lists

from medley.

maxrothman avatar maxrothman commented on July 19, 2024

If we add "group-by" to the docstring, it would turn up the right function if you Ctrl-F'd the docs.

Good point, we should do the same with index-by too.

However, now that I write that out, I'm wondering if having an initf function instead of an init value would be a more general purpose function...

To me it seems like the tradeoff here is between familiarity/clarity and generality. On one hand, people already know how reduce works, and [] is a little easier on the eyes than vector. Also, not having to use cf in two places makes it easier to use anonymous functions. On the other, an init function is more general.

My instinct says that though the initf approach is more general, if we can't come up with a single usecase for it then the demand for it would be pretty rare, which tips the balance towards familiarity/clarity in my view. The only thing I could come up with was that if there was some collection from a library or something that had no 0-arity constructor, then it would be useful. But that seems pretty exotic, and in every other case, using the 0-arity constructor would suffice.

from medley.

weavejester avatar weavejester commented on July 19, 2024

I'll give it some thought - there's no need for a hasty decision - but I'm currently leaning toward initf.

In either case we'll need a function that uses initf, as that's the common functionality between the two arities. The only question is whether or not we make that function public, or wrap it in another function that replaces initf with init.

My inclination is to avoid wrappers that make the API more familiar, at the cost of distancing ourselves from the underlying implementation. This seems a case where initf would be "simple" and init would be "easy", and in general we should favor the former over the latter.

from medley.

maxrothman avatar maxrothman commented on July 19, 2024

I was reminded of this issue today, and figured i'd check in! Shall we get the ball rolling again?

from medley.

maxrothman avatar maxrothman commented on July 19, 2024

Having thought more about the initf vs init question, I realized one material advantage of initf over init: it supports collating into mutable types (e.g. ArrayList, JS arrays, etc.). I think the use case would be rare, but when you needed it, it would be nice to have it there.

On the other hand, the major inconvenience of having to write out #(cf init %) I think would come not from writing it out per se, but from having to bind cf to a name so you could reference it in the cf and initf args. I do think that inconvenience is significant though, since it forces you to bind an otherwise small anonymous fn to a name and add another level of nesting to the code.

So I think the important question is: how often would it come up? If a type has a constructor that takes at least one argument, then you can always use the constructor rather than writing out an anonymous fn. All Clojure types have constructors like that (even numbers via * and +, yay monoids), and the only one I've seen in the wild that doesn't is this interval set library, which provides empty and mark fns but no constructor for non-empty sets. There very well might be more, but they're likely to also be somewhat exotic library-provided types.

So this leaves us with a tradeoff between an uncommon use case and another uncommon use case, but at least now I think the tradeoffs are concrete.

I think there are also two possible compromises in addition to the previously discussed approaches:

  • Check the value of init with fn? and wrap it in a fn if it isn't already in one. This could possibly create a moment of confusion if someone tried to aggregate fns (e.g. with comp) but this seems so exotic as to not be significant. It's slightly magic, but maybe not so bad?
  • Use kwargs, i.e. :init [] or :initf vector. The newish unified call syntax makes this really nice to do, though users of pre-1.11 Clojure wouldn't have that nicety. I think there's also a midway version of this approach: have a 3-arity that accepts an :initf kwarg and a 4-arity with a static init arg. That would preserve the convenience of a regular old init arg for the vast majority of usecases, and let users break out the initf kwarg in the small number of cases where it was actually needed.

Thoughts?

from medley.

weavejester avatar weavejester commented on July 19, 2024

Thanks for the update, and apologies for the delay in getting back to you. After giving it some thought, my preference leans toward initf implementation. This may be less convenient under certain circumstances, but under most use-cases there should be no difference, and it's the simpler implementation - insofar that the init version would essentially be a wrapper over the initf version. I hadn't considered that it would work better with mutable collections, but you're right, and that does seem like another minor point in favour of initf.

from medley.

maxrothman avatar maxrothman commented on July 19, 2024

Sounds good, do you want me to submit a PR or would you prefer to do it? I think the code is pretty much already in this issue.

from medley.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.