Git Product home page Git Product logo

Comments (13)

ghislainfourny avatar ghislainfourny commented on September 27, 2024 1

Reproduced on a smaller dataset:

for $i in parallelize((
    { "commits" : [ { "author" : "Einstein" } ]},
    { "commits" : [ { "author" : "Goedel" }, { "author" : "Ramanujan" } ]}
  ))
  let $k := $i.commits[].author
  group by $r := 1
  return {"repo":$r, "count":count($k)}

from rumble.

ghislainfourny avatar ghislainfourny commented on September 27, 2024 1

The fix is on the way.

#1162

In fact it was something different: the counts were correctly computed, but not marked as counts, so they were treated as items and "the counts were counted again" (to 1).

from rumble.

wzrain avatar wzrain commented on September 27, 2024

This info log also needs to be changed: https://github.com/RumbleDB/rumble/blob/master/src/main/java/org/rumbledb/runtime/flwor/clauses/GroupByClauseSparkIterator.java#L576. It can be very misleading due to the incorrect clause name printed.

from rumble.

ghislainfourny avatar ghislainfourny commented on September 27, 2024

@wzrain many thanks for your insights. This is now fixed and will make it into the next release.

from rumble.

wzrain avatar wzrain commented on September 27, 2024

@ghislainfourny Thanks for the quick response on the issue! Although now the query I mentioned in my original comment:

for $i in json-file("git-archive.json") 
let $k := $i.payload.commits[].author 
group by $r := $i.repo.id 
order by count($k) descending 
count $c where $c <= 5 
return {"repo":$r, "count":count($k)}

leads to a new error when tested on the latest master branch:

[ERROR] An error has occurred: cannot resolve '`k.sequence.sequence`' given input columns: [__auto_generated_subquery_name.grouping_columns, __auto_generated_subquery_name.i, __auto_generated_subquery_name.k.sequence, __auto_generated_subquery_name.r.sequence]; line 1 pos 23;
'Aggregate [grouping_columns#10], ['sum('cardinality('`k.sequence.sequence`)) AS k.sequence#11, first(r.sequence#6, false) AS r.sequence#13]
+- SubqueryAlias __auto_generated_subquery_name
   +- Project [i#1, k.sequence#3, r.sequence#6, createGroupingColumns(struct(r.sequence, r.sequence#6)) AS grouping_columns#10]
      +- SubqueryAlias input5d89821c16cc497a8b7a7789987b4434
         +- Project [i#1, k.sequence#3, letClauseUDF(struct(i, i#1)) AS r.sequence#6]
            +- SubqueryAlias inputc0a2110f1ad6434ea979e0156fec7463
               +- Project [i#1, letClauseUDF(struct(i, i#1)) AS k.sequence#3]
                  +- SubqueryAlias inputc5128f29cf45433eaae533dfa6d619e2
                     +- LogicalRDD [i#1], false

We should investigate this 🙈. Please contact us or file an issue on GitHub with your query. 

from rumble.

wzrain avatar wzrain commented on September 27, 2024

@ghislainfourny Thanks for the quick response on the issue! Although now the query I mentioned in my original comment:

for $i in json-file("git-archive.json") 
let $k := $i.payload.commits[].author 
group by $r := $i.repo.id 
order by count($k) descending 
count $c where $c <= 5 
return {"repo":$r, "count":count($k)}

leads to a new error when tested on the latest master branch:

[ERROR] An error has occurred: cannot resolve '`k.sequence.sequence`' given input columns: [__auto_generated_subquery_name.grouping_columns, __auto_generated_subquery_name.i, __auto_generated_subquery_name.k.sequence, __auto_generated_subquery_name.r.sequence]; line 1 pos 23;
'Aggregate [grouping_columns#10], ['sum('cardinality('`k.sequence.sequence`)) AS k.sequence#11, first(r.sequence#6, false) AS r.sequence#13]
+- SubqueryAlias __auto_generated_subquery_name
   +- Project [i#1, k.sequence#3, r.sequence#6, createGroupingColumns(struct(r.sequence, r.sequence#6)) AS grouping_columns#10]
      +- SubqueryAlias input5d89821c16cc497a8b7a7789987b4434
         +- Project [i#1, k.sequence#3, letClauseUDF(struct(i, i#1)) AS r.sequence#6]
            +- SubqueryAlias inputc0a2110f1ad6434ea979e0156fec7463
               +- Project [i#1, letClauseUDF(struct(i, i#1)) AS k.sequence#3]
                  +- SubqueryAlias inputc5128f29cf45433eaae533dfa6d619e2
                     +- LogicalRDD [i#1], false

We should investigate this 🙈. Please contact us or file an issue on GitHub with your query. 

Reproducible with the smaller dataset:

for $i in parallelize((
    { "commits" : [ { "author" : "Einstein" } ], "repo":"r2"},
    { "commits" : [ { "author" : "Goedel" }, { "author" : "Ramanujan" } ], "repo": "r1"}
))
let $k := $i.commits[].author
group by $r := $i.repo
return {"repo":$r, "count":count($k)}

from rumble.

ghislainfourny avatar ghislainfourny commented on September 27, 2024

Fixed in #1159

@wzrain feel free to close after confirming that the fix works for you.

from rumble.

wzrain avatar wzrain commented on September 27, 2024

@ghislainfourny Thanks for the follow-up. Although I don't really agree that the queries produce the expected result. For the query on the smaller dataset mentioned above:

for $i in parallelize((
    { "commits" : [ { "author" : "Einstein" } ], "repo":"r2"},
    { "commits" : [ { "author" : "Goedel" }, { "author" : "Ramanujan" } ], "repo": "r1"}
))
let $k := $i.commits[].author
group by $r := $i.repo
return {"repo":$r, "count":count($k)}

I think it should return

{ "repo" : "r1", "count" : 2 }
{ "repo" : "r2", "count" : 1 }

instead it returns

{ "repo" : "r1", "count" : 1 }
{ "repo" : "r2", "count" : 1 }

from rumble.

ghislainfourny avatar ghislainfourny commented on September 27, 2024

You are right. I will reiterate. I think I know where the issue is: since $k is recognized as a count-only, the column for $k is incorrectly initialized with constant 1s but it should be initialized with the pre-group $k count.

from rumble.

wzrain avatar wzrain commented on September 27, 2024

You are right. I will reiterate. I think I know where the issue is: since $k is recognized as a count-only, the column for $k is incorrectly initialized with constant 1s but it should be initialized with the pre-group $k count.

Exactly. This is actually what this issue is initially about. :)

from rumble.

wzrain avatar wzrain commented on September 27, 2024

@ghislainfourny Thanks for the fix! The CountFix branch looks good to me now. I will close the issue once the branch is merged.

from rumble.

ghislainfourny avatar ghislainfourny commented on September 27, 2024

@wzrain does it now work? :)

from rumble.

wzrain avatar wzrain commented on September 27, 2024

@ghislainfourny Thanks for the fix! It works now. Issue closed. :)

from rumble.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.