Issue: These two queries: for $i in json-file("git-a

Reproduced on a smaller dataset: <div class="snippet-clipboard-content notranslate

The fix is on the way. <a class="issue-link js-issue-link" data-erro

This info log also needs to be changed: <a href="https://github.com/RumbleDB/rumble/bl

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

Fixed in <a class="issue-link js-issue-link" data-error-text="Failed to load title" da

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

optimization issue of count after group by about rumble HOT 13 CLOSED

wzrain commented on September 27, 2024

optimization issue of count after group by

from rumble.

Comments (13)

ghislainfourny commented on September 27, 2024 1

Reproduced on a smaller dataset:

for $i in parallelize((
    { "commits" : [ { "author" : "Einstein" } ]},
    { "commits" : [ { "author" : "Goedel" }, { "author" : "Ramanujan" } ]}
  ))
  let $k := $i.commits[].author
  group by $r := 1
  return {"repo":$r, "count":count($k)}

from rumble.

ghislainfourny commented on September 27, 2024 1

The fix is on the way.

#1162

In fact it was something different: the counts were correctly computed, but not marked as counts, so they were treated as items and "the counts were counted again" (to 1).

from rumble.

wzrain commented on September 27, 2024

This info log also needs to be changed: https://github.com/RumbleDB/rumble/blob/master/src/main/java/org/rumbledb/runtime/flwor/clauses/GroupByClauseSparkIterator.java#L576. It can be very misleading due to the incorrect clause name printed.

from rumble.

ghislainfourny commented on September 27, 2024

@wzrain many thanks for your insights. This is now fixed and will make it into the next release.

from rumble.

wzrain commented on September 27, 2024

@ghislainfourny Thanks for the quick response on the issue! Although now the query I mentioned in my original comment:

for $i in json-file("git-archive.json") 
let $k := $i.payload.commits[].author 
group by $r := $i.repo.id 
order by count($k) descending 
count $c where $c <= 5 
return {"repo":$r, "count":count($k)}

leads to a new error when tested on the latest master branch:

[ERROR] An error has occurred: cannot resolve '`k.sequence.sequence`' given input columns: [__auto_generated_subquery_name.grouping_columns, __auto_generated_subquery_name.i, __auto_generated_subquery_name.k.sequence, __auto_generated_subquery_name.r.sequence]; line 1 pos 23;
'Aggregate [grouping_columns#10], ['sum('cardinality('`k.sequence.sequence`)) AS k.sequence#11, first(r.sequence#6, false) AS r.sequence#13]
+- SubqueryAlias __auto_generated_subquery_name
   +- Project [i#1, k.sequence#3, r.sequence#6, createGroupingColumns(struct(r.sequence, r.sequence#6)) AS grouping_columns#10]
      +- SubqueryAlias input5d89821c16cc497a8b7a7789987b4434
         +- Project [i#1, k.sequence#3, letClauseUDF(struct(i, i#1)) AS r.sequence#6]
            +- SubqueryAlias inputc0a2110f1ad6434ea979e0156fec7463
               +- Project [i#1, letClauseUDF(struct(i, i#1)) AS k.sequence#3]
                  +- SubqueryAlias inputc5128f29cf45433eaae533dfa6d619e2
                     +- LogicalRDD [i#1], false

We should investigate this 🙈. Please contact us or file an issue on GitHub with your query.

from rumble.

wzrain commented on September 27, 2024

@ghislainfourny Thanks for the quick response on the issue! Although now the query I mentioned in my original comment:

for $i in json-file("git-archive.json") 
let $k := $i.payload.commits[].author 
group by $r := $i.repo.id 
order by count($k) descending 
count $c where $c <= 5 
return {"repo":$r, "count":count($k)}

leads to a new error when tested on the latest master branch:

[ERROR] An error has occurred: cannot resolve '`k.sequence.sequence`' given input columns: [__auto_generated_subquery_name.grouping_columns, __auto_generated_subquery_name.i, __auto_generated_subquery_name.k.sequence, __auto_generated_subquery_name.r.sequence]; line 1 pos 23;
'Aggregate [grouping_columns#10], ['sum('cardinality('`k.sequence.sequence`)) AS k.sequence#11, first(r.sequence#6, false) AS r.sequence#13]
+- SubqueryAlias __auto_generated_subquery_name
   +- Project [i#1, k.sequence#3, r.sequence#6, createGroupingColumns(struct(r.sequence, r.sequence#6)) AS grouping_columns#10]
      +- SubqueryAlias input5d89821c16cc497a8b7a7789987b4434
         +- Project [i#1, k.sequence#3, letClauseUDF(struct(i, i#1)) AS r.sequence#6]
            +- SubqueryAlias inputc0a2110f1ad6434ea979e0156fec7463
               +- Project [i#1, letClauseUDF(struct(i, i#1)) AS k.sequence#3]
                  +- SubqueryAlias inputc5128f29cf45433eaae533dfa6d619e2
                     +- LogicalRDD [i#1], false

We should investigate this 🙈. Please contact us or file an issue on GitHub with your query.

Reproducible with the smaller dataset:

for $i in parallelize((
    { "commits" : [ { "author" : "Einstein" } ], "repo":"r2"},
    { "commits" : [ { "author" : "Goedel" }, { "author" : "Ramanujan" } ], "repo": "r1"}
))
let $k := $i.commits[].author
group by $r := $i.repo
return {"repo":$r, "count":count($k)}

from rumble.

ghislainfourny commented on September 27, 2024

Fixed in #1159

@wzrain feel free to close after confirming that the fix works for you.

from rumble.

wzrain commented on September 27, 2024

@ghislainfourny Thanks for the follow-up. Although I don't really agree that the queries produce the expected result. For the query on the smaller dataset mentioned above:

for $i in parallelize((
    { "commits" : [ { "author" : "Einstein" } ], "repo":"r2"},
    { "commits" : [ { "author" : "Goedel" }, { "author" : "Ramanujan" } ], "repo": "r1"}
))
let $k := $i.commits[].author
group by $r := $i.repo
return {"repo":$r, "count":count($k)}

I think it should return

{ "repo" : "r1", "count" : 2 }
{ "repo" : "r2", "count" : 1 }

instead it returns

{ "repo" : "r1", "count" : 1 }
{ "repo" : "r2", "count" : 1 }

from rumble.

ghislainfourny commented on September 27, 2024

You are right. I will reiterate. I think I know where the issue is: since $k is recognized as a count-only, the column for $k is incorrectly initialized with constant 1s but it should be initialized with the pre-group $k count.

from rumble.

wzrain commented on September 27, 2024

You are right. I will reiterate. I think I know where the issue is: since $k is recognized as a count-only, the column for $k is incorrectly initialized with constant 1s but it should be initialized with the pre-group $k count.

Exactly. This is actually what this issue is initially about. :)

from rumble.

wzrain commented on September 27, 2024

@ghislainfourny Thanks for the fix! The CountFix branch looks good to me now. I will close the issue once the branch is merged.

from rumble.

ghislainfourny commented on September 27, 2024

@wzrain does it now work? :)

from rumble.

wzrain commented on September 27, 2024

@ghislainfourny Thanks for the fix! It works now. Issue closed. :)

from rumble.

optimization issue of count after group by about rumble HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent