Comments (13)
Reproduced on a smaller dataset:
for $i in parallelize((
{ "commits" : [ { "author" : "Einstein" } ]},
{ "commits" : [ { "author" : "Goedel" }, { "author" : "Ramanujan" } ]}
))
let $k := $i.commits[].author
group by $r := 1
return {"repo":$r, "count":count($k)}
from rumble.
The fix is on the way.
In fact it was something different: the counts were correctly computed, but not marked as counts, so they were treated as items and "the counts were counted again" (to 1).
from rumble.
This info log also needs to be changed: https://github.com/RumbleDB/rumble/blob/master/src/main/java/org/rumbledb/runtime/flwor/clauses/GroupByClauseSparkIterator.java#L576. It can be very misleading due to the incorrect clause name printed.
from rumble.
@wzrain many thanks for your insights. This is now fixed and will make it into the next release.
from rumble.
@ghislainfourny Thanks for the quick response on the issue! Although now the query I mentioned in my original comment:
for $i in json-file("git-archive.json")
let $k := $i.payload.commits[].author
group by $r := $i.repo.id
order by count($k) descending
count $c where $c <= 5
return {"repo":$r, "count":count($k)}
leads to a new error when tested on the latest master branch:
[ERROR] An error has occurred: cannot resolve '`k.sequence.sequence`' given input columns: [__auto_generated_subquery_name.grouping_columns, __auto_generated_subquery_name.i, __auto_generated_subquery_name.k.sequence, __auto_generated_subquery_name.r.sequence]; line 1 pos 23;
'Aggregate [grouping_columns#10], ['sum('cardinality('`k.sequence.sequence`)) AS k.sequence#11, first(r.sequence#6, false) AS r.sequence#13]
+- SubqueryAlias __auto_generated_subquery_name
+- Project [i#1, k.sequence#3, r.sequence#6, createGroupingColumns(struct(r.sequence, r.sequence#6)) AS grouping_columns#10]
+- SubqueryAlias input5d89821c16cc497a8b7a7789987b4434
+- Project [i#1, k.sequence#3, letClauseUDF(struct(i, i#1)) AS r.sequence#6]
+- SubqueryAlias inputc0a2110f1ad6434ea979e0156fec7463
+- Project [i#1, letClauseUDF(struct(i, i#1)) AS k.sequence#3]
+- SubqueryAlias inputc5128f29cf45433eaae533dfa6d619e2
+- LogicalRDD [i#1], false
We should investigate this 🙈. Please contact us or file an issue on GitHub with your query.
from rumble.
@ghislainfourny Thanks for the quick response on the issue! Although now the query I mentioned in my original comment:
for $i in json-file("git-archive.json") let $k := $i.payload.commits[].author group by $r := $i.repo.id order by count($k) descending count $c where $c <= 5 return {"repo":$r, "count":count($k)}
leads to a new error when tested on the latest master branch:
[ERROR] An error has occurred: cannot resolve '`k.sequence.sequence`' given input columns: [__auto_generated_subquery_name.grouping_columns, __auto_generated_subquery_name.i, __auto_generated_subquery_name.k.sequence, __auto_generated_subquery_name.r.sequence]; line 1 pos 23; 'Aggregate [grouping_columns#10], ['sum('cardinality('`k.sequence.sequence`)) AS k.sequence#11, first(r.sequence#6, false) AS r.sequence#13] +- SubqueryAlias __auto_generated_subquery_name +- Project [i#1, k.sequence#3, r.sequence#6, createGroupingColumns(struct(r.sequence, r.sequence#6)) AS grouping_columns#10] +- SubqueryAlias input5d89821c16cc497a8b7a7789987b4434 +- Project [i#1, k.sequence#3, letClauseUDF(struct(i, i#1)) AS r.sequence#6] +- SubqueryAlias inputc0a2110f1ad6434ea979e0156fec7463 +- Project [i#1, letClauseUDF(struct(i, i#1)) AS k.sequence#3] +- SubqueryAlias inputc5128f29cf45433eaae533dfa6d619e2 +- LogicalRDD [i#1], false We should investigate this 🙈. Please contact us or file an issue on GitHub with your query.
Reproducible with the smaller dataset:
for $i in parallelize((
{ "commits" : [ { "author" : "Einstein" } ], "repo":"r2"},
{ "commits" : [ { "author" : "Goedel" }, { "author" : "Ramanujan" } ], "repo": "r1"}
))
let $k := $i.commits[].author
group by $r := $i.repo
return {"repo":$r, "count":count($k)}
from rumble.
Fixed in #1159
@wzrain feel free to close after confirming that the fix works for you.
from rumble.
@ghislainfourny Thanks for the follow-up. Although I don't really agree that the queries produce the expected result. For the query on the smaller dataset mentioned above:
for $i in parallelize((
{ "commits" : [ { "author" : "Einstein" } ], "repo":"r2"},
{ "commits" : [ { "author" : "Goedel" }, { "author" : "Ramanujan" } ], "repo": "r1"}
))
let $k := $i.commits[].author
group by $r := $i.repo
return {"repo":$r, "count":count($k)}
I think it should return
{ "repo" : "r1", "count" : 2 }
{ "repo" : "r2", "count" : 1 }
instead it returns
{ "repo" : "r1", "count" : 1 }
{ "repo" : "r2", "count" : 1 }
from rumble.
You are right. I will reiterate. I think I know where the issue is: since $k is recognized as a count-only, the column for $k is incorrectly initialized with constant 1s but it should be initialized with the pre-group $k count.
from rumble.
You are right. I will reiterate. I think I know where the issue is: since $k is recognized as a count-only, the column for $k is incorrectly initialized with constant 1s but it should be initialized with the pre-group $k count.
Exactly. This is actually what this issue is initially about. :)
from rumble.
@ghislainfourny Thanks for the fix! The CountFix
branch looks good to me now. I will close the issue once the branch is merged.
from rumble.
@wzrain does it now work? :)
from rumble.
@ghislainfourny Thanks for the fix! It works now. Issue closed. :)
from rumble.
Related Issues (20)
- Out of memory error when summing large sequence HOT 5
- Offer: Install-Script for Linux HOT 4
- Non-local variables from let clauses are missing in subsequent for clauses HOT 3
- Join detected by where clauses incorrectly HOT 3
- Read context item from standard input
- Behavior of RumbleDB shell when users press Ctrl+C
- Accumulate prologs in Jupyter notebook
- Fall back to CLI parameters in server mode when unspecified as URL query parameters
- Dependency AVG is not supported yet.
- HTTP Parameters HOT 1
- Running Rumble with Docker on M1 doesn't seem to work HOT 2
- Whitespace in working directory causes crash HOT 3
- Overoptimization of variable dependencies HOT 1
- Dependency org.yaml:snakeyaml, leading to CVE problem
- Use static types instead of context-generated types for native SQL queries
- Memory Issue HOT 1
- Problem on Windows reading multiple files at once.
- materialization cap question HOT 1
- "An error has occured" selection clause HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rumble.