Comments (7)
here is the scalding test:
class TinyJoinAndMergeJob(args: Args) extends Job(args) {
import TinyJoinAndMergeJob._
val people = peopleInput.read.mapTo(0 -> 'id) { v: Int => v }
val messages = messageInput.read
.mapTo(0 -> 'id) { v: Int => v }
.joinWithTiny('id -> 'id, people)
(messages ++ people).groupBy('id)(_.size('count)).write(output)
}
here is the input for the tests:
object TinyJoinAndMergeJob {
val peopleInput = TypedTsv[Int]("input1")
val peopleData = List(1, 2, 3, 4)
val messageInput = TypedTsv[Int]("input2")
val messageData = List(1, 2, 3)
val output = TypedTsv[(Int, Int)]("output")
val outputData = List((1, 2), (2, 2), (3, 2), (4, 1))
}
previous behavior:
(1,2,3) joins against (1,2,3,4) creating (1,2,3).
(1,2,3) + (1,2,3,4) -> (1,1,2,2,3,3,4).
(1,1,2,2,3,3,4) count by key -> ((1,2), (2,2) (3,2), (4,1))
current behaviour: every key is being over-counted by one.
from cascading.
I think this is also possibly related to cascading.avro.AvroScheme
third-party package is leaking an old cascading 2.X version onto classpath. That project seems to be completely abandoned right now.
from cascading.
So there is this: https://github.com/cwensel/cascading-avro
from cascading.
thanks for the fork and link. We're still trying to setup some discussion with the scalding project to figure out what a potential upgrade path might look like. Since it's not my project it's not clear to me what is a "must have" to support. I will try to get back once I have more info on what the plan is.
from cascading.
It makes sense to push the avro fork out to maven central, but I don't have time to patch it since it's using maven as the build. If you guys need it, feel free to push PRs to it so we can get it out. I also don't have a build server for it, so it will need a github workflow action(s) as well.
Note, the hardest part of pushing to maven central will be getting the private keys on the build server. We can collaborate on that bit.
from cascading.
@daniel-sudz has this been resolved?
from cascading.
@cwensel I'm going to close this I think it's unlikely that scalding will adopt cascading 4.X in the near future so I don't really have time to look into it.
from cascading.
Related Issues (20)
- NestedRegexFilter is not building a nested pointer HOT 1
- Task cleanup should not look for _temporary dirs when talking to s3
- BaseHashFunction should optionally emit null for null inputs
- Support get for nested lists in JSON objects HOT 1
- Upgrade Janino to 3.x HOT 1
- Janino expressions should accept both named and position parameters
- Add ByteArrayCoercibleType
- Support parallelization of child partition .close() operations
- Native support for Parquet HOT 1
- Move counter duration measurements to aggregated nanos
- Hadoop 3 Support HOT 3
- Compile and language support for Java 11 HOT 1
- JSONCoercibleType should use #treeToValue for coercions HOT 1
- Local mode does not support setting "cascading.source.path"
- `TapOutputCollector` committing partial output on task failure HOT 7
- Possible to keep java8 on 4.5 release? HOT 4
- Possible to publish WIP to maven-central? HOT 13
- refactoring org.apache.hadoop.mapred usage against org.apache.hadoop.mapreduce HOT 8
- Casting issue in PropertyUtil HOT 10
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cascading.