lintool / bespin Goto Github PK
View Code? Open in Web Editor NEWReference implementations of data-intensive algorithms in MapReduce and Spark
Home Page: http://bespin.io/
License: Other
Reference implementations of data-intensive algorithms in MapReduce and Spark
Home Page: http://bespin.io/
License: Other
Cloud9 has a bunch of integration tests that weren't copied over to Bespin - bring them over.
The reference implementation of PageRank throws away accuracy when calculating missing mass (RunPageRankBasic.java:456) by bringing the log-probability back into linear space to compute the missing mass.
Obviously, this introduces error proportional to how small the missing mass is, as well as how many iterations are ran. Specifically, it will slightly affect the solution values for an assignment of a certain systems course taught using this library - 0.00001 on public local test cases, and potentially more on blind cloud test cases. (It does not affect the values in the README example.)
The patch is here (too lazy to pull-request): https://pastebin.com/A9gV6jRi
When I did the initial steps to run wordcount example by executing
hadoop jar target/bespin-1.0.1-SNAPSHOT-fatjar.jar io.bespin.scala.mapreduce.wordcount.WordCount --input data/Shakespeare.txt --output wc-smr-combiner
I got this error message:
17/04/03 14:10:38 INFO wordcount.WordCount$: Input: data/Shakespeare.txt
17/04/03 14:10:38 INFO wordcount.WordCount$: Output: wc-smr-combiner
17/04/03 14:10:38 INFO wordcount.WordCount$: Number of reducers: 1
17/04/03 14:10:38 INFO wordcount.WordCount$: Use in-mapper combining: false
17/04/03 14:10:38 INFO wordcount.WordCount$: Use in-mapper histogram: false
Exception in thread "main" java.lang.NoSuchMethodError: org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V
at org.apache.commons.logging.impl.SLF4JLocationAwareLog.debug(SLF4JLocationAwareLog.java:131)
at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:139)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:206)
at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:185)
at org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:237)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:522)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:505)
at org.apache.hadoop.mapreduce.JobContext.<init>(JobContext.java:80)
at org.apache.hadoop.mapreduce.Job.<init>(Job.java:100)
at org.apache.hadoop.mapreduce.Job.getInstance(Job.java:71)
at io.bespin.scala.mapreduce.wordcount.WordCount$.run(WordCount.scala:152)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at io.bespin.scala.mapreduce.wordcount.WordCount$.main(WordCount.scala:184)
at io.bespin.scala.mapreduce.wordcount.WordCount.main(WordCount.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
This link said it might be the confliction of packages
http://stackoverflow.com/questions/37116688/nosuchmethoderror-org-slf4j-spi-locationawarelogger-log
I am wondering if I am the first one got this problem and how can I solve it.
At some appropriate checkpoint, we should publish Maven artifacts.
Cloud9 has a bunch of unit tests that weren't copied over to Bespin - bring them over.
Documentation can be taken from the 2018w iteration of my big data course:
https://lintool.github.io/bigdata-2018w/assignment7-451.html
PageRank-related classes still uses commons-cli for parsing args. Refactor to args4j.
The current build command mvn clean package
results in a build failure due to 501 errors
Detailed error:
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 33.970 s
[INFO] Finished at: 2020-09-13T15:40:40-04:00
[INFO] Final Memory: 38M/1514M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project bespin: Could not resolve dependencies for project io.bespin:bespin:jar:1.0.5-SNAPSHOT: Failed to collect dependencies at org.apache.spark:spark-core_2.11:jar:2.3.1 -> net.java.dev.jets3t:jets3t:jar:0.9.4 -> commons-codec:commons-codec:jar:1.15-SNAPSHOT: Failed to read artifact descriptor for commons-codec:commons-codec:jar:1.15-SNAPSHOT: Could not transfer artifact commons-codec:commons-codec:pom:1.15-SNAPSHOT from/to maven (http://repo.maven.apache.org/maven2/): Failed to transfer file: http://repo.maven.apache.org/maven2/commons-codec/commons-codec/1.15-SNAPSHOT/commons-codec-1.15-SNAPSHOT.pom. Return code is: 501 , ReasonPhrase:HTTPS Required. -> [Help 1]
This is because of the following change and can be fixed by using https
in the repository
tag of pom.xml :
<repositories>
<repository>
<id>maven</id>
<url>http://repo.maven.apache.org/maven2/</url> -> <url>https://repo.maven.apache.org/maven2/</url>
</repository>
</repositories>
source: this was reported and resolved on the CS451 Piazza, I just copied it over here
Write a tokenizer so the definition of "word" is consistent across all examples...
Examples:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.