revolutionanalytics / rhdfs Goto Github PK

A package that allows R developers to use Hadoop HDFS

Java 18.87% R 81.13%

rhdfs's Introduction

rhdfs

A package that allows R developers to use Hadoop HDFS, developed as part of the RHadoop project. Please see the RHadoop wiki for information.

rhdfs's People

Contributors

Stargazers

Watchers

rhdfs's Issues

RHDFS and kerberos

I am facing any issue when interacting with HDFS from R shell. RHDFS is properly installed. Does rdhfs support kerberos?
The underlying cluster is using Pivotal HD as the hadoop distribution and its secured using kerberos. The error message is below.

14/01/30 04:47:50 ERROR security.UserGroupInformation: PriviledgedActionException as:root (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
14/01/30 04:47:50 WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
14/01/30 04:47:50 ERROR security.UserGroupInformation: PriviledgedActionException as:root (auth:KERBEROS) cause:java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, :

java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "host2.com/192.168.74.113"; destination host is: "host.com":8020;

Getting error while loading rhdfs library in R

hi,
I am getting error while loading rhdfs in R.
I added succesfully library(rJava).

And I cannot solve it, Could someone help ? Thanks in advance

library("rhdfs")
Error : .onLoad failed in loadNamespace() for 'rhdfs', details:
call: fun(libname, pkgname)
error: Environment variable HADOOP_CMD must be set before loading package rhdfs
Error: package/namespace load failed for ‘rhdfs’

I tried to reconf rJava. But it did not change anything.
canil@ubuntu:/$ sudo R CMD javareconf
Java interpreter : /usr/bin/java
Java version : 1.7.0_13
Java home path : /usr/lib/jvm/java-7-oracle/jre
Java compiler : /usr/bin/javac
Java headers gen.: /usr/bin/javah
Java archive tool: /usr/bin/jar
NOTE: Your JVM has a bogus java.library.path system property!
Trying a heuristic via sun.boot.library.path to find jvm library...
Java library path: $(JAVA_HOME)/lib/amd64:$(JAVA_HOME)/lib/amd64/server
JNI linker flags : -L$(JAVA_HOME)/lib/amd64 -L$(JAVA_HOME)/lib/amd64/server -ljvm
JNI cpp flags : -I$(JAVA_HOME)/../include -I$(JAVA_HOME)/../include/linux

Updating Java configuration in /etc/R
Done.

And I did already set the HADOOP_CMD which is my path tho hadoop.
such as ,

Export HADOOP_CMD=/usr/local/hadoop/bin/hadoop

What should I do ?

HDP2.0 / Hadoop 2 compatability

From the wiki:

model <- 3
modelfilename <- "my_smart_unique_name"
modelfile <- hdfs.file(modelfilename, "w")
hdfs.write(model, modelfile)
[1] TRUE
hdfs.close(modelfile)
Error in fh$sync : no field, method or inner class called 'sync'

Is 2.0 support in the works?

Unsupported major.minor version 51.0 with hdfs.init()

I know this has to do with the the difference between the Java versions during compile and runtime, however I think I have set all the environments variables properly so I don't really know that is still causing this issue.

$ java -version
java version "1.7.0_79"
Java(TM) SE Runtime Environment (build 1.7.0_79-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.79-b02, mixed mode)
$ javac -version
java 1.7.0_79
$ echo $JAVA_HOME
/Library/Java/JavaVirtualMachines/jdk1.7.0_79.jdk/Contents/Home
$ hadoop version
Hadoop 2.7.1

> Sys.getenv("JAVA_HOME")
[1] "/Library/Java/JavaVirtualMachines/jdk1.7.0_79.jdk/Contents/Home"
> library(rhdfs)
Loading required package: rJava

HADOOP_CMD=/usr/local/Cellar/hadoop/2.7.1/bin/hadoop

Be sure to run hdfs.init()
Warning message:
package ‘rJava’ was built under R version 3.1.3 
> hdfs.init()
Error in .jnew("org/apache/hadoop/conf/Configuration") : 
  java.lang.UnsupportedClassVersionError: org/apache/hadoop/conf/Configuration : Unsupported major.minor version 51.0

Also I set the $JAVA_HOME in Hadoop's hadoop-env.sh to 1.7.0 as well

export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.7.0_79.jdk/Contents/Home

I would really appreciate if someone can point out what's going on here.

rhdfs does not repect the quietly flag when loading

library(rhdfs, quietly=TRUE)

HADOOP_CMD=/usr/bin/hadoop

Be sure to run hdfs.init()

HADOOP_CMD getting lost...

Hi Antonio,

I've faced a scenario where I call a mapreduce (rmr) from a Shell script inside a Mapper (This is how Oozie launches a Shell action)

Here is the flow:
Oozie Launcher Job -> Lancher Map only task where shell script (Rscript myscript.r) executes -> StreamJob -> Mapper / Reducers

myscript.r

Sys.setenv(JAVA_HOME="/usr/jdk64/jdk1.7.0_67")
Sys.setenv(HADOOP_CMD="/usr/bin/hadoop")
Sys.setenv(HADOOP_HOME="/usr/hdp/2.2.6.0-2800/hadoop")
Sys.setenv(HADOOP_STREAMING="/usr/hdp/2.2.6.0-2800/hadoop-mapreduce/hadoop-streaming.jar")
library("rhdfs")
library("rmr2")
hdfs.init()
library(Matrix)

Logs

Launcher Job (Job that launches Map only task)
On this map, the following shell script is executed as a system call (Rscript myscript.r) - Here HADOOP_CMD is set correctly, but an error on the mr function is logged (Probably due the Streaming Mapper error when a hdfs function is called from inside the mr function). Launcher (Mapper) log:

Loading required package: methods
Loading required package: rJava

HADOOP_CMD=/usr/bin/hadoop

Be sure to run hdfs.init()
Please review your hadoop settings. See help(hadoop.settings)
Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce,  : 
  hadoop streaming failed with error code 15
Calls: getTags -> mapreduce -> mr

The rmr streaming starts a StreamJob which can have Mappers and Reducers. StreamJob (Mapper) log:

    Log Type: stderr
    Log Upload Time: Thu Aug 06 19:24:41 -0400 2015

    Log Length: 2722

    Loading objects:
    Loading objects:
      backend.parameters
      combine
    Please review your hadoop settings. See help(hadoop.settings)
      combine.file
      combine.line
      debug
      default.input.format
      default.output.format
      in.folder
      in.memory.combine
      input.format
      libs
      map
      map.file
      map.line
      out.folder
      output.format
      pkg.opts
      postamble
      preamble
      profile.nodes
      reduce
      reduce.file
      reduce.line
      rmr.global.env
      rmr.local.env
      save.env
      tempfile
      vectorized.reduce
      verbose
      work.dir
    Loading required package: methods
    Loading required package: rJava
    Loading required package: rhdfs
    Error : .onLoad failed in loadNamespace() for 'rhdfs', details:
      call: fun(libname, pkgname)
      error: Environment variable HADOOP_CMD must be set before loading package rhdfs
    Warning in FUN(X[[i]], ...) : can't load rhdfs
    Loading required package: rmr2
    Loading required package: Matrix

However, calling the myscript.r from command line works fine.

Here is my question:
Should rmr propagate the environment envs in this case, or this should be responsibility of the enviroment to provide the value for the HADOOP_CMD variable?

Recursive hdfs.read on directory path

hdfs.init()
modelfilename <- "<PATH_TO_DIRECTORY>"
modelfile = hdfs.read(modelfilename, "r")
m <- hdfs.read(modelfile)
head(m)

Console :

> modelfilename <- "<PATH_TO_DIRECTORY>"
> modelfile = hdfs.read(modelfilename, "r")
Error in con$fh : $ operator is invalid for atomic vectors
> m <- hdfs.read(modelfile)
> head(m)
[1] 06 f7 9f 04 50 28
>

Access to data is OK but an error on "con$fh"

Cannot install rhdfs at my EC2 unbuntu instances

Hi,

I am trying to install "rhdfs" at my EC2 unbuntu instances. I first installed the rhdfs dependent packages: such as rJava, Plyr, rmr2. But when I tryied to run the following command to install rhdfs,

sudo R CMD INSTALL /data/tarfiles/rhdfs_1.0.8.tar.gz

I got the following error message:

** testing if installed package can be loaded
Error : .onLoad failed in loadNamespace() for 'rhdfs', details:
call: fun(libname, pkgname)
error: Environment variable HADOOP_CMD must be set before loading package rhdfs
Error: loading failed

I tried to open R and run "Sys.getenv("HADOOP_CMD")". It does give me the right HADOOP HOME. So the error message is really weird for me.

Can anybody help?

Max

I cannot read the output of a mapreduce job

I cannot read the output of a mapreduce job.

The code:

data=to.dfs(1:10)
res = mapreduce(input = data, map = function(k, v) cbind(v, 2*v))
print(res())

[1] "/tmp/Rtmpr5Xv1g/file34916a6426bf"

And then....

from.dfs(res)

Exception in thread "main" java.io.FileNotFoundException: File does not exist: /tmp/Rtmpr5Xv1g/file34916a6426bf/_logs
...
...

Finally,

hdfs.ls("/tmp/Rtmpr5Xv1g/file34916a6426bf")

permission owner group size modtime
1 -rw------- daniel supergroup 0 2013-05-13 18:24
2 drwxrwxrwt daniel supergroup 0 2013-05-13 18:23
3 -rw------- daniel supergroup 448 2013-05-13 18:24
4 -rw------- daniel supergroup 122 2013-05-13 18:23
file
1 /tmp/Rtmpr5Xv1g/file34916a6426bf/_SUCCESS
2 /tmp/Rtmpr5Xv1g/file34916a6426bf/_logs
3 /tmp/Rtmpr5Xv1g/file34916a6426bf/part-00000
4 /tmp/Rtmpr5Xv1g/file34916a6426bf/part-00001

I note that /tmp/Rtmpr5Xv1g/file34916a6426bf/_logs is a directory

Why does the program search the file "_logs" when it is a directory??????

Thanks in advance

Alfonso

Unable to read MR job output using R

Environment:
HortonWorks 2.1 cluster integrated with Kerberos and Active Directory
R version: 3.1.3

Issue:
I am trying to run a simple MR job using R on a Kerberos enabled Hadoop cluster. The R code is given below:
Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-mapreduce/hadoop-streaming-2.4.0.2.1.5.0-695.jar")
Sys.setenv(HADOOP_CMD="/usr/bin/hadoop")
Sys.setenv(HADOOP_CONF_DIR="/etc/hadoop/conf")
library(rhdfs)
library(rmr2)
hdfs.init()
ints = to.dfs(1:100)
calc = mapreduce(input = ints, map = function(k, v) cbind(v, 2*v))

Till this point the mapreduce job runs successfully but when I try to access the results using the following command, an error is thrown:
from.dfs(calc)

The error is "Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 8 elements".

The same error is thrown while accessing output of any MR job [wordcount, pi value].

The traceback() function displays the following:
7: scan(file = file, what = what, sep = sep, quote = quote, dec = dec,
nmax = nrows, skip = 0, na.strings = na.strings, quiet = TRUE,
fill = fill, strip.white = strip.white, blank.lines.skip = blank.lines.skip,
multi.line = FALSE, comment.char = comment.char, allowEscapes = allowEscapes,
flush = flush, encoding = encoding, skipNul = skipNul)
6: read.table(textConnection(hdfs("ls", fname, intern = TRUE)),
skip = 1, col.names = c("permissions", "links", "owner",
"group", "size", "date", "time", "path"), stringsAsFactors = FALSE)
5: hdfs.ls(fname)
4: part.list(fname)
3: lapply(src, function(x) system(paste(hadoop.streaming(), "dumptb",
rmr.normalize.path(x), ">>", rmr.normalize.path(dest))))
2: dumptb(part.list(fname), tmp)
1: from.dfs(calc)

Please let me know how to resolve this issue.

I cannot read the output of a mapreduce job

Hello I could help with this error does not take me the data of the mapreduce and I mark error some idea of which may be the fault
This is the code I'm using
Sys.setenv(HADOOP_HOME="/usr/local/hadoop")
Sys.setenv(HADOOP_CMD="/usr/local/hadoop/bin/hadoop")
Sys.setenv(JAVA_HOME="/usr/lib/jvm/java-7-openjdk-i386")
Sys.setenv(HADOOP_STREAMING="/usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar")
library('rmr2')
data=to.dfs(1:10)
res = mapreduce(input = data, map = function(k, v) cbind(v, 2*v))
from.dfs(res)

This is what appears to me on the console

Sys.setenv(HADOOP_CMD="/usr/local/hadoop/bin/hadoop")

Sys.setenv(JAVA_HOME="/usr/lib/jvm/java-7-openjdk-i386")
Sys.setenv(HADOOP_STREAMING="/usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar")
library('rmr2')
data=to.dfs(1:10)
OpenJDK Server VM warning: You have loaded library /usr/local/hadoop/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c ', or link it with '-z noexecstack'.
17/04/10 18:12:44 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/04/10 18:12:46 INFO compress.CodecPool: Got brand-new compressor [.deflate]
res = mapreduce(input = data, map = function(k, v) cbind(v, 2*v))
OpenJDK Server VM warning: You have loaded library /usr/local/hadoop/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c ', or link it with '-z noexecstack'.
17/04/10 18:12:50 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/04/10 18:12:50 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
packageJobJar: [/tmp/hadoop-unjar5133688999707817678/] [] /tmp/streamjob6112579330814301418.jar tmpDir=null
17/04/10 18:12:51 INFO client.RMProxy: Connecting to ResourceManager at master/192.168.0.24:8050
17/04/10 18:12:52 INFO client.RMProxy: Connecting to ResourceManager at master/192.168.0.24:8050
17/04/10 18:12:53 INFO mapred.FileInputFormat: Total input paths to process : 1
17/04/10 18:12:53 INFO mapreduce.JobSubmitter: number of splits:2
17/04/10 18:12:54 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1491851865431_0014
17/04/10 18:12:54 INFO impl.YarnClientImpl: Submitted application application_1491851865431_0014
17/04/10 18:12:54 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1491851865431_0014/
17/04/10 18:12:54 INFO mapreduce.Job: Running job: job_1491851865431_0014
17/04/10 18:13:02 INFO mapreduce.Job: Job job_1491851865431_0014 running in uber mode : false
17/04/10 18:13:02 INFO mapreduce.Job: map 0% reduce 0%
17/04/10 18:13:10 INFO mapreduce.Job: map 50% reduce 0%
17/04/10 18:13:11 INFO mapreduce.Job: map 100% reduce 0%
17/04/10 18:13:12 INFO mapreduce.Job: Job job_1491851865431_0014 completed successfully
17/04/10 18:13:13 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=220440
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=973
HDFS: Number of bytes written=244
HDFS: Number of read operations=14
HDFS: Number of large read operations=0
HDFS: Number of write operations=4
Job Counters
Launched map tasks=2
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=13405
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=13405
Total vcore-seconds taken by all map tasks=13405
Total megabyte-seconds taken by all map tasks=13726720
Map-Reduce Framework
Map input records=3
Map output records=0
Input split bytes=180
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=124
CPU time spent (ms)=2450
Physical memory (bytes) snapshot=296972288
Virtual memory (bytes) snapshot=1424506880
Total committed heap usage (bytes)=217579520
File Input Format Counters
Bytes Read=793
File Output Format Counters
Bytes Written=244
17/04/10 18:13:13 INFO streaming.StreamJob: Output directory: /tmp/file649a194b8a0e
from.dfs(res)
OpenJDK Server VM warning: You have loaded library /usr/local/hadoop/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c ', or link it with '-z noexecstack'.
17/04/10 18:13:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/04/10 18:13:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
Line 1 does not have 8 elements

Performance issue of hdfs.read() of rhdfs 1.0.8

Report here:

https://groups.google.com/forum/?hl=en-US#!topic/rhadoop/uZ9IZJwWlto

Error when running hdfs.init()

Opened on behalf of @yoonus786

Hi, I am able to run hdfs.init() in master node. But I am getting error when I am running in the slave node.

hdfs.init()
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, :
java.io.IOException: failure to login
Please let me know what will be the problem here.

Thanks,
Yoonus

Error installing rhdfs_1.0.8.tar.gz

[186946@01HW524744 hadd]$ sudo R CMD INSTALL rhdfs_1.0.8.tar.gz

installing to library ‘/usr/lib64/R/library’
installing source package ‘rhdfs’ ...
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
converting help for package ‘rhdfs’
finding HTML links ... done
hdfs-file-access html
hdfs-file-manip html
hdfs.defaults html
hdfs.file-level html
initialization html
rhdfs html
text.files html
** building package indices
** testing if installed package can be loaded
Error : .onLoad failed in loadNamespace() for 'rhdfs', details:
call: fun(libname, pkgname)
error: Environment variable HADOOP_CMD must be set before loading package rhdfs
Error: loading failed
Execution halted
ERROR: loading failed

* removing ‘/usr/lib64/R/library/rhdfs’

when I try[186946@01HW524744 hadd]$ sudo -E R CMD INSTALL rhdfs_1.0.8.tar.gz

installing to library ‘/usr/lib64/R/library’
installing source package ‘rhdfs’ ...
** R
** inst
** preparing package for lazy loading
Error : .onLoad failed in loadNamespace() for 'rJava', details:
call: dyn.load(file, DLLpath = DLLpath, ...)
error: unable to load shared object '/usr/lib64/R/library/rJava/libs/rJava.so':
libjvm.so: cannot open shared object file: No such file or directory
Error : package ‘rJava’ could not be loaded
ERROR: lazy loading failed for package ‘rhdfs’
removing ‘/usr/lib64/R/library/rhdfs’

HADOOP_CMD is set in .bashrc
[186946@01HW524744 hadd]$ echo $HADOOP_CMD

/usr/bin/hadoop

Please help

from.dfs produces "file does not exist" error

Hi,
I set up R and Hadoop using cloudera quick start VM CDH 5.3.

R version 3.1.2. VirtualBox Manager 4.3.20 running on MacOSX 10.7.5
I followed the blog
http://www.r-bloggers.com/integration-of-r-rstudio-and-hadoop-in-a-virtualbox-cloudera-demo-vm-on-mac-os-x/
to set up R and Hadoop and turned of MR2/YARN. Instead I Am using MR1.

Everything seems to work fine but the from.dfs function.

I am using the simple example in R:
small.ints <- to.dfs(1:1000)
out <- mapreduce(input = small.ints, map = function(k, v) keyval(v, v^2))
df <- as.data.frame(from.dfs(out))

from.dfs produces the following error. If you could be of any hep, I'd greatly appreciate it. Thank you very much. -EK

When I use it I get the error:
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:8020/user/cloudera/128432
at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093)
at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085)
at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:8020/user/cloudera/422
at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093)
at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085)
at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:8020/user/cloudera/122
at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093)
at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085)
at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

reading output of Hadoop job from RHDFS problem!

Hi,
I have run the kmeans.R and the job finished successfully.
But the outpu file cannot be read or reached. It always say to me it is not permitted. I tried even sudo .
The output is in /tmp/Rtmpm1Mc34/filef281df27152.
How can I see the output ? from R or from terminal ?

and secondly how Can I set the ouput file in the hdfs . I mean, when I do hdfs.ls(".") it should be listed.

13/02/24 21:57:18 INFO streaming.StreamJob: Job complete: job_201302242120_0006
13/02/24 21:57:18 INFO streaming.StreamJob: Output: /tmp/Rtmpm1Mc3/filef281df27152

Thanks in advance
(I hope this time the place is OK. )

hdfs.read() cannot load all data from huge csv file on hdfs

Hi,
I have many huge csv files(more 20GB) on my hortonworks HDP 2.0.6.0 GA cluster,
I use the following code to read file from HDFS:

Sys.setenv(HADOOP_CMD="/usr/lib/hadoop/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-101.jar")
Sys.setenv(HADOOP_COMMON_LIB_NATIVE_DIR="/usr/lib/hadoop/lib/native/")
library(rmr2);
library(rhdfs);
library(lubridate);
hdfs.init();
f = hdfs.file("/etl/rawdata/201202.csv","r",buffersize=104857600);
m = hdfs.read(f);
c = rawToChar(m);
data = read.table(textConnection(c), sep = ",");

When I use dim(data) to verify, it show me as following:
[1] 1523 7

But actually, it should be "134279407" instead of "1523".
I found the value of m show in RStudio is "raw [1:131072] 50 72 69 49 ...", and there is

a thread in hadoop-hdfs-user mailing list(why can FSDataInputStream.read() only read 2^17 bytes in hadoop2.0?) .
Ref.
http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201403.mbox/%3CCAGkDawm2ivCB+rNaMi1CvqpuWbQ6hWeb06YAkPmnOx=8PqbNGQ@mail.gmail.com%3E

Is it a bug of hdfs.read() in rhdfs-1.0.8?

Best Regards,
James Chang