azure / azurestor Goto Github PK
View Code? Open in Web Editor NEWR interface to Azure storage accounts
License: Other
R interface to Azure storage accounts
License: Other
Hi there!
First of all thanks for this great work in the AzureR package, I'm using actively AzureStor and your package works awesome. Looking at the documentation there are no details about editing or appending content to a file already located in a Storage Container.
My current use case is to save logs in Azure Storage. Is it possible to consider this enhancement? On the contrary case, do you consider there is any workaround for this case in the current AzureStor?
Thank you! 🤓
Yor Castaño
I am trying to use the download_from_url function from the AzureStor package. I keep getting the below error
Error in download_blob_internal(container, src, dest, blocksize = blocksize, : Forbidden (HTTP 403). Failed to complete Storage Services operation. Message: .
I know that it should work and the permissions are set up correctly for the storage account because I am able to download the blob using Python.
> devtools::install_github("Azure/AzureQstor")
Downloading GitHub repo Azure/AzureQstor@master
from URL https://api.github.com/repos/Azure/AzureQstor/zipball/master
Installing AzureQstor
"C:/PROGRA~1/R/R-34~1.3PA/bin/x64/R" --no-site-file --no-environ --no-save --no-restore --quiet CMD \
INSTALL \
"C:/Users/Hanieh_local/AppData/Local/Temp/RtmpExcszx/devtools6f58179f2f93/Azure-AzureQstor-cb830d7" \
--library="C:/Users/Hanieh_local/Documents/R/win-library/3.4" --install-tests
* installing *source* package 'AzureQstor' ...
** R
** tests
** preparing package for lazy loading
Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) :
namespace 'AzureStor' 2.0.1 is being loaded, but >= 3.0.0 is required
ERROR: lazy loading failed for package 'AzureQstor'
* removing 'C:/Users/Hanieh_local/Documents/R/win-library/3.4/AzureQstor'
In R CMD INSTALL
Installation failed: Command failed (1)
Using an earlier version of AzureStor, we were able to successfully upload files into the blob storage and retrieve them.
After the update to version 3.0.0 the list_storage_files function fails with the following error message:
Error in as.POSIXct.default(df$`Creation-Time`, format = "%a, %d %b %Y %H:%M:%S", : do not know how to convert 'df$`Creation-Time`' to class “POSIXct”
After soem investigation, we have found this issue to be causes by line 332 in: https://github.com/Azure/AzureStor/blob/36317e84a81e3f6496324d3efc6faf56c65e233e/R/blob_client_funcs.R
In our case the Creation-Time is not part of the properties returned on this line:
lst <- res$Blobs
Our initial thought was to delete the files and upload them again using AzureStor 3.0.0, unfortunately, this didn't solve the issue at hand.
I am wanting to move or copy blobs between two containers that may or may not be in the same storage account. It would be nice to have a method to do so. Also, it would be nice to be able to supply a vector of names to have a multi copy version or this.
Hia, awesome package. I've struggled to figure out how to extend or add new methods.
Is it possible to add the Storage API Metadata functions?
Hi all,
I notice list_blob_containers
does not iterate on the NextMarker
like the list_blobs
function.
I need list_blob_containers
to return the entire list of containers from my account and not just the first elements.
Example :
> connection<- AzureStor::blob_endpoint(endpoint = endpoint, key = key)
> names(AzureStor::list_blob_containers(connection))
[1] "db-tab-1" "db-tab-2"
But in reality, I have more containers, I should have something like
> connection<- AzureStor::blob_endpoint(endpoint = endpoint, key = key)
> names(AzureStor::list_blob_containers(connection))
[1] "db-tab-1" "db-tab-2" "db-tab-3" "db-tab-4"
So to fix that, just include the iteration on NextMarker
😄 or did I miss something on my code? 🤔
Thank you!
The storage_download function may work on larger files, but I caused weird timeouts on very small files (the iris dataset, as shown in the documentation examples). I put together a reproducible example to demonstrate. I tracked down the place in the code where the process hangs to the place where the size of the file is computed for the progress bar, but haven't yet figured out exactly what about the logic is causing it to hang.
Perhaps the source of the issue will be obviously apparent to someone more familiar with the package. If not, perhaps the logic could instead be borrowed from the list_blobs function... it may not be as exact, but I had no trouble getting file sizes for the small files using this function.
Code to produce error:
username="myusername"
storage_account<-"mystorageaccountname"
container_name<-"mycontainername"
access_key<-"myaccesskey"
storage_endpoint_url<-"https://${storage_account}.blob.core.windows.net"
storage_endpoint_url_resolved<-glue::glue(storage_endpoint_url,.open="${")
bl_endp_key <- AzureStor::storage_endpoint(storage_endpoint_url_resolved,
key=access_key
)
AzureStor::list_storage_containers(bl_endp_key)
cont<-AzureStor::create_storage_container(bl_endp_key,name=container_name)
json <- jsonlite::toJSON(iris, pretty=TRUE, auto_unbox=TRUE)
con <- textConnection(json)
AzureStor::storage_upload(cont, src=con, dest="iris.json")
AzureStor::list_blobs(cont)
### will fail on next line (times out ten times in a row)
## it gets stuck here: https://github.com/Azure/AzureStor/blob/a71fcf88a9bd7b97c8ab221aaf3ced813d0265db/R/blob_transfer_internal.R#L83
## if I change the http_verb from "HEAD" to "GET" it finishes, so perhaps there's something wrong in the logic
## with regard to very small file sizes?
rawvec <- AzureStor::storage_download(cont, src="iris.json", dest=NULL)
rawToChar(rawvec)
### alternate workaround to show the file is fine and downloadable
tenant_id<-"mytenantid"
subscription_id<-"mysubscriptionid"
resource_group_name<-"myresourcegroupname"
az <- AzureRMR::create_azure_login(tenant=tenant_id,auth_type="device_code")
sub <- az$get_subscription(subscription_id)
rg <- sub$get_resource_group(resource_group_name)
stor <- rg$get_resource(type="Microsoft.Storage/storageAccounts", name=storage_account)
rdevstor1 <- rg$get_storage_account(storage_account)
my_sas<-rdevstor1$get_account_sas(permissions="r",services="bf", start=Sys.Date(), expiry=Sys.Date() + 31)
my_blob_endpoint<-rdevstor1$get_blob_endpoint()
l_endp_sas <- AzureStor::storage_endpoint(storage_endpoint_url_resolved, sas=my_sas)
adm_cont <- AzureStor::storage_container(l_endp_sas, container_name)
test_url<-paste0(storage_endpoint_url_resolved,"/",container_name,"/iris.json?",my_sas)
test_download<-jsonlite::fromJSON(test_url)
testthat::expect_equal(test_download,jsonlite::fromJSON(jsonlite::toJSON(iris)))
I just tested it for blob but assume issue may prevail for other endpoint types. You seem to be missing a forward slash in the print function print.blob_container <- function(x, ...)
of the blob endpoint object in AzureStor/R/blob_client_funcs.R
.
The function returns as follows
$ cat(sprintf("URL: %s\n", paste0("https://myblob.blob.core.windows.net", "mycontainer")))
> URL: https://myblob.blob.core.windows.netmycontainer
when it should return
URL: https://myblob.blob.core.windows.net/mycontainer
I could try a pull request but have never done that before on some open source GitHub R project so not sure how helpful that would be (as it would also take some time getting used to the conventions and such). Lmk.
I am trying to list the folders within a specific container using the following code
AzureStor::list_storage_files(
containers[["data"]],
dir = "path/to/folder",
recursive = FALSE,
info = "name"
)
However the recursive = FALSE
option seems to have no effect and presents the file paths for all files in each of the folders (i.e. recursive = TRUE
).
Looking into this, I cannot see where recursive
is used within list_blobs()
even though it is an option to that function.
hi,
When I try to get the Blob storage data from the Rstudio Server in my VM (CentOS7), I get the following error massage.
Error in download_blob_internal(container, src, dest, blocksize = blocksize, :
Forbidden (HTTP 403). Failed to complete Storage Services operation. Message:
Could you please tell me the solution to this problem?
Here's the code I tried
library(AzureStor, lib.loc = "/usr/lib64/R/library")
endp <- storage_endpoint("https://mystorageaccount.blob.core.windows.net", key="myaccesskey")
cont <- storage_container(endp, "mycontainer")
storage_download(cont, "myblob.csv",
"myblob_local.csv",
overwrite = T)
.
sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] AzureStor_3.2.3
loaded via a namespace (and not attached):
[1] httr_1.4.2 compiler_3.6.0 R6_2.4.1 AzureRMR_2.3.5
[5] tools_3.6.0 curl_4.3 rappdirs_0.3.1 AzureAuth_1.2.4
[9] mime_0.9 openssl_1.4.2 askpass_1.1
blob storage info:
・ Performance/Access tier:Standard/Hot
・Replication:Read-access geo-redundant storage (RA-GRS)
・ account kind:StorageV2 (general purpose v2)
・ location:Japan East, Japan West
*type of blob is Block blob
VM info
・ OS:Linux (centos 7.8.2003)
・ SKU:centos-76
・ size:Basic A2 (2 vcpus, 3.5 GiB memory)
・ location:Japan East
Hello there,
First of all, thanks again for the great R interface to Azure you provide hongooi73, and your strong support.
It seems that the AzureStor
functions do not support the uploading of files and folders which contain UTF-8 characters.
As the following snap shows, it is possible to upload folder and file with UTF-8 accents in a blob container, uploaded here with Microsoft Azure Storage Explorer :
If I list the above container, I got the following output
> AzureStor::list_storage_files(container = cont)
name size isdir
1 Img NA TRUE
2 Img/1-B1.png 170273 FALSE
3 Img/2-F NA TRUE
4 Img/2-F/RB1.png 397330 FALSE
5 Img/2-F/RB1.tif 519996 FALSE
6 Img/3-R1-é.tif 215388 FALSE
7 Img/4-ùûüÿ€àâæçéèêëïîôœ—– NA TRUE
8 Img/4-ùûüÿ€àâæçéèêëïîôœ—–/1-B1.png 170273 FALSE
9 Img/4-ùûüÿ€àâæçéèêëïîôœ—–/3-R1-é.tif 215388 FALSE
I would like to upload the same folder to an Azure Blob container with R commands instead of Microsoft Azure Storage Explorer :
└───Img
│ 1-B1.png
│ 3-R1-é.tif
│
├───2-F
│ RB1.png
│ RB1.tif
│
└───4-ùûüÿ€àâæçéèêëïîôœ—–
1-B1.png
3-R1-é.tiff
Thus I use the folowing command to upload:
library("AzureStor", "AzureRMR", "AzureAuth")
# without AZcopy ##########
#login ####
token <- AzureAuth::get_azure_token(
resource = sto_url,
tenant = tenant,
app = aad_id,
username = az_log,
password = az_pwd,
auth_type = "resource_owner",
use_cache = F
)
stopifnot(AzureAuth::is_azure_token(token))
stopifnot(token$validate())
endp_tok <- AzureStor::blob_endpoint(sto_url, token = token)
cont <- AzureStor::storage_container(endp_tok, cont_name)
# uploading ####
img_path_from <- "./Img"
# simple recursive uploading
img_to <- "ImgS"
files <- list.files(path = img_path_from, recursive = TRUE)
for (i in 1:length(files)) {
AzureStor::storage_upload(
container = cont,
src = paste0(img_path_from, "/", files[i]),
dest = paste0(img_to, "/", files[i]) ,
use_azcopy = F
)
}
# multiple uploading
img_to <- "ImgM"
storage_multiupload(
container = cont,
src = paste0(img_path_from, "/*"),
dest = img_to,
use_azcopy = F,
recursive = TRUE
)
And I got the following errors :
# simple recursive uploading
|==============================================================================| 100%
|==============================================================================| 100%
|==============================================================================| 100%
|================== | 23%
Error in process_storage_response(response, match.arg(http_status_handler), :
Bad Request (HTTP 400). Failed to complete Storage Services operation. Message:
InvalidUri
The requested URI does not represent any resource on the server.
RequestId:57e7118e-401e-004e-065c-5ba0ae000000
Time:2020-07-16T10:30:23.2985328Z.
# multiple uploading
Creating background pool
Error in checkForRemoteErrors(val) :
3 nodes produced errors; first error: Bad Request (HTTP 400). Failed to complete Storage Services operation. Message:
InvalidUri
The requested URI does not represent any resource on the server.
RequestId:a31743fd-d01e-004c-0b5c-5b1e16000000
Time:2020-07-16T10:31:30.9125139Z.
Only the files without UTF-8 accent in their path have been uploaded even with or without multiupload:
> # listing ####
> AzureStor::list_storage_files(container = cont)
name size isdir
1 ImgM NA TRUE
2 ImgM/1-B1.png 170273 FALSE
3 ImgM/2-F NA TRUE
4 ImgM/2-F/RB1.png 397330 FALSE
5 ImgM/2-F/RB1.tif 519996 FALSE
6 ImgS NA TRUE
7 ImgS/1-B1.png 170273 FALSE
8 ImgS/2-F NA TRUE
9 ImgS/2-F/RB1.png 397330 FALSE
10 ImgS/2-F/RB1.tif 519996 FALSE
I tried with use_azcopy = T
but it was impossible as well to upload the files with UTF-8 characters, as the below outputs show :
# a simple file name
# 1-B1.png
Using azcopy binary C:\Users\ajallais\DOCUME~1\AZCOPY~1.3\azcopy.exe
Running "C:\Users\ajallais\DOCUME~1\AZCOPY~1.3\azcopy.exe" copy ./Img/1-B1.png \
"https://#############notshow############/ImgS/1-B1.png" \
--blob-type BlockBlob --block-size-mb 16
INFO: Scanning...
INFO: AZCOPY_OAUTH_TOKEN_INFO is set.
INFO: Authenticating to destination using Azure AD
INFO: Any empty folders will not be processed, because source and/or destination doesn't have full folder support
Job 6fdac609-3340-0943-4b14-9e6ee066d277 has started
Log file is located at: C:\Users\ajallais\.azcopy\6fdac609-3340-0943-4b14-9e6ee066d277.log
0.0 %, 0 Done, 0 Failed, 1 Pending, 0 Skipped, 1 Total,
Job 6fdac609-3340-0943-4b14-9e6ee066d277 summary
Elapsed Time (Minutes): 0.0333
Number of File Transfers: 1
Number of Folder Property Transfers: 0
Total Number of Transfers: 1
Number of Transfers Completed: 1
Number of Transfers Failed: 0
Number of Transfers Skipped: 0
TotalBytesTransferred: 170273
Final Job Status: Completed
# a more complicated path name
# 3-R1-é.tif
Running "C:\Users\ajallais\DOCUME~1\AZCOPY~1.3\azcopy.exe" copy "./Img/3-R1-é.tif" \
"https://#############notshow############/ImgS/3-R1-é.tif" \
--blob-type BlockBlob --block-size-mb 16
INFO: Scanning...
INFO: AZCOPY_OAUTH_TOKEN_INFO is set.
INFO: Authenticating to destination using Azure AD
INFO: Any empty folders will not be processed, because source and/or destination doesn't have full folder support
failed to perform copy command due to error: cannot start job due to error: cannot scan the path \\?\C:\Users\ajallais\Documents\Script\GIT-Script\Script\R\TestBigFiles\7-UTF8Char\Img\3-R1-�.tif, please verify that it is a valid.
Error in processx::run(get_azcopy_path(), args, env = env, echo_cmd = !silent, :
System command 'azcopy.exe' failed, exit status: 1, stdout & stderr were printed
Type .Last.error.trace to see where the error occured
> .Last.error.trace
Stack trace:
1. AzureStor::storage_upload(container = cont, src = paste0(img_path_from, ...
2. AzureStor:::storage_upload.blob_container(container = cont, src = paste0(img_path_f ...
3. AzureStor:::upload_blob(container, ...)
4. AzureStor:::azcopy_upload(container, src, dest, type = type, ...
5. AzureStor:::call_azcopy_from_storage(container$endpoint, "copy", ...
6. AzureStor:::call_azcopy(..., env = auth$env)
7. processx::run(get_azcopy_path(), args, env = env, echo_cmd = !silent, ...
8. throw(new_process_error(res, call = sys.call(), echo = echo, ...
x System command 'azcopy.exe' failed, exit status: 1, stdout & stderr were printed
My environment parameters are the following :
I will appreciate any help or suggestion,
Have good summer holidays if you have
I am writing R-code in an Azure Databricks notebook that is supposed to explore a data lake since latter is supplied by many (partially unknown) sources. Using AzureStor I started doing the following code
AzureStor::adls_endpoint(endpoint = "https://<myDLName>.dfs.core.windows.net", key = myAccessKey) -> endP
AzureStor::list_storage_containers(endP) -> storage_containers
paste0("https://", myDLName,".dfs.core.windows.net/", names(storage_containers)[1] ) -> path
AzureStor::adls_filesystem(path, mykAccessLKey) -> myFileSys
AzureStor::list_adls_files(myFileSys, "/") -> myFiles
which outputs a R data.frame named "myFiles" having columns like "name" and "isDirectory" which I really appreciate. Now, if for some row of the "myFiles" data.frame "isDirectory" is TRUE, I would my code like to go with the exploration, i.e. look into the directory, list the objects in there and so on. Just extending the endpoint url by "/myFolder" and using this
In the end of the day, the task is to read all files (mostly .csv) into the session wich are present in an unknown folder structure. How can that be achieved using AzureStor?
Hi,
First of all, thanks again for the great R interface to Azure, you provide, and your strong support.
I was wondering how could I display the progress bar of the following Job :
storage_multiupload(container = contAC, src=paste0(img_out_path, "/*"), dest=folder_name, use_azcopy=FALSE, recursive = TRUE)
Indeed, even if I have run before the following command : options(azure_storage_progress_bar=TRUE)
, nothing is printed to the prompt.
Sometimes in a RShiny app, but not during the execution of a single command in the prompt), it says : creating background pool
, but nothing more.
Is it because I have not enough files to send, or that my connection is to fast ?
Or is it cause of I do not use properly the options()
command ?
Do you have any recommandations ?
Thanks in advance for your help,
Sincerely yours
When you have a large blob container with a deep hierarchy, it's often more efficient to traverse it like a hierarchy. This is supported in the REST API :
https://docs.microsoft.com/en-us/rest/api/storageservices/list-blobs
As well as the .NET API:
https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blobs-list?tabs=dotnet
Currently, the call to is_endpoint_url
enforces the default Azure URL format, also documented in the source code:
endpoint URL must be of the form {scheme}://{acctname}.{type}.{etc}
However, the Azure storage emulator tool Azurite does not follow this schema.
Instead of <http|https>://<account-name>.<service-name>.core.windows.net/<resource-path>
, Azurite uses http://<local-machine-address>:<port>/<account-name>/<resource-path>
.
Is it possible to disable the format check of the endpoint URL in case of directly calling blob_endpoint
function?
rdevstor1 <- rg$get_storage("rdevstor1")
should read
rdevstor1 <- rg$get_storage_account("rdevstor1")
I am using the AzureSDK storage emulator for development.
https://docs.microsoft.com/en-us/azure/storage/common/storage-use-emulator
Can anyone help me with creating the storage endpoint to access the Azure storage emulator?
Tried the following:
library(AzureStor)
endp <- storage_endpoint("http://127.0.0.1:10000/devstoreaccount1")
Error: Unknown endpoint type
endp<-storage_endpoint("UseDevelopmentStorage=true")
Error: Unknown endpoint type
First off, wanted to state that this package is great. Nice work.
Looking at the documentation it appears I can load individual objects with rawConnection and storage_upload or use multiple files with storage_multiupload.
Is it possible to use storage_multiupload to upload multiple objects that are in memory in R to blob/adls storage upstream? If not, is this something we can add as an enhancement if it's technically feasible.
I know I can write the objects to temp storage and then just multi-upload the individual files but it will incur unnecessary file I/O and can be inefficient especially for the large objects we have in memory.
thank you very much for your time.
anand
Hi,
First of all, thanks for the works you do with the AzureRFamily !
As it is shown in the following snap :
(The above files were uploaded with the followng command : storage_multiupload(cont, paste0(img_path,"*"), dest="azcopy", use_azcopy=TRUE)
)
When I listing the different elements of a blob container with the following command :
list_storage_files (cont, info ="all")
, the isdir colunm indicates isdir=TRUE
even if it is an image.
I can manage what I want to do with the Content-Type column, but I am still wondering if the isdir column indicates whether the element is a file or directory. If it is not the case I think the name isdir could be reformulated, what do you think ?
Sincerely yours,
The requests for new pages in list_blobs()
don't include the prefix
parameter.
This means filtering on dir
only applies to the first page of blobs listed. All other pages are unfiltered.
AzureStor/R/blob_client_funcs.R
Line 342 in b0dcdff
Would be helpful if one could define a query in the call to list_storage_files(), e.g. files modified after some date or simply maxresult = some value. Not sure if that is even possible in the rest api but it seems so telling form MS Docs: REST API Reference - List Blobs. A similar question was asked on SO in the context of a C# client, see StackOverflow: Getting the latest file modified from Azure Blob.
Hi,
First of all, thanks again for the great R interface to Azure, you provide, and your strong support.
I was wondering how could I display or even better record in an object (as a dataframe for example), the summary of a multiupload job with the following command :
storage_multiupload(container = contAC, src=paste0(img_out_path, "/*"), dest=folder_name, use_azcopy=FALSE, recursive = TRUE)
I know that it is possible to display such informations with use_azcopy=T
, but I prefer to not use it.
Thanks in advance for your help,
Sincerely yours
Hi. I'm doing some batch processing and I would like after each batch the R process to write into a queue so that other parts of the pipeline can pick up the process.
Is there any time-frame for azure storage queue support?
I am using download_blob
inside lapply
to download a list of blobs. When I have a large list of blobs it sometimes (not always) fails after downloading some of the blobs with the error below. Any thoughts on a better way to code this? Is there some way to avoid this? The problem is the program hangs and does not through an error until the Esc
key is hit by the user. Thank you in advance!
lapply(unique(totag_sub$download_path),
function(x){
download_blob( container = cont_image,
src = x,
dest =paste0(tagging_location,"/",x) ,overwrite = T)
})
|============================================================================================================================| 100%
|============================================================================================================================| 100%
|============================================================================================================================| 100%
|============================================================================================================================| 100%
|============================================================================================================================| 100%
|=========== | 9%
Connection error, retrying (1 of 10)
|
Error in download_blob_internal(container, src, dest, blocksize = blocksize, :
Forbidden (HTTP 403). Failed to complete Storage Services operation. Message:
AuthenticationFailed
Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
RequestId:a1a2e308-901e-0062-5902-c4087b000000
Time:2020-01-05T19:57:57.9797635Z
Request date header too old: 'Sun, 05 Jan 2020 11:42:07 GMT'.
Hello,
is there a way to avoid the prompt when we load the package?
This might be an issue when run as a script?
What it says on the tin: list_adls_files
returns only the first 5000 files when used on a directory with more than 5000 files.
Since I do not know of any public data lakes with a large number of files I cannot give a reproducible example, but the following fails when used on a ADLS version 2 datalake filesystem I have access to
library(AzureAuth)
library(AzureStor)
token <- get_azure_token(resource="https://storage.azure.com",
tenant=TENANT_ID,
app=SERVICE_PROV_ID,
password=SECRET)
lake <- storage_endpoint("https://STORAGE_ACCOUNT.dfs.core.windows.net", token=token)
fs <- storage_container(lake, "FILESYSTEM_NAME")
files <- list_adls_files(fs, "PATH_TO_DIR_WITH_+5000_FILES", "name", FALSE)
length(files) == 5000
results in TRUE
.
This results in storage_multidownload
silently(!) downloading fewer files than expected.
I am writing an application in a databricks notebook using R-language. The code must be capable to look into storage containers located in a data lake gen2 in order to list all their contents.
So far I am doing the following for development
endpoint <- AzureStor::adls_endpoint(endpoint = "https://<myStorageName>.dfs.core.windows.net", key = "<myStorageKey>") storage_containers <- AzureStor::list_storage_containers(endPoint)
which works perfectly for me. This way I see all the containers in the data lake and can access them.
However, in order to become productive I must ban from the code which I hope is possible in principle using key vault since reading in a file works for me like this using
%py spark.conf.set("fs.azure.account.key.<myStorageName>.dfs.core.windows.net", dbutils.secrets.get(scope = "myScopeName", key = "mySecret") )
SparkR::read.df(path = "abfss://<myContainerName>@<myStorageName>.dfs.core.windows.net/<myFolderName>/<myFileName>.csv", source = "csv") -> mySparkDataFrame
I started by trying to define the key vault applying the AzureKeyVault package
vault <- AzureKeyVault::key_vault(url ="https://<myKeyVaultName>.vault.azure.net")
but this query runs forever such that I must cancle it manually. I guess this is an authentification issue?!
So, what is the correct way of defining a storage endpoint using the secret stored in a key vault, or of authentifying, respectively?
Thanks in advance for the support!
Hi,
not exactly sure how to work with SAS key using this package.
ep <- storage_endpoint("https://xxx.blob.core.windows.net/", sas=my_key)
list_storage_containers(ep)
#Error in do_storage_call(endpoint$url, "/", options = list(comp = "list"), :
# Forbidden (HTTP 403). Failed to complete Storage Services operation. Message:
#AuthenticationFailed
#Server failed to authenticate the request. Make sure the value of Authorization header is formed #correctly including the signature.
#RequestId:b7275fcc-f01e-0020-613b-9bb78c000000
#Time:2019-11-14T22:29:55.5202907Z
#The specified signed resource is not allowed for the this resource level.
(cont=storage_container(ep, the_container_name))
#Azure blob container 'the_container_name'
#URL: https://xxx.blob.core.windows.net/the_container_name
#Access key: <none supplied>
#Azure Active Directory access token: <none supplied>
#Account shared access signature: <hidden>
#Storage API version: 2018-03-28
list_storage_files(container = cont)
#Error in do_storage_call(endp$url, path, options = options, headers = headers, :
# Forbidden (HTTP 403). Failed to complete Storage Services operation. Message:
#AuthenticationFailed
#Server failed to authenticate the request. Make sure the value of Authorization header is formed #correctly including the signature.
#RequestId:b7275fdd-f01e-0020-703b-9bb78c000000
#Time:2019-11-14T22:29:55.5332998Z
#Signature did not match. String to sign used was
#
#
#/xxx/the_container_name
#sas_id.
also attached the assess right of the two sas keys I tried. none works
Is it possible to write their internal repository and then read it from Azure storage (without mounting the storage)? The example could be using miniCRAN or drat package for creating the repositories.
The storage_upload
function doesn't have a parameter to enable sending the md5 along with the file.
I can calculate the base64 encoded md5 and manually add the header "x-ms-blob-content-md5" to a PUT operation, and it works fine (similar to what I think you get using the --put-md5
option for az-copy).
Is there some other way I'm missing to do this with less effort? If not, perhaps adding an option to the storage_upload
function would be nice.
list_blobs(cont)
with a hns-enabled account results in
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
Current retry logic fails to check expiry date of OAuth token, which means that for a long-running transfer, the token can expire during a retry. Solution is to move the token validation check inside the retry loop.
I created an Azure Workspace and a Compute Instance, after the compute instance is created I launched R Studio from the column named Application URI. The R Studio load but when I run the command install.packages("AzureStor") it not installed it displays the following error:
Using PKG_CFLAGS=
Using PKG_LIBS=-lxml2
------------------------- ANTICONF ERROR ---------------------------
Configuration failed because libxml-2.0 was not found. Try installing:
ERROR: configuration failed for package ‘xml2’
I have been using upload_blob
in production for months and just 3 days ago it began to fail. I am uploading a CSV from an active r session to a blob storage container. This is not a repex but here is the code.
It begins to upload (rapidly at first) and then pauses for several minutes before printing "connection failure. Eventually, I get an error message. Thoughts? the csv is only 4510 rows of 11 columns so not a big file and this used to run almost instantly.
# create a text connection to the tagged data to upload
con_tagged<-textConnection(object = format_csv(tagged_upload))
# upload it to the blob container
upload_blob(cont_labels, src=con_tagged,
dest=paste0("tagged_",round(as.numeric(Sys.time())*1000),".csv"),
use_azcopy = F)
|======================================================== | 55%Connection error, retrying (1 of 10)
|===================== | 21%Connection error, retrying (2 of 10)
|================== | 17%Connection error, retrying (3 of 10)
|========================================== | 41%Connection error, retrying (4 of 10)
|============== | 14%Error in process_storage_response(response, match.arg(http_status_handler), :
Forbidden (HTTP 403). Failed to complete Storage Services operation. Message:
AuthenticationFailed
Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
RequestId:4945a495-901e-001a-479e-ed322b000000
Time:2020-02-27T18:46:45.1205751Z
Request date header too old: 'Thu, 27 Feb 2020 18:31:44 GMT'.
Just tried the example from the docs
copy_url_to_storage(
container = my_container,
src = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
dest = "iris.csv")
For me it fails with following error:
Error in UseMethod("copy_from_url") : no applicable method for 'copy_from_url' applied to an object of class "c('blob_container', 'storage_container')"
My container object is properly created.
> class(mycontainer)
[1] "blob_container" "storage_container"
I can call other methods on it that work, e.g. list_storage_files(mycontainer)
My current verison is at AzureStor_2.1.1.9000.
Is there a possibility to read/write (for any file type) directly into R memory from a adls gen2 storage? For example: on a gen2 file storage I have a text.csv file which I like to read with read_csv to a dataframe.
The current solution I came up with is first to download the file to a temporary local directory and then use read_csv.
Hi,
There appears to be a bug in AzureStor::list_adls_files()
when more than 5k files exist in the filesystem. An -almost- reprex (I presumably shouldn't share our key :) ):
library(AzureStor)
endpoint_url <- "<endpoint>.dfs.core.windows.net"
filesystem_name <- "<filesystem_name>"
key = "<access_key>"
endpoint <- AzureStor::storage_endpoint(endpoint = endpoint_url, key = key)
filesystem <- AzureStor::adls_filesystem(endpoint, name = filesystem_name)
AzureStor::list_adls_files(filesystem, recursive = TRUE)
gives an error:
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
In our case, the repeat
-block in list_adls_files
generates an eight-column dataframe in the first loop, with names:
> names(out)
[1] "contentLength" "etag" "group" "isDirectory" "lastModified" "name" "owner"
[8] "permissions"
In the second loop, the content of out has only seven columns, missing the "isDirectory" column:
> names(out2)
[1] "contentLength" "etag" "group" "lastModified" "name" "owner" "permissions"
Good morning,
what package do you recommend to access ADLSgen1 ?
Thank you
Deletes PARENT dirs, not contents of subdir (!)
Hi,
I am unable to download blobs from my storage account. I followed the documentation in the readme, this is my code:
I can successfully create an endpoint and get a container with
bl_endp_key <- storage_endpoint("url", "key")
cont <- storage_container(bl_endp_key, "container_name")
I can further successfully list my containers and files in the blob like this:
list_storage_containers(bl_endp_key)
list_storage_files(cont)
But when I try to download a blob, the call takes very long and fails due to a connectionerror. It then tries to reconnect for 10 times and fails every time. This is the code i use:
rawvec <- download_blob(cont, src="blob_name", dest=NULL)
My container has container access level (anonymous read access for containers and blobs). I used the Azure storage account access key to create the endpoint. I can access the same blob in python using the same storage account access key.
Do you know what I am doing wrong?
Is support for azure table storage on the roadmap? We are looking for functionality like this: table storage python.
Or is there another package which provided this kind of functionality?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.