azure / azurestor Goto Github PK

View Code? Open in Web Editor NEW

61.0 18.0 18.0 756 KB

R interface to Azure storage accounts

License: Other

R 100.00%

r azure-storage azure-storage-blob azure-storage-file azure-data-lake azure-sdk-r

azurestor's Issues

Support for append/edit content to existing files

Hi there!

First of all thanks for this great work in the AzureR package, I'm using actively AzureStor and your package works awesome. Looking at the documentation there are no details about editing or appending content to a file already located in a Storage Container.

My current use case is to save logs in Azure Storage. Is it possible to consider this enhancement? On the contrary case, do you consider there is any workaround for this case in the current AzureStor?

Thank you! 🤓

Yor Castaño

download_from_url function not working

I am trying to use the download_from_url function from the AzureStor package. I keep getting the below error

Error in download_blob_internal(container, src, dest, blocksize = blocksize, : Forbidden (HTTP 403). Failed to complete Storage Services operation. Message: .

I know that it should work and the permissions are set up correctly for the storage account because I am able to download the blob using Python.

'AzureStor' 2.0.1 is being loaded, but >= 3.0.0 is required

> devtools::install_github("Azure/AzureQstor")
Downloading GitHub repo Azure/AzureQstor@master
from URL https://api.github.com/repos/Azure/AzureQstor/zipball/master
Installing AzureQstor
"C:/PROGRA~1/R/R-34~1.3PA/bin/x64/R" --no-site-file --no-environ --no-save --no-restore --quiet CMD  \
  INSTALL  \
  "C:/Users/Hanieh_local/AppData/Local/Temp/RtmpExcszx/devtools6f58179f2f93/Azure-AzureQstor-cb830d7"  \
  --library="C:/Users/Hanieh_local/Documents/R/win-library/3.4" --install-tests 

* installing *source* package 'AzureQstor' ...
** R
** tests
** preparing package for lazy loading
Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) : 
  namespace 'AzureStor' 2.0.1 is being loaded, but >= 3.0.0 is required
ERROR: lazy loading failed for package 'AzureQstor'
* removing 'C:/Users/Hanieh_local/Documents/R/win-library/3.4/AzureQstor'
In R CMD INSTALL
Installation failed: Command failed (1)

Add blob transfer from URL

Creation-Time is not part of the properties, but is expected in list_blobs

Using an earlier version of AzureStor, we were able to successfully upload files into the blob storage and retrieve them.
After the update to version 3.0.0 the list_storage_files function fails with the following error message:

Error in as.POSIXct.default(df$`Creation-Time`, format = "%a, %d %b %Y %H:%M:%S", : do not know how to convert 'df$`Creation-Time`' to class “POSIXct”

After soem investigation, we have found this issue to be causes by line 332 in: https://github.com/Azure/AzureStor/blob/36317e84a81e3f6496324d3efc6faf56c65e233e/R/blob_client_funcs.R

In our case the Creation-Time is not part of the properties returned on this line:
lst <- res$Blobs

Our initial thought was to delete the files and upload them again using AzureStor 3.0.0, unfortunately, this didn't solve the issue at hand.

function to move or copy blobs from one container to a different container

I am wanting to move or copy blobs between two containers that may or may not be in the same storage account. It would be nice to have a method to do so. Also, it would be nice to be able to supply a vector of names to have a multi copy version or this.

GET/SET Blob Metadata

Hia, awesome package. I've struggled to figure out how to extend or add new methods.

Is it possible to add the Storage API Metadata functions?

list_blob_containers, missing nextmarker iteration

Hi all,

I notice list_blob_containers does not iterate on the NextMarker like the list_blobs function.

I need list_blob_containers to return the entire list of containers from my account and not just the first elements.

Example :

> connection<- AzureStor::blob_endpoint(endpoint = endpoint, key = key)
> names(AzureStor::list_blob_containers(connection))
[1] "db-tab-1"             "db-tab-2"

But in reality, I have more containers, I should have something like

> connection<- AzureStor::blob_endpoint(endpoint = endpoint, key = key)
> names(AzureStor::list_blob_containers(connection))
[1] "db-tab-1"             "db-tab-2"             "db-tab-3"             "db-tab-4"

So to fix that, just include the iteration on NextMarker 😄 or did I miss something on my code? 🤔

Thank you!

storage_download fails on very small file

The storage_download function may work on larger files, but I caused weird timeouts on very small files (the iris dataset, as shown in the documentation examples). I put together a reproducible example to demonstrate. I tracked down the place in the code where the process hangs to the place where the size of the file is computed for the progress bar, but haven't yet figured out exactly what about the logic is causing it to hang.

Perhaps the source of the issue will be obviously apparent to someone more familiar with the package. If not, perhaps the logic could instead be borrowed from the list_blobs function... it may not be as exact, but I had no trouble getting file sizes for the small files using this function.

Code to produce error:

username="myusername"
storage_account<-"mystorageaccountname"
container_name<-"mycontainername"
access_key<-"myaccesskey"

storage_endpoint_url<-"https://${storage_account}.blob.core.windows.net"
storage_endpoint_url_resolved<-glue::glue(storage_endpoint_url,.open="${")

bl_endp_key <- AzureStor::storage_endpoint(storage_endpoint_url_resolved, 
                                    key=access_key
                                    )
AzureStor::list_storage_containers(bl_endp_key)

cont<-AzureStor::create_storage_container(bl_endp_key,name=container_name)
json <- jsonlite::toJSON(iris, pretty=TRUE, auto_unbox=TRUE)
con <- textConnection(json)
AzureStor::storage_upload(cont, src=con, dest="iris.json")

AzureStor::list_blobs(cont)
### will fail on next line (times out ten times in a row)
## it gets stuck here: https://github.com/Azure/AzureStor/blob/a71fcf88a9bd7b97c8ab221aaf3ced813d0265db/R/blob_transfer_internal.R#L83
## if I change the http_verb from "HEAD" to "GET" it finishes, so perhaps there's something wrong in the logic
## with regard to very small file sizes?  
rawvec <- AzureStor::storage_download(cont, src="iris.json", dest=NULL)
rawToChar(rawvec)


### alternate workaround to show the file is fine and downloadable

tenant_id<-"mytenantid"
subscription_id<-"mysubscriptionid"
resource_group_name<-"myresourcegroupname"

az <- AzureRMR::create_azure_login(tenant=tenant_id,auth_type="device_code")
sub <- az$get_subscription(subscription_id)
rg <- sub$get_resource_group(resource_group_name)
stor <- rg$get_resource(type="Microsoft.Storage/storageAccounts", name=storage_account)
rdevstor1 <- rg$get_storage_account(storage_account)
my_sas<-rdevstor1$get_account_sas(permissions="r",services="bf", start=Sys.Date(), expiry=Sys.Date() + 31)

my_blob_endpoint<-rdevstor1$get_blob_endpoint()
l_endp_sas <- AzureStor::storage_endpoint(storage_endpoint_url_resolved, sas=my_sas)
adm_cont <- AzureStor::storage_container(l_endp_sas, container_name)

test_url<-paste0(storage_endpoint_url_resolved,"/",container_name,"/iris.json?",my_sas)
test_download<-jsonlite::fromJSON(test_url)
testthat::expect_equal(test_download,jsonlite::fromJSON(jsonlite::toJSON(iris)))

Endpoint print function is missing a forward slash

I just tested it for blob but assume issue may prevail for other endpoint types. You seem to be missing a forward slash in the print function print.blob_container <- function(x, ...) of the blob endpoint object in AzureStor/R/blob_client_funcs.R.

The function returns as follows

$ cat(sprintf("URL: %s\n", paste0("https://myblob.blob.core.windows.net", "mycontainer")))
> URL: https://myblob.blob.core.windows.netmycontainer

when it should return

URL: https://myblob.blob.core.windows.net/mycontainer

I could try a pull request but have never done that before on some open source GitHub R project so not sure how helpful that would be (as it would also take some time getting used to the conventions and such). Lmk.

Resume transfer after interruption

list_storage_files recursive = FALSE does not work

I am trying to list the folders within a specific container using the following code

AzureStor::list_storage_files(
  containers[["data"]],
  dir = "path/to/folder",
  recursive = FALSE,
  info = "name"
)

However the recursive = FALSE option seems to have no effect and presents the file paths for all files in each of the folders (i.e. recursive = TRUE).

Looking into this, I cannot see where recursive is used within list_blobs() even though it is an option to that function.

storage_download fails( Forbidden (HTTP 403))

hi,
When I try to get the Blob storage data from the Rstudio Server in my VM (CentOS7), I get the following error massage.

Error in download_blob_internal(container, src, dest, blocksize = blocksize,  : 
  Forbidden (HTTP 403). Failed to complete Storage Services operation. Message:

Could you please tell me the solution to this problem?
Here's the code I tried

library(AzureStor, lib.loc = "/usr/lib64/R/library")
endp <- storage_endpoint("https://mystorageaccount.blob.core.windows.net", key="myaccesskey")
cont <- storage_container(endp, "mycontainer")
storage_download(cont, "myblob.csv", 
                 "myblob_local.csv",
                 overwrite = T)

.
sessionInfo()

R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] AzureStor_3.2.3

loaded via a namespace (and not attached):
 [1] httr_1.4.2      compiler_3.6.0  R6_2.4.1        AzureRMR_2.3.5 
 [5] tools_3.6.0     curl_4.3        rappdirs_0.3.1  AzureAuth_1.2.4
 [9] mime_0.9        openssl_1.4.2   askpass_1.1

blob storage info:
・ Performance/Access tier：Standard/Hot
・Replication：Read-access geo-redundant storage (RA-GRS)
・ account kind：StorageV2 (general purpose v2)
・ location：Japan East, Japan West
*type of blob is Block blob

VM info
・ OS：Linux (centos 7.8.2003)
・ SKU：centos-76
・ size：Basic A2 (2 vcpus, 3.5 GiB memory)
・ location：Japan East

Uploading file and folder with a non-ascii char in their name

Hello there,

First of all, thanks again for the great R interface to Azure you provide hongooi73, and your strong support.

It seems that the AzureStor functions do not support the uploading of files and folders which contain UTF-8 characters.

As the following snap shows, it is possible to upload folder and file with UTF-8 accents in a blob container, uploaded here with Microsoft Azure Storage Explorer :

If I list the above container, I got the following output

> AzureStor::list_storage_files(container = cont)
                                  name   size isdir
1                                  Img     NA  TRUE
2                         Img/1-B1.png 170273 FALSE
3                              Img/2-F     NA  TRUE
4                      Img/2-F/RB1.png 397330 FALSE
5                      Img/2-F/RB1.tif 519996 FALSE
6                       Img/3-R1-é.tif 215388 FALSE
7            Img/4-ùûüÿ€àâæçéèêëïîôœ—–     NA  TRUE
8   Img/4-ùûüÿ€àâæçéèêëïîôœ—–/1-B1.png 170273 FALSE
9 Img/4-ùûüÿ€àâæçéèêëïîôœ—–/3-R1-é.tif 215388 FALSE

I would like to upload the same folder to an Azure Blob container with R commands instead of Microsoft Azure Storage Explorer :

└───Img
    │   1-B1.png
    │   3-R1-é.tif
    │
    ├───2-F
    │       RB1.png
    │       RB1.tif
    │
    └───4-ùûüÿ€àâæçéèêëïîôœ—–
            1-B1.png
            3-R1-é.tiff

Thus I use the folowing command to upload:

library("AzureStor", "AzureRMR", "AzureAuth")

# without AZcopy ##########

#login ####
token <- AzureAuth::get_azure_token(
  resource = sto_url,
  tenant = tenant,
  app = aad_id,
  username = az_log,
  password = az_pwd,
  auth_type = "resource_owner",
  use_cache = F
)
stopifnot(AzureAuth::is_azure_token(token))
stopifnot(token$validate())

endp_tok <- AzureStor::blob_endpoint(sto_url, token = token)
cont <- AzureStor::storage_container(endp_tok, cont_name)

# uploading ####
img_path_from <- "./Img"

# simple recursive uploading
img_to <- "ImgS"
files <- list.files(path = img_path_from, recursive = TRUE)
for (i in 1:length(files)) {
  AzureStor::storage_upload(
    container = cont,
    src = paste0(img_path_from, "/", files[i]),
    dest = paste0(img_to, "/", files[i]) ,
    use_azcopy = F
  )
}

# multiple uploading
img_to <- "ImgM"
storage_multiupload(
  container = cont,
  src = paste0(img_path_from, "/*"),
  dest = img_to,
  use_azcopy = F,
  recursive = TRUE
)

And I got the following errors :

# simple recursive uploading

  |==============================================================================| 100%
  |==============================================================================| 100%
  |==============================================================================| 100%
  |==================                                                            |  23%
  Error in process_storage_response(response, match.arg(http_status_handler),  : 
  Bad Request (HTTP 400). Failed to complete Storage Services operation. Message:
InvalidUri
The requested URI does not represent any resource on the server.
RequestId:57e7118e-401e-004e-065c-5ba0ae000000
Time:2020-07-16T10:30:23.2985328Z.


# multiple uploading

Creating background pool
Error in checkForRemoteErrors(val) : 
  3 nodes produced errors; first error: Bad Request (HTTP 400). Failed to complete Storage Services operation. Message:
InvalidUri
The requested URI does not represent any resource on the server.
RequestId:a31743fd-d01e-004c-0b5c-5b1e16000000
Time:2020-07-16T10:31:30.9125139Z.

Only the files without UTF-8 accent in their path have been uploaded even with or without multiupload:

> # listing ####
> AzureStor::list_storage_files(container = cont)
               name   size isdir
1              ImgM     NA  TRUE
2     ImgM/1-B1.png 170273 FALSE
3          ImgM/2-F     NA  TRUE
4  ImgM/2-F/RB1.png 397330 FALSE
5  ImgM/2-F/RB1.tif 519996 FALSE
6              ImgS     NA  TRUE
7     ImgS/1-B1.png 170273 FALSE
8          ImgS/2-F     NA  TRUE
9  ImgS/2-F/RB1.png 397330 FALSE
10 ImgS/2-F/RB1.tif 519996 FALSE

I tried with use_azcopy = T but it was impossible as well to upload the files with UTF-8 characters, as the below outputs show :

# a simple file name
# 1-B1.png

Using azcopy binary C:\Users\ajallais\DOCUME~1\AZCOPY~1.3\azcopy.exe
Running "C:\Users\ajallais\DOCUME~1\AZCOPY~1.3\azcopy.exe" copy ./Img/1-B1.png \
  "https://#############notshow############/ImgS/1-B1.png" \
  --blob-type BlockBlob --block-size-mb 16
INFO: Scanning...
INFO: AZCOPY_OAUTH_TOKEN_INFO is set.
INFO: Authenticating to destination using Azure AD
INFO: Any empty folders will not be processed, because source and/or destination doesn't have full folder support

Job 6fdac609-3340-0943-4b14-9e6ee066d277 has started
Log file is located at: C:\Users\ajallais\.azcopy\6fdac609-3340-0943-4b14-9e6ee066d277.log

0.0 %, 0 Done, 0 Failed, 1 Pending, 0 Skipped, 1 Total, 


Job 6fdac609-3340-0943-4b14-9e6ee066d277 summary
Elapsed Time (Minutes): 0.0333
Number of File Transfers: 1
Number of Folder Property Transfers: 0
Total Number of Transfers: 1
Number of Transfers Completed: 1
Number of Transfers Failed: 0
Number of Transfers Skipped: 0
TotalBytesTransferred: 170273
Final Job Status: Completed



# a more complicated path name
# 3-R1-é.tif

Running "C:\Users\ajallais\DOCUME~1\AZCOPY~1.3\azcopy.exe" copy "./Img/3-R1-é.tif" \
  "https://#############notshow############/ImgS/3-R1-é.tif" \
  --blob-type BlockBlob --block-size-mb 16
INFO: Scanning...
INFO: AZCOPY_OAUTH_TOKEN_INFO is set.
INFO: Authenticating to destination using Azure AD
INFO: Any empty folders will not be processed, because source and/or destination doesn't have full folder support

failed to perform copy command due to error: cannot start job due to error: cannot scan the path \\?\C:\Users\ajallais\Documents\Script\GIT-Script\Script\R\TestBigFiles\7-UTF8Char\Img\3-R1-ï¿½.tif, please verify that it is a valid.

Error in processx::run(get_azcopy_path(), args, env = env, echo_cmd = !silent,  : 
  System command 'azcopy.exe' failed, exit status: 1, stdout & stderr were printed
Type .Last.error.trace to see where the error occured



> .Last.error.trace 

 Stack trace:

 1. AzureStor::storage_upload(container = cont, src = paste0(img_path_from,  ...
 2. AzureStor:::storage_upload.blob_container(container = cont, src = paste0(img_path_f ...
 3. AzureStor:::upload_blob(container, ...)
 4. AzureStor:::azcopy_upload(container, src, dest, type = type,  ...
 5. AzureStor:::call_azcopy_from_storage(container$endpoint, "copy",  ...
 6. AzureStor:::call_azcopy(..., env = auth$env)
 7. processx::run(get_azcopy_path(), args, env = env, echo_cmd = !silent,  ...
 8. throw(new_process_error(res, call = sys.call(), echo = echo,  ...

 x System command 'azcopy.exe' failed, exit status: 1, stdout & stderr were printed

My environment parameters are the following :

R version 4.0.2 (2020-06-22) -- "Taking Off Again"
Platform: x86_64-w64-mingw32/x64 (64-bit)
Microsoft Windows 10 Professionnel
AzCopy 10.4.3
packageVersion('AzureStor') ‘3.2.2’
packageVersion('AzureRMR') ‘2.3.4’
packageVersion('AzureAuth') ‘1.2.4’

I will appreciate any help or suggestion,
Have good summer holidays if you have

Exploration of sub-folders/directories in a file system of a data lake

I am writing R-code in an Azure Databricks notebook that is supposed to explore a data lake since latter is supplied by many (partially unknown) sources. Using AzureStor I started doing the following code

AzureStor::adls_endpoint(endpoint = "https://<myDLName>.dfs.core.windows.net", key = myAccessKey) -> endP
AzureStor::list_storage_containers(endP) -> storage_containers
paste0("https://", myDLName,".dfs.core.windows.net/", names(storage_containers)[1] ) -> path
AzureStor::adls_filesystem(path, mykAccessLKey) -> myFileSys
AzureStor::list_adls_files(myFileSys, "/") -> myFiles

which outputs a R data.frame named "myFiles" having columns like "name" and "isDirectory" which I really appreciate. Now, if for some row of the "myFiles" data.frame "isDirectory" is TRUE, I would my code like to go with the exploration, i.e. look into the directory, list the objects in there and so on. Just extending the endpoint url by "/myFolder" and using this

In the end of the day, the task is to read all files (mostly .csv) into the session wich are present in an unknown folder structure. How can that be achieved using AzureStor?

Progress bar falt with storage_multiupload() even with options(azure_storage_progress_bar=TRUE)

Hi,

First of all, thanks again for the great R interface to Azure, you provide, and your strong support.

I was wondering how could I display the progress bar of the following Job :

storage_multiupload(container = contAC, src=paste0(img_out_path, "/*"), dest=folder_name, use_azcopy=FALSE, recursive = TRUE)

Indeed, even if I have run before the following command : options(azure_storage_progress_bar=TRUE) , nothing is printed to the prompt.
Sometimes in a RShiny app, but not during the execution of a single command in the prompt), it says : creating background pool, but nothing more.

Is it because I have not enough files to send, or that my connection is to fast ?
Or is it cause of I do not use properly the options() command ?
Do you have any recommandations ?

Thanks in advance for your help,
Sincerely yours

Add support for blob hierarchies to list_blobs

When you have a large blob container with a deep hierarchy, it's often more efficient to traverse it like a hierarchy. This is supported in the REST API :

https://docs.microsoft.com/en-us/rest/api/storageservices/list-blobs

As well as the .NET API:

https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blobs-list?tabs=dotnet

Allow using custom endpoint URL format to support Azurite

Currently, the call to is_endpoint_url enforces the default Azure URL format, also documented in the source code:

endpoint URL must be of the form {scheme}://{acctname}.{type}.{etc}

However, the Azure storage emulator tool Azurite does not follow this schema.

Instead of <http|https>://<account-name>.<service-name>.core.windows.net/<resource-path>, Azurite uses http://<local-machine-address>:<port>/<account-name>/<resource-path>.

Is it possible to disable the format check of the endpoint URL in case of directly calling blob_endpoint function?

typo in readme under admin interface

rdevstor1 <- rg$get_storage("rdevstor1")

should read

rdevstor1 <- rg$get_storage_account("rdevstor1")

Accessing storage emulator from Azure SDK

I am using the AzureSDK storage emulator for development.
https://docs.microsoft.com/en-us/azure/storage/common/storage-use-emulator

Can anyone help me with creating the storage endpoint to access the Azure storage emulator?

Tried the following:
library(AzureStor)
endp <- storage_endpoint("http://127.0.0.1:10000/devstoreaccount1")

Error: Unknown endpoint type

endp<-storage_endpoint("UseDevelopmentStorage=true")
Error: Unknown endpoint type

Support for multiple objects with storage_multiupload

First off, wanted to state that this package is great. Nice work.

Looking at the documentation it appears I can load individual objects with rawConnection and storage_upload or use multiple files with storage_multiupload.

Is it possible to use storage_multiupload to upload multiple objects that are in memory in R to blob/adls storage upstream? If not, is this something we can add as an enhancement if it's technically feasible.

I know I can write the objects to temp storage and then just multi-upload the individual files but it will incur unnecessary file I/O and can be inefficient especially for the large objects we have in memory.

thank you very much for your time.

anand

list_storage_files isdir=TRUE for files after uploading with azcopy

Hi,
First of all, thanks for the works you do with the AzureRFamily !
As it is shown in the following snap :

(The above files were uploaded with the followng command : storage_multiupload(cont, paste0(img_path,"*"), dest="azcopy", use_azcopy=TRUE))

When I listing the different elements of a blob container with the following command :
list_storage_files (cont, info ="all"), the isdir colunm indicates isdir=TRUE even if it is an image.

I can manage what I want to do with the Content-Type column, but I am still wondering if the isdir column indicates whether the element is a file or directory. If it is not the case I think the name isdir could be reformulated, what do you think ?

Sincerely yours,

`list_blobs` doesn't honor `dir` when paging is used.

The requests for new pages in list_blobs() don't include the prefix parameter.
This means filtering on dir only applies to the first page of blobs listed. All other pages are unfiltered.

AzureStor/R/blob_client_funcs.R

Line 342 in b0dcdff

 res <- do_container_op(container, options=list(comp="list", restype="container", marker=res$NextMarker[[1]])) 

Support identity SAS

Limit output of list_storage_files

Would be helpful if one could define a query in the call to list_storage_files(), e.g. files modified after some date or simply maxresult = some value. Not sure if that is even possible in the rest api but it seems so telling form MS Docs: REST API Reference - List Blobs. A similar question was asked on SO in the context of a C# client, see StackOverflow: Getting the latest file modified from Azure Blob.

Recording MutliUpload Job Summary

Hi,

First of all, thanks again for the great R interface to Azure, you provide, and your strong support.

I was wondering how could I display or even better record in an object (as a dataframe for example), the summary of a multiupload job with the following command :

storage_multiupload(container = contAC, src=paste0(img_out_path, "/*"), dest=folder_name, use_azcopy=FALSE, recursive = TRUE)

I know that it is possible to display such informations with use_azcopy=T, but I prefer to not use it.

Thanks in advance for your help,
Sincerely yours

Queue storage package available

Hi. I'm doing some batch processing and I would like after each batch the R process to write into a queue so that other parts of the pipeline can pick up the process.

Is there any time-frame for azure storage queue support?

download_blob fails inside lapply.... sometimes

I am using download_blob inside lapply to download a list of blobs. When I have a large list of blobs it sometimes (not always) fails after downloading some of the blobs with the error below. Any thoughts on a better way to code this? Is there some way to avoid this? The problem is the program hangs and does not through an error until the Esc key is hit by the user. Thank you in advance!

lapply(unique(totag_sub$download_path),
                    function(x){
                      download_blob( container = cont_image,
                                     src = x,
                                     dest =paste0(tagging_location,"/",x) ,overwrite = T)
                    })

|============================================================================================================================| 100%
|============================================================================================================================| 100%
|============================================================================================================================| 100%
|============================================================================================================================| 100%
|============================================================================================================================| 100%
|=========== | 9%
Connection error, retrying (1 of 10)
|

Error in download_blob_internal(container, src, dest, blocksize = blocksize, :
Forbidden (HTTP 403). Failed to complete Storage Services operation. Message:
AuthenticationFailed
Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
RequestId:a1a2e308-901e-0062-5902-c4087b000000
Time:2020-01-05T19:57:57.9797635Z
Request date header too old: 'Sun, 05 Jan 2020 11:42:07 GMT'.

Support passing OAuth token to azcopy

avoid prompt when library(AzureStor)

Hello,

is there a way to avoid the prompt when we load the package?
This might be an issue when run as a script?

Directories created in blob containers via portal not detected as such

list_adls_files limited to 5000 results

What it says on the tin: list_adls_files returns only the first 5000 files when used on a directory with more than 5000 files.

Since I do not know of any public data lakes with a large number of files I cannot give a reproducible example, but the following fails when used on a ADLS version 2 datalake filesystem I have access to

library(AzureAuth)
library(AzureStor)

token <- get_azure_token(resource="https://storage.azure.com",
                         tenant=TENANT_ID,
                         app=SERVICE_PROV_ID,
                         password=SECRET)

lake <- storage_endpoint("https://STORAGE_ACCOUNT.dfs.core.windows.net", token=token)

fs <- storage_container(lake, "FILESYSTEM_NAME")

files <- list_adls_files(fs, "PATH_TO_DIR_WITH_+5000_FILES", "name", FALSE)

length(files) == 5000

results in TRUE.

This results in storage_multidownload silently(!) downloading fewer files than expected.

Provide async upload/download

Define storage endpoint using key vault

I am writing an application in a databricks notebook using R-language. The code must be capable to look into storage containers located in a data lake gen2 in order to list all their contents.

So far I am doing the following for development

endpoint <- AzureStor::adls_endpoint(endpoint = "https://<myStorageName>.dfs.core.windows.net", key = "<myStorageKey>") storage_containers <- AzureStor::list_storage_containers(endPoint)

which works perfectly for me. This way I see all the containers in the data lake and can access them.

However, in order to become productive I must ban from the code which I hope is possible in principle using key vault since reading in a file works for me like this using

%py spark.conf.set("fs.azure.account.key.<myStorageName>.dfs.core.windows.net", dbutils.secrets.get(scope = "myScopeName", key = "mySecret") )

SparkR::read.df(path = "abfss://<myContainerName>@<myStorageName>.dfs.core.windows.net/<myFolderName>/<myFileName>.csv", source = "csv") -> mySparkDataFrame

I started by trying to define the key vault applying the AzureKeyVault package

vault <- AzureKeyVault::key_vault(url ="https://<myKeyVaultName>.vault.azure.net")

but this query runs forever such that I must cancle it manually. I guess this is an authentification issue?!

So, what is the correct way of defining a storage endpoint using the secret stored in a key vault, or of authentifying, respectively?

Thanks in advance for the support!

How to use with SAS?

Hi,
not exactly sure how to work with SAS key using this package.

ep <- storage_endpoint("https://xxx.blob.core.windows.net/", sas=my_key) 

list_storage_containers(ep)
#Error in do_storage_call(endpoint$url, "/", options = list(comp = "list"),  : 
#  Forbidden (HTTP 403). Failed to complete Storage Services operation. Message:
#AuthenticationFailed
#Server failed to authenticate the request. Make sure the value of Authorization header is formed #correctly including the signature.
#RequestId:b7275fcc-f01e-0020-613b-9bb78c000000
#Time:2019-11-14T22:29:55.5202907Z
#The specified signed resource is not allowed for the this resource level.

(cont=storage_container(ep, the_container_name))
#Azure blob container 'the_container_name'
#URL: https://xxx.blob.core.windows.net/the_container_name
#Access key: <none supplied>
#Azure Active Directory access token: <none supplied>
#Account shared access signature: <hidden>
#Storage API version: 2018-03-28

list_storage_files(container = cont)
#Error in do_storage_call(endp$url, path, options = options, headers = headers,  : 
#  Forbidden (HTTP 403). Failed to complete Storage Services operation. Message:
#AuthenticationFailed
#Server failed to authenticate the request. Make sure the value of Authorization header is formed #correctly including the signature.
#RequestId:b7275fdd-f01e-0020-703b-9bb78c000000
#Time:2019-11-14T22:29:55.5332998Z
#Signature did not match. String to sign used was 
#
#
#/xxx/the_container_name
#sas_id.

also attached the assess right of the two sas keys I tried. none works

xxx

Examples request (Creating CRAN)

Is it possible to write their internal repository and then read it from Azure storage (without mounting the storage)? The example could be using miniCRAN or drat package for creating the repositories.

Sending md5 with uploads

The storage_upload function doesn't have a parameter to enable sending the md5 along with the file.

I can calculate the base64 encoded md5 and manually add the header "x-ms-blob-content-md5" to a PUT operation, and it works fine (similar to what I think you get using the --put-md5 option for az-copy).

Is there some other way I'm missing to do this with less effort? If not, perhaps adding an option to the storage_upload function would be nice.

Blob/ADLS interop broken

list_blobs(cont) with a hns-enabled account results in

Error in rbind(deparse.level, ...) :
  numbers of columns of arguments do not match

Token can expire during retry

Current retry logic fails to check expiry date of OAuth token, which means that for a long-running transfer, the token can expire during a retry. Solution is to move the token validation check inside the retry loop.

AzureStor fails to install in Azure Compute Instance

I created an Azure Workspace and a Compute Instance, after the compute instance is created I launched R Studio from the column named Application URI. The R Studio load but when I run the command install.packages("AzureStor") it not installed it displays the following error:

Using PKG_CFLAGS=
Using PKG_LIBS=-lxml2
------------------------- ANTICONF ERROR ---------------------------
Configuration failed because libxml-2.0 was not found. Try installing:

deb: libxml2-dev (Debian, Ubuntu, etc)
rpm: libxml2-devel (Fedora, CentOS, RHEL)
csw: libxml2_dev (Solaris)
If libxml-2.0 is already installed, check that 'pkg-config' is in your
PATH and PKG_CONFIG_PATH contains a libxml-2.0.pc file. If pkg-config
is unavailable you can set INCLUDE_DIR and LIB_DIR manually via:
R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'

ERROR: configuration failed for package ‘xml2’

removing ‘/home/azureuser/R/x86_64-pc-linux-gnu-library/3.6/xml2’
Warning in install.packages :
installation of package ‘xml2’ had non-zero exit status
ERROR: dependency ‘xml2’ is not available for package ‘roxygen2’
removing ‘/home/azureuser/R/x86_64-pc-linux-gnu-library/3.6/roxygen2’
Warning in install.packages :
installation of package ‘roxygen2’ had non-zero exit status
ERROR: dependency ‘xml2’ is not available for package ‘rversions’
removing ‘/home/azureuser/R/x86_64-pc-linux-gnu-library/3.6/rversions’
Warning in install.packages :
installation of package ‘rversions’ had non-zero exit status
ERROR: dependency ‘xml2’ is not available for package ‘AzureStor’
removing ‘/home/azureuser/R/x86_64-pc-linux-gnu-library/3.6/AzureStor’
Warning in install.packages :
installation of package ‘AzureStor’ had non-zero exit status

upload_blob failing

I have been using upload_blob in production for months and just 3 days ago it began to fail. I am uploading a CSV from an active r session to a blob storage container. This is not a repex but here is the code.
It begins to upload (rapidly at first) and then pauses for several minutes before printing "connection failure. Eventually, I get an error message. Thoughts? the csv is only 4510 rows of 11 columns so not a big file and this used to run almost instantly.

# create a text connection to the tagged data to upload
con_tagged<-textConnection(object = format_csv(tagged_upload))
 # upload it to the blob container
upload_blob(cont_labels, src=con_tagged,
       dest=paste0("tagged_",round(as.numeric(Sys.time())*1000),".csv"),
       use_azcopy = F)

|======================================================== | 55%Connection error, retrying (1 of 10)
|===================== | 21%Connection error, retrying (2 of 10)
|================== | 17%Connection error, retrying (3 of 10)
|========================================== | 41%Connection error, retrying (4 of 10)
|============== | 14%Error in process_storage_response(response, match.arg(http_status_handler), :
Forbidden (HTTP 403). Failed to complete Storage Services operation. Message:
AuthenticationFailed
Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
RequestId:4945a495-901e-001a-479e-ed322b000000
Time:2020-02-27T18:46:45.1205751Z
Request date header too old: 'Thu, 27 Feb 2020 18:31:44 GMT'.

copy_url_to_storage fails with error in UseMethod

Just tried the example from the docs

copy_url_to_storage(
  container = my_container,
  src = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
  dest = "iris.csv")

For me it fails with following error:

Error in UseMethod("copy_from_url") : no applicable method for 'copy_from_url' applied to an object of class "c('blob_container', 'storage_container')"

My container object is properly created.

> class(mycontainer)
[1] "blob_container"    "storage_container"

I can call other methods on it that work, e.g. list_storage_files(mycontainer)

My current verison is at AzureStor_2.1.1.9000.

Write/read directly to memory

Is there a possibility to read/write (for any file type) directly into R memory from a adls gen2 storage? For example: on a gen2 file storage I have a text.csv file which I like to read with read_csv to a dataframe.

The current solution I came up with is first to download the file to a temporary local directory and then use read_csv.

Bug in AzureStor::list_adls_files()

Hi,

There appears to be a bug in AzureStor::list_adls_files() when more than 5k files exist in the filesystem. An -almost- reprex (I presumably shouldn't share our key :) ):

library(AzureStor)

endpoint_url <- "<endpoint>.dfs.core.windows.net"
filesystem_name <- "<filesystem_name>"
key = "<access_key>"

endpoint <- AzureStor::storage_endpoint(endpoint = endpoint_url, key = key)
filesystem <- AzureStor::adls_filesystem(endpoint, name = filesystem_name)

AzureStor::list_adls_files(filesystem, recursive = TRUE)

gives an error:

Error in rbind(deparse.level, ...) : 
  numbers of columns of arguments do not match

In our case, the repeat-block in list_adls_files generates an eight-column dataframe in the first loop, with names:

> names(out)
[1] "contentLength" "etag"          "group"         "isDirectory"   "lastModified"  "name"          "owner"        
[8] "permissions"

In the second loop, the content of out has only seven columns, missing the "isDirectory" column:

> names(out2)
[1] "contentLength" "etag"          "group"         "lastModified"  "name"          "owner"         "permissions"

ADLSgen1

Good morning,

what package do you recommend to access ADLSgen1 ?

Thank you

delete_azure_dir(recursive=TRUE) is broken

Deletes PARENT dirs, not contents of subdir (!)

Allow passing vector of filenames to multi_upload/download

Unable to download blob from container

Hi,

I am unable to download blobs from my storage account. I followed the documentation in the readme, this is my code:

I can successfully create an endpoint and get a container with

bl_endp_key <- storage_endpoint("url", "key")
cont <- storage_container(bl_endp_key, "container_name")

I can further successfully list my containers and files in the blob like this:

list_storage_containers(bl_endp_key)
list_storage_files(cont)

But when I try to download a blob, the call takes very long and fails due to a connectionerror. It then tries to reconnect for 10 times and fails every time. This is the code i use:

rawvec <- download_blob(cont, src="blob_name", dest=NULL)

My container has container access level (anonymous read access for containers and blobs). I used the Azure storage account access key to create the endpoint. I can access the same blob in python using the same storage account access key.

Do you know what I am doing wrong?

Table storage package now available

Is support for azure table storage on the roadmap? We are looking for functionality like this: table storage python.

Or is there another package which provided this kind of functionality?

azure / azurestor Goto Github PK

azurestor's Issues

Recommend Projects

Recommend Topics

Recommend Org