Git Product home page Git Product logo

azurestor's Introduction

AzureStor

CRAN Downloads R-CMD-check

This package implements both an admin- and client-side interface to Azure Storage Services. The admin interface uses R6 classes and extends the framework provided by AzureRMR. The client interface provides several S3 methods for efficiently managing storage and performing file transfers.

The primary repo for this package is at https://github.com/Azure/AzureStor; please submit issues and PRs there. It is also mirrored at the Cloudyr org at https://github.com/cloudyr/AzureStor. You can install the development version of the package with devtools::install_github("Azure/AzureStor").

Storage endpoints

The interface for accessing storage is similar across blobs, files and ADLSGen2. You call the storage_endpoint function and provide the endpoint URI, along with your authentication credentials. AzureStor will figure out the type of storage from the URI.

AzureStor supports all the different ways you can authenticate with a storage endpoint:

  • Blob storage supports authenticating with an access key, shared access signature (SAS), or an Azure Active Directory (AAD) OAuth token;
  • File storage supports access key and SAS;
  • ADLSgen2 supports access key and AAD token.

In the case of an AAD token, you can also provide an object obtained via AzureAuth::get_azure_token(). If you do this, AzureStor can automatically refresh the token for you when it expires.

# various endpoints for an account: blob, file, ADLS2
bl_endp_key <- storage_endpoint("https://mystorage.blob.core.windows.net", key="access_key")
fl_endp_sas <- storage_endpoint("https://mystorage.file.core.windows.net", sas="my_sas")
ad_endp_tok <- storage_endpoint("https://mystorage.dfs.core.windows.net", token="my_token")

# alternative (recommended) way of supplying an AAD token
token <- AzureRMR::get_azure_token("https://storage.azure.com",
                                   tenant="myaadtenant", app="app_id", password="mypassword"))
ad_endp_tok2 <- storage_endpoint("https://mystorage.dfs.core.windows.net", token=token)

Listing, creating and deleting containers

AzureStor provides a rich framework for managing storage. The following generics allow you to manage storage containers:

  • storage_container: get a storage container (blob container, file share or ADLS filesystem)
  • create_storage_container
  • delete_storage_container
  • list_storage_containers
# example of working with containers (blob storage)
list_storage_containers(bl_endp_key)
cont <- storage_container(bl_endp_key, "mycontainer")
newcont <- create_storage_container(bl_endp_key, "newcontainer")
delete_storage_container(newcont)

Files and blobs

These functions for working with objects within a storage container:

  • list_storage_files: list files/blobs in a directory (defaults to the root directory)
  • create_storage_dir/delete_storage_dir: create or delete a directory
  • delete_storage_file: delete a file or blob
  • storage_file_exists: check that a file or blob exists
  • storage_upload/storage_download: transfer a file to or from a storage container
  • storage_multiupload/storage_multidownload: transfer multiple files in parallel to or from a storage container
  • get_storage_properties: Get properties for a storage object
  • get_storage_metadata/set_storage_metadata: Get and set user-defined metadata for a storage object
# example of working with files and directories (ADLSgen2)
cont <- storage_container(ad_endp_tok, "myfilesystem")
list_storage_files(cont)
create_storage_dir(cont, "newdir")
storage_download(cont, "/readme.txt")
storage_multiupload(cont, "N:/data/*.*", "newdir")  # uploading everything in a directory

Uploading and downloading

AzureStor includes a number of extra features to make transferring files efficient.

Parallel connections

As noted above, you can transfer multiple files in parallel using the storage_multiupload/download functions. These functions utilise a background process pool supplied by AzureRMR to do the transfers in parallel, which usually results in major speedups when transferring multiple small files. The pool is created the first time a parallel file transfer is performed, and persists for the duration of the R session; this means you don't have to wait for the pool to be (re-)created each time.

# uploading/downloading multiple files at once: use a wildcard to specify files to transfer
storage_multiupload(cont, src="N:/logfiles/*.zip")
storage_multidownload(cont, src="/monthly/jan*.*", dest="~/data/january")

# or supply a vector of file specs as the source and destination
src <- c("file1.csv", "file2.csv", "file3.csv")
dest <- file.path("data/", src)
storage_multiupload(cont, src=src, dest=dest)

File format helpers

AzureStor includes convenience functions to transfer data in a number of commonly used formats: RDS, RData, TSV (tab-delimited), CSV, and CSV2 (semicolon-delimited). These work via connections and so don't create temporary files on disk.

# save an R object to storage and read it back again
obj <- list(n=42L, x=pi, c="foo")
storage_save_rds(obj, cont, "obj.rds")
objnew <- storage_load_rds(cont, "obj.rds")
identical(obj, objnew)  # TRUE

# reading/writing data to CSV format
storage_write_csv(mtcars, cont, "mtcars.csv")
mtnew <- storage_read_csv(cont, "mtcars.csv")
all(mapply(identical, mtcars, mtnew))  # TRUE

Transfer to and from connections

You can upload a (single) in-memory R object via a connection, and similarly, you can download a file to a connection, or return it as a raw vector. This lets you transfer an object without having to create a temporary file as an intermediate step.

# uploading serialized R objects via connections
json <- jsonlite::toJSON(iris, pretty=TRUE, auto_unbox=TRUE)
con <- textConnection(json)
storage_upload(cont, src=con, dest="iris.json")

rds <- serialize(iris, NULL)
con <- rawConnection(rds)
storage_upload(cont, src=con, dest="iris.rds")

# downloading files into memory: as a raw vector with dest=NULL, and via a connection
rawvec <- storage_download(cont, src="iris.json", dest=NULL)
rawToChar(rawvec)

con <- rawConnection(raw(0), "r+")
storage_download(cont, src="iris.rds", dest=con)
unserialize(con)

Copy from URLs (blob storage only)

The copy_url_to_storage function lets you transfer the contents of a URL directly to storage, without having to download it to your local machine first. The multicopy_url_to_storage function does the same, but for a vector of URLs. Currently, these only work for blob storage.

# copy from a public URL: Iris data from UCI machine learning repository
copy_url_to_storage(cont,
    "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
    "iris.csv")

# copying files from another storage account, by appending a SAS to the URL(s)
sas <- "?sv=...."
files <- paste0("https://srcstorage.blob.core.windows.net/container/file", 0:9, ".csv", sas)
multicopy_url_to_storage(cont, files)

Appending (blob storage only)

AzureStor supports uploading to append blobs. An append blob is comprised of blocks and is optimized for append operations; it is well suited for data that is constantly growing, but should not be modified once written, such as server logs.

To upload to an append blob, specify type="AppendBlob" in the storage_upload call. To append data (rather than overwriting an existing blob), include the argument append=TRUE. See ?upload_blob for more details.

# create a new append blob
storage_upload(cont, src="logfile1.csv", dest="logfile.csv", type="AppendBlob")

# appending to an existing blob
storage_upload(cont, src="logfile2.csv", dest="logfile.csv", type="AppendBlob", append=TRUE)

Interface to AzCopy

AzureStor includes an interface to AzCopy, Microsoft's high-performance commandline utility for copying files to and from storage. To take advantage of this, simply include the argument use_azcopy=TRUE on any upload or download function. AzureStor will then call AzCopy to perform the file transfer, rather than using its own internal code. In addition, a call_azcopy function is provided to let you use AzCopy for any task.

# use azcopy to download
myfs <- storage_container(ad_endp_tok, "myfilesystem")
storage_download(myfs, "/incoming/bigfile.tar.gz", "/data", use_azcopy=TRUE)

# use azcopy to sync a local and remote dir
call_azcopy("sync", "c:/local/path", "https://mystorage.blob.core.windows.net/mycontainer", "--recursive=true')

For more information, see the AzCopy repo on GitHub.

Note that AzureStor uses AzCopy version 10. It is incompatible with versions 8.1 and earlier.

Admin interface

Finally, AzureStor's admin-side interface allows you to easily create and delete resource accounts, as well as obtain access keys and generate a SAS. Here is a sample workflow:

library(AzureStor)

# authenticate with Resource Manager
az <- AzureRMR::get_azure_login("mytenant")
sub1 <- az$get_subscription("subscription_id")
rg <- sub1$get_resource_group("resgroup")


# get an existing storage account
rdevstor1 <- rg$get_storage_account("rdevstor1")
rdevstor1
#<Azure resource Microsoft.Storage/storageAccounts/rdevstor1>
#  Account type: Storage 
#  SKU: name=Standard_LRS, tier=Standard 
#  Endpoints:
#    blob: https://rdevstor1.blob.core.windows.net/
#    queue: https://rdevstor1.queue.core.windows.net/
#    table: https://rdevstor1.table.core.windows.net/
#    file: https://rdevstor1.file.core.windows.net/ 
# ...

# retrieve admin keys
rdevstor1$list_keys()

# create a shared access signature (SAS)
rdevstor1$get_account_sas(permissions="rw")

# obtain an endpoint object for accessing storage (will have the access key included by default)
rdevstor1$get_blob_endpoint()
#Azure blob storage endpoint
#URL: https://rdevstor1.blob.core.windows.net/
# ...

# create a new storage account
blobstor2 <- rg$create_storage_account("blobstor2", location="australiaeast", kind="BlobStorage")

# delete it (will ask for confirmation)
blobstor2$delete()

azurestor's People

Contributors

cantpitch avatar grthr avatar hadley avatar hanstwins avatar hongooi73 avatar koderkow avatar microsoft-github-policy-service[bot] avatar mikkmart avatar msdcanderson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

azurestor's Issues

storage_download fails on very small file

The storage_download function may work on larger files, but I caused weird timeouts on very small files (the iris dataset, as shown in the documentation examples). I put together a reproducible example to demonstrate. I tracked down the place in the code where the process hangs to the place where the size of the file is computed for the progress bar, but haven't yet figured out exactly what about the logic is causing it to hang.

Perhaps the source of the issue will be obviously apparent to someone more familiar with the package. If not, perhaps the logic could instead be borrowed from the list_blobs function... it may not be as exact, but I had no trouble getting file sizes for the small files using this function.

Code to produce error:

username="myusername"
storage_account<-"mystorageaccountname"
container_name<-"mycontainername"
access_key<-"myaccesskey"

storage_endpoint_url<-"https://${storage_account}.blob.core.windows.net"
storage_endpoint_url_resolved<-glue::glue(storage_endpoint_url,.open="${")

bl_endp_key <- AzureStor::storage_endpoint(storage_endpoint_url_resolved, 
                                    key=access_key
                                    )
AzureStor::list_storage_containers(bl_endp_key)

cont<-AzureStor::create_storage_container(bl_endp_key,name=container_name)
json <- jsonlite::toJSON(iris, pretty=TRUE, auto_unbox=TRUE)
con <- textConnection(json)
AzureStor::storage_upload(cont, src=con, dest="iris.json")

AzureStor::list_blobs(cont)
### will fail on next line (times out ten times in a row)
## it gets stuck here: https://github.com/Azure/AzureStor/blob/a71fcf88a9bd7b97c8ab221aaf3ced813d0265db/R/blob_transfer_internal.R#L83
## if I change the http_verb from "HEAD" to "GET" it finishes, so perhaps there's something wrong in the logic
## with regard to very small file sizes?  
rawvec <- AzureStor::storage_download(cont, src="iris.json", dest=NULL)
rawToChar(rawvec)


### alternate workaround to show the file is fine and downloadable

tenant_id<-"mytenantid"
subscription_id<-"mysubscriptionid"
resource_group_name<-"myresourcegroupname"

az <- AzureRMR::create_azure_login(tenant=tenant_id,auth_type="device_code")
sub <- az$get_subscription(subscription_id)
rg <- sub$get_resource_group(resource_group_name)
stor <- rg$get_resource(type="Microsoft.Storage/storageAccounts", name=storage_account)
rdevstor1 <- rg$get_storage_account(storage_account)
my_sas<-rdevstor1$get_account_sas(permissions="r",services="bf", start=Sys.Date(), expiry=Sys.Date() + 31)

my_blob_endpoint<-rdevstor1$get_blob_endpoint()
l_endp_sas <- AzureStor::storage_endpoint(storage_endpoint_url_resolved, sas=my_sas)
adm_cont <- AzureStor::storage_container(l_endp_sas, container_name)

test_url<-paste0(storage_endpoint_url_resolved,"/",container_name,"/iris.json?",my_sas)
test_download<-jsonlite::fromJSON(test_url)
testthat::expect_equal(test_download,jsonlite::fromJSON(jsonlite::toJSON(iris)))

Blob/ADLS interop broken

list_blobs(cont) with a hns-enabled account results in

Error in rbind(deparse.level, ...) :
  numbers of columns of arguments do not match

Examples request (Creating CRAN)

Is it possible to write their internal repository and then read it from Azure storage (without mounting the storage)? The example could be using miniCRAN or drat package for creating the repositories.

Define storage endpoint using key vault

I am writing an application in a databricks notebook using R-language. The code must be capable to look into storage containers located in a data lake gen2 in order to list all their contents.

So far I am doing the following for development

endpoint <- AzureStor::adls_endpoint(endpoint = "https://<myStorageName>.dfs.core.windows.net", key = "<myStorageKey>") storage_containers <- AzureStor::list_storage_containers(endPoint)

which works perfectly for me. This way I see all the containers in the data lake and can access them.

However, in order to become productive I must ban from the code which I hope is possible in principle using key vault since reading in a file works for me like this using

%py spark.conf.set("fs.azure.account.key.<myStorageName>.dfs.core.windows.net", dbutils.secrets.get(scope = "myScopeName", key = "mySecret") )

SparkR::read.df(path = "abfss://<myContainerName>@<myStorageName>.dfs.core.windows.net/<myFolderName>/<myFileName>.csv", source = "csv") -> mySparkDataFrame

I started by trying to define the key vault applying the AzureKeyVault package

vault <- AzureKeyVault::key_vault(url ="https://<myKeyVaultName>.vault.azure.net")

but this query runs forever such that I must cancle it manually. I guess this is an authentification issue?!

So, what is the correct way of defining a storage endpoint using the secret stored in a key vault, or of authentifying, respectively?

Thanks in advance for the support!

copy_url_to_storage fails with error in UseMethod

Just tried the example from the docs

copy_url_to_storage(
  container = my_container,
  src = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
  dest = "iris.csv")

For me it fails with following error:

Error in UseMethod("copy_from_url") : no applicable method for 'copy_from_url' applied to an object of class "c('blob_container', 'storage_container')"

My container object is properly created.

> class(mycontainer)
[1] "blob_container"    "storage_container"

I can call other methods on it that work, e.g. list_storage_files(mycontainer)

My current verison is at AzureStor_2.1.1.9000.

Allow using custom endpoint URL format to support Azurite

Currently, the call to is_endpoint_url enforces the default Azure URL format, also documented in the source code:

endpoint URL must be of the form {scheme}://{acctname}.{type}.{etc}

However, the Azure storage emulator tool Azurite does not follow this schema.

Instead of <http|https>://<account-name>.<service-name>.core.windows.net/<resource-path>, Azurite uses http://<local-machine-address>:<port>/<account-name>/<resource-path>.

Is it possible to disable the format check of the endpoint URL in case of directly calling blob_endpoint function?

Unable to download blob from container

Hi,

I am unable to download blobs from my storage account. I followed the documentation in the readme, this is my code:

I can successfully create an endpoint and get a container with

bl_endp_key <- storage_endpoint("url", "key")
cont <- storage_container(bl_endp_key, "container_name")

I can further successfully list my containers and files in the blob like this:

list_storage_containers(bl_endp_key)
list_storage_files(cont)

But when I try to download a blob, the call takes very long and fails due to a connectionerror. It then tries to reconnect for 10 times and fails every time. This is the code i use:

rawvec <- download_blob(cont, src="blob_name", dest=NULL)

My container has container access level (anonymous read access for containers and blobs). I used the Azure storage account access key to create the endpoint. I can access the same blob in python using the same storage account access key.

Do you know what I am doing wrong?

Sending md5 with uploads

The storage_upload function doesn't have a parameter to enable sending the md5 along with the file.

I can calculate the base64 encoded md5 and manually add the header "x-ms-blob-content-md5" to a PUT operation, and it works fine (similar to what I think you get using the --put-md5 option for az-copy).

Is there some other way I'm missing to do this with less effort? If not, perhaps adding an option to the storage_upload function would be nice.

Write/read directly to memory

Is there a possibility to read/write (for any file type) directly into R memory from a adls gen2 storage? For example: on a gen2 file storage I have a text.csv file which I like to read with read_csv to a dataframe.

The current solution I came up with is first to download the file to a temporary local directory and then use read_csv.

Support for multiple objects with storage_multiupload

First off, wanted to state that this package is great. Nice work.

Looking at the documentation it appears I can load individual objects with rawConnection and storage_upload or use multiple files with storage_multiupload.

Is it possible to use storage_multiupload to upload multiple objects that are in memory in R to blob/adls storage upstream? If not, is this something we can add as an enhancement if it's technically feasible.

I know I can write the objects to temp storage and then just multi-upload the individual files but it will incur unnecessary file I/O and can be inefficient especially for the large objects we have in memory.

thank you very much for your time.

anand

Token can expire during retry

Current retry logic fails to check expiry date of OAuth token, which means that for a long-running transfer, the token can expire during a retry. Solution is to move the token validation check inside the retry loop.

Progress bar falt with storage_multiupload() even with options(azure_storage_progress_bar=TRUE)

Hi,

First of all, thanks again for the great R interface to Azure, you provide, and your strong support.

I was wondering how could I display the progress bar of the following Job :

storage_multiupload(container = contAC, src=paste0(img_out_path, "/*"), dest=folder_name, use_azcopy=FALSE, recursive = TRUE)

Indeed, even if I have run before the following command : options(azure_storage_progress_bar=TRUE) , nothing is printed to the prompt.
Sometimes in a RShiny app, but not during the execution of a single command in the prompt), it says : creating background pool, but nothing more.

Is it because I have not enough files to send, or that my connection is to fast ?
Or is it cause of I do not use properly the options() command ?
Do you have any recommandations ?

Thanks in advance for your help,
Sincerely yours

Uploading file and folder with a non-ascii char in their name

Hello there,

First of all, thanks again for the great R interface to Azure you provide hongooi73, and your strong support.

It seems that the AzureStor functions do not support the uploading of files and folders which contain UTF-8 characters.

As the following snap shows, it is possible to upload folder and file with UTF-8 accents in a blob container, uploaded here with Microsoft Azure Storage Explorer :

portalazurecom

If I list the above container, I got the following output

> AzureStor::list_storage_files(container = cont)
                                  name   size isdir
1                                  Img     NA  TRUE
2                         Img/1-B1.png 170273 FALSE
3                              Img/2-F     NA  TRUE
4                      Img/2-F/RB1.png 397330 FALSE
5                      Img/2-F/RB1.tif 519996 FALSE
6                       Img/3-R1-é.tif 215388 FALSE
7            Img/4-ùûüÿ€àâæçéèêëïîôœ—–     NA  TRUE
8   Img/4-ùûüÿ€àâæçéèêëïîôœ—–/1-B1.png 170273 FALSE
9 Img/4-ùûüÿ€àâæçéèêëïîôœ—–/3-R1-é.tif 215388 FALSE

I would like to upload the same folder to an Azure Blob container with R commands instead of Microsoft Azure Storage Explorer :

└───Img
    │   1-B1.png
    │   3-R1-é.tif
    │
    ├───2-F
    │       RB1.png
    │       RB1.tif
    │
    └───4-ùûüÿ€àâæçéèêëïîôœ—–
            1-B1.png
            3-R1-é.tiff

Thus I use the folowing command to upload:

library("AzureStor", "AzureRMR", "AzureAuth")

# without AZcopy ##########

#login ####
token <- AzureAuth::get_azure_token(
  resource = sto_url,
  tenant = tenant,
  app = aad_id,
  username = az_log,
  password = az_pwd,
  auth_type = "resource_owner",
  use_cache = F
)
stopifnot(AzureAuth::is_azure_token(token))
stopifnot(token$validate())

endp_tok <- AzureStor::blob_endpoint(sto_url, token = token)
cont <- AzureStor::storage_container(endp_tok, cont_name)

# uploading ####
img_path_from <- "./Img"

# simple recursive uploading
img_to <- "ImgS"
files <- list.files(path = img_path_from, recursive = TRUE)
for (i in 1:length(files)) {
  AzureStor::storage_upload(
    container = cont,
    src = paste0(img_path_from, "/", files[i]),
    dest = paste0(img_to, "/", files[i]) ,
    use_azcopy = F
  )
}

# multiple uploading
img_to <- "ImgM"
storage_multiupload(
  container = cont,
  src = paste0(img_path_from, "/*"),
  dest = img_to,
  use_azcopy = F,
  recursive = TRUE
)

And I got the following errors :

# simple recursive uploading

  |==============================================================================| 100%
  |==============================================================================| 100%
  |==============================================================================| 100%
  |==================                                                            |  23%
  Error in process_storage_response(response, match.arg(http_status_handler),  : 
  Bad Request (HTTP 400). Failed to complete Storage Services operation. Message:
InvalidUri
The requested URI does not represent any resource on the server.
RequestId:57e7118e-401e-004e-065c-5ba0ae000000
Time:2020-07-16T10:30:23.2985328Z.


# multiple uploading

Creating background pool
Error in checkForRemoteErrors(val) : 
  3 nodes produced errors; first error: Bad Request (HTTP 400). Failed to complete Storage Services operation. Message:
InvalidUri
The requested URI does not represent any resource on the server.
RequestId:a31743fd-d01e-004c-0b5c-5b1e16000000
Time:2020-07-16T10:31:30.9125139Z.

Only the files without UTF-8 accent in their path have been uploaded even with or without multiupload:

> # listing ####
> AzureStor::list_storage_files(container = cont)
               name   size isdir
1              ImgM     NA  TRUE
2     ImgM/1-B1.png 170273 FALSE
3          ImgM/2-F     NA  TRUE
4  ImgM/2-F/RB1.png 397330 FALSE
5  ImgM/2-F/RB1.tif 519996 FALSE
6              ImgS     NA  TRUE
7     ImgS/1-B1.png 170273 FALSE
8          ImgS/2-F     NA  TRUE
9  ImgS/2-F/RB1.png 397330 FALSE
10 ImgS/2-F/RB1.tif 519996 FALSE

I tried with use_azcopy = T but it was impossible as well to upload the files with UTF-8 characters, as the below outputs show :

# a simple file name
# 1-B1.png

Using azcopy binary C:\Users\ajallais\DOCUME~1\AZCOPY~1.3\azcopy.exe
Running "C:\Users\ajallais\DOCUME~1\AZCOPY~1.3\azcopy.exe" copy ./Img/1-B1.png \
  "https://#############notshow############/ImgS/1-B1.png" \
  --blob-type BlockBlob --block-size-mb 16
INFO: Scanning...
INFO: AZCOPY_OAUTH_TOKEN_INFO is set.
INFO: Authenticating to destination using Azure AD
INFO: Any empty folders will not be processed, because source and/or destination doesn't have full folder support

Job 6fdac609-3340-0943-4b14-9e6ee066d277 has started
Log file is located at: C:\Users\ajallais\.azcopy\6fdac609-3340-0943-4b14-9e6ee066d277.log

0.0 %, 0 Done, 0 Failed, 1 Pending, 0 Skipped, 1 Total, 


Job 6fdac609-3340-0943-4b14-9e6ee066d277 summary
Elapsed Time (Minutes): 0.0333
Number of File Transfers: 1
Number of Folder Property Transfers: 0
Total Number of Transfers: 1
Number of Transfers Completed: 1
Number of Transfers Failed: 0
Number of Transfers Skipped: 0
TotalBytesTransferred: 170273
Final Job Status: Completed



# a more complicated path name
# 3-R1-é.tif

Running "C:\Users\ajallais\DOCUME~1\AZCOPY~1.3\azcopy.exe" copy "./Img/3-R1-é.tif" \
  "https://#############notshow############/ImgS/3-R1-é.tif" \
  --blob-type BlockBlob --block-size-mb 16
INFO: Scanning...
INFO: AZCOPY_OAUTH_TOKEN_INFO is set.
INFO: Authenticating to destination using Azure AD
INFO: Any empty folders will not be processed, because source and/or destination doesn't have full folder support

failed to perform copy command due to error: cannot start job due to error: cannot scan the path \\?\C:\Users\ajallais\Documents\Script\GIT-Script\Script\R\TestBigFiles\7-UTF8Char\Img\3-R1-�.tif, please verify that it is a valid.

Error in processx::run(get_azcopy_path(), args, env = env, echo_cmd = !silent,  : 
  System command 'azcopy.exe' failed, exit status: 1, stdout & stderr were printed
Type .Last.error.trace to see where the error occured



> .Last.error.trace 

 Stack trace:

 1. AzureStor::storage_upload(container = cont, src = paste0(img_path_from,  ...
 2. AzureStor:::storage_upload.blob_container(container = cont, src = paste0(img_path_f ...
 3. AzureStor:::upload_blob(container, ...)
 4. AzureStor:::azcopy_upload(container, src, dest, type = type,  ...
 5. AzureStor:::call_azcopy_from_storage(container$endpoint, "copy",  ...
 6. AzureStor:::call_azcopy(..., env = auth$env)
 7. processx::run(get_azcopy_path(), args, env = env, echo_cmd = !silent,  ...
 8. throw(new_process_error(res, call = sys.call(), echo = echo,  ...

 x System command 'azcopy.exe' failed, exit status: 1, stdout & stderr were printed 

My environment parameters are the following :

  • R version 4.0.2 (2020-06-22) -- "Taking Off Again"
  • Platform: x86_64-w64-mingw32/x64 (64-bit)
  • Microsoft Windows 10 Professionnel
  • AzCopy 10.4.3
  • packageVersion('AzureStor') ‘3.2.2’
  • packageVersion('AzureRMR') ‘2.3.4’
  • packageVersion('AzureAuth') ‘1.2.4’

I will appreciate any help or suggestion,
Have good summer holidays if you have

list_storage_files isdir=TRUE for files after uploading with azcopy

Hi,
First of all, thanks for the works you do with the AzureRFamily !
As it is shown in the following snap :

image

(The above files were uploaded with the followng command : storage_multiupload(cont, paste0(img_path,"*"), dest="azcopy", use_azcopy=TRUE))

When I listing the different elements of a blob container with the following command :
list_storage_files (cont, info ="all"), the isdir colunm indicates isdir=TRUE even if it is an image.

I can manage what I want to do with the Content-Type column, but I am still wondering if the isdir column indicates whether the element is a file or directory. If it is not the case I think the name isdir could be reformulated, what do you think ?

Sincerely yours,

download_from_url function not working

I am trying to use the download_from_url function from the AzureStor package. I keep getting the below error

Error in download_blob_internal(container, src, dest, blocksize = blocksize, : Forbidden (HTTP 403). Failed to complete Storage Services operation. Message: .

I know that it should work and the permissions are set up correctly for the storage account because I am able to download the blob using Python.

download_blob fails inside lapply.... sometimes

I am using download_blob inside lapply to download a list of blobs. When I have a large list of blobs it sometimes (not always) fails after downloading some of the blobs with the error below. Any thoughts on a better way to code this? Is there some way to avoid this? The problem is the program hangs and does not through an error until the Esc key is hit by the user. Thank you in advance!

lapply(unique(totag_sub$download_path),
                    function(x){
                      download_blob( container = cont_image,
                                     src = x,
                                     dest =paste0(tagging_location,"/",x) ,overwrite = T)
                    })

|============================================================================================================================| 100%
|============================================================================================================================| 100%
|============================================================================================================================| 100%
|============================================================================================================================| 100%
|============================================================================================================================| 100%
|=========== | 9%
Connection error, retrying (1 of 10)
|

Error in download_blob_internal(container, src, dest, blocksize = blocksize, :
Forbidden (HTTP 403). Failed to complete Storage Services operation. Message:
AuthenticationFailed
Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
RequestId:a1a2e308-901e-0062-5902-c4087b000000
Time:2020-01-05T19:57:57.9797635Z
Request date header too old: 'Sun, 05 Jan 2020 11:42:07 GMT'.

Recording MutliUpload Job Summary

Hi,

First of all, thanks again for the great R interface to Azure, you provide, and your strong support.

I was wondering how could I display or even better record in an object (as a dataframe for example), the summary of a multiupload job with the following command :

storage_multiupload(container = contAC, src=paste0(img_out_path, "/*"), dest=folder_name, use_azcopy=FALSE, recursive = TRUE)

I know that it is possible to display such informations with use_azcopy=T, but I prefer to not use it.

Thanks in advance for your help,
Sincerely yours

storage_download fails( Forbidden (HTTP 403))

hi,
When I try to get the Blob storage data from the Rstudio Server in my VM (CentOS7), I get the following error massage.

Error in download_blob_internal(container, src, dest, blocksize = blocksize,  : 
  Forbidden (HTTP 403). Failed to complete Storage Services operation. Message:

Could you please tell me the solution to this problem?
Here's the code I tried

library(AzureStor, lib.loc = "/usr/lib64/R/library")
endp <- storage_endpoint("https://mystorageaccount.blob.core.windows.net", key="myaccesskey")
cont <- storage_container(endp, "mycontainer")
storage_download(cont, "myblob.csv", 
                 "myblob_local.csv",
                 overwrite = T)

.
sessionInfo()

R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] AzureStor_3.2.3

loaded via a namespace (and not attached):
 [1] httr_1.4.2      compiler_3.6.0  R6_2.4.1        AzureRMR_2.3.5 
 [5] tools_3.6.0     curl_4.3        rappdirs_0.3.1  AzureAuth_1.2.4
 [9] mime_0.9        openssl_1.4.2   askpass_1.1    

blob storage info:
・ Performance/Access tier:Standard/Hot
・Replication:Read-access geo-redundant storage (RA-GRS)
・ account kind:StorageV2 (general purpose v2)
・ location:Japan East, Japan West
*type of blob is Block blob

VM info
・ OS:Linux (centos 7.8.2003)
・ SKU:centos-76
・ size:Basic A2 (2 vcpus, 3.5 GiB memory)
・ location:Japan East

upload_blob failing

I have been using upload_blob in production for months and just 3 days ago it began to fail. I am uploading a CSV from an active r session to a blob storage container. This is not a repex but here is the code.
It begins to upload (rapidly at first) and then pauses for several minutes before printing "connection failure. Eventually, I get an error message. Thoughts? the csv is only 4510 rows of 11 columns so not a big file and this used to run almost instantly.

# create a text connection to the tagged data to upload
con_tagged<-textConnection(object = format_csv(tagged_upload))
 # upload it to the blob container
upload_blob(cont_labels, src=con_tagged,
       dest=paste0("tagged_",round(as.numeric(Sys.time())*1000),".csv"),
       use_azcopy = F)

|======================================================== | 55%Connection error, retrying (1 of 10)
|===================== | 21%Connection error, retrying (2 of 10)
|================== | 17%Connection error, retrying (3 of 10)
|========================================== | 41%Connection error, retrying (4 of 10)
|============== | 14%Error in process_storage_response(response, match.arg(http_status_handler), :
Forbidden (HTTP 403). Failed to complete Storage Services operation. Message:
AuthenticationFailed
Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
RequestId:4945a495-901e-001a-479e-ed322b000000
Time:2020-02-27T18:46:45.1205751Z
Request date header too old: 'Thu, 27 Feb 2020 18:31:44 GMT'.

list_blob_containers, missing nextmarker iteration

Hi all,

I notice list_blob_containers does not iterate on the NextMarker like the list_blobs function.

I need list_blob_containers to return the entire list of containers from my account and not just the first elements.

Example :

> connection<- AzureStor::blob_endpoint(endpoint = endpoint, key = key)
> names(AzureStor::list_blob_containers(connection))
[1] "db-tab-1"             "db-tab-2"

But in reality, I have more containers, I should have something like

> connection<- AzureStor::blob_endpoint(endpoint = endpoint, key = key)
> names(AzureStor::list_blob_containers(connection))
[1] "db-tab-1"             "db-tab-2"             "db-tab-3"             "db-tab-4"

So to fix that, just include the iteration on NextMarker 😄 or did I miss something on my code? 🤔

Thank you!

Bug in AzureStor::list_adls_files()

Hi,

There appears to be a bug in AzureStor::list_adls_files() when more than 5k files exist in the filesystem. An -almost- reprex (I presumably shouldn't share our key :) ):

library(AzureStor)

endpoint_url <- "<endpoint>.dfs.core.windows.net"
filesystem_name <- "<filesystem_name>"
key = "<access_key>"

endpoint <- AzureStor::storage_endpoint(endpoint = endpoint_url, key = key)
filesystem <- AzureStor::adls_filesystem(endpoint, name = filesystem_name)

AzureStor::list_adls_files(filesystem, recursive = TRUE)

gives an error:

Error in rbind(deparse.level, ...) : 
  numbers of columns of arguments do not match

In our case, the repeat-block in list_adls_files generates an eight-column dataframe in the first loop, with names:

> names(out)
[1] "contentLength" "etag"          "group"         "isDirectory"   "lastModified"  "name"          "owner"        
[8] "permissions"

In the second loop, the content of out has only seven columns, missing the "isDirectory" column:

> names(out2)
[1] "contentLength" "etag"          "group"         "lastModified"  "name"          "owner"         "permissions"  

list_storage_files recursive = FALSE does not work

I am trying to list the folders within a specific container using the following code

AzureStor::list_storage_files(
  containers[["data"]],
  dir = "path/to/folder",
  recursive = FALSE,
  info = "name"
)

However the recursive = FALSE option seems to have no effect and presents the file paths for all files in each of the folders (i.e. recursive = TRUE).

Looking into this, I cannot see where recursive is used within list_blobs() even though it is an option to that function.

Exploration of sub-folders/directories in a file system of a data lake

I am writing R-code in an Azure Databricks notebook that is supposed to explore a data lake since latter is supplied by many (partially unknown) sources. Using AzureStor I started doing the following code

AzureStor::adls_endpoint(endpoint = "https://<myDLName>.dfs.core.windows.net", key = myAccessKey) -> endP
AzureStor::list_storage_containers(endP) -> storage_containers
paste0("https://", myDLName,".dfs.core.windows.net/", names(storage_containers)[1] ) -> path
AzureStor::adls_filesystem(path, mykAccessLKey) -> myFileSys
AzureStor::list_adls_files(myFileSys, "/") -> myFiles

which outputs a R data.frame named "myFiles" having columns like "name" and "isDirectory" which I really appreciate. Now, if for some row of the "myFiles" data.frame "isDirectory" is TRUE, I would my code like to go with the exploration, i.e. look into the directory, list the objects in there and so on. Just extending the endpoint url by "/myFolder" and using this

In the end of the day, the task is to read all files (mostly .csv) into the session wich are present in an unknown folder structure. How can that be achieved using AzureStor?

Support for append/edit content to existing files

Hi there!

First of all thanks for this great work in the AzureR package, I'm using actively AzureStor and your package works awesome. Looking at the documentation there are no details about editing or appending content to a file already located in a Storage Container.

My current use case is to save logs in Azure Storage. Is it possible to consider this enhancement? On the contrary case, do you consider there is any workaround for this case in the current AzureStor?

Thank you! 🤓

Yor Castaño

How to use with SAS?

Hi,
not exactly sure how to work with SAS key using this package.

ep <- storage_endpoint("https://xxx.blob.core.windows.net/", sas=my_key) 

list_storage_containers(ep)
#Error in do_storage_call(endpoint$url, "/", options = list(comp = "list"),  : 
#  Forbidden (HTTP 403). Failed to complete Storage Services operation. Message:
#AuthenticationFailed
#Server failed to authenticate the request. Make sure the value of Authorization header is formed #correctly including the signature.
#RequestId:b7275fcc-f01e-0020-613b-9bb78c000000
#Time:2019-11-14T22:29:55.5202907Z
#The specified signed resource is not allowed for the this resource level.

(cont=storage_container(ep, the_container_name))
#Azure blob container 'the_container_name'
#URL: https://xxx.blob.core.windows.net/the_container_name
#Access key: <none supplied>
#Azure Active Directory access token: <none supplied>
#Account shared access signature: <hidden>
#Storage API version: 2018-03-28

list_storage_files(container = cont)
#Error in do_storage_call(endp$url, path, options = options, headers = headers,  : 
#  Forbidden (HTTP 403). Failed to complete Storage Services operation. Message:
#AuthenticationFailed
#Server failed to authenticate the request. Make sure the value of Authorization header is formed #correctly including the signature.
#RequestId:b7275fdd-f01e-0020-703b-9bb78c000000
#Time:2019-11-14T22:29:55.5332998Z
#Signature did not match. String to sign used was 
#
#
#/xxx/the_container_name
#sas_id.

also attached the assess right of the two sas keys I tried. none works

image (2)

Endpoint print function is missing a forward slash

I just tested it for blob but assume issue may prevail for other endpoint types. You seem to be missing a forward slash in the print function print.blob_container <- function(x, ...) of the blob endpoint object in AzureStor/R/blob_client_funcs.R.

The function returns as follows

$ cat(sprintf("URL: %s\n", paste0("https://myblob.blob.core.windows.net", "mycontainer")))
> URL: https://myblob.blob.core.windows.netmycontainer 

when it should return

URL: https://myblob.blob.core.windows.net/mycontainer

I could try a pull request but have never done that before on some open source GitHub R project so not sure how helpful that would be (as it would also take some time getting used to the conventions and such). Lmk.

'AzureStor' 2.0.1 is being loaded, but >= 3.0.0 is required

> devtools::install_github("Azure/AzureQstor")
Downloading GitHub repo Azure/AzureQstor@master
from URL https://api.github.com/repos/Azure/AzureQstor/zipball/master
Installing AzureQstor
"C:/PROGRA~1/R/R-34~1.3PA/bin/x64/R" --no-site-file --no-environ --no-save --no-restore --quiet CMD  \
  INSTALL  \
  "C:/Users/Hanieh_local/AppData/Local/Temp/RtmpExcszx/devtools6f58179f2f93/Azure-AzureQstor-cb830d7"  \
  --library="C:/Users/Hanieh_local/Documents/R/win-library/3.4" --install-tests 

* installing *source* package 'AzureQstor' ...
** R
** tests
** preparing package for lazy loading
Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) : 
  namespace 'AzureStor' 2.0.1 is being loaded, but >= 3.0.0 is required
ERROR: lazy loading failed for package 'AzureQstor'
* removing 'C:/Users/Hanieh_local/Documents/R/win-library/3.4/AzureQstor'
In R CMD INSTALL
Installation failed: Command failed (1)

AzureStor fails to install in Azure Compute Instance

I created an Azure Workspace and a Compute Instance, after the compute instance is created I launched R Studio from the column named Application URI. The R Studio load but when I run the command install.packages("AzureStor") it not installed it displays the following error:

Using PKG_CFLAGS=
Using PKG_LIBS=-lxml2
------------------------- ANTICONF ERROR ---------------------------
Configuration failed because libxml-2.0 was not found. Try installing:

  • deb: libxml2-dev (Debian, Ubuntu, etc)
  • rpm: libxml2-devel (Fedora, CentOS, RHEL)
  • csw: libxml2_dev (Solaris)
    If libxml-2.0 is already installed, check that 'pkg-config' is in your
    PATH and PKG_CONFIG_PATH contains a libxml-2.0.pc file. If pkg-config
    is unavailable you can set INCLUDE_DIR and LIB_DIR manually via:
    R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'

ERROR: configuration failed for package ‘xml2’

  • removing ‘/home/azureuser/R/x86_64-pc-linux-gnu-library/3.6/xml2’
    Warning in install.packages :
    installation of package ‘xml2’ had non-zero exit status
    ERROR: dependency ‘xml2’ is not available for package ‘roxygen2’
  • removing ‘/home/azureuser/R/x86_64-pc-linux-gnu-library/3.6/roxygen2’
    Warning in install.packages :
    installation of package ‘roxygen2’ had non-zero exit status
    ERROR: dependency ‘xml2’ is not available for package ‘rversions’
  • removing ‘/home/azureuser/R/x86_64-pc-linux-gnu-library/3.6/rversions’
    Warning in install.packages :
    installation of package ‘rversions’ had non-zero exit status
    ERROR: dependency ‘xml2’ is not available for package ‘AzureStor’
  • removing ‘/home/azureuser/R/x86_64-pc-linux-gnu-library/3.6/AzureStor’
    Warning in install.packages :
    installation of package ‘AzureStor’ had non-zero exit status

Creation-Time is not part of the properties, but is expected in list_blobs

Using an earlier version of AzureStor, we were able to successfully upload files into the blob storage and retrieve them.
After the update to version 3.0.0 the list_storage_files function fails with the following error message:

Error in as.POSIXct.default(df$`Creation-Time`, format = "%a, %d %b %Y %H:%M:%S", : do not know how to convert 'df$`Creation-Time`' to class “POSIXct”

After soem investigation, we have found this issue to be causes by line 332 in: https://github.com/Azure/AzureStor/blob/36317e84a81e3f6496324d3efc6faf56c65e233e/R/blob_client_funcs.R

In our case the Creation-Time is not part of the properties returned on this line:
lst <- res$Blobs

Our initial thought was to delete the files and upload them again using AzureStor 3.0.0, unfortunately, this didn't solve the issue at hand.

Queue storage package available

Hi. I'm doing some batch processing and I would like after each batch the R process to write into a queue so that other parts of the pipeline can pick up the process.

Is there any time-frame for azure storage queue support?

ADLSgen1

Good morning,

what package do you recommend to access ADLSgen1 ?

Thank you

list_adls_files limited to 5000 results

What it says on the tin: list_adls_files returns only the first 5000 files when used on a directory with more than 5000 files.

Since I do not know of any public data lakes with a large number of files I cannot give a reproducible example, but the following fails when used on a ADLS version 2 datalake filesystem I have access to

library(AzureAuth)
library(AzureStor)

token <- get_azure_token(resource="https://storage.azure.com",
                         tenant=TENANT_ID,
                         app=SERVICE_PROV_ID,
                         password=SECRET)

lake <- storage_endpoint("https://STORAGE_ACCOUNT.dfs.core.windows.net", token=token)

fs <- storage_container(lake, "FILESYSTEM_NAME")

files <- list_adls_files(fs, "PATH_TO_DIR_WITH_+5000_FILES", "name", FALSE)

length(files) == 5000

results in TRUE.

This results in storage_multidownload silently(!) downloading fewer files than expected.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.