Comments (5)
Well, it's on the to-do list now, but my current focus for AzureR is on getting the Table storage/CosmosDB package done. So I wouldn't expect any major changes for AzureStor in the short term.
from azurestor.
Hi, thanks for the comments!
This is a rather tricky task, mostly because of how to design the interface. Note that there is no UI to upload a single object directly to storage either: if you have say a dataframe, you have to serialise it first to a connection object or file, and then upload that.
I'm also not convinced of the utility of multiple uploads of objects. Even if you create a tempfile on disk, your Internet connection is much more likely to be the bottleneck. The exception would be if you are doing wholly within-Azure transfers (eg from a VM to a storage account) and everything is in the same region, but then your absolute transfer times are going to be pretty good in any case.
As for object size, you're limited by available memory so there's only so much you can upload at once anyway.
Feel free to continue iterating on this, I'd be interested in knowing more about your particular use case if you believe this is warranted.
from azurestor.
Hi, thanks for the quick response & sorry for the delay in responding.
At our firm, we run little to nothing on our laptops. All VMs (including the one hosting RStudio Pro) run on Azure and so yes, we are looking for the case of within-Azure transfers where network bandwidth is not a problem (from my experience this is fairly typical of most large enterprises with single cloud-providers)
We prefer to keep our disk sizes small on our VMs and so our /tmp storage is not that big. However, we have fairly sizable RAM, clock speeds and several cores. We spin VMs up & down as needed. So memory & CPU utilization are not typical bottlenecks but I prefer to avoid unnecessary disk I/O.
The use case is being able to take objects that we already have in memory (for example we generate several co-variance matrices using mclapply over all the cores) and then upload them to Azure blob storage in parallel.
Currently, we have to either
- write all of them to /tmp (if we add space) resulting in unnecessary disk I/O and then use storage_multiupload
- write some sort of apply loop ourselves and use storage_upload with rawConnection for each object
It would instead be beneficial to have something already built into your package, similar to like storage_mutliupload for parallel uploads but for a collection (list?) of objects that we provide, instead of file paths.
Hope that clarifies things. Let me know your thoughts.
Thanks
from azurestor.
While you may have small disks on the VMs, I'd still be surprised if you don't have far more disk space than memory, so that shouldn't be a constraint. Note that even if /tmp
is a limited filesystem, R's tempfile()
lets you choose the directory, so you could write your files to, say tempfile(tempdir="~")
to write to your home dir. Or if you have a data disk mounted, you can write there.
Similarly, the network may be fast for within-Azure transfers, but assuming you're using SSDs, writing to local storage should still be much faster. So any slowdown from writing to a tempfile should not be a major factor. In particular, any blob upload involves at least 2 API calls with associated latency, so there is a lower limit to how fast things can get.
If you are using mclapply
to parallelise your computations, you can also insert the storage_upload
as part of the call, rather than doing it separately after the compute is finished. That would save having to wait on the slowest job.
from azurestor.
Yes, I am familiar with all of your points and know of all those possibilities. My only suggestion was to have the package to do all this (parallel loads et al) instead of the user implementing this themselves. If you don't this is something that will be implemented anytime soon or at all, I can just build something out myself.
Thanks for listening. I really like this package and thanks for your efforts on it.
from azurestor.
Related Issues (20)
- Support versioning for blobs HOT 1
- using token auth somewhere gets switched to access key in the pipeline HOT 2
- Using token auth breaks HOT 1
- storage_load_rds gives error due to memory while download and then read_rds works
- Error in curl::curl_fetch_memory(url, handle = handle) : Timeout was reached: HOT 1
- unused argument (azure_storage_progress_bar = progress) HOT 4
- This repo is missing important files
- Enable blob versioning with `create_blob_container()`
- Writing CSV files With Column Widths Greater Than 255 Characters to Blob Storage
- How to read Excel files in ADLS?
- Support for setting expiry on a per blob basis HOT 1
- TLS Support HOT 1
- Authenticating Using AzureGraph HOT 3
- storage_write_csv2 crashes Rstudio session
- Sourcing R code HOT 1
- SAS Token functions not working HOT 1
- Is this package still maintained HOT 2
- How can I generate a sas token with "sp" prefix and not "sv" when using get_services_sas?
- Using interactive Active Directory/Entra authentication in AuzreStor HOT 2
- Generate SAS tokens with get_service_sas or get_account_sas
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from azurestor.