Git Product home page Git Product logo

Comments (12)

matt-dray avatar matt-dray commented on June 30, 2024 2

Some thoughts:

  1. No two RAP journeys are the same. Sure, packrat may suck for a behemoth publication, but it's probably okay for helping publish a few RAPped tables or a very short doc with few package requirements.
  2. The 'packrat problem' can be helped (but not solved) by following the tinyverse approach as per #81.
  3. With R in mind, what other non-Docker solutions do we have? miniCRAN as per #84, but should we be thinking about RStudio package manager?
  4. As we all know, RAP is language- and tooling-agnostic and the companion can't provide total coverage of these. This might mean an explanation for some approaches (e.g. packrat for R or virtualenv and requirements.txt for Python) but a tutorial or how-to on the One True Way (e.g. Docker) – emphasised words as per the Divio blog.

from rap-website.

mammykins avatar mammykins commented on June 30, 2024 2

On @RobinL final point, reducing package dependencies also reduces the maintenance of dependencies and their future security vulnerabilities yet to be discovered. Some strategies are discussed here: https://gds-way.cloudapps.digital/standards/tracking-dependencies.html

This may or may not be an issue depending on the context of the specific RAP use case.

from rap-website.

mammykins avatar mammykins commented on June 30, 2024 2

Also, there is a rung up on the ladder from no dependency management before you get to packrat and docker:
devtools::session_info()

from rap-website.

alexander-newton avatar alexander-newton commented on June 30, 2024 1

CRAN only publishes binary packages for the 'current' version of a particular package, so once you attempt to install an older version of a package (as you might do if you've frozen the R package dependencies for a project, and attempt to transfer that project to a separate machine) you'll need to install from sources, which requires build tools.

Packrat could potentially download and store package binaries within a project for re-use, but unfortunately this is not done right now.

This is all fine and well on personal machines, but usually build tools are locked down on government systems (for good reason!) As a result it can be impossible to return to older states.

from rap-website.

TimTaylor avatar TimTaylor commented on June 30, 2024 1

I've only used packrat a couple of times and always found it a bit clunky but did manage to get it working on quite a locked down machine. Similarly I've only used docker a couple of times but for true reproducibility saving the final image somewhere is probably easiest. I think a lot of it comes down to proportionality (e.g. for the rap companion I think using packrat or docker is overkill anyway but appreciate for rap projects it may be of greater benefit). We have done a piece of work where we combined a dockerfile with a packrat collection, but the packrat file is so massive I'm unconvinced of the benefit versus just saving the resultant image. Perhaps someone who has thought about it more will chip in.

from rap-website.

RobinL avatar RobinL commented on June 30, 2024 1

TL;DR: Docker (or equivalent) is probably the most robust tool for reproducibility if used correctly. However, it's also frequently unavailable, easy to use incorrectly, and difficult to understand for less technical colleagues. If you have control over your environment, packrat is probably the easiest tool to gain reproducibility. The best solution may be a combination of Docker and packrat, with the user doing dependency management in packrat, and reproducibility coming from the fact their R Studio environment is running in a Dockerised container (they don't need to know about this). Another possible solution is just to force users to use a specific dockerised R Studio with no ability to install new packages.

More details

Our objective is to be able to return to an old project or run someone else's project and get the same result. Overall, I've found no workflow that is quick and easy. Various solutions work but are quite technical (less technical users find them confusing, time consuming, or both.

Note that I mainly have experience using packrat on R Studio server, running using rocker/rstudio. I also have experience of running packrat on Mac OS X with a local install of R Studio, and on Windows.

No dependency management
Using no dependency management is extremely bad news. Attempting to get others code working which was written even 6 months before can be an absolute nightmare. I'm completely convinced that despite their flaws, some form of dependency management is absolutely critical.

Packrat
Overall I've found that it's fairly common to run into packrat problems. Often these are failures to packrat::restore or to create working packrat.lock files. Error messages are difficult to decipher, and often don't tell you the root cause of your problem, sending you on a wild goose chase. I have spent hours and hours wrestling with packrat. The 'acid test' is basically whether you can build (packrat::restore) your project in a new Docker environment. If it works, then you're fairly safe...ish.... If you don't test this, then your ability to packrat::restore may be contingent on some operating-system level dependency that isn't tracked by packrat

One of the most frustrating things I've found with packrat is the need to install packages afresh into each project. This is a particular problem on Linux based systems where packages need to re-compile every time. This can take upwards of an hour for a big project. We have a solution for this problem here, but this solution is fairly specific to our setup (running R Studio from rocker/rstudio Docker image)

There doesn't seem to be an easy way of using your global package install directory as a 'cache'. This is an option in packrat.opts but it doesn't work the way you may think it should.. This doesn't make much sense to me - if you have a specific version of dplyr installed globally (takes 10 mins plus to compile anew), why isn't there an easy way to simlink my existing install rather than recompiling.

Overall the problem seems to be that packrat is too strict. This is discussed in detail by the devs and others here.

Docker
Some people don't bother with packrat. It seems that a common pattern is to just run install.packages statements in your Dockerfile. You can then use the Docker cache to prevent having to recompile all the time.

If you have access to Docker, including a Docker repository that you trust to back up your built images indefinitely, this is probably a reasonable solution. A huge gotya, though, is that if you rebuild the docker image then you might get a different result. It seems fairly common to just use install.packages("dplyr") or whatever in the Dockerfile, which of course will pull the latest version. You may say, well, let's use remotes to pull a specific version. But now you're basically writing a packrat.lock file, and I'm not sure how this will deal with updates to dplyr dependencies. It's probably not as reproducible as it looks unless you're very diligent in cataloguing all your docker builds.

Once you start to worry about this versioning issue, then you probably want packrat anyway - because it will handle cataloguing the specific version of dplyr and all it's depedencies. So I feel like to get full reproducibility here you need a Dockerfile that performs a packrat::restore within it.

Our 'solution'
At the MoJ, we're deploying R Studio to users from this Dockerfile. The user therefore doesn't realise they're using Docker - but it doesn't matter - they're guaranteed that their computing environment is fully specified and under version control. They then use packrat for reproducibility. They get fast Linux based package installs using our custom CRAN proxy.

Having said all of this, I have spent countless hours struggling with packrat, and it's one of my least favourite tools. Here's a recent tweet with various opinions from the community.

*Another possible option
Given all of these troubles, I wonder whether a more extreme solution may turn out to be better for users is to give then a 'reproducibility' Docker build of R Studio when they're doing certain projects like RAPs. This would be a totally fixed computing environment, where the list of packages is predetermined and changes only very infrequently (annually?). If I were to start again with RAP, I think this might actually be my preferred approach.

Other stuff
I've slowly learned that it's almost always best to reduce the number of package dependencies to the minimum possible. The promise of re-using others code comes associated with a big dependency management cost. So others' packages should be used judiciously and each one should be treated as a 'cost'. I don't agree with this tweet but I do think it contains a grain of truth.

from rap-website.

TimTaylor avatar TimTaylor commented on June 30, 2024 1

It is also important to understand what "reproducibility" is. Reproducibility ensures that the results you achieved can be achieved by others (a good thing when we are publishing). However using packrat / docker images for a piece of analysis that you perform each year is not necessarily ideal if there are bugs in your dependencies that have been fixed in the interim. Whilst unit tests make this easier to check for issues when reproducing tables, it is trickier when you move towards more algorithmic projects (e.g. using regression or optimization algorithms). Adding the aforementioned security issues my current leaning would first be a package (with CI to see if/when it breaks) combined with session info from when the analysis was run. If you want to archive it for reproducibility build it in Docker and make the image available.

from rap-website.

matt-dray avatar matt-dray commented on June 30, 2024 1

Is renv going to be the solution for R? Keep an eye on it.

The goal is for renv to be a robust, stable replacement for the Packrat package, with fewer surprises and better default behaviors.

from rap-website.

alexander-newton avatar alexander-newton commented on June 30, 2024

There also seem to be issues installing packages with compilation in packrat. For instance, at MOJ, we can install devtools only when packrat is disabled. Unsure of the cause.

from rap-website.

RobinL avatar RobinL commented on June 30, 2024

ps @alexander-newton - if you're having issues with devtools installation, raise an issue here :-)

from rap-website.

ivyleavedtoadflax avatar ivyleavedtoadflax commented on June 30, 2024

There is also checkpoint which may be a good solution for some. One downside as @RobinL points out is that it can take an age on Linux. In the original RAP I did use checkpoint originally, and it works more smoothly than packrat, but it lacks the fine grained control of dependencies that you get from packrat - even if it is rather temperamental.

Frankly this lack of easy dependency management in the R ecosystem is a huge pain, and I expect someone will solve it at some point with a recipe based system similar to Python's pip.

In the long run I think that containerisation is a sensible solution, and one could envisage a container relating to each publication: here is one from the first RAP: https://github.com/DCMSstats/eesectorsdocker -- I used checkpoint to manage dependencies here.

from rap-website.

ivyleavedtoadflax avatar ivyleavedtoadflax commented on June 30, 2024

I expect someone will solve it at some point with a recipe based system similar to Python's pip.

I saw this at a UseR event today; looks like someone already did it: https://github.com/trinker/pacman/blob/master/R/p_install_version.R

from rap-website.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.