Git Product home page Git Product logo

Comments (7)

MilesMcBain avatar MilesMcBain commented on August 10, 2024 2

Hi all, thanks for this thread. It has me thinking deeply about my philosophy for data analysis projects and how it fits with all of this.

Some high level thoughts:

Regarding reproducibility, I always like to consider it in two aspects: 1. being the mechanical reproduction of the analysis result from analysis inputs, 2. being the reproduction of the knowledge that lead to the creation of the analysis code.

These two don’t automatically come together, although there is much interplay between them. In some ways, 1. is ‘easier’ because it is a domain more easily spanned by automated tools. We can’t be asleep at the wheel though, we sill have to select and structure our dependencies judiciously. And the automated tools have some wicked edge cases that we need to be able to handle.

Some people might claim 2. is the domain of prose, and that mandates literate programming, but I tend to disagree. It is possible to communicate a great deal of domain knowledge in code, such that it is illuminating beyond the mere mechanical number crunching. To do this well the author needs to make use of certain styles and structures that produce code that has layers of domain specific abstraction a reader can traverse up and down as they build their understanding of the project. Functional programming style, coupled with a dependency graph as per {targets} are useful tools in this regard.

Almost certainly there is benefit to packaging some of these abstractions along the way.

One thing I’ve been thinking on recently is that if you 2. really well, it can help you be more resilient to problems with 1. If it’s very clear what code is doing and the layers of abstraction are built from sufficiently decoupled software, it’s not a project-killing task to swap out one dependency for another or to patch a hole left by a dependency yourself. On the other hand if you do 2. poorly, and your code sets in solid concrete, you’re pretty much entirely reliant on no failure happening in 1.

On Literate programming then, I’m yet to see a really compelling example of wrapping all of an analysis project’s code in (R|Q)md. When done poorly this can even go so far as to produce something less intelligible than plain code or plain prose. E.g. the case where an attempt is made to ship a lot of low level ‘nuts and bolts’ code in a monolithic (R|Q)md document where the prose is trying to sit at a more overview level of detail. The code and text can fight for the narrative rather than complementing each other, and the linear structure makes code a chore to navigate (illiterate programming?).

I don’t see a lot being added to a {targets} plan by wrapping it in (R|Qmd). That’s not to say a high level document that discusses the plan is of no value. I’m just saying I don’t think you’d discuss it in code from - it’d probably be better to take the graphical representation and discuss that? You’d need some way of selecting sub-graphs for display.

Individual targets themselves are possibly another story, I have never done it, but I can see how it might be helpful. Particularly if the literately programmed targets were all cross-referenced into some kind of linked document - kind of like a {packagedown} website - to me that seems a more natural way to explore an analysis graph rather than a linear document.

‘Assertive programming’ is a topic that might be missing from the book. I think of it as a kind of dual of unit testing. Unit testing is for more generally applicable packaged code. But when you have functions in your analysis pipeline that operate on a very specific kind of input data, unit testing becomes kind of nonsensical because you’re left to dream up endless variations of your input dataset that may never occur. It’s a bit easier to flip the effort to validating the assumptions you have about your input and output data, which you can do in the pipeline functions themselves rather than separate unit testing ones. This is nice because it ensures the validation is performed in the pipeline run, and so is backed by the same reproducibility guarantees.

Responding to some things in the thread:

  • I think it’s great to have {renv} and potentially some other options discussed, but maybe not with {targets}. I would think it is more appropriately discussed in the same chapter as Docker. {renv} being close to the start of a continuum of reproducibility measures that ends at Docker. How far you go along the continuum depends on your reproducibility goal.
  • There are still a lot of people that find git intimidating and still potential for some things to go badly for a project if git is used in the wrong way. I once had a colleague who assured me they knew how to use git proceed to use a repo like their personal dropbox folder. Perhaps the details of git usage can be basically waved away, but some detail about good git workflow could be incorporated. For example:
    • The branching model to use. IMHO trunk-based development works much better than gitflow for analysis teams.
    • Version number discipline. Why you always bump the version number when making changes to your packages.
    • Why keeping commits small and confined to just one target at a time if possible is useful when tracing problems with a pipeline.

That’s all I have for now, I’ll be back with some targets specific thoughts.

from rap4all.

b-rodrigues avatar b-rodrigues commented on August 10, 2024 1

Hey guys, hope you had a nice winter break! I've just pushed a quarto template to the repository with the structure we've discussed. I hope I didn't forget anything. Would you kindly take a look? I also copied some of your comments into the different chapters/sections and hope I didn't misunderstand.

While doing this, I was also thinking about the project itself. I guess we should have the readers follow along with a project, that they would work on from start to finish? What do you think? This way, we could "simulate" the experience of conducting this project with our proposed approach.

from rap4all.

statnmap avatar statnmap commented on August 10, 2024
  • You can inflate Qmd files with fusen. For now, there is no choice for a flat template by default, it will be a Rmd file, for which you need to change the extension. But we can imagine proposing qmd flat files. Indeed, only the yaml header would change a little. I opened an issue for this: ThinkR-open/fusen#175
  • PROPRE or RAP is more a philosophy than a tool. It is for people to define the way they want to interact, and what they want to automate. Detect where in their process, they want to use their time and brain to do some more interesting things than copy-paste and emails. So, I see that more as an introduction of this kind of book, or the main thread of the story. Why would you want to take time to make things reproducible ? What does it mean for you ?
    Hence, I would recommend git + literate for sure. I would recommend thinking about packaging for reliable documented work, and separation between tools and analyses. I'd probably recommend {targets} to earn time, but also to force thinking about the processes involved.
    But, using Docker or {renv}, I do not see it as a general recommendation. It is more of a choice that needs to be made depending on the context. Having dev dedicated to maintain the tools, could be enough.

from rap4all.

b-rodrigues avatar b-rodrigues commented on August 10, 2024

Thanks for the clarifications Sébastien!

So overall, you would agree with the proposed outline, if I understand you correctly. I think that making the PROPRE approach the main thread would probably be a good idea.

Regarding Docker and {renv} though, I would still suggest that we include it, somehow, because in my opinion, we also want the pipeline to be reproducible (or "future-proof" in a sense"). The way I did it in my ebook for the uni course was to illustrate the problems not using renv nor Docker would cause in the future. "Luckily", when we tried to rerun the demo pipeline from William Landau (https://github.com/wlandau/targets-minimal) in class, we had mixed results; some students managed, some didn’t, and what was interesting was that the operating system the students were running didn’t seem to matter! So this was a nice way to introduce them to Docker as a way of getting something always to run.

Looking forward to hearing what @dmi3kno and @MilesMcBain think of this.

from rap4all.

dmi3kno avatar dmi3kno commented on August 10, 2024
  • Chapter 1: Functional programming

    • It seems to me that {fusen} and {targets} require two different approaches to functional programming, reflecting @statnmap's separation of tools from analyses. So I would love to see this chapter (not sure whether it needs to be Chapter 1 or 2) to highlight this separation. The required mindset is very important to put forward early in the book.
    • Perhaps the first chapter should generally be about reproducibility and what we mean by reproducible, replicable, robust, etc.
  • Chapter 2: Git

    • Much has been written about Git and there's really very little we can say that is unique to R. So I am not completely sure it deserves a chapter. Version control is crucial, GitHub/GitLab is essential for collaboration. Period. Here are cool resources to get started.
  • Chapter 3: Writing up the analysis with Quarto.

    • Again, Quarto is such a vast landscape to cover, but if we focus on one particular aspect of it, which is flat documents rendered to HTML, then we spare ourselves some sanity covering LaTEX or Reveal.js oddities.
    • We need to introduce Quarto (as Rmarkdown 2.0) to discuss fusen, so some minimal introduction with generous referencing to the official documentation should do the job.
  • Chapter 4: Packaging the tools (with fusen)

    • So in the {fusen} workflow this is happening after some analysis has been written up and the author realized: "Ah! there are some things to package here for future reuse and testing" (the tools/analysis separation).
    • We need to convey important concepts regarding the required components for packages (e.g. documentation, examples, etc), and how they are mapped to the comments in the documents. We will describe something like "thoughtful commenting with {fusen}", because with careful and deliberate commenting you will not have to repeat yourself (DRY principle).
    • I think meaningful examples should be included here, which become unit tests in fusen. It is part of packaging the tools, ideally inseparable from it, because otherwise there will be very little incentive to write good tests.
  • Chapter 5: Targets

    • I think a good segway here is everything that is not a package should be a target (with very well-motivated exceptions). I have never written a targets markdown which is also foldable by fusen, but I think it is an interesting challenge. I would love @MilesMcBain thoughts on this and maybe we will have to come back to Will, but I did a README file, which was a Targets markdown (and could very well be a foldable fusen document as well), while my analysis package only did tar_read() and tar_load().
    • There are some gotchas with globals and I would love to discuss how much of that could/should be packaged. I have a feeling that packaging a large portion of global functions solves some of the issues I experienced with, for example, creating the initializing functions for my MCMC pipelines.
  • Chapter 6: Environment management with renv and Docker.

    • I think all modern R users need to be aware of renv. If you don't know what renv is for, you will.
    • Docker is somehow the next step over renv and it does what renv does, but also in a clean compute environment. Depending on what the nature of the computation is, it might be more or less difficult to replicate the computational environment exactly (e.g. use of external OS-specific libraries), so that is something we need to get across. When is Docker needed? One thing we need to keep in mind that the starting point for Docker for Python users is very different from ours: they don't have CRAN and the dependency management,... well you know it.
  • Chapter 7. Continuous integration.

    • This is again, very useful knowledge to have in the modern days. If not for the analysis itself, at least setting up the analysis website under CI is a useful skill to have.
    • This is potentially a very deep topic, but R users would probably just need to be aware of the name Jim Hester and know of the {usethis} and {actions} packages. And where to find help when something goes wrong (and it does).

from rap4all.

statnmap avatar statnmap commented on August 10, 2024

I might rephrase what Miles just said, as I need to make it clear in my head. 😃
I think we agree on the content and how could look like such a data analysis workflow.
There seems to be two way of presenting such a workflow:

  • What does the complete project looks like at the end and what tools do I need?
  • How do I start working on it?

What does the complete project looks like at the end and what tools do I need?

If what I want to achieve is a data analysis report, then I personally see the complete project as:

  • A fully documented and tested package
  • A {targets} workflow that uses the functions included in the package. The workflow itself could be included and tested in the package
  • The final report that present some outputs of the {targets} workflow, along with the analysis of the results

IMG_20221125_155321

How do I start working on it?

Depending on your ability to see everything that will come, I'd follow Dmytro to say that you'd probably won't start with writing functions in a package.
My approach usually is:

  • Explore the data in a Rmd, write what you are looking for and how.
  • At some point, you'll want to:
    • Extract the code inside chunks to gather it as functions and put it out of the Rmd. And as we say functions, we say package and HTML website
    • Link the different part of your chunks to show what depends on what. Thus start thinking about a {targets} workflow. Which can be developed in the package.

I do not know if you start by thinking about the network or about gathering functions. It probably depends on the type of analysis, how big the workflow is, ...

  • I imagine the original Rmd to be the future vignette of the package, because it explains why and what to use. And this does not prevent to include some {targets} workflow in it. Is it a unique big wokflow, or multiple small ones in this case, I do not know.
  • Finally, you'll want to write your report, the place where you think about the results of your data analyses. This report structure could be similar to the vignette / original Rmd as you write your story in the same order.
    • You can run the complete {targets} workflow prior to the Rmd, or in the first chunk, and then call intermediate outputs in your report to illustrate your analysis (tables, figures, ...).
    • You can re-run the final report when some data is updated. This does not have to be the case for the vignette of the package.

IMG_20221125_155349

Concerning reproducibility

I guess that the package is the option 2 of Miles, where you make the workflow reproducible / reusable, thanks to package docs and tests.
And the report lives in option 1, which you'd want to complete with {renv}, Docker, ..., to make the analysis results reproducible.

from rap4all.

b-rodrigues avatar b-rodrigues commented on August 10, 2024

BTW, the book is up:

https://b-rodrigues.github.io/rap4all/

from rap4all.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.