Lots of people use nf-core pipelines offline. We want to make the process of using mod

Paging the <a class="user-mention notranslate" data-hovercard-type="organization" data

(pinging <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard

I'm pretty much in alignment to what <a class="user-mention notranslate" data-hovercar

Handle module / process imports,about nf-core/modules

Comments (78)

ewels commented on July 28, 2024 6

Paging the @bioconda elite team: @johanneskoester / @daler / @dpryan79 / @epruesse (and any others who are interested)..

This thread is a little different to usual for you guys, but I would really appreciate your inputs here. The discussion above is pretty long, so here's my summary:

nf-core collects @nextflow-io pipelines, see https://nf-co.re - we lean heavily on bioconda for the software dependencies and most of us are small-time contributors. Quite a lot of how we operate the @nf-core github organisation is inspired by what we've seen you guys do with @bioconda
The latest version of @nextflow-io now officially supports a new DSL2 syntax, allowing easy imports of partial pipelines (modules) from other locations. We would like to create some kind of repository for tool wrappers (processes in Nextflow jargon, aka rules in Snakemake). The idea is that pipeline developers can then easily import these into their workflows.
Each module, or tool wrapper, is essentially a small text file which describes the process along with details of a container and basic config flags. The repo will also have tests built in, but these don't necessarily need to be included in the pipelines that use them.
This is developers only - we want users to just have flat pipeline repos where all of this is handled
Whatever solution we come up with, we will probably wrap the commands in nf-core tools anyway to make life simpler for developers

The above thread has a discussion about how to handle the logistics of managing this new repository which a few different ideas of how we could build it. They are:

Just have this entire repo included as a git submodule in every pipeline
Write custom code in the nf-core/tools helper package which copies flat files out of here and in to a pipeline, with a metadata file to track which files were included at which commit. Lint tests would check that they are not edited in the pipeline repo.
Use npm (node package manager) to handle the imports. We ideally want everything in one repo for all modules, instead of having one per wrapper, but apparently the new GitHub npm registry can handle that.
Use conda to manage this with a custom @nf-core channel.

Of course the final point is where we would like your input! @grst has put together a draft PR where he has mocked up a back-end we could use for managing this which leans heavily on bioconda-utils - see #14.

What do you guys think?

Is conda massive overkill for packaging a handful of small text files?
Should we avoid relying on bioconda-utils, or is that a good idea?
If it were you, which of the above approaches would you use?

We have a bit of a range of opinions here - I'm a bit wary of premature optimisation and making things more complex than they need to be. But others above are equally wary of us reinventing the wheel when there are tools which already do some of what we need.

Thank you in advance!

from modules.

pditommaso commented on July 28, 2024 3

Awesome, think there are all the pieces for a pilot module!

from modules.

daler commented on July 28, 2024 3

(pinging @mbargull and @bgruening as well)

Interesting discussion!

Is conda massive overkill for packaging a handful of small text files?

It sounds like those small text files will not themselves be complex but will be versioned and will be dependent on each other. That's where the complexity of the problem lies, and for that I don't think conda is overkill.

Should we avoid relying on bioconda-utils, or is that a good idea?

bioconda-utils was originally written to be general, but I'm not actually aware of anyone else using it. It may not work for you as a drop-in replacement because there are decisions we made there that are specific to how we organize things, but there are certainly useful ideas/methodology/code in there that you could re-purpose.

If it were you, which of the above approaches would you use?

To me the GitHub npm registry is still an unknown so if it were me I'd probably dig a little to understand how that works and balance that against conda. But I would probably use conda/npm over git submodule or a custom solution. Conda/npm gains you searchability, dependency tracking, versioning, and hosting -- all of which would be hard to develop from scratch or to implement in a submodule context.

Just my 2 cents, curious to see what others think.

from modules.

grst commented on July 28, 2024 3

I have the feeling that this is going a bit in circles and maybe you guys have to figure that out at the Hackathon while having a few beers ;)
Maybe we also need to move forward both (i.e. simplistic vs. package manager) approaches a bit to see how it goes. I put together a proof-of-concept for conda, maybe someone else can create sth similar for the other approach?

Some more thoughts:

Like @drpatelh, I don't really like the idea of using different versions of the same tool within a single pipeline. Maybe it's better if people are forced to update a module if they hit a version conflict instead of using different, potentially outdated versions in their pipeline?
What would be your strategy for versioning modules with the simplistic approach? I mentioned this earlier, but it has not been picked up so far:

I don't think a git repository is good for keeping track of "released versions". Yes, there are tags and releases, but we want individual releases for each module. For this to work, we would at least require some external system that links a certain version number of a module to the corresponding commit hash.

Ho do the approaches compare if we think big, i.e. 1000s of modules, 100s of versions, 100s of PRs per week, sub-sub-sub-sub workflows etc.? Bioconda/conda-forge proof that it's manageable using their build system. What are the challenges we would face creating our own?
Using the simplistic build system, how would sub-sub-sub-... workflows resolve? If all sub-workflows are self-contained it would certainly become a mess. (Think of a pipeline predicting targets for personalized cancer vaccinations that depends on a neoantigen prediction pipeline that depends on fusion neoantigen prediction pipeline that depends on a variant calling workflow that depends on a DNA-seq workflow that depends on a quality control workflow that depends on the fastqc module.) ... But maybe it really doesn't matter that much.

Finally, answering @junjun-zhang's question:

What I am not sure is whether Conda supports this nested packaging, or whether it's a good practice in Conda.

I believe it would be possible by tweaking the recipe file, but it's not considered good practice, or at least, i've never seen this before. After all, the whole point of conda is to resolve all dependencies s.t. there are no conflicts.

from modules.

pditommaso commented on July 28, 2024 2

Since there are many package managers out, there's nothing that could be used to manage NF module assets? I was even thinking to use npm. Would that be such crazy?

from modules.

ewels commented on July 28, 2024 2

Yes this was my initial thought as well. But I still see two main drawbacks:

We will have to have a separate repository for every tool / process we want to import
We have a lot of pipeline curators! So the extract dependencies are still a little irritating.. (though not too bad and if commands are wrapped then I agree it's not a big deal)

from modules.

maxulysse commented on July 28, 2024 2

I like the conda idea, I think that it'll fit well within the bioinfo community as well

from modules.

pditommaso commented on July 28, 2024 2

Important: my suggestion is to use Conda (or npm, etc) to allow the pipeline developer to fetch one or more specific modules/versions and include them in the pipeline project.

The pipeline user is not expected to have any interaction with the package manager since it will get the modules along the pipeline using the usual nextflow pull/run commands.

Think it's clear, just to make sure all agree on this.

from modules.

pditommaso commented on July 28, 2024 2

+1 to have a dedicated nf-core channel on Conda (!)

from modules.

mbargull commented on July 28, 2024 2

I'm pretty much in alignment to what @daler wrote regarding those questions.

If you need something that handles dependencies of versioned packages/workflows, then right, using a multi-platform package manager surely makes sense. (As for npm I can't comment either due to lack of knowledge.)

You can of course use bioconda-utils, but yes, there might be some Bioconda-specific quirks in it. Personally, I'm not against someone else using bioconda-utils and generally in favor of making tools more universally usable (if pragmatically feasible, that is ;) ). But please let us know if you encounter issues then -- happy to collaborate!

My general advice would be to explore the options you see and just try out what seems to make sense to you. Conda with or without bioconda-utils may be one of those. However, maybe try to have/keep a general concept that is not too dependent of the build and packaging system. If you find that you want to change either of those due to some new (or previously not thought of) conditions, you wouldn't want to have your system rely on peculiarities of them from the get go.

As for concrete advice, I can't give any yet because I haven't read this discussion yet and thus have no overall picture on what your requirements might be.
(Though, if you want to go with Conda packages, I would advice to add a certain prefix -- at least nf-, or rahter something more distinct -- to the package names if you want your users to be able to install software and workflow packages side by side into environments. Otherwise you might get undesirable name clashes. Meaning: Just try to plan ahead ;).)

from modules.

drpatelh commented on July 28, 2024 2

I'm slightly confused about the practicality of all of this 🤔 Given the sub-workflow scenario we could potentially be using different versions of the same tool in a given workflow? How would you even begin to conventionally write that up in a paper? Yes the pipeline in its entirety is reproducible and I understand that version control is important and shouldnt be compromised but surely there is a way in which we can instruct individual modules to use particular versions of software? I also understand that module files may become redundant with individual tool updates but this seems to be a more practical aspect to put under version control. I'm not suggesting I have the answers but that maybe we need to be thinking about this differently?

Is it plausible to have full version control between the main script > sub-workflow > individual module file whilst maintaining a level of abstraction as to which software container is going to be used? Or maybe Ive misunderstood and need to :homer_disappear:

from modules.

ewels commented on July 28, 2024 2

WIP nf-core tools command for importing module files into a pipeline as a draft PR here: nf-core/tools#579

from modules.

piotr-faba-ardigen commented on July 28, 2024 2

To add something from myself. I just tested both implemented approaches:

conda by @grst implemented in #14
nf-core modules by @ewels implement in nf-core/tools#579

It seems that both quickly allow to achieve the goal. That is, modules end up in the right directory structure.

I like nf-core modules more as it does not introduce additional dependency.

On the other hand, for now conda has this appealing feature that the channel where the modules are hosted is customizable, making it useful for more general use case. However, that said, it does not mean that nf-core modules could not be expanded to take repository, it pulls from, as an argument. However, going in that direction could easily grow the code as various repository provides (github, gitlab etc) have different APIs. Nextflow itself in this manner is supporting only the most popular ones: https://www.nextflow.io/docs/latest/sharing.html .

I may be missing other important points. But at this moment it seems to me, that the choice between the two falls into the category of personal preference.

from modules.

apeltzer commented on July 28, 2024 2

Pretty much in line with what @piotr-faba-ardigen just wrote, tested both locally here (okay, in a VM to be fair), but that does seem to work in both cases. The second part that Piotr just argued about is also strikingly important to me I think, as e.g. we cannot share all the modules we have but would like to be able to e.g. have multiple "channels" of modules - if we can add some flexibility to nf-core modules that allows this, that would be super cool - that way users can rely on the hopefully big repository of nf-core/modules in the future, BUT also use and rely on their own modules too.

We could even add some linting in the future to check where the modules exactly came from, but as long as we follow kind of a hierarchical model similar to bioconda, conda-forge, anaconda etc, it should be fine to adopt the concept for modules here too.

I'd also like to keep things as simple as possible, although benefitting from experiences at bioconda is a good idea :-)

from modules.

ewels commented on July 28, 2024 1

A possible problem including nf-core/modules is that you will need to update all modules altogether, that still could be a strategy.

Yes - I think that having everything at one commit is a sacrifice worth making for simplicity though. In fact I'd prefer to enforce this as otherwise stuff could get very confusing quickly.. 😝

from modules.

junjun-zhang commented on July 28, 2024 1

A possible problem including nf-core/modules is that you will need to update all modules altogether, that still could be a strategy.

if you want to control the version of each module independently you should include each of the as a separate substree.

We are taking a different approach to import remote modules that addresses the above concern and does allow us to version control each module independently. Here are the modules, and here is how these modules are imported into the pipeline repository, basically they are materialized locally (needs to run something similar npm install to import/sync module files).

Since the module files are local, same as other normal files in the pipeline's git repo, once sync'd and committed to git, there is nothing additional needed to make the pipeline work.

from modules.

ewels commented on July 28, 2024 1

The more I think about this, the more I think that we should copy the approach of npm / bioconda but build this functionality in to the nf-core tools package. This is already a dependency, so doesn't add any complexity for developers, and means that we have complete control of how we want the system to work.

This is of course less good as a general nextflow (not nf-core) option, but I think that maybe that is ok for now.

from modules.

pditommaso commented on July 28, 2024 1

Thought having an ad-hoc nf-core package manager tool surely streamline the experience for the final user, I would suggest resisting the temptation to create yet another package manager and related specification (metafiles? how to manage releases? version numbers, etc).

Maybe a compromise could be to implement a wrapper over an existing package manager to simplify/hide the interaction for the final user and at the same rely on a well-established package managing foundation.

The external dependency with conda/npm/etc I don't think it's so critical because the module files would be in any case included in the GH project repository. Therefore the pipeline user would not need to use any third-party package manager. It would only be required by the pipeline curator when update/sync the deps.

from modules.

olgabot commented on July 28, 2024 1

Conda makes a lot of sense since many people (not including me) have submitted bioconda recipes and there's already some tooling there we can use

from modules.

pditommaso commented on July 28, 2024 1

Still not sure that Conda can stage artefacts in the project directory, instead of using its own managed directory. Does anybody know about that?

from modules.

grst commented on July 28, 2024 1

I have been experimenting with it. I think conda would be feasible.

Installing into the project directory is possible with conda create -p ./modules <package>.
Every module could have a bioconda recipe named nextflow-XXX

Here is an example meta.yml recipe to build the fastqc module from nf-core/modules:

package:
  name: nextflow-fastqc 
  version: "0.0.1"

build: 
  script: mkdir $PREFIX/nextflow && cp -R tools/fastqc $PREFIX/nextflow

source:
  url:
    - https://github.com/nf-core/modules/archive/master.zip

Build the package:

conda build nextflow-fastqc

Install the package in the ./modules directory:

conda create -p ./modules grst::nextflow-fastqc

The ./modules directory now looks like this:

modules
├── conda-meta
│   ├── history
│   └── nextflow-fastqc-0.0.1-0.json
└── nextflow
    └── fastqc
        ├── main.nf
        ├── meta.yml
        └── test
            ├── main.nf
            └── nextflow.config

Installing multiple nextflow-xxx packages would be no problem, and conda would take care of versions.

from modules.

grst commented on July 28, 2024 1

What's the rationale of not having the user download the packages? To me that would feel like the cleaner solution... And the conda command could probably easily be wrapped into nf-core downlaod or even nextflow pull.

I agree that a dedicated conda channel is probably the best solution. The advantage I see with bioconda is that we could take advantage of their already established bot system for automated builds.

from modules.

ewels commented on July 28, 2024 1

What's the rationale of not having the user download the packages?

Currently the only dependencies are Java + Nextflow, plus some kind of software. Building this fetch in adds a dependency for conda for all users. It also complicates things for running offline (this could be helped by custom tools such as nf-core download as you say).

By keeping this to developers only and keeping all required nextflow source code in the repo, we add no dependencies and no extra complexity. All systems continue to work as they are, nothing changes.

In contrast, the only advantage of getting the user to pull the wrappers that I can see is that the version control history is cleaner. In my opinion that's a fairly minor thing and much less important than ease of use.

from modules.

ewels commented on July 28, 2024 1

I'm still slightly skeptical about how much work is needed to build the kind of infrastructure that bioconda has to manage the automation of packaging for conda. But I agree that this seems like a nice path forward.

To play devil's advocate, I think that there's still an argument for making our own custom system:

Bioconda has to host big assets with non-negligible file sizes for multiple platforms, we don't - we just have a handful of very short text files (curl from github is no problem)
Conda is good at handling nested dependencies, which is super tricky. I don't think that our nextflow wrappers will ever have the same kind of dependency network (?), so we probably don't need this functionality.
We will potentially need to write and maintain a lot of code to handle the maintenance of the anaconda cloud channel, in the form of CI scripts and packages. Like, seriously, the simplicity of adding a bioconda package totally does not represent the complexity of the back end that powers it.

I think the only viable alternative is to write something new in the nf-core package. I know that this isn't popular as it's making yet another packaging tool. But it could be a hell of a lot simpler and easier to write / manage:

We have a metadata file in the pipeline that tracks each imported file and the git hash it comes from in the modules repo
nf-core modules install / update etc searches GitHub modules repo, pulls the one small text file and saves it, updates the metadata file.
nf-core lint deletes all local module code and pulls again according to the metadata file. Any diffs will indicate that the module code has been tampered with and is dirty.
If we want to be able to support tool versions as well as just git commit hashes, we could easily automate a metadata file in the modules repo that lists all tools and all versions that they have ever had, with corresponding repo hashes. This would just be a case of looping through commits and watching when the tool metadata file updates the version number.

In contrast, I think I could probably write this code in a morning (or two). We have no new dependencies (local, or web-packaging wise, eg. anaconda cloud). By mimicking other tools the familiarity in functionality and usage would be comparable so devs wouldn't really have to learn anything new. It would also probably be easier to test and lint.

I'm being a little provocative here deliberately as I think this is a really important decision. I would appreciate counter-arguments, especially pointing out any concrete advantages that using conda / alternatives have over the new hand-coded option described above.

from modules.

grst commented on July 28, 2024 1

@ewels, I can see your point and it might indeed be overkill.

Here are some points in favor of conda:

Bioconda has to host big assets with non-negligible file sizes for multiple platforms, we don't - we just have a handful of very short text files (curl from github is no problem)

Arguably, the modules won't become big, but they could consist of several files (e.g. helper processes, scripts in a bin folder, ...). This can, of course, be handled by a custom script but with conda it would work with no additional effort.

Conda is good at handling nested dependencies, which is super tricky. I don't think that our nextflow wrappers will ever have the same kind of dependency network (?), so we probably don't need this functionality.

Is that really the case? I would envisage larger modules ("sub-workflows") to exist that depend on basic modules (e.g. a DNA-seq subworkflow that depends on some QC and alignment modules. The DNA-seq module could then become part of, e.g., variant-calling pipeline).

Especially, as nf-core/modules grows, this could become more and more tricky and conda is proven to handle that well.

We will potentially need to write and maintain a lot of code to handle the maintenance of the anaconda cloud channel, in the form of CI scripts and packages. Like, seriously, the simplicity of adding a bioconda package totally does not represent the complexity of the back end that powers it.

I don't think it's that bad. A minimal working solution would run conda build on each recipe that was modified. I could probably implement that also in "a morning (or two)" in Github actions.
I don't say that it can't become more complicated because, as always, there will be some caveats, but this is at least as true for a homebrewed solution.

We have a metadata file in the pipeline that tracks each imported file and the git hash it comes from in the modules repo

I don't think a git repository is good for keeping track of "released versions". Yes, there are tags and releases, but we want individual releases for each module. For this to work, we would at least require some external system that links a certain version number of a module to the corresponding commit hash.

from modules.

ewels commented on July 28, 2024 1

For subworkflows - I guess I envisioned workflows to always their own repository, and this to only ever be for processes. But yes, if we're generalising beyond the strict confines of nf-core then this is of course a possibility and a potentially powerful tool.

I am definitely encouraged by your worked example above @grst - do you think you could do a draft PR to start to sketch out the build and push to anaconda cloud? Let me know if you'd like me to add you as to the nf-core anaconda cloud organisation.

Even if we use conda, I think it could still be good to wrap the conda commands in nf-core tools as I could imagine people (me) missing the critical -p flag on a regular basis..

For this to work, we would at least require some external system that links a certain version number of a module to the corresponding commit hash.

We kind of do this on the nf-core website with pipelines already, but it's more for convenience only as GitHub is the real store of this information in the pipeline releases.

from modules.

ewels commented on July 28, 2024 1

@pditommaso - what do you think about building some of this functionality into nextflow itself? We can already do nextflow pull to maintain a cache of workflows. We could conceivably wrap the kind of behaviour described above in to nextflow too - this may be a way to avoid committing the imported process code in to the workflow git repository (would need some thought for running offline however).

from modules.

pditommaso commented on July 28, 2024 1

what do you think about building some of this functionality into nextflow itself?

Maybe in the future, but surely not in the next 12 months. I agree a wrapper could be convenient over another tool could be convenient to simplify the quick start for novice users.

At the same time, I see the benefits of adopting a well know package manager because they have already solved many of these problems (versioning, checksum verifications, dependency tree management). This is the classic problem that at the beginning looks simple but soon escalate in something much more complex. Moreover adopting an established platform allows you to benefits from the existing ecosystem. For example, GitHub could host npm packages.

Last thing, I'm not getting what big assets you are referring in this statement

Bioconda has to host big assets with non-negligible file sizes for multiple platforms, we don't - we just have a handful of very short text files (curl from github is no problem)

I think Conda package for a NF module, would only require a yaml metadata file and the tar of the module itself.

from modules.

junjun-zhang commented on July 28, 2024 1

It appears conda is becoming the consensus here, it is a sensible choice to me.

@ewels and @grst, I think whether to use a separate repository for conda recipes is a valid question. Maybe for now it makes sense to maintain all source code of nf-core modules under nf-core/modules repo, but in the long run I see module source code could be maintained by contributing nf-core community members themselves as it grows. I really like how Conda-Forge has this staged-recipes repo that is dedicated to on-boarding new packages (whose source code are not maintained by Conda-Forge). Like us, the ICGC ARGO, it's more likely we maintain our nextflow modules in our own repos, but submit the corresponding conda recipes to nf-core. This way the modules can still be importable (via Anaconda Cloud nf-core channel) by any one in the community.

Regarding sub-workflows (or sub-modules), basically a module (not a NF process but a NF DSL2 workflow) depends on another module. @ewels I think it could sooner become a need, at least we already have such use case. For now, we just copy&paste same sub-workflow under different main workflows. It would be nice there is an elegant solution, I think conda sets us to the right path to eventually support sub-workflow as reusable modules, and many other great features.

from modules.

grst commented on July 28, 2024 1

I'll try to build a prototype by, hopefully, end of next week.
Ideally, I can re-use parts of the bioconda build system.

For the prototype I'll just go with

recipes in the main repo and
a single yaml file for conda and metadata.

Changing that structure later shouldn't change anything fundamental to the build system.

from modules.

pditommaso commented on July 28, 2024 1

Once I've made the comment I've realised as well that resolving the module path against a common directory, would open the door to possible module version conflicts.

Also, this structure can already be achieved in the current implementation just using the following idiom:

include foo from "$baseDir/X/Y"
include bar from "$baseDir/P/Q"

Say that, I agree with Phil that each sub-workflow should bring their own modules and preventing in this way any potential conflict.

At the same time, I share the view of JunJun that in the long run, complex pipelines, could become a mess and a flat structure could be easier to maintain.

Now, regarding the problem of version conflicts. This exactly what these tools (conda, npm, etc) are designed for. Therefore I think how conflicts should be managed (and the resulting directory structure) should be delegated to the tool that is chosen to handle the packaging of the modules.

from modules.

junjun-zhang commented on July 28, 2024 1

@grst your points are well taken! I like the idea of building quick PoCs and think big. Not that we have to do the big things at the beginning, but definitely beneficial planning for it.

What would be your strategy for versioning modules with the simplistic approach? I mentioned this earlier, but it has not been picked up so far

how about this: #2 (comment) ? To be honest, I like it a lot, it's super simple and get-job-done! I was afraid it seemed such a hack to others, but as soon as I see it's also being used by GO Lang, big relief.

from modules.

ewels commented on July 28, 2024 1

I've just started working on a simple proof of concept PR for a simplistic copy-from-github method. I'll link it here when there is something to show 👍

Maybe it's better if people are forced to update a module if they hit a version conflict instead of using different, potentially outdated versions in their pipeline?

I think this is probably the only way to manage this problem and we came up with the exact same idea earlier today. Basically in the linting check that there are no duplicate tools with different versions in a pipeline. If there are, the linting fails and the author has to change imports / upstream pipelines until this is resolved. Except it was pointed out that there may be some edge cases where different versions are required, so it would probably be a warning instead of a hard failure.

For versioning I think we can use the metadata yaml files that come with the import, plus some kind of extension of the current version calls that we currently have..?

Thinking big is good, but only if it doesn't come at the expense of adding lots of complexity that may slow down growth - it's a balance! 😄 I'm a little encouraged at the idea that we hope to wrap whatever solution we go for in nf-core tools subcommands, so we can switch the back end at a later date without affecting the developer flow much / at all.

from modules.

ewels commented on July 28, 2024 1

Ok, after a little further discussion on the nf-core slack in various channels and again just now with @grst, I think we should wrap this up.

Let's go for the basic home made nf-core modules approach for now - in the future we can always revise this and just update that command to use the conda backend instead if we want to. This way we keep things simple and lightweight for now so that we can get moving and start to build stuff 🚀

I've moved my initial proof-of-concept code onto a module-import branch on nf-core/tools with this draft PR to monitor its status: nf-core/tools#615 - feel free to dig in and make some PRs to that branch with missing functionality. When we have a basic set in place we can merge it / release.

Thanks all for an excellent discussion!

from modules.

drpatelh commented on July 28, 2024

A naive implementation is mentioned here #3, and follows the current procedure we use for nf-core/configs.

Excuse my simple brain but would the git submodule approach allow us to also dynamically obtain the latest hash for the modules repo so we dont have to update it manually in nextflow.config (or point out that it needs to be updated as part of the linting)?. For example, when we release a pipeline the hash will need to remain static to use those particular versions of the module files.

We will definitely need to have a more fine grained control over versioning for modules compared to the configs repo.

from modules.

ewels commented on July 28, 2024

No, we will still have to manually update it in nextflow.config for the remote loading to be done (step 2 in my workflow above, if not cloned recursively). This is as your example in #3.

Prepping an example now, will make more sense when you see it hopefully 👍

from modules.

ewels commented on July 28, 2024

Example pipeline repo: https://github.com/nf-core/dsl2pipelinetest

To be combined with code in #9

from modules.

ewels commented on July 28, 2024

Ok, so discussion with @pditommaso on gitter refers to this [link]

it's not a good idea
put the modules as a subtree in the project
but in a nutshell, it's like having the complete copy in your working tree
but still you have the ability to sync with the remote one

So the suggestion is that we never load remote files here - we just always include the entire nf-core/modules repo in every pipeline.

Pros:

No change to current behaviour - git clone and nextflow pull is super simple
Will work offline without any issues

Downsides:

Repos could get pretty big, if nf-core/modules gets big
..?

I personally like this 😉

from modules.

pditommaso commented on July 28, 2024

A possible problem including nf-core/modules is that you will need to update all modules altogether, that still could be a strategy.

if you want to control the version of each module independently you should include each of the as a separate substree.

from modules.

ewels commented on July 28, 2024

So one major downside - with git submodules you have a nice .gitmodules file that explicitly shows what commit hash you currently have:

[submodule "modules"]
	path = modules
	url = https://github.com/nf-core/modules.git

Subproject commit a88b867498d783a84ec659017ed94ee2acaaa22b

With git subtree everything is in one repo, so it's much more difficult to inspect at what commit the modules repo is at. I think the only way to do it is by inspecting the commit messages in git log...

from modules.

ewels commented on July 28, 2024

Downsides of submodules is that people using git clone will hit problems as they have to use --recursive. Nextflow can handle this with nextflow pull, so that's fine. The GitHub zip file download probably also won't work (used by nf-core download as well as being a visible button on the repo web pages).

from modules.

apeltzer commented on July 28, 2024

Can we find a way to resolve the latter issue somehow else with nf-core download? E.g. by locally doing a git pull recursive / using submodules, packaging that up in nf-core download and providing it to the user for copying it over?

from modules.

ewels commented on July 28, 2024

Yes, it's pretty easy to fix with nf-core download by refactoring that code 👍 - but the download button on GitHub won't work still. Fairly minor problem though. I think I'm most keen on the submodules now with the more explicit and traceable lock on the modules remote repo. I think it will just be too easy to mess stuff up with the subtree 😟

from modules.

apeltzer commented on July 28, 2024

I agree - don't care too much about the Github download button either as we provide a proper alternative and can document that as well 👍

from modules.

ewels commented on July 28, 2024

@aunderwo - I'd be curious to hear your thoughts on this one! Just reading your blog post where you mention git subrepo..

from modules.

ewels commented on July 28, 2024

Hi @junjun-zhang,

This is definitely an interesting idea.. So this would involve creating a new subcommand for the nf-core helper tool I guess, which would manage the copying / updating of process files. I guess we could also add lint tests to ensure that the code for the processes is not modified away from the upstream repository at all. It would certainly make the pipelines easier to read / understand too..

Phil

from modules.

apeltzer commented on July 28, 2024

Yes, a very interesting approach!
The only drawback I see with it is, that we need to use a separate tool/method to handle this. A user that only uses nextflow on a cluster that e.g. cannot be configured by the user might have some issues as such an installation involves talking to IT / HPC admins first... which the plain usage of Nextflow does not require. Other than that I can only follow the comment from Phil - would make a lot of standard things easier ...

from modules.

ewels commented on July 28, 2024

No, I'm not sure that this is correct - here the nf-core helper command would only be needed by pipeline developers when editing the pipeline. The process code would be flat files within the pipeline repository, so nothing special for the end user (in fact, even less than using git submodules).

from modules.

apeltzer commented on July 28, 2024

Ok I guess I didn't understand it then yet entirely - after reading again, I think I understood it now as well. I think nf-core tools extension is the way to go then. Fully flexible & we can expect developers to be able to do this when doing the dev work - for users it doesn't interfere at all 👍

from modules.

ewels commented on July 28, 2024

Following the logic of the npm way of doing things, I guess we could then have a meta information files with the details of where each process comes from.. eg. processes.json that has the name, version, repo, hash & path for each package that has been pulled in (maybe we don’t need all of these fields? Maybe I’m missing some?).

@junjun-zhang how do you handle the management of these imported files?

from modules.

junjun-zhang commented on July 28, 2024

@ewels what you mentioned is possible, one way or the other those information is useful to keep for dependency management. I am fairly new to Nextflow, still trying to learn more, so just to share my own thoughts here. What we are experimenting is something quick and simple but well-supports one of the most important features - explicit declaration of module dependencies down to specific versions. This is to fulfill the ultimate goal of reproducible / version controlled pipeline builds. At this point, our plan is to write a simple script (likely in Python) to detect dependencies of remote modules by searching for lines starting with include "./modules/raw.githubusercontent.com/xxxxx" in Nextflow pipeline code, then fetch the contents and store them under the local modules folder. Of course, this is very preliminary and basic, locking module content down with git commit hash etc would be great future improvement.

Dependency management is an important feature for any programming language, GO Language initially did not have good support for it. There have been numerous solutions developed by the GO community until the introduction of GO Modules. Some blogs might be interesting to read: here, here and here. I am not suggesting to take the same approach as GO Modules, but it's certainly a great source of inspiration. Ultimately, I think it's up to Nextflow language to choose it's own official approach for dependency management. For that, I'd like to hear how others think, particularly @pditommaso

from modules.

antunderwood commented on July 28, 2024

@aunderwo - I'd be curious to hear your thoughts on this one! Just reading your blog post where you mention git subrepo..

I have found subrepo (wrapper around git subtree) a more transparent way of dealing with modules particularly since the files pulled in via subrepo are not links

from modules.

junjun-zhang commented on July 28, 2024

Since there are many package managers out, there's nothing that could be used to manage NF module assets? I was even thinking to use npm. Would that be such crazy?

That seems a bold idea, don't know npm enough to make further comment. Leveraging existing solutions is definitely plausible. Might conda be another possible option? Here is how Conda describes herself:

Package, dependency and environment management for any language—Python, R, Ruby, Lua, Scala, Java, JavaScript, C/ C++, FORTRAN, and more.

Nextflow could possibly be added to the above list?

from modules.

pditommaso commented on July 28, 2024

Not an expert either, but my understanding is that npm allow managing any asset irrespective of the prog lang.

Conda would be even better since is very well know too for the bioinfo community, however, I'm sure it allows the copying of the module files in the project directory as we need. I think it's designed to keep them in a conda central directory which would not work for our use case. But, I repeat, I not 100% about this point.

from modules.

ewels commented on July 28, 2024

I feel that conda might be a little bit confusing, as all of these tools are possible to install via other conda channels. I can imagine doing conda install bwa and it copying a few nextflow files somewhere random. Also as you say, I think conda always keeps stuff in its own environment directories. npm install works in the current working / project directory though, which is probably more what we want.

from modules.

ewels commented on July 28, 2024

One down side of npm is that each package needs to be within its own git repository. This isn't necessarily all bad (we've discussed doing this before anyway). On the plus side, we can publish it within a nextflow / nf-core scope which would make the installation names pretty clear.

from modules.

drpatelh commented on July 28, 2024

I realise it has its advantages but Im not too keen on having a separate repository for each module because:

We will most likely have 100s-1000s of modules
It will make things more difficult to find, track, update and maintain

Im not entirely sure how we could make modules fit within the Conda ecosystem so if anyone has any ideas as to a more formal implementation that would be useful to hear 👍

from modules.

maxulysse commented on July 28, 2024

I think the way bioconda handles everything in one repo is a proof that it can be done.

from modules.

maxulysse commented on July 28, 2024

That's actually a good point.

I'm guessing it should be possible to use Conda to copy files from GitHub.

I just tink that conda would be easier than npm to use and will have the advantage of already being used by a majority of bioinformaticians.

Hypothetically, I would see a modules.yml file like that:

name: nf-core-sarek-modules-3.0
channels:
  - nf-core-modules
dependencies:
  - bwa=0.7.17
  - gatk4-spark=4.1.4.1
  - tabix=0.2.6
  - samtools=1.9
  - nf-core-header=0.2
  - nf-core-sarek=3.0

and with a command similar to conda create modules.yml I would get all my modules with the right versions in the current directory.

But I am no expert at all neither with Conda, nor with npm...

EDIT: Now that I have seen @grst reply, I'm getting more and more convinced it could be possible

from modules.

ewels commented on July 28, 2024

Nice! I didn't know about the -p flag for conda, thanks @grst - I was on the same page as @pditommaso and didn't think it would be possible.

Two thoughts:

I don't think that we will be able to / should host these wrapper with bioconda
- It will be very confusing for people to do conda search bwa and get a combination of tool installations and wrappers.
- It's a comparatively small community of people who will need to use conda here, only pipeline developers
- Bioconda is for packaged software, which this is not
I think that we should prefix package names, even in a custom channel
- eg. nf-bwa will help to keep the package name as unique as possible, and obvious that this is a wrapper and not the tool itself.

I think we probably want to go down the route of making a new conda channel, see docs. Hopefully we can get this hosted on anaconda.org for free still, maybe under a nf-core namespace.

And yes - to confirm, the idea would be that devs run these conda commands and then the imported files are kept under version control with the rest of the pipeline code. Feels a little dirty, but I think it has to be this way.

Phil

from modules.

ewels commented on July 28, 2024

I made an nf-core organisation on anaconda cloud: https://anaconda.org/nf-core so it's there if we want it.

from modules.

drpatelh commented on July 28, 2024

I agree that both Conda and a custom package manager are viable options 👍 and that it may be overkill and possibly alot of time (we dont have) to get everything set-up properly on the Conda back-end.

@ewels would the custom option work for developers that want to use our modules in the general Nextflow community. I feel this is quite an important point. If we are going through this much effort to get it right it should work for everyone.

from modules.

grst commented on July 28, 2024

I am definitely encouraged by your worked example above @grst - do you think you could do a draft PR to start to sketch out the build and push to anaconda cloud? Let me know if you'd like me to add you as to the nf-core anaconda cloud organisation.

If conda is the way to go now, I could try to put something together.

In that case a question is if we want to

have conda recipes in a different repository (e.g. nextflow-modules-recipes)
automatically generate conda recipes from the yaml documentation. (No boilerplate code, but no possibility for customizing the build process should that ever become relevant).
turn the yaml module documentation into a conda recipe (see #1 (comment)) (a bit of conda-boilerplate, but full flexibility)

from modules.

ewels commented on July 28, 2024

@pditommaso:

For example, GitHub could host npm packages.

Ooh, interesting thought! Especially as it looks like GitHub npm registry doesn't have the same requirement as the main npm registry of having one repository per package: GitHub npm docs. So that becomes a viable option again.

I think Conda package for a NF module, would only require a yaml metadata file and the tar of the module itself.

Yup, I think we are saying the same thing. My point was that regular bioconda software packages have assets and that we don't.

@grst:

I think it would be cool if you could put something together. I think that's the only way that this will move forward now - if we start sketching out functioning skeletons for one or more options. Conda seems like the most viable to me right now.

have conda recipes in a different repository (e.g. nextflow-modules-recipes)

I don't really understand your question here? My thinking was that each recipe would be in a directory of this repository (nf-core/modules).

conda recipes yaml + meta / docs

I started writing that we could probably build this in to the conda yaml, then saw that your final bullet point was suggesting this! Yes I think it's probably a better idea, instead of using a new custom yaml format..

from modules.

junjun-zhang commented on July 28, 2024

just to share our minimal work to support two key features: 1) import and localize remote NF modules; 2) remote modules are individually versioned.

Here is where remote module files are localized: https://github.com/icgc-argo/dna-seq-processing-wfs/tree/master/modules
Simple script to install/synchronize remote modules: https://github.com/icgc-argo/dna-seq-processing-wfs/blob/master/scripts/install-modules.py
Here is how include statements look like in the pipeline code

This works well so far, there are plans for future improvements including possible adoption of Conda based solution discussed here.

from modules.

bgruening commented on July 28, 2024

I can only second what @mbargull and @daler said. Conda is probably the best solution at this time. Just think about a proper namespace and how you pin your packages. bioconda-utils should do the trick, just disable a lot of features that you will not need and it should be quick and reliable for your purpose.

from modules.

ewels commented on July 28, 2024

Brilliant, thanks all! Plenty to chew on there, much appreciated.

I was thinking more about the dependencies thing today after reading your comments. As was mentioned, it's the management of dependencies which is the difficult problem here. However, going over it again I'm not sure that we need to worry about this...

Because we are in the slightly strange position of explicitly wanting to include the source files into the repository, I'm not sure that we ever will have a dependency tree / any dependencies.

To explain - this topic was originally raised up here: #8 (comment) :

I would envisage larger modules ("sub-workflows") to exist that depend on basic modules (e.g. a DNA-seq subworkflow that depends on some QC and alignment modules. The DNA-seq module could then become part of, e.g., variant-calling pipeline).

However, if sub-workflows work in the same way as pipelines in that they include the basic modules within subdirectories, then they essentially have no dependencies. If you import the workflow, you get everything it needs in one go.

This then simplifies the entire problem a lot. Then we're back to just having to track a directory of files with an associated version / git hash, which is a pretty simple problem.

from modules.

junjun-zhang commented on July 28, 2024

@ewels, good thinking! If I understand you correctly, you are talking about something conceptually like the following structure for workflow with sub-workflows:

my_wf/
├── main.nf
└── modules/
    ├── mod_3.nf
    ├── mod_4.nf
    ├── modules/
    │   ├── mod_1.nf
    │   └── mod_2.nf
    ├── sub-wf_1.nf
    └── sub-wf_2.nf

my_wf uses module mod_3 and mod_4, and sub-workflow sub-wf_1 and sub-wf_2. sub-wf_1 uses mod_1, sub-wf_2 uses mod_2. This way, the importing NF script can always write the include statement as this pattern: include MOD_1 from './modules/mod_1'.

That could work, but I was thinking all of the modules were supposed to be under the same level and there is no hierarchical structure (obviously it does not have to be this way).

One thing clear to me is that if we go with this, workflows / modules that are intended to be sharable should have unique names, not like currently main.nf is pretty much for everything.

from modules.

ewels commented on July 28, 2024

Yup, exactly. Though I would err towards using a directory for each tool instead of a file, then having each file called main.nf. Then the whole directory can be grabbed and we have a bit more flexibility for the code organisation.

What I had in mind was something like this:

my_wf/
├── main.nf
└── modules
    ├── tools
    │   ├── tool_1
    │   │   └── main.nf
    │   └── tool_2
    │       └── main.nf
    └── workflows
        └── sub_wf_1
            ├── main.nf
            └── modules
                └── tools
                    ├── tool_3
                    │   └── main.nf
                    └── tool_4
                        └── main.nf

Note that in the above, I envision sub-workflows using being fully-fledged pipelines in themselves, so they would mirror the structure of the pipeline importing them.

from modules.

pditommaso commented on July 28, 2024

This is an interesting point. In the current form included paths are resolved against the including script path, but this could result in duplicating the same module when imported by two different sub-workflows, that's is not good.

I'm starting to think the included path should be resolved against the main project directory.

from modules.

junjun-zhang commented on July 28, 2024

I'm starting to think the included path should be resolved against the main project directory.

+1 for that. I am actually more leaning towards flat structure, it not only avoids double importing, but also much simpler.

Maybe Nextflow could amend its module include mechanism for a bit. Include path starts with . continue to be interpreted as relative path. Include path starts with other characters than . or / (/ is actually not allowed) would have two paths to search sequentially: 1) same path as the current including script path which is equivalent to ./; 2) a user configured path under main project directory, such as: "${workflow.projectDir}/modules", or could introduce a new runtime metadata variable: moduleDir. Pinging @pditommaso for his comments on this.

my_wf
├── main.nf
└── modules
    ├── sub_wf_1
    │   ├── helper_process.nf  // I imagine some sort of helper maybe necessary in some situation
    │   └── main.nf
    ├── sub_wf_2
    │   └── main.nf
    ├── tool_1
    │   └── main.nf
    ├── tool_2
    │   └── main.nf
    └── tool_3
        └── main.nf

For the above example, the include statements may look like:

In my_wf/main.nf: include "sub_wf_1/main"; include "sub_wf_2/main"; include "tool_1/main"; include "tool_2/main"
In my_wf/modules/sub_wf_1/main.nf: include "helper_process"; include "tool_1/main"
In my_wf/modules/sub_wf_2/main.nf: include "tool_3/main"

Note that tool_1 is included in both my_wf and sub_wf_1

from modules.

ewels commented on July 28, 2024

I disagree - I think that double importing the same module is a necessary evil. If we avoid double-importing, we lose control over the versioning. Then sub-workflows could end up using different versions of imported processes depending on which workflow imports them. This is bad for reproducibility and in the worst case scenario will break stuff.

The flip side is that a pipeline that uses multiple sub-workflows could run different versions of the same tool. This could be confusing, but I think that this is safer: it shouldn't break anything.

Does it really matter to double import processes?

from modules.

junjun-zhang commented on July 28, 2024

I thought a lot about versioning and being able to include specific versions. How about all installed modules are versioned using their corresponding folder names?

my_wf
├── main.nf
└── modules
    ├── sub_wf_1.v3
    │   ├── helper_process.nf
    │   └── main.nf
    ├── sub_wf_2.v2
    │   └── main.nf
    ├── tool_1 -> tool_1.v3  // can even use symlink pointing to latest version if one really wants always latest in some legit cases, I don't personally like this though
    ├── tool_1.v3
    │   └── main.nf
    ├── tool_1.v2
    │   └── main.nf
    ├── tool_2.v4
    │   └── main.nf
    └── tool_3.v2
        └── main.nf

Then include statements would be:

In my_wf/main.nf: include "sub_wf_1.v3/main"; include "sub_wf_2.v2/main"; include "tool_1/main"; include "tool_2.v4/main"
In my_wf/modules/sub_wf_1.v3/main.nf: include "./helper_process"; include "tool_1.v2/main"
In my_wf/modules/sub_wf_2.v2/main.nf: include "tool_3.v2/main"

Note that my_wf includes latest version (v3) of tool_1, sub_wf_1.v3 includes v2 of it. Now they all live in harmony.

I much agree that reproducibility is one of the top priorities in bioinformatics pipelines, all include statements should be explicit about which version is being imported. Like Conda environment file, if one wants the environment to be reproducible, all dependencies should specify specific versions.

One the other point: there is no difference between importing tools and importing sub-workflows, which I believe is a good thing. From importing script's (my_wf/main.nf) point of view, it's all the same including and making use of the imported modules, regardless it's a tool (DSL2 process) or sub-workflow (DSL2 workflow).

from modules.

junjun-zhang commented on July 28, 2024

That’s great, didn’t know include foo from "$baseDir/X/Y" works in 20.01, it didn’t work in previous versions.

Regards the possibility sub-workflows may depend on different versions of the same tool, I think this is a feature needs to be supported, meaning that they both need to be brought in to the main workflow. I am not aware whether Conda or other package manager supports installation of different versions of the same package at the same time.

from modules.

ewels commented on July 28, 2024

I'm still struggling to understand why all tools need to be on the same level - it seems like a huge amount of complexity to add and I can't see any advantages. If we just copy them in with a subworkflow then we don't have to do any management at all and the entire situation remains very simple..

At the same time, I share the view of JunJun that in the long run, complex pipelines, could become a mess and a flat structure could be easier to maintain.

I don't see how this would be the case though, as developers will never edit the imported code. So sure the final workflow could in theory end up with a complex file tree, but it doesn't really matter as the developer will never need to look in to those files. If they want to edit anything in the subworkflow, they edit it in that repository where the file tree is simple and consistent with all other pipelines..

from modules.

junjun-zhang commented on July 28, 2024

@ewels you got a point. There is one question I am not clear how others think. Given the following example, sub_wf_1 uses tool_1 and tool_2. It's clear tool_1 and tool_2 can be versioned, packaged and registered in, say, nf-core Conda channel. The question is when we package sub_wf_1, will tool_1 and tool_2 be packaged together?

sub_wf_1
    ├── main.nf
    └── modules
        ├── tool_1
        │   └── main.nf
        └── tool_2
            └── main.nf

If it's a yes, then it's something like a Uber JAR in JAVA world. When sub_wf_1 is imported, it will have tool_1 and tool_2 included as well, the importing script has no choice to exclude them. In such case, only the nested structure you proposed will work.

If it's a no, only sub_wf_1/main.nf will be packaged. It's still possible to create nested structure with some extra work. What I am not sure is whether Conda supports this nested packaging, or whether it's a good practice in Conda. Or, maybe here is the custom tools like nf-core tools come into play?

from modules.

ewels commented on July 28, 2024

The question is when we package sub_wf_1, will tool_1 and tool_2 be packaged together?

I think the answer is yes, going way back up to this comment at the top of this thread, where we established that all workflows should include hard copies of their imported modules in the same repository.

from modules.

junjun-zhang commented on July 28, 2024

all workflows should include hard copies of their imported modules in the same repository

For the workflow repo, yes, it should include code of itself and all of its dependent modules. I think we are on the same page for that.

However, what goes to the workflow package and gets uploaded to the registry server (like Anaconda) is a separate question. It could be all included, or could be just the workflow code only. For the latter, when installing the workflow, it first pulls down the workflow code then fetches its dependencies from individual module packages. Both should be possible, just not sure which is a more sensible choice for us.

from modules.

Handle module / process imports about modules HOT 78 CLOSED

Comments (78)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent