openfaas / nats-queue-worker Goto Github PK

View Code? Open in Web Editor NEW

127.0 9.0 59.0 5.13 MB

Queue-worker for OpenFaaS with NATS Streaming

Home Page: https://docs.openfaas.com/reference/async/

License: MIT License

Go 94.62% Dockerfile 5.38%

nats nats-streaming docker cluster

nats-queue-worker's Introduction

Notice

NATS Streaming was deprecated in June 2023 by Synadia, and will receive no more updates, including for critical security issues.

Migrate to OpenFaaS Standard for NATS JetStream, learn more:

queue-worker (Community Edition) for NATS Streaming

The queue-worker (Community Edition) processes asynchronous function invocation requests, you can read more about this in the async documentation

Usage

Screenshots from keynote / video - find out more over at https://www.openfaas.com/

Configuration

Parameter	Description	Default
`write_debug`	Print verbose logs	`false`
`faas_gateway_address`	Address of gateway DNS name	`gateway`
`faas_gateway_port`	Port of gateway service	`8080`
`faas_max_reconnect`	An integer of the amount of reconnection attempts when the NATS connection is lost	`120`
`faas_nats_address`	The host at which NATS Streaming can be reached	`nats`
`faas_nats_port`	The port at which NATS Streaming can be reached	`4222`
`faas_nats_cluster_name`	The name of the target NATS Streaming cluster	`faas-cluster`
`faas_reconnect_delay`	Delay between retrying to connect to NATS	`2s`
`faas_print_body`	Print the body of the function invocation	`false`

nats-queue-worker's People

Contributors

Stargazers

Watchers

nats-queue-worker's Issues

Nats cluster has fixed cluster ID "faas-cluster"

Hello,

I'm trying OpenFaas for my new project using Nomad with Consul by following Nick Jackson's documentation for faas-nomad provider.
While I'm having my environment already setup with Nats and Nats-Streaming clusters using the default cluster id "test-cluster" there was no way that the OpenFaas gateway would pick it up. I figured out the issue was a hard coded cluster ID in the source code for the problem.

I renamed my cluster ID with "faas-cluster" and all worked fine. But to keep my own setup intact I had to create separate nats streaming cluster only for OpenFaas.

It would be nice if there is a way to pass the cluster id in the gateway args rather than creating a dedicated cluster.

Current Behaviour

Nats cluster must have an id of "faas-cluster"

Possible Solution

Pass an optional argument faas_nats_id to the gateway arguments

queue-worker crashes if callback happens and function call failed

If function call ends up giving 503, nats-queue-worker crashes with panic: runtime error: invalid memory address or nil pointer dereference when trying to do callback if X-Callback-Url is set.

Expected Behaviour

It doesn't crash.

Current Behaviour

It crashes and pod dies.

Possible Solution

See #89

Steps to Reproduce (for bugs)

Deploy OpenFaaS in GCP using marketplace (default timeout settings).
Create function with the following handle, based on python3 template

import time
def handle(req):
    t = 60
    time.sleep(t)
    print(f"Slept for {t} seconds")

Deploy function using default timeout settings faas up -f sleep.yml
Do an async function call with callback to e.g. https://requestbin.com and watch the kubernetes logs for the queue worker and wait for the following error
queueworker_crash.txt

Context

It started crashing some functions I thought failed for another reason.

Your Environment

Kubernetes

Remove durable config from the readme

Context

In #95 the durable configuration was removed

https://github.com/openfaas/nats-queue-worker/pull/95/files#diff-3511d0be867c6e99793319238ebe2438L48

So we should remove this line from the README

https://github.com/openfaas/nats-queue-worker/blame/master/README.md#L37

[Feature Request] Asynchronous Concurrency Limiting

My actions before raising this issue

Followed the troubleshooting guide
Read/searched the docs
Searched past issues

Expected Behaviour

If you set max_inflight on an async function's of-watchdog, you might naturally assume that a queue worker will handle the case when it tries to invoke a function on a pod that has reached its max_inflight, and would therefore send the request to a different pod, if possible.

Current Behaviour

If you use a sync function and set of-watchdog.max_inflight on an of-watchdog pod, and invoke that function above that amount (assuming 1 pod), you will as expected get 429s letting you know that you’ve sent more requests than you’ve configured to be allowed to be worked on per pod.

However, the same behavior exists when using async functions; if you set queue.max_inflight=5 and of-watchdog.max_inflight=1 , if the queue worker attempts to send that request to a pod that already has 1 invocation in progress, it receives the same 429 and forwards it off to the gateway as your official response. I propose that a queue worker would instead retry on 429s, or a similar retry mechanism (as a 429 is a completely valid response for functions at the moment).

Possible Solution

Having a queue worker retry to a different pod (if possible), potentially with incremental backoff.

I’m also open to a completely different mechanism, and willing to implement.

Context

Having a queue worker retry if it attempts to invoke a pod that’s already hit its maximum concurrent invocations would probably be the last technical hurdle for me to get async autoscaling working well for “long-running” functions, and therefore I'd say this issue is related to openfaas/faas#657.

Your Environment

FaaS-CLI version ( Full output from: faas-cli version ): 0.12.21
Docker version docker version (e.g. Docker 17.0.05 ): 20.10.2
Are you using Kubernetes or faasd? kubernetes
Operating System and version (e.g. Linux, Windows, MacOS): MacOS/Linux
Code example or link to GitHub repo or gist to reproduce problem:
Other diagnostic information / logs from troubleshooting guide

Add shut-down logic for subscriber/publisher

Expected Behaviour

The subscriber (queue-worker) and publisher (gateway) should not try to reconnect if the connection is being broken due to SIGTERM/SIGINT.

Current Behaviour

Once #50 is merged they should both try to re-try during the period at which they are terminating.

Possible Solution

The watchdog implements similar logic, it could probably be adapted for use here: https://github.com/openfaas/faas/blob/master/watchdog/main.go#L86

Auth needed for gateway calls to /system/async-report

Expected Behaviour

The call to /system/async-report needs to be decorated with basic auth credentials.

Current Behaviour

It is open which is why no changes were needed, but this is invalid because someone could discover the gateway and post false statistics to this endpoint.

Possible Solution

Update docker-compose/helm/yaml to add the basic auth username/password to this component
Update the HTTP call to /system/async-reportto pass those secrets

Steps to Reproduce (for bugs)

Deploy OpenFaaS with auth
Post to gateway:port/system/async-report

Context

Found whilst doing a deeper code review on the faas/server entrypoint

Gateway crashes on startup if there is a new instance connected to Nats

On rolling updates or scale up operations the Gateway will crash with the following message:

2017/11/22 22:45:18 Binding to external function provider: http://faas-netesd.openfaas:8080/
2017/11/22 22:45:18 Async enabled: Using NATS Streaming.
2017/11/22 22:45:18 Opening connection to nats://nats.openfaas:4222
2017/11/22 22:45:18 stan: clientID already registered

Add reconnection logic for NATS queue worker and gateway handler

Expected Behaviour

The gateway and NATS queue worker both should reconnect if the connection is broken.

Current Behaviour

They do not reconnect.

Possible Solution

Solution being worked on in #49

Also tracked via openfaas/faas#1031

Dockerfile broken for armhf

A user emailed me to let me know that the latest image for armhf wasn't pushed up to the Hub, I've just run another build and saw this error:

docker build --build-arg http_proxy="" --build-arg https_proxy="" -t openfaas/queue-worker:0.5.2-armhf . -f Dockerfile.armhf
Sending build context to Docker daemon  5.449MB
Step 1/18 : FROM golang:1.9.7-alpine as golang
 ---> 04b7973586e8
Step 2/18 : WORKDIR /go/src/github.com/openfaas/nats-queue-worker
 ---> Using cache
 ---> fa184adc442f
Step 3/18 : COPY vendor     vendor
 ---> Using cache
 ---> 0189eb5714b0
Step 4/18 : COPY handler    handler
 ---> Using cache
 ---> c766896111b9
Step 5/18 : COPY main.go  .
 ---> Using cache
 ---> bf7e017dced1
Step 6/18 : COPY readconfig.go .
 ---> Using cache
 ---> 3b5807be63e5
Step 7/18 : COPY readconfig_test.go .
 ---> Using cache
 ---> e9fa50b00ea4
Step 8/18 : RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o app .
 ---> Running in 49ffbaea8abb
# github.com/openfaas/nats-queue-worker
./main.go:280:8: undefined: AddBasicAuth
The command '/bin/sh -c CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o app .' returned a non-zero code: 2
Makefile:12: recipe for target 'ci-armhf-build' failed
make: *** [ci-armhf-build] Error 2

I suspect this is to do with the latest changes in #36 #35

CI: push an image when doing a release

Copy the mechanism from th FaaS project so that we automatically push a binary Docker image upon release via GitHub (tag pushed)

Fix properties used by queue worker and add concurrent function invocation

Expected Behaviour

The queue worker uses nats as queue. Because of that some options don't quite fit:

Current Behaviour

durable name is optional
durable is created with StartWithLastReceived
because of the blocking nature of the callback
- Max In flight is not as use-full.
- Messages are processed sequentially.

Possible Solution

durable name can be derived from the subject name
durable is created with DeliverAllAvailable
process max in flight many messages concurrently
- this would require specifying manual ack as well.

Your Environment

This is a result of code inspection.

Feature: Return HTTP status code and other meta-data via X-Callback-Url

Expected Behaviour

The HTTP status code of the function being executed and potentially any other useful meta-data should be returned to the X-Callback-Url as a HTTP Header

Current Behaviour

No context about pass/failure, but it appears if the process gives a bad HTTP code we may get an empty body.

Possible Solution

This is not a refactoring exercise, so it shouldn't restructure the code unnecessarily or make unrelated changes.

This should populate a header such as X-Status-Code: 400 or similar.

Steps to Reproduce (for bugs)

Create a classic function that exits with non-zero
Or create a function with of-watchdog that returns non-200

Context

Confusing user experience for failed executions

SIGSEGV in queue-worker when remove/deploy (redeploy) function that is being called

I have a service that is trying to call asynchronous functions at the same time those functions are being re-deployed.

Expected Behaviour

I would expect a caught error rather than seg fault or null pointer ref.

Current Behaviour

The queue-worker seems to have a fatal crash. Good news is, once the functions are re-deployed, everything does go back to normal.

Post http://ping.openfaas-fn.svc.cluster.local:8080/: dial tcp 10.102.218.40:8080: i/o timeout
Callback to: http://gateway.openfaas:8080/function/task-complete
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x69b878]

goroutine 19 [running]:
main.postResult(0xc42005da40, 0x0, 0x0, 0x0, 0x0, 0xc42040cc80, 0x33, 0xc4204071d0, 0x24, 0x0, ...)
    /go/src/github.com/openfaas/nats-queue-worker/main.go:239 +0xb8
main.main.func1(0xc4204099e0)
    /go/src/github.com/openfaas/nats-queue-worker/main.go:122 +0x8ba
github.com/openfaas/nats-queue-worker/vendor/github.com/nats-io/go-nats-streaming.(*conn).processMsg(0xc4200e6380, 0xc42040e280)
    /go/src/github.com/openfaas/nats-queue-worker/vendor/github.com/nats-io/go-nats-streaming/stan.go:751 +0x26f
github.com/openfaas/nats-queue-worker/vendor/github.com/nats-io/go-nats-streaming.(*conn).(github.com/openfaas/nats-queue-worker/vendor/github.com/nats-io/go-nats-streaming.processMsg)-fm(0xc42040e280)
    /go/src/github.com/openfaas/nats-queue-worker/vendor/github.com/nats-io/go-nats-streaming/sub.go:228 +0x34
github.com/openfaas/nats-queue-worker/vendor/github.com/nats-io/go-nats.(*Conn).waitForMsgs(0xc42009e500, 0xc420152240)
    /go/src/github.com/openfaas/nats-queue-worker/vendor/github.com/nats-io/go-nats/nats.go:1652 +0x24a
created by github.com/openfaas/nats-queue-worker/vendor/github.com/nats-io/go-nats.(*Conn).subscribe
    /go/src/github.com/openfaas/nats-queue-worker/vendor/github.com/nats-io/go-nats/nats.go:2374 +0x4de
rpc error: code = Unknown desc = Error: No such container: e99f5b51bcca8ca95b02291e051c9c2b1320e8c69ca7086d350f273cd63209a8

Steps to Reproduce (for bugs)

create a function foo
create a service that calls the foo function asynchronously every few seconds
tail the worker-queue logs
now, faas remove then faas deploy the function foo

Context

Implementing a jobs system. If any of those jobs needs to be re-deployed, I imagine in production having to remove and deploy.

Potentially there is a more graceful way to remove and deploy, but I imagine we still wouldn't want something as core as the nats-queue-worker to have a fatal crash in this situation, given the fact that other functions that need to be called async may still be deployed, and it would be hard to notify external services and tell them to stop calling the functions during a (re)deployment.

Your Environment

Docker for mac (k8s)

Refactor: unified processing parse env max_inflight, ack_wait variable

Unified processing parse env max_inflight, ack_wait variable

Expected Behaviour

Unified processing parse env max_inflight, ack_wait variable

Possible Solution

Move parse env max_inflight, ack_wait to readConfig.Read() , use readConfig unified processing parse env variable

Feature: add healthcheck

Expected Behaviour

Healthcheck over HTTP or an exec probe which can be used by Kubernetes to check readiness and health

Current Behaviour

N/a

Possible Solution

Please suggest one of the options above, or see how other projects are doing this and report back.

Context

A health-check can help with robustness.

Feature request - don't log message contents

The queue-worker logs all incoming request msg data, and thus could possiblly expose data not intended to be logged routinely.

Expected Behaviour

Not log incoming request body unless set on for debugging like https://github.com/openfaas/nats-queue-worker/blob/master/main.go#L76.

Current Behaviour

All incoming request msg are logged here: https://github.com/openfaas/nats-queue-worker/blob/master/main.go#L59

Possible Solution

Guard https://github.com/openfaas/nats-queue-worker/blob/master/main.go#L59 with a debug flag as it has been done at https://github.com/openfaas/nats-queue-worker/blob/master/main.go#L76

	if config.DebugPrintBody {
			log.Printf("[#%d] Received on [%s]: '%s'\n", i, msg.Subject, msg)
	}

Steps to Reproduce (for bugs)

Context

Your Environment

Docker version docker version (e.g. Docker 17.0.05 ):
Are you using Docker Swarm or Kubernetes (FaaS-netes)?
Operating System and version (e.g. Linux, Windows, MacOS):
Link to your project or a code example to reproduce issue:

Upgrade nats-streaming client

Expected Behaviour

Use a current version of the nats streaming client.

Current Behaviour

A while ago the repository for the nats streaming client got moved from:
https://github.com/nats-io/go-nats-streaming
to :
https://github.com/nats-io/stan.go
Therefore, nats-queue-worker will not pick up bug fixes and improvements.

Possible Solution

Rename the package in go files, Gopkg.toml and update the vendor directory accordingly.

Context

February 22nd we we had contact about some behavior of the openfaas nats connector.
One suggestion was to bump the version of the nats streaming server.
I Recently realized that, due to the repository move, nats-queue-worker is locked in to an old version.

I'd be more than happy to contribute this change.

X-Duration-Seconds header is not set on function timeout

If the original function invocation takes longer than than the functions combined read_timeout and write_timeout the X-Duration-Seconds header is not set on the callback function

Expected Behaviour

X-Duration-Seconds should be set on the callback function

Steps to Reproduce (for bugs)

Create a function a , make is sleep for 10 seconds
Create a callback function b that prints the X-Duration-Seconds header
invoke function a i.e. curl http://127.0.0.1:8080/async-function/a -d 'hello' -H 'X-Callback-Url: http://gateway:8080/function/b'

CI: create new release

Create a release to test CI's ability to push Docker Images.

Async invocation callback follows 30x redirects but loses data on the way

When asynchronous function execution completes the data is sent back to callback url. However if that callback URL issues 30x redirect (from http to https or similar), the data seems to be posted to HTTP url, server responds 30x redirect to HTTPS, another POST requests is done to HTTPS url however function output data is missing.

Expected Behaviour

Do not follow 30x redirects - it is user's responsibility to provide correct data collector URL that is not redirecting elsewhere... OR
Follow 30x redirects and make sure the function output is posted to redirect destination as well

Current Behaviour

Client posts back results, gets 30x redirect from server and follows redirect passing all relevant headers, but does not include data (body) in that second request

Possible Solution

Same as expected behaviour above

Steps to Reproduce (for bugs)

Deploy something like requestbin on HTTPS with redirect from HTTP to HTTPS
Invoke function asynchronously providing HTTP url to requestbin as callback URL
Requestbin will see post with all headers but no body (function output)
If callback URL is switched to HTTPS to avoid redirect, all data gets captured correctly and available in requestbin

Context

Took a long time to debug and figure out what happens there and why async functions don't return data correctly.
This was caused literally by requestbin offering URL with HTTP when server running requestbin was forcing redirect to HTTPS - found by accident.

Your Environment

faasd v0.9.6 - default install and default docker-compose.yml

Clarify queue-worker durable queue implementation and delivery semantics

Before we start working on DLQ (#81) and ACK options (#80) it would be good to get a better understanding of the current code regarding the durable queue subscription.

Durable queue seems to be optional (default off) as per environment configuration (see recent fix #76) - why is durable subscription not default/always on? Isn't that the desired behavior in async mode?
var durable, even if unset, will be passed to q.conn.QueueSubscribe() with stan.DurableName(q.durable) - what is the STAN behavior if one passes an empty string here? STAN code doesn't check, so I am not 100% sure about the expected behavior (especially in combination with the queue-worker unsubscribe call during shutdown)
q.startOption is set to stan.StartWithLastReceived() - related to the unclear (at least to me) behavior described in (2) durable subscription semantics are described as "Note that once a queue group is formed, a member's start position is ignored" (redundant?)
stan.AckWait(q.ackWait) (30s) is overwritten by faas-netes [1], we should be consistent here
ack, ack-wait and persistency (durability) behavior (in-memory) should be documented so users understand the "at most once delivery" semantics for the async implementation (no retries for sync gateway fn calls/callbacks if specified) - with one exception (see next)
if the invoked (sync) function runs longer than ack-wait (30/60s default depending on environment), a message could be delivered multiple times when running multiple queue-workers - receivers should account for idempotency (clarify message handling in docs)
returning "202 accepted" by the gateway for async routes underlines the "at most once delivery semantics" (e.g. edge case where message is not persisted and gateway crashes) - would a "201 created" be more aligned with what users expect (especially when considering "at least once" semantics with higher durability?); also see related issue [2]
several different timeout handlers are at play (http client for gateway/callback invocations, STAN ack-wait handling) - more consistency or guidelines in docs
with the upcoming architectural changes in NATS (JetStream) how does this affect the queue-worker implementation (tracking item to keep this in mind when making changes)

cc/ @bmcstdio

[1]
https://github.com/openfaas/faas-netes/blob/41c33f9f7c29e8276bd01387f78d6f0cff847890/yaml/queueworker-dep.yml#L50

[2]
openfaas/faas#1298

Expected Behaviour

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

Docker version docker version (e.g. Docker 17.0.05 ):
Are you using Docker Swarm or Kubernetes (FaaS-netes)?
Operating System and version (e.g. Linux, Windows, MacOS):
Link to your project or a code example to reproduce issue:

[Feature Request] CRD for Queue/Queue Worker

Expected Behaviour

Right now faas-netes exposes up a handy little CRD for Functions.

At the moment, I find myself copying a live deployment of a queue-worker and turning that into a helm template, which is slightly dangerous because it means that users using more than the default queue are likely all bespoke to a certain degree.

It would be nice to be able to define a Queue as a CRD similar to a Function, which would allow things like the Pro queue worker and standard queue worker to be bundled under one resource type.

That, and it would be nice to see the things like queue capacity in k9s by simply typing :queue, but that's another story.

Current Behaviour

There is not a CRD for this right now.

Possible Solution

Steps to Reproduce (for bugs)

Context

Not DIY my own queue worker deployments, which can potentially cause drift from OF and may increase the need for troubleshooting if a DIY deployment is using a version with different behavior than the default queue worker.

Also, it makes it possible (perhaps likely) to have different functions using different queue workers that all have different images, because image tag wasn't exposed in a template or the like.

Your Environment

FaaS-CLI version ( Full output from: faas-cli version ):
0.13.13
Docker version docker version (e.g. Docker 17.0.05 ):
20.10.8
What version and distriubtion of Kubernetes are you using? kubectl version
server v1.21.3
client v1.22.2
Operating System and version (e.g. Linux, Windows, MacOS):
MacOS
Link to your project or a code example to reproduce issue:
What network driver are you using and what CIDR? i.e. Weave net / Flannel

Proposal: Add GitHub issue and pull request template

As all the faas repos has pre-defined issue and pull request template. I think we should have that for this repo as well.

Cc / @alexellis

ARM64 Docker build failing on Rock64

Expected Behaviour

Build should pass

Current Behaviour

docker build --build-arg http_proxy="" --build-arg https_proxy="" -t openfaas/queue-worker:0.5.6-arm64 . -f Dockerfile.arm64
Sending build context to Docker daemon  5.473MB
Step 1/19 : FROM golang:1.10-alpine as golang
 ---> 9b358c7fcf94
Step 2/19 : WORKDIR /go/src/github.com/openfaas/nats-queue-worker
 ---> Using cache
 ---> 5c38a19911f4
Step 3/19 : COPY vendor     vendor
 ---> Using cache
 ---> 1b688da519ea
Step 4/19 : COPY handler    handler
 ---> Using cache
 ---> 51cd8aa2edb2
Step 5/19 : COPY main.go  .
 ---> Using cache
 ---> fbf565b66600
Step 6/19 : COPY readconfig.go .
 ---> Using cache
 ---> 6f2fe52e627c
Step 7/19 : COPY readconfig_test.go .
 ---> Using cache
 ---> 38a76238ab59
Step 8/19 : COPY auth.go .
 ---> Using cache
 ---> f9e76bd42930
Step 9/19 : RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o app .
 ---> Running in cc8e4e8e5ec3
ESC[91mmain.go:21:2: cannot find package "github.com/openfaas/nats-queue-worker/nats" in any of:
        /go/src/github.com/openfaas/nats-queue-worker/vendor/github.com/openfaas/nats-queue-worker/nats (vendor tree)
        /usr/local/go/src/github.com/openfaas/nats-queue-worker/nats (from $GOROOT)
        /go/src/github.com/openfaas/nats-queue-worker/nats (from $GOPATH)
ESC[0mMakefile:24: recipe for target 'ci-arm64-build' failed

Server: Docker Engine - Community
 Engine:
  Version:          18.09.0
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.4
  Git commit:       4d60db4
  Built:            Wed Nov  7 00:17:01 2018
  OS/Arch:          linux/arm64
  Experimental:     false

I'm using publish-arm.sh from openfaas/faas/contrib

Alex

Durable subscriptions?

Expected Behaviour

Invocations made while no workers are running are eventually performed (i.e. after there's at least one worker running again).

Current Behaviour

All but the very last invocation made while no workers are running is actually performed.

Possible Solution

Support durable queue subscriptions, which should be just a matter of setting durable to a non-emptry value.

Steps to Reproduce (for bugs)

Run one or more instances of the worker.
Invoke a function asynchronously, and observe it being invoked.
Scale the number of workers down to 0.
Wait for some time, and observe (via the NATS monitoring endpoint) that all subscriptions have been removed.
Invoke a function asunchronously two or more times.
Scale the number of workers back to 1.
Observe that the function is invoked only once, and with whatever values it was invoked the last.

Context

N/A

Your Environment

Docker version docker version (e.g. Docker 17.0.05 ):

N/A

Are you using Docker Swarm or Kubernetes (FaaS-netes)?

Kubernetes (OpenFaaS Operator)

Operating System and version (e.g. Linux, Windows, MacOS):

N/A

Link to your project or a code example to reproduce issue:

N/A

Custom Ack timeout

Description

Ack time-out appears to be set to 30 seconds. This means async jobs should complete within 30 seconds or they will be retried.

Workarounds

The synchronous timeout can be upped and used instead.

Solution

We should look at options, but we can:

Ack early and continue ( functions/queue-worker:ack_early )
Make the timeout custom

Thanks @blu3gui7ar for notifying.

Expose nats monitoring endpoints to prometheus

Expected Behaviour

Expose stats like pending messages to prometheus and grafana

Current Behaviour

These are currently not exported

Possible Solution

Open the monitoring endpoint on the streaming server
start the nats-prometheus-exporter
enable scraping of the produced metrics

Context

This is a result of #61

Show built version and SHA on start-up

Expected Behaviour

Users should be able to identify versions from logs

Current Behaviour

You can inspect the Docker image

Possible Solution

See the logs of faas-netes and use the same approach

Steps to Reproduce (for bugs)

arkade install openfaas
kubectl logs -n openfaas deploy/queue-worker

Context

Better consistency and logging.

TLSInsecure option required for callbacks

Cognite have requested a "TLSInsecure" option for callbacks due to incorrectly configured services within their network that cannot be changed.

I don't mind adding this change as long as the default is off since this is also present in the CLI.

Per-function ack_wait

In a previous conversion @alexellis and I discussed some items related to the queue worker, one of which being to verify whether or not the queue worker ack_waits for multiple functions using 1 "global" setting, or on a per-function basis.

Expected Behaviour

When discussing multiple functions being listened to at the same time on a single queue worker, we discussed potentially preferred behavior, in the effort to have a single queue worker, and have it autoscale to meet demand, rather than having a static replica count and different wait times per queue.

Given the following:

1 queue worker with an ack_wait of 3m15s and max_inflight of 2
1 function sleep1 with duration of 1m and write_timeout of 1m5s
1 function sleep2 with duration of 3m and write_timeout of 3m5s

We assume a kubernetes environment or environment with a similar orchestration layer and pattern to kubernetes, and we assume the event triggering the pod is a graceful shutdown command, such as a Node draining for maintenance and scheduling resources on a different Node.

Expecting events with rough timing; the sections in the format [duration] are the general timings from the start of this example timeline

queue worker is subscribed to channel(s) [0s]
sleep1 is invoked via gateway and sent to nats [0s]
sleep2 is invoked via gateway and sent to nats [0s]
queue worker receives a message from nats for sleep1 [0s]
queue worker receives a message from nats for sleep2 [0s]
queue worker begins function invocation for sleep1 call [0s]
queue worker begins function invocation for sleep2 call [0s]
queue worker receives SIGTERM (via drain), a new queue worker is scheduled to replace it [5s]
- we assume that graceful shutdown does not occur here, either because it's not currently implement, or because something unexpected happens. why it doesn't ack is out of scope of this issue, but we can assume it is sent a SIGKILL for this example.
new queue worker comes online, subscribes [7s]
sleep1 invocation completes, is not acknowledged [1m]
new queue worker receives a message from nats for sleep1 [1m5s]
new queue worker invokes sleep1 [1m5s]
sleep1 invocation completes, is handled by queue worker [2m5s]
sleep2 invocation completes, is not acknowledged [3m]
new queue worker receives a message from nats for sleep2 [3m5s]
new queue worker invokes sleep2 [3m5s]
sleep2 invocation completes, is handled by queue worker [6m5s]

Current Behaviour

An example of this timing with the same settings and format as above, functional (non-timing) differences in bold italics:

queue worker is subscribed to channel (note: only 1 channel) [0s]
sleep1 is invoked via gateway and sent to nats [0s]
sleep2 is invoked via gateway and sent to nats [0s]
queue worker receives a message from nats for sleep1 [0s]
queue worker receives a message from nats for sleep2 [0s]
queue worker begins function invocation for sleep1 call [0s]
queue worker begins function invocation for sleep2 call [0s]
queue worker receives SIGTERM (via drain), a new queue worker is scheduled to replace it [5s]
- we assume that graceful shutdown does not occur here, either because it's not currently implement, or because something unexpected happens. why it doesn't ack is out of scope of this issue, but we can assume it is sent a SIGKILL for this example.
new queue worker comes online, subscribes [7s]
sleep1 invocation completes, is not acknowledged [1m]
sleep2 invocation completes, is not acknowledged [3m]
- note that in the previous example the second invocation of sleep1 had already completed and been handled by this point
new queue worker receives a message from nats for sleep1 [3m15s]
new queue worker receives a message from nats for sleep2 [3m15s]
new queue worker invokes sleep1 [3m15s]
new queue worker invokes sleep2 [3m15s]
sleep1 invocation completes, is handled by queue worker [4m15s]
sleep2 invocation completes, is handled by queue worker [6m15s]

The major differences from the above:

sleep1's result takes 4m15s to finally come through, vs 2m5s in the previous example
sleep2's result takes 6m15s to finally come through, vs 6m5s in the previous example
in the previous example, ack_wait for the queue worker itself becomes functionally irrelevant (graceful shutdown is a different issue)

Possible Solution

As we had discussed previously, it would likely be advantageous to have different ack_wait times per function, and instead have a single queue worker that simply has no ack_wait of its own, and rather only knows about a graceful shutdown duration, which the user would have to configure in advance, knowing the ack_wait of their environment's longest running function.

The difference in this implementation would likely be to have multiple subscriptions with different AckWait periods in the queue worker, which may require more channels, rather than the current implementation, which only listens to 1 channel (based on what I see in the environment variables, specifically the variable faas_nats_channel).

The issue

The big wrench in this discussion is the max_inflight for a particular function's queue. For example, let's say I have a function with a concurrency limit of 100 (watchdog.max_inflight) and a maximum pod count of 10 (com.openfaas.scale.max). From those values, you can presume that that function's queue should not have more than 1000 (queueWorker.max_inflight) being attempted at once, because otherwise you'd be trying to send invocations to a function that would not be able to handle the request because all pods are busy.

The questions that occurs to me which effectively prevent this solution from actually working as expected is:

how does a queue worker know how many maximum in-flight invocations it should be able to send to a function?

I would say that this could be calculated by watchdog.max_inflight * com.openfaas.scale.max. The queue worker would then potentially not need its own max_inflight, and instead be able to be autoscaled based on cpu/memory or a custom metric.

how does a queue worker know how many are already in-flight by other replicas?

I would say that this doesn't have to be perfectly immediate between pods, and you could potentially accomplish this an external lookup (metrics or some such), and as long as it's able to prevent an endless flood of 429s, it should be fine with not being immediate (at least, for the pro queue worker, which can retry on 429s from pods).

Steps to Reproduce (for bugs)

Context

We would like to be able to have 1 queue worker handle multiple functions with different timings for retries (nats redelivery).
We would like the queue workers to be able to understand the realistic maximum number of invocations for a particular function, so as to not hit busy pods.

Your Environment

FaaS-CLI version ( Full output from: faas-cli version ):
0.13.13
Docker version docker version (e.g. Docker 17.0.05 ):
20.10.8
What version and distriubtion of Kubernetes are you using? kubectl version
server v1.21.3
client v1.22.2
Operating System and version (e.g. Linux, Windows, MacOS):
MacOS
Link to your project or a code example to reproduce issue:
What network driver are you using and what CIDR? i.e. Weave net / Flannel

Support scale_from_zero

Expected Behaviour

In the gateway, when a function is invoked which has zero replicas, it will request that the function is scaled up and then block the HTTP request until the function is ready. The queue-worker should mirror this behaviour.

Current Behaviour

The queue-worker calls functions directly over HTTP, so if a function (deployment) has 0/0 replicas it will return an error - HTTP 502 or similar.

Direct function call: https://github.com/openfaas/nats-queue-worker/blob/master/main.go#L119

Possible Solution

Option 1:

Make use of the gateway code for maintaining a list of services and replicas, then call the existing gateway API to scale. Code can be re-used from exported packages.

Option 2:

Have the queue-worker always call the gateway instead of the functions directly. Adds latency and means the async timeout will be reduced to the sync route timeout value.

Option 3:

Have the gateway scale deployments to minReplicas whenever it gets an async work-item queued.

Steps to Reproduce (for bugs)

Enable scale from zero
Deploy figlet from store
Scale figlet to zero with kubectl/docker
Invoke figlet via gateway and see it scale
Repeat process, but this time invoke via async route.

Context

scale_from_zero is an important feature for multi-tenant environments

NATS images

Is JetStream available for OpenFaaS?

Sorry to ask a question but is there a reason this does not use jetstream . Seems to use the old Stan which is approaching End of Life

Posting function statistics to the gateway returns http status 401

The queue worker fails to post reports back to the gateway when gateway_invoke is false and basic_auth is set to "1".

Expected Behaviour

Posting reports back to the gateway should return http status 202.

Current Behaviour

When gateway_invoke is false and the basic_auth env variable is set to "1" posting a report back to the gateway fails with http status 401.

queue-worker logs:

[#1] Invoking: figlet with 6 bytes, via: http://figlet.openfaas-fn.svc.cluster.local:8080/
[#1] Invoked: figlet [200] in 0.016794s
[#1] figlet returned 138 bytes
[#1] Posting report for figlet, status: 401

Possible Solution

The postReport function gets the value of the bacic_auth env variable but only accepts the string "true" to be thruty while ReadConfig accepts "1" and "true" to be thruty.

nats-queue-worker/main.go

Lines 318 to 320 in b5f165b

 if os.Getenv("basic_auth") == "true" && credentials != nil { 

 request.SetBasicAuth(credentials.User, credentials.Password) 

 }

Two possible solutions:

Pass config as an argument to the postReport function and use the value of the BasicAuth field in the above mentioned conditional.

Skip the check of the basic_auth value since credentials will always be nil if config.BasicAuth is false.

nats-queue-worker/main.go

Lines 39 to 48 in b5f165b

 var credentials *auth.BasicAuthCredentials 

 var err error 

 if config.BasicAuth { 

 log.Printf("Loading basic authentication credentials") 

 credentials, err = LoadCredentials() 

 if err != nil { 

 log.Printf("Error with LoadCredentials: %s ", err.Error()) 

 } 

 }

Steps to Reproduce (for bugs)

Deploy openfaas on Kubernetes using arkade.
Update the queue-worker deployment and change the following env variables.

- name: gateway_invoke
  value: "false"
- name: basic_auth
  value: "1"

Deploy a function and invoke it async

Context

Your Environment

Docker version docker version (e.g. Docker 17.0.05 ): 20.10.9
Are you using Docker Swarm or Kubernetes (FaaS-netes)? Kubernetes
Operating System and version (e.g. Linux, Windows, MacOS): Linux
Link to your project or a code example to reproduce issue:

[Feature request] Dead letter queue for NATS

My actions before raising this issue

Followed the troubleshooting guide
Read/searched the docs
Searched past issues

Using async invocation it seems there's no way to tell whether the invocation eventually succeeded. Failure could be caused by API issues, functions being deleted/not accepting connections (SIGTERM), event payload issues causing exceptions or simple app logic bugs within the function.

For async invocation this is usually handled with a dead letter queue (DLQ). I could not find any mention of DLQ support in OpenFaaS/NATS (STAN). How is this dealt with today? Is it a concern at all? Does STAN automatically redrive failed invocations? If so, how many until it gives up?

Expected Behaviour

Failure during async function invocation should be trackable, if possible using DLQ where events can be inspected and potentially redriven.

Current Behaviour

Tested async invocation via faas-cli and a connector using connector-sdk where the subscribed function does not exist (anymore). There was no error reported leaving the caller believing that the invocation would eventually succeed (even though 202 technically does not give a guarantee, so introspection capabilities would be generally useful in a 202 setup).

A work around seems to be to provide callbacks where the error status can be introspected. Not sure if this is always possible (CLI) or desired.

Details see here: openfaas/faas#1298

Possible Solution

Implement a DLQ capability. Are there already metrics exposed for failed async function invocations?

Steps to Reproduce (for bugs)

Simply call faas-cli -a (or curl) on a non-existing function.

Context

I sense potential consistency issues (no error reported while the function was not executed at all) leading to hard to debug issues. Also, malformed payloads and application logic bugs could be hidden by the current implementation (if my understanding of the issue is correct and complete).

Queue Worker does not gracefully shut down

In a previous conversion @alexellis and I discussed some items related to the queue worker, one of which being to verify whether or not the queue worker gracefully shuts down, or if it just abandons its work.

Expected Behaviour

The behavior we discussed that we desired was that the queue worker attempts to gracefully shut down by:

stop subscribing to new messages from nats
finishing up all invocations it had started as normal
when no longer working on any invocations, exit 0

An example of this timing for a `sleep` function with the following config:

sleep duration of 30s
[x]_timeouts of 1m
queue worker with an ack_wait of 1m5s

Expecting events with rough timing; the sections in the format `[duration]` are the general timings from the start of this example timeline:

async function is invoked via gateway and sent to nats [0s]
queue worker receives a message from nats subscription [0s]
queue worker invokes sleep function, which is configured to sleep for 30 seconds [0s]
queue worker receives SIGTERM(via drain), a new queue worker is scheduled to replace it [5s]
queue worker stops subscribing to new messages [5s]
new queue worker comes online, subscribes [7s]
function invocation completes [30s]
queue worker receives response and handles as normal [30s]
queue worker notices it has no more invocations to wait for, and exits with a status code of 0 [30s]
queue worker pod is removed [30s]

Current Behaviour

Currently the queue worker immediately exits, I don't even see a log such as "received SIGTERM" or the like.
Once the queue-worker comes back online, nats eventually sends the message again.

An example of this timing with the same settings and format as above, functional (non-timing) differences in bold italics:

async function is invoked via gateway and sent to nats [0s]
queue worker receives a message from nats subscription [0s]
queue worker invokes sleep function, which is configured to sleep for 30 seconds [0s]
queue worker receives SIGTERM(via drain), a new queue worker is scheduled to replace it [5s]
queue worker immediately exits [5s]
new queue worker comes online, subscribes [7s]
function invocation completes, but is not handled by anything [30s]
original invocation is considered a "miss" and resent by nats [1m5s]
new queue worker receives a message from nats [1m5s]
new queue worker invokes sleep function again, which is configured to sleep for 30 seconds [1m5s]
second function invocation completes [1m35s]
new queue worker receives response and handles as normal [1m35s]

The two major differences from the above:

2 function invocations occurred
the overall time was extended by the ack_wait duration, meaning a process that should take 30s instead takes 1m35s (function duration + ack_wait duration)

Possible Solution

Steps to Reproduce (for bugs)

Context

We are interested in the timing of jobs, as well as not duplicating function invocations, if graceful shutdown were implemented, we could expect certain invocations to not wait for the full ack_wait duration before attempting the function again.

Your Environment

FaaS-CLI version ( Full output from: faas-cli version ):
0.13.13
Docker version docker version (e.g. Docker 17.0.05 ):
20.10.8
What version and distriubtion of Kubernetes are you using? kubectl version
server v1.21.3
client v1.22.2
Operating System and version (e.g. Linux, Windows, MacOS):
MacOS
Link to your project or a code example to reproduce issue:
What network driver are you using and what CIDR? i.e. Weave net / Flannel

Dynamic max_inflight

Expected Behaviour

I have a function which runs ffmpeg to convert a video. It's CPU bound, so I've used HPAv2 to autoscale the pods running my function. This works great, but since my functions take awhile to finish, I'm using the queue worker to do async processing. My issue is, I cannot dynamically set max_inflight. Ideally I would like each pod running my function to process n tasks at once. In my case, n would be set to 1. If I set max_inflight to 1, even though my autoscaler would bring up a 2nd pod, it would never be used in parallel, since the queue worker only schedules one at a time. If I set the max_inflight to a higher value, I risk invoking the function multiple times before my autoscaler can kick in, and my long running function will be using the same pod for multiple tasks. Ideally I would like to have max_inflight mirror n*pod_count. I could scale the queue workers themselves, but they would have to perfectly mirror the count of function pods at any given moment. Is there some way to tell the queue worker to always have the same number of max_inflight as the number of instances of my functions? Without this, autoscaling openfaas is really limited to non-async functions.

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

Context

I'm trying to autoscale my openfaas function, which I call asynchronously.

Your Environment

Kubernetes 1.24
Are you using Docker Swarm or Kubernetes (FaaS-netes)?
Kubernetes FaaS-netes with CRD
Operating System and version (e.g. Linux, Windows, MacOS):
Linux
Link to your project or a code example to reproduce issue:

Make NATS Streaming message input/response optional

Because func_queue-worker writes output to the log, it is visible when you run docker service logs func_queue-worker.

This means that if you are returning binary data from your function, the logs will be useless.

I think that debug logs of output data should be surpressed by default (similar to write_debug = false).

Feature: ppc64le support

Expected Behaviour

nats-queue-worker compiles on Linux/ppc64le

Current Behaviour

nats-queue-worker doesn't compile on Linux/ppc64le

Possible Solution

merge request will come

Context

trying to deploy openfaas on ICP on RHEL/ppc64le

Your Environment

Docker version docker version (e.g. Docker 17.0.05 ):
18.06.3-ce
Are you using Docker Swarm or Kubernetes (FaaS-netes)?
Kubernetes
Operating System and version (e.g. Linux, Windows, MacOS):
RHEL 7/ppc64le

Support a manual acknowledgement mode.

Expected Behaviour

From what I understand from reading the code, and from my tests, if a function invocation fails for some reason (networking, ...) or returns a non-2xx status code, the invocation won't be replayed because (a) messages are automatically acknowledged upon being received, and (b) there's no re-queue logic in the error handling bits (here and here).

Current Behaviour

The queue worker, for the most part, ignores whether the function invocation was successful or not, and does not perform retries.

Possible Solution

It would be good to have a (possibly opt-in) manual acknowledgement mode, possibly coupled with a customisable AckWait timeout, in which networking errors or 5xx status codes would cause an invocation not to be acknowledged, and hence retried (i.e. a redelivery would occur), while 4xx errors would cause an invocation to be acknowledged (causing the invocation not to be tried again, as it would most probably never succeed).

Things that might lend themselves to discussion are what to do with 3xx status codes, and whether to retry each invocation only a predefined number of times (and how to keep track of that).

Steps to Reproduce (for bugs)

Context

I am trying to understand whether the queue worker could easily be used in a scenario in which function invocations must be retried (possibly up to a predefined number of times) in case they fail.

Your Environment

Docker version docker version (e.g. Docker 17.0.05 ):

N/A

Are you using Docker Swarm or Kubernetes (FaaS-netes)?

N/A

Operating System and version (e.g. Linux, Windows, MacOS):

N/A

Link to your project or a code example to reproduce issue:

N/A

Allow alternate port for gateway to invoke functions

Expected Behaviour

When using OpenFaaS on Nomad I'd expect to be able to override the gateway port based on my configuration.

Current Behaviour

The queue worker has a hard-coded value of :8080 as the gateway port. https://github.com/openfaas/nats-queue-worker/blob/master/main.go#L80

Possible Solution

Allow the gateway port to be configurable or pass and use the information from the async request message itself.

Context

I cannot setup the queue worker in my Nomad-based OpenFaaS environment since my gateway port is not listening on :8080 and there's no default overlay networking.

max_inflight env var is ignored

Hey,
I have an OpenFaaS cluster setup and running with k8s and I am using the async function invocation.
As it is written in the docs (https://docs.openfaas.com/deployment/troubleshooting/) it is possible to configure parallel executions for each queue worker.

As it is written in https://github.com/openfaas/nats-queue-worker/blob/master/readconfig.go, there is an environment variable max_inflight to take care of a parallel amount of tasks (correct me if I am mistaken). I have set it using the helm chart as it is here:
https://github.com/avielb/faas-netes/commit/878a6b9ddb2eb2d94c2aafb861d319b4dc778062

I can confirm seeing the variable set in the container but still when I execute the 3 async functions in a row they are executed one by one and not in parallel.

Expected Behaviour

Async function to be executed in parallel with 1 queue worker.

Current Behaviour

It is done serially one by one ignoring max_inflight env var.

Possible Solution

Steps to Reproduce (for bugs)

Deploy k8s with Helm
Deploy the helm chart using my fork: https://github.com/avielb/faas-netes/
cd chart/openfaas && helm install --debug ./ -f values.yaml -n openfaas
Deploy a long-running function

Context

Your Environment

Docker version docker version (e.g. Docker 17.0.05 ): Docker Desktop 18.09.1 and EKS
Are you using Docker Swarm or Kubernetes (FaaS-netes)? FaaS-netes indeed
Operating System and version (e.g. Linux, Windows, MacOS): MacOS and EKS with AWS linux
Link to your project or a code example to reproduce issue: https://github.com/avielb/faas-netes/

Move the queue-worker to Go modules

Task

Move the queue-worker to Go modules

Dep is now deprecated.

I would suggest that a previous contributor to openfaas takes this.

I'd be most comfortable if @matthiashanel had time to look into it given his recent contributions and understanding of the NATS work upstream.

404 Asyc Doc Readme

404 Linked in the README.md

https://github.com/openfaas/nats-queue-worker/blame/master/README.md#L9 (currently links https://github.com/openfaas/faas/blob/master/guide/asynchronous.md)

should probably be linking to https://docs.openfaas.com/reference/async/

[Research] Retries for certain HTTP codes

This issue is to gather research and opinions on how to tackle retries for certain HTTP codes.

Expected Behaviour

If a function returns certain errors like 429 (too busy) (as can be set by max_inflight in the function's watchdog), then the queue-worker could retry the request a number of times.

Current Behaviour

The failure will be logged, but not retried.

It does seem like retries will be made implicitly if the function invocation takes longer than the "ack window".

So if a function takes 2m to finish, and the ack window is 30s, that will be retried, possibly indefinitely in our current implementation.

Possible Solution

I'd like to gather some use-cases and requests from users on how they expect this to work.

Context

@matthiashanel also has some suggestions on how the new NATS JetStream project could help with this use-case.

@andeplane recently told me about a custom fork / patch that retries up to 5 times whenever a 429 error is received with an exponential back-off. My concern with an exponential backoff with the current implementation is that it effectively shortens the ack window and could cause undefined results.

The Linkerd team also caution about automatic retries in their documentation stating risk of cascading failure. "How Retries Can Go Wrong" "Choosing a maximum number of retry attempts is a guessing game" "Systems configured this way are vulnerable to retry storms" -> https://linkerd.io/2/features/retries-and-timeouts/

The team discuss a "retry budget", should we look into this?

Should individual functions be able to express an annotation with retry data? I.e. a backoff for processing an image may be valid at 2, 4, 8 seconds, but retrying a Tweet because Twitter's API has rate-limited us for 4 hours, will clearly not work.

What happens if we cannot retry a function call like in the Twitter example above? Where does the message go, how is this persisted? See also (call for a dead-letter queue) in #81

Finally, if we do start retrying, that metadata seems key to operational tuning of the system and auto-scaling, should this be exposed via Prometheus metrics and a HTTP /metrics endpoint?

Update docker Go version to 1.11.13

Per the security announcement https://groups.google.com/forum/#!topic/golang-announce/65QixT3tcmg we should update the docker build layers to at least Go 1.11.13

Possible Solution

The builder layer in the Dockerfile should use golang:1.11

Update Dockerfile and rebuild/release

openfaas/faas#1291
openfaas/templates#170
openfaas/faas-netes#494
#66
openfaas/of-watchdog#78
https://github.com/openfaas-incubator/faas-idler/issues/32
openfaas/golang-http-template#28
openfaas-incubator/faas-federation#4
openfaas-incubator/vcenter-connector#27
openfaas-incubator/faas-memory#3
openfaas-incubator/faas-rancher#8

Only Rebuild

openfaas/faas-swarm#56
openfaas/ingress-operator#10
https://github.com/openfaas-incubator/openfaas-operator/issues/87

Custom headers and QueryString should be serialized and passed onto function.

Custom headers such as X-Forwarded-By or query string values such as ?id=alex should be passed onto the function when it is invoked by a queue worker. This also needs a related change in the gateway: openfaas/faas#369

Allow message to be verified

Feature: Non-repudiation for queue-worker callbacks

Suggested by: Ed Wilde @ewilde

We can use HMAC or RSA and HMAC together to sign messages when we use the X-Callback-Url. This means that receivers of the callback messages can verify the sender as the queue worker vs. some bad actor that discovered the URL.

Provide original function name in callback

The callback currently contains the following meta-data from the original invocation:

Header	Description
X-Call-Id	The original function call's tracing UUID
X-Duration-Seconds	Time taken in seconds to execute the original function call
X-Function-Status	HTTP status code returned by the original function call

I am proposing that we add the original function name as well

Header	Description
X-Function-Name	The original function name that was invoked

Use case

With this extra piece of meta-data people can design their own back-off / retry / eventual consistency systems, without the need to modify any core OpenFaaS components

Consider this design now enabled by the extra field

Path needs transformation before invoking via gateway_invoke

The path for a service needs transformation before invoking via gateway_invoke

Expected Behaviour

If env is invoked as per:

curl gateway/async-function/env -d "test"

The path received by the function should be / whatever the value for gateway_invoke - true or false.

Current Behaviour

When in gateway_invoke true and with direct_functions set to false the whole path is pre-pended:

curl gateway/async-function/env -d "test"

The function receives a request for /async-function/env3 instead of /

Possible Solution

There are several areas where changes could be made:

faas - the gateway external proxy, it doesn't run a path transform - https://github.com/openfaas/faas/blob/master/gateway/handlers/queue_proxy.go#L56
faas-provider - the proxy code could be aware of the URLs and strip this prefix as required
nats-queue-worker - could be made to be path-aware and use the path transform

Steps to Reproduce (for bugs)

Deploy with --set gateway.directFunctions=false and --set queueWorker.gatewayInvoke=true
Deploy faas-cli deploy --name env --image functions/alpine:latest --fprocess=env
Invoke, and use an async receiver URL to view the result i.e. run nc -l 8888 on your laptop and then add the address in the header: -H "X-Callback-Url: http://192.168.0.28:8888" to your invoke

You'll then see this incorrect path:

PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
fprocess=env
HOME=/home/app
Http_Accept=*/*
Http_X_Call_Id=b3aa50ae-a31d-4eb9-9987-915ef97f0ca8
Http_X_Start_Time=1577136342524082049
Http_X_Callback_Url=http://192.168.0.28:8888
Http_X_Forwarded_For=172.19.0.1:47744
Http_X_Forwarded_Host=gateway:8080
Http_User_Agent=curl/7.47.0
Http_Content_Length=0
Http_Accept_Encoding=gzip
Http_Content_Type=application/x-www-form-urlencoded
Http_Method=POST
Http_ContentLength=0
Http_Path=/async-function/env3
Http_Host=172.19.0.78:8080

Now set deploy with --set gateway.directFunctions=true and --set queueWorker.gatewayInvoke=false

Perform the same invocation, and you should see the Path as Http_Path=/

Context

This will break some functions and services, for instance figlet did not seem to work with this change.

Pinging @LucasRoesler @ewilde

	if os.Getenv("basic_auth") == "true" && credentials != nil {
	request.SetBasicAuth(credentials.User, credentials.Password)
	}

	var credentials *auth.BasicAuthCredentials
	var err error

	if config.BasicAuth {
	log.Printf("Loading basic authentication credentials")
	credentials, err = LoadCredentials()
	if err != nil {
	log.Printf("Error with LoadCredentials: %s ", err.Error())
	}
	}

openfaas / nats-queue-worker Goto Github PK

nats-queue-worker's Introduction

Notice

queue-worker (Community Edition) for NATS Streaming

Usage

Configuration

nats-queue-worker's People

Contributors

Stargazers

Watchers

Forkers

nats-queue-worker's Issues

Current Behaviour

Possible Solution

Expected Behaviour

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

Context

My actions before raising this issue

Expected Behaviour

Current Behaviour

Possible Solution

Context

Your Environment

Expected Behaviour

Current Behaviour

Possible Solution

Expected Behaviour

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

Context

Expected Behaviour

Current Behaviour

Possible Solution

Expected Behaviour

Current Behaviour

Possible Solution

Your Environment

Expected Behaviour

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

Context

Expected Behaviour

Current Behaviour

Steps to Reproduce (for bugs)

Context

Your Environment

Expected Behaviour

Possible Solution

Expected Behaviour

Current Behaviour

Possible Solution

Context

Expected Behaviour

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

Expected Behaviour

Current Behaviour

Possible Solution

Context

Expected Behaviour

Steps to Reproduce (for bugs)

Expected Behaviour

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

Expected Behaviour

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)