cloudfoundry / loggregator-release Goto Github PK

Cloud Native Logging

License: Apache License 2.0

Shell 0.97% HTML 2.57% Go 96.46%

loggregator-release's Issues

syslog-tls and https log drains should support TLS mutual authentication

We have a number of users that would like to use TLS mutual authentication with their syslog drains but doppler doesn't currently support that use case. Would you be willing to accept a pull request to add the support?

Our proposal is to use the credentials map from the user provided service definition to hold the client certificate, private key, and (optionally) a certificate chain for the drain. When these properties are provided, we will carry them into doppler and use them to setup the tls.Config properly for the handshake, mutual auth, and certificate verification.

This would require an updated syslog_drain_urls API in the cloud controller and some changes to the syslog drain binder, the TLS writer, the HTTPS writer, and some ancillary plumbing.

Metron agent doesn't build on Windows

Loggregatorlib used to be able to build under Windows, and with the recent change in 80d688e we seemed to have dropped the ability to compile since we didn't carry across the fake Windows implementation from https://github.com/cloudfoundry/loggregatorlib/blob/2b68a3305af889947f0f8c11d0aca4817621f983/cfcomponent/logging_windows.go

Maybe we need to get CI for this project on Windows, since the Garden Windows team depends on it.

loggregator build error

hi, I download the loggregator, and build it, it's not success, the message is:

zcy@ubuntu:~/go/go_workspace/src/loggregator/loggregator$ go build

loggregator/store

../store/app_service_store_watcher.go:77: cannot use event.Node (type *storeadapter.StoreNode) as type storeadapter.StoreNode in argument to appServiceFromStoreNode
../store/app_service_store_watcher.go:79: cannot use event.Node (type *storeadapter.StoreNode) as type storeadapter.StoreNode in argument to w.deleteEvent
../store/app_service_store_watcher.go:81: cannot use event.Node (type *storeadapter.StoreNode) as type storeadapter.StoreNode in argument to w.deleteEvent

is something wrong with it?

Memory Leak in doppler and metron_agent?

Hi,

I'm running a v205 CF deployment with ~350 application instances.

I just found that the doppler process on the doppler node and the metron_agent process on the gorouter node used too much memory(90.9% and 13.3 % respectively), as below.

Acutually, the memory usage of metron_agent was also >90%, and I restarted it yesterday, and just after 14 hours, the usage went up to 13.3%.

There are two doppler nodes and three gorouter nodes.

Process 'doppler'

status Running
monitoring status Monitored
pid 21648
parent pid 1
uptime 2d 11h 10m
children 0
memory kilobytes 14948028
memory kilobytes total 14948028
memory percent 90.9%
memory percent total 90.9%
cpu percent 12.5%
cpu percent total 12.5%

Process 'metron_agent'

status Running
monitoring status Monitored
pid 28995
parent pid 1
uptime 14h 10m
children 2
memory kilobytes 2195608
memory kilobytes total 2195608
memory percent 13.3%
memory percent total 13.3%
cpu percent 6.8%
cpu percent total 6.8%

Loggregator TC Runtime error: invalid memory address or nil pointer

running CF 210.2

The
/var/vcap/sys/log/loggregator_trafficcontroller/loggregator_trafficcontroller.stderr.log
file has frequent occurrences of this type of snippet

2015/06/16 18:16:13 http: panic serving :: runtime error:
invalid memory address or nil pointer dereference
goroutine 1614338 [running]:
net/http.func·011()
/usr/local/go/src/pkg/net/http/server.go:1100 +0xb7
runtime.panic(0x7c1a40, 0xa4a993)
/usr/local/go/src/pkg/runtime/panic.c:248 +0x18d
github.com/cloudfoundry/loggregatorlib/server/handlers.(*httpHandler).ServeHTTP(0xc209a61e30,
0x7f5171a3b288, 0xc20abf5ae0, 0xc20b12bad0)

/var/vcap/data/compile/loggregator_trafficcontroller/loggregator/src/
github.com/cloudfoundry/loggregatorlib/server/handlers/http_handler.go:29
+0x2f5
trafficcontroller/dopplerproxy.(*Proxy).serveWithDoppler(0xc2080ac0d0,
0x7f5171a3b288, 0xc20abf5ae0, 0xc20b12bad0, 0xc20aaacd6b, 0xa,
0xc20aaacd46, 0x24, 0x0, 0x12a05f200, ...)

/var/vcap/data/compile/loggregator_trafficcontroller/loggregator/src/trafficcontroller/dopplerproxy/doppler_proxy.go:181
+0x152
trafficcontroller/dopplerproxy.(*Proxy).serveAppLogs(0xc2080ac0d0,
0x7f5171a3b288, 0xc20abf5ae0, 0xc20b12bad0)

/var/vcap/data/compile/loggregator_trafficcontroller/loggregator/src/trafficcontroller/dopplerproxy/doppler_proxy.go:170
+0x849
trafficcontroller/dopplerproxy.(*Proxy).ServeHTTP(0xc2080ac0d0,
0x7f5171a3b288, 0xc20abf5ae0, 0xc20b12b930)

/var/vcap/data/compile/loggregator_trafficcontroller/loggregator/src/trafficcontroller/dopplerproxy/doppler_proxy.go:86
+0x4d9
net/http.serverHandler.ServeHTTP(0xc2080046c0, 0x7f5171a3b288,
0xc20abf5ae0, 0xc20b12b930)
/usr/local/go/src/pkg/net/http/server.go:1673 +0x19f
net/http.(_conn).serve(0xc20a8c6f00)
/usr/local/go/src/pkg/net/http/server.go:1174 +0xa7e
created by net/http.(_Server).Serve
/usr/local/go/src/pkg/net/http/server.go:1721 +0x313
2015/06/22 11:28:58 http: panic serving :: runtime error:
invalid memory address or nil pointer dereference
goroutine 2034572 [running]:
net/http.func·011()
/usr/local/go/src/pkg/net/http/server.go:1100 +0xb7
runtime.panic(0x7c1a40, 0xa4a993)
/usr/local/go/src/pkg/runtime/panic.c:248 +0x18d
github.com/cloudfoundry/loggregatorlib/server/handlers.(*httpHandler).ServeHTTP(0xc20bf2d820,
0x7f5171a3b288, 0xc20d33a000, 0xc20c8c9c70)

/var/vcap/data/compile/loggregator_trafficcontroller/loggregator/src/
github.com/cloudfoundry/loggregatorlib/server/handlers/http_handler.go:29
+0x2f5
trafficcontroller/dopplerproxy.(*Proxy).serveWithDoppler(0xc2080ac0d0,
0x7f5171a3b288, 0xc20d33a000, 0xc20c8c9c70, 0xc20c96e1eb, 0xa,
0xc20c96e1c6, 0x24, 0x0, 0x12a05f200, ...)

Log forwarding issues

At current moment we are running load tests against our Cloud Foundry deployment. During tests our system can produce around 12000 logs/sec. We have configured syslog_daemon_config to forward all logs to ELK. At some moment we stop to get any logs and only monit restart metron_agent for each Cloud Foundry instance helps to resume logs forward process.
I believe that the bottleneck is somewhere between metron_agent and doppler. But also as i understand with syslog_daemon_config property enabled metron_agent only adds rsyslog template to /etc/rsyslog.d folder and all components logs must be forwarded via rsyslog daemon. So for me it is not clear how monit restart metron_agent can solve logs forwarding issue.

syslog_drain_binder does not report instance number

Hi, using syslog_drain_binder i noticed it does not report instance number (as cli does it for me)
159 <14>1 2015-11-09T12:03:52.628411+00:00 loggregator b563cf61-a5ba-4540-9734-7ee0e4c5add4 [APP] - - sun.java.mem=1507328 #timestamp=Mon Nov 09 12:03:52 UTC 2015
170 <14>1 2015-11-09T12:03:52.628524+00:00 loggregator b563cf61-a5ba-4540-9734-7ee0e4c5add4 [APP] - - sun.java.heap.committed=1507328 #timestamp=Mon Nov 09 12:03:52 UTC 2015
164 <14>1 2015-11-09T12:03:52.628477+00:00 loggregator b563cf61-a5ba-4540-9734-7ee0e4c5add4 [APP] - - sun.java.mem.free=1113654 #timestamp=Mon Nov 09 12:03:52 UTC 2015
161 <14>1 2015-11-09T12:03:52.628492+00:00 loggregator b563cf61-a5ba-4540-9734-7ee0e4c5add4 [APP] - - sun.java.processors=32 #timestamp=Mon Nov 09 12:03:52 UTC 2015

cf cli reports this as e.g. [APP/0]

Is it a bug or there are some knobs to tune this?

How to use loggregator in CF?

How to use loggregator in CF?The cf has been deployed in one pc，but I didn`t see some components about loggregator(dea_agent,loggregator,trafficcontrol).

README doesn't mention uaa.clients.doppler property

We added doppler.firehose to the uaa.clients.cf.scope as the README suggested, but we were still getting unauthorized errors until we added uaa.clients.doppler.{id,secret,authorities}, which was not mentioned in the README.

Applications exiting within 10-20ms of startup lose logs

If an application starts and immediately crashes/quits, anything outputted to stdout/err is not logged which makes debugging quite hard. On our deployments it seems to affect anything that dies within 10-20ms.

The workaround for this is to change the Procfile to have a sleep statement in it, e.g. web: sleep 1 && command.

I'm not sure if this is the correct place to file this issue, but I figured it was a good start.

This can be replicated by uploading a simple application that outputs something to stdout and immediately quits, such as example.go:

func main() {
    println("This will not appear in the application logs")
}

Race condition in Doppler under load

Hi,

We have a situation where our Firehose client suddently stops receiving log messages from Firehose without any error signalization in any of the components or in the client.

We are using the standard noaa client.

It is very easy to reproduce. One needs to deploy a Cloud Foundry system v230 on bosh-lite, push an application that dumps an excess number of log messages (e.g. 5000 msg/sec), and connect a number of Firehose clients with different subscription IDs. Having around 4 clients puts a good load on Doppler and reproduces it in less than a minute. Having the clients dump a msg/sec statistic to stdout makes it easy to track.

After some thorough investigation, we narrowed down the problem to Doppler. There is a race condition in the code that, as unlikely as it may seem, occurs quite often under heavy load. It results in a go-routine waiting on a channel that is never written to.

The problem is caused by the following line of code, that is replacing the output channel with a new one:
https://github.com/cloudfoundry/loggregator/blob/develop/src/truncatingbuffer/truncating_buffer.go#L101

Under load, the following line of code manages to consume the content of the old output channel and blocks on it before having the chance to pick the new one up:
https://github.com/cloudfoundry/loggregator/blob/develop/src/doppler/sinks/websocket/websocket_sink.go#L77

It seems unlikely as a race condition because it requires that the websocket_sink be able to loop through the whole output channel buffer, before the following piece of block finishes, but it happens.
https://github.com/cloudfoundry/loggregator/blob/develop/src/truncatingbuffer/truncating_buffer.go#L90-L98

I am working on a Pull Request that solves the problem and attempts to change as little code as possible. Unfortunately, the truncating_buffer is used by other packages and it will take some effort to align those as well.

This Issue is for tracking purposes and is a good place to get any remarks / comments on the matter prior to the Pull Request. (E.g. if there is already something under way or some restriction I should be aware of).

Regards,
Momchil

Read from doppler

How can i read all data stream from doppler? for example, i want to read all data from router?

Output Proxy: Error reading from the client

We are seeing this issue on the traffic controller when we try to push a new app, Any ideas what can be happening?

{"timestamp":1400165945.950070858,"process_id":1557,"source":"loggregator trafficcontroller","log_level":"error","message":"Output Proxy: Error reading from the client - unexpected EOF","data":null,"file":"/var/vcap/data/compile/loggregator_trafficcontroller/loggregator/src/trafficcontroller/proxy_handler.go","line":91,"method":"trafficcontroller.(*handler).watchKeepAlive"}
{"timestamp":1400165945.951148272,"process_id":1557,"source":"loggregator trafficcontroller","log_level":"error","message":"Output Proxy: Error reading from the server - EOF - 10.240.6.40:8080","data":null,"file":"/var/vcap/data/compile/loggregator_trafficcontroller/loggregator/src/trafficcontroller/proxy_handler.go","line":77,"method":"trafficcontroller.(*handler).proxyConnectionTo"}

On this same push request we are seeing the following on loggregator's log:

{"timestamp":1400165975.950389862,"process_id":1546,"source":"loggregator","log_level":"info","message":"SinkManager: Sink with channel 10.240.6.45:58622 and identifier %!s(MISSING) requested closing. Closed it.","data":null,"file":"/var/vcap/data/compile/loggregator/loggregator/src/loggregator/sinkserver/sinkmanager/sink_manager.go","line":142,"method":"loggregator/sinkserver/sinkmanager.(*SinkManager).unregisterSink"}

we checked and udp is working on 3456 in loggregator TC and in loggregator.

Architecture description for external syslog drain of loggregator incorrect?

The description and corresponding image in https://github.com/cloudfoundry/loggregator#architecture state (highlights are mine)

*Doppler*: Responsible for gathering logs from the Metron agents, storing them in temporary buffers, *and forwarding logs to 3rd party syslog drains.*

However, what I gather from the syslog configuration of the metron agents, especially the forwarding configuration is that each component forwards their logs to a configured external syslog drain.

Is my understanding correct that architectural description and implementation are different in this case? Or is it that the configuration is actually done on each machine where the metron agent is installed, but only takes effect on doppler?

Thanks!

Need to use the cgo dns resolver when using Go 1.6+

When you switched to Go 1.6, you are using the Go based DNS resolver. This resolver has a few known issues and doesn't work well in many situations with Cloud Foundry. Most of the issues have been fixed with Go 1.7.

For the time being, most if not all Go based components have been using the C-based DNS resolver by adding export GODEBUG=netdns=cgo to their scripts.
You can see examples of this in the HM9000 and Diego scripts.

TruncatingBuffer is unnecessarily fragile

TruncatingBuffer drops the whole buffer when full. This seems needlessly fragile, making a small change in workload (one extra message) have a far more destructive impact than it needs to.

A less fragile implementation would discard the newest messages as they arrive. This approach would be better than discarding those at the front of the queue, as the older messages may have information that 'leads up to' the blockage.

An antifragile solution would dynamically resize the buffer when it approaches capacity, up to a larger and more sensible limit.

traffic controller return 401 for non-exist app id when retrieving container metrics

When I retrieve container metrics from traffic controller through api endpoint: /apps/APP_ID/containermetrics.

when I use invalid token, it returns 401 which is correct behavior

Test with incorrect token...
*   Trying 10.244.0.34...
* Connected to doppler.bosh-lite.com (10.244.0.34) port 443 (#0)
* TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
* Server certificate: *.bosh-lite.com
> GET /apps/63afa04a-c3aa-453a-aac4-c684a963b91e/containermetrics HTTP/1.1
> Host: doppler.bosh-lite.com
> User-Agent: curl/7.43.0
> Accept: */*
> Authorization: bearer non-exist-token
> 
< HTTP/1.1 401 Unauthorized
< Content-Length: 52
< Content-Type: text/plain; charset=utf-8
< Date: Mon, 20 Jun 2016 21:54:10 GMT
< Www-Authenticate: Basic
< X-Vcap-Request-Id: 648208c5-d7d0-4ce5-5448-7d74d7c87615
< 
* Connection #0 to host doppler.bosh-lite.com left intact
You are not authorized. Error: Invalid authorization

but when I use a non-exist appid with correct token, it returns 401 too instead of 404.

*   Trying 10.244.0.34...
* Connected to doppler.bosh-lite.com (10.244.0.34) port 443 (#0)
* TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
* Server certificate: *.bosh-lite.com
> GET /apps/not-exist-id/containermetrics HTTP/1.1
> Host: doppler.bosh-lite.com
> User-Agent: curl/7.43.0
> Accept: */*
> Authorization: bearer eyJhbGciOiJSUzI1NiIsImtpZCI6ImxlZ2FjeS10b2tlbi1rZXkiLCJ0eXAiOiJKV1QifQ.eyJqdGkiOiIyYzBhMzQ4YWVmMjY0MGQ4OTQyMWE3MDBhNGFkNTc2MiIsInN1YiI6IjllODkwYzk5LTJhZjctNDc4Mi05NmFkLTIyNWNhMGVhMGI2MiIsInNjb3BlIjpbInJvdXRpbmcucm91dGVyX2dyb3Vwcy5yZWFkIiwiY2xvdWRfY29udHJvbGxlci5yZWFkIiwicGFzc3dvcmQud3JpdGUiLCJjbG91ZF9jb250cm9sbGVyLndyaXRlIiwib3BlbmlkIiwiZG9wcGxlci5maXJlaG9zZSIsInNjaW0ud3JpdGUiLCJzY2ltLnJlYWQiLCJjbG91ZF9jb250cm9sbGVyLmFkbWluIiwidWFhLnVzZXIiXSwiY2xpZW50X2lkIjoiY2YiLCJjaWQiOiJjZiIsImF6cCI6ImNmIiwiZ3JhbnRfdHlwZSI6InBhc3N3b3JkIiwidXNlcl9pZCI6IjllODkwYzk5LTJhZjctNDc4Mi05NmFkLTIyNWNhMGVhMGI2MiIsIm9yaWdpbiI6InVhYSIsInVzZXJfbmFtZSI6ImFkbWluIiwiZW1haWwiOiJhZG1pbiIsInJldl9zaWciOiJmMTZlOGQ1YiIsImlhdCI6MTQ2NjQ1OTY1MSwiZXhwIjoxNDY2NDYwMjUxLCJpc3MiOiJodHRwczovL3VhYS5ib3NoLWxpdGUuY29tL29hdXRoL3Rva2VuIiwiemlkIjoidWFhIiwiYXVkIjpbInNjaW0iLCJjbG91ZF9jb250cm9sbGVyIiwicGFzc3dvcmQiLCJjZiIsInVhYSIsIm9wZW5pZCIsImRvcHBsZXIiLCJyb3V0aW5nLnJvdXRlcl9ncm91cHMiXX0.NMdSU-4ZyKHn-_g2uwVM_PCW4xnrYyi2P2WOZ1B6TxxGTN9QAjRhZ2NcqNSdx-hxJySfXbuBZ9tP7U5s6hTV6Ng58J9ADIwc4qh8twHulra8nFJgJHa_1bOGKcENaNv5SkuJ77inxyd9okJEvIBPseopLWKks5LB4wTXNiCG76I
> 
< HTTP/1.1 401 Unauthorized
< Content-Length: 52
< Content-Type: text/plain; charset=utf-8
< Date: Mon, 20 Jun 2016 21:54:11 GMT
< Www-Authenticate: Basic
< X-Vcap-Request-Id: 0f82e985-5d14-49a2-5958-432125166f86
< 
* Connection #0 to host doppler.bosh-lite.com left intact
You are not authorized. Error: Invalid authorization

when I use an existing app id with the correct access token, it returns correct result. (I am using 'cf oauth-token' to get the access token, it is different with the previous token in the test since it is refreshed ),

*   Trying 10.244.0.34...
* Connected to doppler.bosh-lite.com (10.244.0.34) port 443 (#0)
* TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
* Server certificate: *.bosh-lite.com
> GET /apps/63afa04a-c3aa-453a-aac4-c684a963b91e/containermetrics HTTP/1.1
> Host: doppler.bosh-lite.com
> User-Agent: curl/7.43.0
> Accept: */*
> Authorization: bearer eyJhbGciOiJSUzI1NiIsImtpZCI6ImxlZ2FjeS10b2tlbi1rZXkiLCJ0eXAiOiJKV1QifQ.eyJqdGkiOiIyYzBhMzQ4YWVmMjY0MGQ4OTQyMWE3MDBhNGFkNTc2MiIsInN1YiI6IjllODkwYzk5LTJhZjctNDc4Mi05NmFkLTIyNWNhMGVhMGI2MiIsInNjb3BlIjpbInJvdXRpbmcucm91dGVyX2dyb3Vwcy5yZWFkIiwiY2xvdWRfY29udHJvbGxlci5yZWFkIiwicGFzc3dvcmQud3JpdGUiLCJjbG91ZF9jb250cm9sbGVyLndyaXRlIiwib3BlbmlkIiwiZG9wcGxlci5maXJlaG9zZSIsInNjaW0ud3JpdGUiLCJzY2ltLnJlYWQiLCJjbG91ZF9jb250cm9sbGVyLmFkbWluIiwidWFhLnVzZXIiXSwiY2xpZW50X2lkIjoiY2YiLCJjaWQiOiJjZiIsImF6cCI6ImNmIiwiZ3JhbnRfdHlwZSI6InBhc3N3b3JkIiwidXNlcl9pZCI6IjllODkwYzk5LTJhZjctNDc4Mi05NmFkLTIyNWNhMGVhMGI2MiIsIm9yaWdpbiI6InVhYSIsInVzZXJfbmFtZSI6ImFkbWluIiwiZW1haWwiOiJhZG1pbiIsInJldl9zaWciOiJmMTZlOGQ1YiIsImlhdCI6MTQ2NjQ1OTY1MCwiZXhwIjoxNDY2NDYwMjUwLCJpc3MiOiJodHRwczovL3VhYS5ib3NoLWxpdGUuY29tL29hdXRoL3Rva2VuIiwiemlkIjoidWFhIiwiYXVkIjpbInNjaW0iLCJjbG91ZF9jb250cm9sbGVyIiwicGFzc3dvcmQiLCJjZiIsInVhYSIsIm9wZW5pZCIsImRvcHBsZXIiLCJyb3V0aW5nLnJvdXRlcl9ncm91cHMiXX0.BB_oAdNlpNv7VQYfyUNPNyJncGGmYbxTd6mfsjEe9ruHkMm3gLeJOmdEQ9U_CYsyUiUPXI3N8FPvTOcbDegjyWwmaEg_5nYpByGzz7xsj4rkxwwcsF_iio3qYwPPLoBWebqgbPCu3CG9w8_15qVp5tmzsiaYBjwr4796Xm1h4Tg
> 
< HTTP/1.1 200 OK
< Content-Length: 612
< Content-Type: multipart/x-protobuf; boundary=3739652dcf4198b311c9c29b345f7be0ab4f8acc659273f97415ebb916e1
< Date: Mon, 20 Jun 2016 21:54:10 GMT
< X-Vcap-Request-Id: 8fe343bf-1f90-4e9a-6776-8ff034149828
< 
--3739652dcf4198b311c9c29b345f7be0ab4f8acc659273f97415ebb916e1


DEA 0??̀????b8
$63afa04a-c3aa-453a-aac4-c684a963b91e3?k1??? ???T(j cf-wardenr  runner_z1z0?
                                                                                    10.244.0.26
--3739652dcf4198b311c9c29b345f7be0ab4f8acc659273f97415ebb916e1


DEA 0??ߊ????b8
$63afa04a-c3aa-453a-aac4-c684a963b91eIK7.??? ???O(j cf-wardenr  runner_z1z0?
                                                                                    10.244.0.26
--3739652dcf4198b311c9c29b345f7be0ab4f8acc659273f97415ebb916e1


DEA 0·??????b8
$63afa04a-c3aa-453a-aac4-c684a963b91e?Ix!?? ???R(j  cf-wardenr  runner_z1z0?
                                                                                    10.244.0.26
--3739652dcf4198b311c9c29b345f7be0ab4f8acc659273f97415ebb916e1--

renaming repo to doppler?

I'm prob not only person confused by the renaming of loggregator to doppler. Is "loggregator" a thing still and "doppler" is something else?

Should this repo be renamed? Is the traffic controller now "doppler traffic controller"? Are all instances of "loggregator" (e.g. in the diagrams) now "doppler"? Or is there still a "loggregator" concept?

Github search for "cloudfoundry doppler" returns nothing; which didn't help me figure out the new differences :)

dea_logging_agent: Error while reading from socket ERR, EOF

As described in,
https://groups.google.com/a/cloudfoundry.org/forum/#!topic/vcap-dev/n5SimQSbkT0

In my deployment, I'm seeing:

Error while reading from socket ERR, /var/vcap/data/warden/depot/17ilr01935u/jobs/10, EOF

in the logs on the DEA whenever a new app is started up. The rest of my loggregator config appears to be working normally--

that is, I've tested tcp connections between the various components and they are able to communicate.

More details (including my BOSH manifest) can be found in the forum.

loggregator doesn't incrementally stream application startup logs

When viewing the loggregator stream during application startup I don't see my application startup logs until the application is "started".

To reproduce:

Create an application that takes a long time to start (50 sec)? Have it print something every second.
Push the application.
While the application is "starting" open a loggregator stream in another console.
Note that you can see the staging logs but no application startup logs.
Once the application has started note that loggregator then gets all the startup logs in a big push.

Note:
What makes this issue worse is if the application takes even longer to start loggregator appears to only get the last 70-80 seconds worth of logs. Anything before never appears in loggregator.

why loggregator's loglevel default value is info?

can i change it in config file,such as error or warn?

Reconfigure loggregator buffer to avoid "log message output too high. we've dropped 100 messages"

Customer comments..

Hello,
Can we reconfigure loggregator buffer from 100 to a higher number to avoid seeing frequent "log message output too high. we've dropped 100 messages". We have gone through the explaination on http://support.run.pivotal.io/entries/79609435-Troubleshooting-dropped-or-missing-log-messages but believe we may have some benefits for java heap dumps too occassionally with higher value especially for Development spaces.

Whether the loggregator can collect the logs of router or other components's logs.

In the project of loggregator,I just see the dea_agent,loggregator and trafficcontroller,so I think the loggregator can just collect the log of apps in dea.Am I right?

Set multiple endpoints for syslog drain

(I'm not sure this is correct place to ask, if not tell me where to post)

When CF user run create-user-provided-service and configure log forwarding service, they can specify only one endpoint (like syslog://logs.example.com).

Some log service (e.g., Splunk) provides multiple endpoints for HA reason. Now CF user can not specify multiple endpoint. This means when something wrong on one endpoint. Log will not be forwarded.

Do you have a plan to support multiple endpoints (load balancing) ?

`metron_agent` assumes `network.apps` property in `syslog_forwarder_conf.erb`

https://github.com/cloudfoundry/loggregator/blob/6306eda36c369fd7400e5ee5e8e373ed29ee56a0/bosh/jobs/metron_agent/templates/syslog_forwarder.conf.erb#L19

Everything that wants to use metron has to have a network.apps property defined on each job. "apps" doesn't make sense for a generic release (I don't even know the name makes a ton of sense for cf-release). There is a way to get the default network (if one is specified as default), or picks the first one otherwise. See:

https://github.com/concourse/concourse/blob/56361c6bdd7069314e3b26043dbbbce807e9ccad/jobs/atc/templates/atc_ctl.erb#L12-L32

(Relevant story in BOSH backlog: https://www.pivotaltracker.com/story/show/79158540)

sinkManager.logger.Warnf call missing a format string

In SinkManager#SendSyslogErrorToLoggregator only one argument is provided. If it contains
'%' characters it results in error messages and missed expected output in the logs.

Syslog drain binder bosh job config assumes Cloud controller is available via http

When in many cases it is not.

Should perhaps use cc.srv_api_uri instead? Which, I believe, includes the preferred protocol and doesn't guess at the host name.

TLS certificate verification config for user syslog drains should be separate from internal endpoints

The user provided syslog-tis and https sinks are configured to ignore certificate validation whenever the ssl.skip_cert_verify property is set in the bosh deployment. Within cf-release, that property is intended to be set when the haproxy or ELB is configured with a self-signed certificate and is needed to allow communication within the deployment in that scenario. Generally speaking, that deployment configuration should not impact the certificate verification performed when an application talks to a third party syslog drain.

Certificate verification for syslog sinks should really have its own control point that is separate and distinct from the deployment's ssl.skip_cert_verify property. Ideally this is something the end user could control when defining the user provided service.

Request for sample bosh deployment yml for Loggregator

Hi all, this is not a bug report, just a request.

I'd like to try out Loggregator in my CF installation, can you guys publish a sample of the BOSH deployment YML you are using to deploy Loggregator on run.pivotal.io?

Thanks
Troy.

Loggregator or MetronAgent is driving ETCD high CPU

After upgrade from v221 to v226, the average ETCD CPU usage got increased from about 30% to 75%. please see the attached ETCD CPU usage. We have verified that was not caused by HM, so it could be Loggregator or MetronAgent.

"Log message output too high" make LGR more resilient to temporary log rate increase

loggregator (LGR) is sometime filtering with the following warning:

Log message output too high. We've dropped 100 messages

It occurs when the application is logging a lot of messages in a short time.
It may be opportunate in the event the application is logging GB/sec for several seconds or minutes.

But sometimes the application is just logging 200 lines at a time, then stop or go back to a moderate logging rate.

It may not be appropriate to just drop those messages, as the log messages being dropped are important.

Why is LGR using a static threshold (100 messages) instead of a dynamic or configurable threshold?
Is that possible to configure it on a per application basis? (running in a public cloud)

Would it be possible to change LGR so that it truncate a buffer only when the syslog endpoint fails to keep up with the current rate?
Or, would it be possible to enable the flood protection only when it's during for a 'long' time? For instance if the application is logging 300 lines / seconds for 10 seconds, only drop after the 3rd or 4th second, so that we get the content of the flood for the first 2 or 3 seconds. Also, it could send the last buffer if the flood is stopping so that we get the 10th second of logs.

How to discriminate the log message of each app instances between the same app.

How to discriminate the log message of each app instances between the same app.
I read the code very clearly,I find that we cannot know the log message is from which app.

Error in doppler's README.md

Doppler's README.md (https://github.com/cloudfoundry/loggregator/blob/develop/src/doppler/README.md) has a description below:

In a redundant CloudFoundry setup, Loggregator can be configured to survive zone failures. Log messages from non-affected zones will still make it to the end user. On AWS, availability zones could be used as redundancy zones. The following is an example of a multi zone setup with two zones.

2 points:

There is no image after the description. Therefore the sentence "The following is an example of a multi zone setup with two zones." should be removed, or an image (![Loggregator Diagram](../../docs/loggregator_multizone.png)) should be added.
The same description is in https://github.com/cloudfoundry/loggregator/blob/develop/README.md as well, with expected image.

So why don't we delete the description from the Doppler's README.md?

Bosh is not able to fill out metron_agent template when dynamic network is used

This is inspired by this open issue on bosh-release:
cloudfoundry/bosh#1206

The problem appears only when a new vm is provisioned and the job was not deployed before (a new deployment).

Director task 57
  Started preparing deployment > Preparing deployment. Done (00:00:03)
Error 100: Unable to render instance groups for deployment. Errors are:
   - Unable to render jobs for instance group 'stats_z1'. Errors are:
     - Unable to render templates for job 'metron_agent'. Errors are:
       - Error filling in template 'syslog_forwarder.conf.erb' (line 44: undefined method `strip' for nil:NilClass)

Removing the striping of a retrieved IP must solve the issue. The way it is done in the nats template does not cause any issue:
https://github.com/cloudfoundry/cf-release/blob/v235/jobs/nats/templates/nats.conf.erb

<%
   def network_config
     networks = spec.networks.marshal_dump
     _, network = networks.find do |_name, network_spec|
       network_spec.default
     end
     if !network
       _, network = networks.first
     end
     if !network
       raise "Could not determine IP via network spec: #{networks}"
     end
     network
   end
   def discover_external_ip
     network_config.ip
   end
   def discover_external_hostname
     network_config.dns_record_name
   end
%>
<% self_ip = discover_external_ip %>

rsyslog: get rid of legacy format

syslog_forwarder.conf.erb contains legacy configuration options which are difficult to get right. The problem is that legacy style options (for example queue settings) apply for the next action defined after those options and do not apply to the rest. While for static configuration it might be okay, for dynamic (our case) it is very easy to break expected behavior.

Therefore the following way of configuration:

$ActionResumeRetryCount ...
$ActionQueueType ...

should be replaced by:
action(... action.resumeRetryCount="..." queue.type="...")

More information on the official web site: http://www.rsyslog.com/doc/v8-stable/configuration/action/index.html

Statements modify the next action(s) that is/are defined via legacy syntax after the respective statement. Actions defined via the action() object are not affected by the legacy statements listed here. Use the action() object properties instead.

global unified request identifier

Tracing CloudFoundry in a cloud environment can become a challange, as the end-to-end transaction for a call/a message is hard to find in the logs and even harder to combine within the transaction flow. Therefore openstack is working on a "global transaction" id which will allow you to trace and find the logs of each transaction in your cloud environment;

https://blueprints.launchpad.net/nova/+spec/cross-service-request-id

As CF grows and the integration of openstack becomes more important; this would be an interesting feature to add in a further loggregator version or in the log-creation process of each component in CF.

MaxMessageSize property in metron agent is hardcoded

Hi!

It seems that MaxMessageSize in metron agent is hardcoded to 4k. Unfortunately this size is too small and we would like to change it without forking the release. The problem that we're facing is the following: In a CF installation DEAs sends messages that goes to the Nats_stream_frowarder. Each message contains information about all applications currently running on that DEA (incl current state, timestamp, etc). Then the nats_stream_forwarder escapes the json file that DEA has send, and logs it. Unfortunately a DEA having 12 running apps results to a log message bigger than 4k. So if there is an ELK stack in the picture this results to an error messages in the log parsers as these logs are not a valid json files.

To summarise - if MaxMessageSize property is configurable everyone would be able to set appropriate log message size so all messages can be logged properly, and to avoid unnecessary error messages.

Regards,
Martina

support for BOSH zones

Hi,

when do you plan to add support for BOSH zones in this release? A lot of jobs here use zones and currently require manual configuration - meaning you have to create definition of a job per zone just to be able to manually specify different zone. This creates excessive bloat in deployment manifests...

repo contains binaries => repo is too big and bins are OS specific

Errors in logging_stream.go can cause application instance to hang

We're running release-169. Though I believe this is still an issue in latest since not much has changed in the agent since then.

We've been chasing a bad problem where applications instances seemingly randomly lock up in cloud foundry but don't crash. Causing outages until the application is actively restarted.

We finally were able to duplicate the issue. It appears that when there is an error in logging_stream.go the agent stops listening on the stream, however, the application unknowingly continues to write to the stream seemingly to some buffer somewhere. Once this buffer fills up then the application (a java application in this case) blocks while attempting to write more to the stream. This eventually causes all the application threads attempting to write the stream to freeze up and the application is essentially hung. Blocked attempting to write the stream.

I imagine there are several error situations that could cause this frozen behavior. The one that we were able to duplicate happens when an application attempts to write more than 64K in one line to the stream. When this happens:

The error below is logged by the agent.
The logging stream returns.
The application continues logging to the stream.
Eventually the stream's buffer(s) fill
Application threads hand attempting to write to the stream.

{"timestamp":1400852198.287148476,"process_id":20212,"source":"deaagent","log_level":"info","message":"Error while reading from socket OUT, /var/vcap/data/warden/depot/17och3r06s6/jobs/740, bufio.Scanner: token too long","data":null,"file":"/var/vcap/data/compile/dea_logging_agent/loggregator/src/deaagent/loggingstream/logging_stream.go","line":68,"method":"deaagent/loggingstream.func·001"}

This this scenario is caused by logging too much to the stream I imagine there are other scenarios that could also cause an error in the same place. It would be great if this code were much more defensive and careful at not causing this hung situation. Perhaps triggering an application instance crash instead of just returning?

Mike

sorry, testing again

Support for a custom syslog forwarder job

I'd like to provide some custom syslog config and use the metron agent. Though currently it seems metron agent takes over all syslog forwarding config of vcap* messages since it is config 00 and contains :programname, startswith, "vcap." ~

Any interest in accepting a PR that either make disposal of all vcap.* message optional, changes syslog_forwarder config to something like 05 instead of 00, or breaks the syslog forwarder config out into a separate job? So I can simply replace that job with my custom one?

I'd be happy to submit a PR for any of the above. Thoughts?

It is not possible to forward RSyslog to several hosts

RSyslog allows to failover syslog servers by adding additional servers if the primary fails but Cloud Foundry only allows one server.

This means the log collector must provide redundacy on its own.

No healthy messages from Doppler and TrafficController anymore

With change cloudfoundry/loggregator@8325aa7 the collectorregistrar was removed from the TrafficController. I didn't find the concrete change, but I assume there is a similar change for Doppler as well.

These registration messages are not only used for the collector checking /varz endpoints, but also for checking /healthz.

It seems that there is no replacement for the /healthz endpoint registration - therefore, we don't get healthy metrics for the Doppler and TrafficController components anymore. Can we get the healthy metric back, plz?

High number of Loggregator to Doppler connections

Hi,

It seems that Loggregator would open new connections to all the Dopplers for each new /firehose/<subscription> connection.

A new connector.Connect method is called for each new connection:
https://github.com/cloudfoundry/loggregator/blob/develop/src/trafficcontroller/dopplerproxy/doppler_proxy.go#L163

Which in turn creates it's own Doppler connection list:
https://github.com/cloudfoundry/loggregator/blob/develop/src/trafficcontroller/channel_group_connector/channel_group_connector.go#L42

This means that if given C firehose clients, L loggregator instances and D doppler instances, you would end up with roughly C x D / L outbound connections from each Loggregator and C inbound connections for each Doppler.

Is there a reason why those Doppler connections are not reused, which would result in D outbound connections from each Loggregator and L inbound connections for each Doppler?

Assuming that the clients are likely to be more than the number of Loggregators, this should result in better numbers. You would have to do some log mapping / duplication in Loggregator which could increase the overhead there, but I have seen the CPU load on Doppler peek much quicker than on Loggregator right now, so that could balance the load.

Regards,
Momchil

Include org/space/name information with each log event

As part of the CloudFoundry -> Logsearch integration I'm working on - cloudfoundry-community/logsearch-for-cloudfoundry#21 - I'd like for Log Analysis dashboards that include human readable names for log events.

Doppler tags all logs with a unique cf_app_id UUID, eg:

{"cf_app_id":"afc7d161-b5d5-44f2-a2c8-63870ff8f2db","level":"info","message_type":"OUT","msg":"goexample.apps.54.183.203.97.xip.io - [01/04/2015:10:22:50 +0000] \"GET / HTTP/1.1\" 200 7 \"-\" \"Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)\" 10.10.2.6:60637 x_forwarded_for:\"46.165.195.139, 10.10.2.6\" vcap_request_id:9f598459-f756-4ea4-58e5-e9f212ff328f response_time:0.002463550 app_id:afc7d161-b5d5-44f2-a2c8-63870ff8f2db\n","source_instance":"0","source_type":"RTR","time":"2015-04-01T10:22:50Z"}

UUIDs however, make for crappy dashboards:

Are there any plans to include additional information in addition to the cf_app_id with each doppler firehost log event; specifically:

cf_app_name
cf_space_id and cf_space_name
cf_org_id and cf_org_name

Thanks!

Syslog Sink throttled to 1 message per millisecond

The ExponentialRetryStrategy is defaulting to a delay of 1ms.

If I'm reading the code correctly, the use of this strategy in SyslogSink means that each iteration of the loop starts with a 1ms delay. If this is the case, then surely the loop should be timed that it only delays if a connection has failed, instead of delaying up-front?

I haven't been able to find the code that deals with incoming messages, but if it is not also throttled in the same manner then this would explain why the TruncatingBuffer is filling so easily.

Include updated loggregator diagram with syslog configuration

It would be nice to show that metron configures VMs to forward syslog to some destination.

seeing lots of metron agent panics after upgrading from v231 to v236

Noticed that our metron agents having many panics throughout the day at random times with following error:

panic: No enabled dopplers available, check your manifest to make sure you have dopplers listening for the following protocols [udp]

After looking into it noticed that many times it correlated to etcd re-elections which were happening frequently due to slow disk i/o on etcd servers. I adjusted the election timeout a bit and that has helped alot but still seeing occasionally panics. Metron agents seem very fragile now if they have any interruptions with etcd or if the dopplers fail to update their entries for some reason.

Syslog drains killing loggregator in CF170

Hi All,

I have deployed CF170 and am trying to connect a syslog drain to my loggregator. The end result of which is that the loggregator passes a few messages then falls over.

Before binding the drain I can run the nyet tests against my CF. After binding the drain and pushing some logs as described below the nyet tests fail.

To reproduce

Install CF170
Push an app, observe that the logs work fine through loggregator.
- I am using this app to test https://github.com/FreightTrain/play-cf-env
On a "syslog drain" machine, run netcat to listen for logs:
- nc -l 1514
Create a syslog drain user provided service
- cf cups syslog -l syslog://ip-of-drain-machine:1514
Bind the service to your app
Restart your app
Observe that about 6 log messages come through to netcat, then loggragator falls over

To work around the issue

Unbind the syslog drain from your app
Restart the app
monit restart all on the loggregator

Terminal output from my netcat session

[root@logstash-logger-01 ~]# nc -l 1514

205 <14>1 2014-05-23T15:41:39+00:00 loggregator e62e36d5-a2ab-481d-bfc3-3994b8066770 [App/0] - - [info] application - Could not read config value akka.actor.deployment.default.routees.paths as String
181 <14>1 2014-05-23T15:41:39+00:00 loggregator e62e36d5-a2ab-481d-bfc3-3994b8066770 [App/0] - - [info] application - Could not read config value akka.event-handlers as String
177 <14>1 2014-05-23T15:41:39+00:00 loggregator e62e36d5-a2ab-481d-bfc3-3994b8066770 [App/0] - - [info] application - Could not read config value akka.extensions as String
186 <14>1 2014-05-23T15:41:39+00:00 loggregator e62e36d5-a2ab-481d-bfc3-3994b8066770 [App/0] - - [info] application - Could not read config value play.akka.event-handlers as String
168 <14>1 2014-05-23T15:41:39+00:00 loggregator e62e36d5-a2ab-481d-bfc3-3994b8066770 [App/0] - - [info] application - Rendering time: Fri May 23 15:41:39 UTC 2014
413 <14>1 2014-05-23T15:41:39+00:00 loggregator e62e36d5-a2ab-481d-bfc3-3994b8066770 [RTR] - - troy-test.paas.dev.col.tx.cpgpaas.net - [23/05/2014:15:41:39 +0000] "GET / HTTP/1.1" 200 33469 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:29.0) Gecko/20100101 Firefox/29.0" 192.168.80.26:52908 vcap_request_id:4dd535a398530b89fc06791f6fde0d8a response_time:0.049732475 app_id:e62e36d5-a2ab-481d-bfc3-3994b8066770

Logs from the loggregator traffic controller:

{"timestamp":1400859682.107874155,"process_id":6843,"source":"loggregator trafficcontroller","log_level":"error","message":"Output Proxy: Error reading from the server - unexpected EOF - 192.168.80.53:8080","data":null,"file":"/var/vcap/data/compile/loggregator_trafficcontroller/loggregator/src/trafficcontroller/proxy_handler.go","line":77,"method":"trafficcontroller.(*handler).proxyConnectionTo"}
{"timestamp":1400859682.108934164,"process_id":6843,"source":"loggregator trafficcontroller","log_level":"error","message":"Output Proxy: Error reading from the client - read tcp 192.168.80.29:21971: use of closed network connection","data":null,"file":"/var/vcap/data/compile/loggregator_trafficcontroller/loggregator/src/trafficcontroller/proxy_handler.go","line":91,"method":"trafficcontroller.(*handler).watchKeepAlive"}