cloudfoundry / loggregator-release Goto Github PK
View Code? Open in Web Editor NEWCloud Native Logging
License: Apache License 2.0
Cloud Native Logging
License: Apache License 2.0
We have a number of users that would like to use TLS mutual authentication with their syslog drains but doppler doesn't currently support that use case. Would you be willing to accept a pull request to add the support?
Our proposal is to use the credentials map from the user provided service definition to hold the client certificate, private key, and (optionally) a certificate chain for the drain. When these properties are provided, we will carry them into doppler and use them to setup the tls.Config
properly for the handshake, mutual auth, and certificate verification.
This would require an updated syslog_drain_urls
API in the cloud controller and some changes to the syslog drain binder, the TLS writer, the HTTPS writer, and some ancillary plumbing.
Loggregatorlib used to be able to build under Windows, and with the recent change in 80d688e we seemed to have dropped the ability to compile since we didn't carry across the fake Windows implementation from https://github.com/cloudfoundry/loggregatorlib/blob/2b68a3305af889947f0f8c11d0aca4817621f983/cfcomponent/logging_windows.go
Maybe we need to get CI for this project on Windows, since the Garden Windows team depends on it.
hi, I download the loggregator, and build it, it's not success, the message is:
zcy@ubuntu:~/go/go_workspace/src/loggregator/loggregator$ go build
../store/app_service_store_watcher.go:77: cannot use event.Node (type *storeadapter.StoreNode) as type storeadapter.StoreNode in argument to appServiceFromStoreNode
../store/app_service_store_watcher.go:79: cannot use event.Node (type *storeadapter.StoreNode) as type storeadapter.StoreNode in argument to w.deleteEvent
../store/app_service_store_watcher.go:81: cannot use event.Node (type *storeadapter.StoreNode) as type storeadapter.StoreNode in argument to w.deleteEvent
is something wrong with it?
Hi,
I'm running a v205 CF deployment with ~350 application instances.
I just found that the doppler process on the doppler node and the metron_agent process on the gorouter node used too much memory(90.9% and 13.3 % respectively), as below.
Acutually, the memory usage of metron_agent was also >90%, and I restarted it yesterday, and just after 14 hours, the usage went up to 13.3%.
There are two doppler nodes and three gorouter nodes.
status Running
monitoring status Monitored
pid 21648
parent pid 1
uptime 2d 11h 10m
children 0
memory kilobytes 14948028
memory kilobytes total 14948028
memory percent 90.9%
memory percent total 90.9%
cpu percent 12.5%
cpu percent total 12.5%
status Running
monitoring status Monitored
pid 28995
parent pid 1
uptime 14h 10m
children 2
memory kilobytes 2195608
memory kilobytes total 2195608
memory percent 13.3%
memory percent total 13.3%
cpu percent 6.8%
cpu percent total 6.8%
running CF 210.2
The
/var/vcap/sys/log/loggregator_trafficcontroller/loggregator_trafficcontroller.stderr.log
file has frequent occurrences of this type of snippet
2015/06/16 18:16:13 http: panic serving :: runtime error:
invalid memory address or nil pointer dereference
goroutine 1614338 [running]:
net/http.func·011()
/usr/local/go/src/pkg/net/http/server.go:1100 +0xb7
runtime.panic(0x7c1a40, 0xa4a993)
/usr/local/go/src/pkg/runtime/panic.c:248 +0x18d
github.com/cloudfoundry/loggregatorlib/server/handlers.(*httpHandler).ServeHTTP(0xc209a61e30,
0x7f5171a3b288, 0xc20abf5ae0, 0xc20b12bad0)
/var/vcap/data/compile/loggregator_trafficcontroller/loggregator/src/
github.com/cloudfoundry/loggregatorlib/server/handlers/http_handler.go:29
+0x2f5
trafficcontroller/dopplerproxy.(*Proxy).serveWithDoppler(0xc2080ac0d0,
0x7f5171a3b288, 0xc20abf5ae0, 0xc20b12bad0, 0xc20aaacd6b, 0xa,
0xc20aaacd46, 0x24, 0x0, 0x12a05f200, ...)
/var/vcap/data/compile/loggregator_trafficcontroller/loggregator/src/trafficcontroller/dopplerproxy/doppler_proxy.go:181
+0x152
trafficcontroller/dopplerproxy.(*Proxy).serveAppLogs(0xc2080ac0d0,
0x7f5171a3b288, 0xc20abf5ae0, 0xc20b12bad0)
/var/vcap/data/compile/loggregator_trafficcontroller/loggregator/src/trafficcontroller/dopplerproxy/doppler_proxy.go:170
+0x849
trafficcontroller/dopplerproxy.(*Proxy).ServeHTTP(0xc2080ac0d0,
0x7f5171a3b288, 0xc20abf5ae0, 0xc20b12b930)
/var/vcap/data/compile/loggregator_trafficcontroller/loggregator/src/trafficcontroller/dopplerproxy/doppler_proxy.go:86
+0x4d9
net/http.serverHandler.ServeHTTP(0xc2080046c0, 0x7f5171a3b288,
0xc20abf5ae0, 0xc20b12b930)
/usr/local/go/src/pkg/net/http/server.go:1673 +0x19f
net/http.(_conn).serve(0xc20a8c6f00)
/usr/local/go/src/pkg/net/http/server.go:1174 +0xa7e
created by net/http.(_Server).Serve
/usr/local/go/src/pkg/net/http/server.go:1721 +0x313
2015/06/22 11:28:58 http: panic serving :: runtime error:
invalid memory address or nil pointer dereference
goroutine 2034572 [running]:
net/http.func·011()
/usr/local/go/src/pkg/net/http/server.go:1100 +0xb7
runtime.panic(0x7c1a40, 0xa4a993)
/usr/local/go/src/pkg/runtime/panic.c:248 +0x18d
github.com/cloudfoundry/loggregatorlib/server/handlers.(*httpHandler).ServeHTTP(0xc20bf2d820,
0x7f5171a3b288, 0xc20d33a000, 0xc20c8c9c70)
/var/vcap/data/compile/loggregator_trafficcontroller/loggregator/src/
github.com/cloudfoundry/loggregatorlib/server/handlers/http_handler.go:29
+0x2f5
trafficcontroller/dopplerproxy.(*Proxy).serveWithDoppler(0xc2080ac0d0,
0x7f5171a3b288, 0xc20d33a000, 0xc20c8c9c70, 0xc20c96e1eb, 0xa,
0xc20c96e1c6, 0x24, 0x0, 0x12a05f200, ...)
/var/vcap/data/compile/loggregator_trafficcontroller/loggregator/src/trafficcontroller/dopplerproxy/doppler_proxy.go:181
+0x152
trafficcontroller/dopplerproxy.(*Proxy).serveAppLogs(0xc2080ac0d0,
0x7f5171a3b288, 0xc20d33a000, 0xc20c8c9c70)
/var/vcap/data/compile/loggregator_trafficcontroller/loggregator/src/trafficcontroller/dopplerproxy/doppler_proxy.go:170
+0x849
trafficcontroller/dopplerproxy.(*Proxy).ServeHTTP(0xc2080ac0d0,
0x7f5171a3b288, 0xc20d33a000, 0xc20c8c9ad0)
/var/vcap/data/compile/loggregator_trafficcontroller/loggregator/src/trafficcontroller/dopplerproxy/doppler_proxy.go:86
+0x4d9
net/http.serverHandler.ServeHTTP(0xc2080046c0, 0x7f5171a3b288,
0xc20d33a000, 0xc20c8c9ad0)
/usr/local/go/src/pkg/net/http/server.go:1673 +0x19f
net/http.(_conn).serve(0xc20c9cc680)
/usr/local/go/src/pkg/net/http/server.go:1174 +0xa7e
created by net/http.(_Server).Serve
/usr/local/go/src/pkg/net/http/server.go:1721 +0x313
At current moment we are running load tests against our Cloud Foundry deployment. During tests our system can produce around 12000 logs/sec. We have configured syslog_daemon_config
to forward all logs to ELK. At some moment we stop to get any logs and only monit restart metron_agent
for each Cloud Foundry instance helps to resume logs forward process.
I believe that the bottleneck is somewhere between metron_agent and doppler. But also as i understand with syslog_daemon_config
property enabled metron_agent only adds rsyslog template to /etc/rsyslog.d
folder and all components logs must be forwarded via rsyslog daemon. So for me it is not clear how monit restart metron_agent
can solve logs forwarding issue.
Hi, using syslog_drain_binder i noticed it does not report instance number (as cli does it for me)
159 <14>1 2015-11-09T12:03:52.628411+00:00 loggregator b563cf61-a5ba-4540-9734-7ee0e4c5add4 [APP] - - sun.java.mem=1507328 #timestamp=Mon Nov 09 12:03:52 UTC 2015
170 <14>1 2015-11-09T12:03:52.628524+00:00 loggregator b563cf61-a5ba-4540-9734-7ee0e4c5add4 [APP] - - sun.java.heap.committed=1507328 #timestamp=Mon Nov 09 12:03:52 UTC 2015
164 <14>1 2015-11-09T12:03:52.628477+00:00 loggregator b563cf61-a5ba-4540-9734-7ee0e4c5add4 [APP] - - sun.java.mem.free=1113654 #timestamp=Mon Nov 09 12:03:52 UTC 2015
161 <14>1 2015-11-09T12:03:52.628492+00:00 loggregator b563cf61-a5ba-4540-9734-7ee0e4c5add4 [APP] - - sun.java.processors=32 #timestamp=Mon Nov 09 12:03:52 UTC 2015
cf cli reports this as e.g. [APP/0]
Is it a bug or there are some knobs to tune this?
How to use loggregator in CF?The cf has been deployed in one pc,but I didn`t see some components about loggregator(dea_agent,loggregator,trafficcontrol).
We added doppler.firehose to the uaa.clients.cf.scope as the README suggested, but we were still getting unauthorized errors until we added uaa.clients.doppler.{id,secret,authorities}, which was not mentioned in the README.
If an application starts and immediately crashes/quits, anything outputted to stdout/err is not logged which makes debugging quite hard. On our deployments it seems to affect anything that dies within 10-20ms.
The workaround for this is to change the Procfile to have a sleep statement in it, e.g. web: sleep 1 && command
.
I'm not sure if this is the correct place to file this issue, but I figured it was a good start.
This can be replicated by uploading a simple application that outputs something to stdout and immediately quits, such as example.go
:
func main() {
println("This will not appear in the application logs")
}
Hi,
We have a situation where our Firehose client suddently stops receiving log messages from Firehose without any error signalization in any of the components or in the client.
We are using the standard noaa client.
It is very easy to reproduce. One needs to deploy a Cloud Foundry system v230
on bosh-lite
, push an application that dumps an excess number of log messages (e.g. 5000 msg/sec
), and connect a number of Firehose clients with different subscription IDs. Having around 4
clients puts a good load on Doppler and reproduces it in less than a minute. Having the clients dump a msg/sec
statistic to stdout
makes it easy to track.
After some thorough investigation, we narrowed down the problem to Doppler. There is a race condition in the code that, as unlikely as it may seem, occurs quite often under heavy load. It results in a go-routine waiting on a channel that is never written to.
The problem is caused by the following line of code, that is replacing the output
channel with a new one:
https://github.com/cloudfoundry/loggregator/blob/develop/src/truncatingbuffer/truncating_buffer.go#L101
Under load, the following line of code manages to consume the content of the old output
channel and blocks on it before having the chance to pick the new one up:
https://github.com/cloudfoundry/loggregator/blob/develop/src/doppler/sinks/websocket/websocket_sink.go#L77
It seems unlikely as a race condition because it requires that the websocket_sink
be able to loop through the whole output
channel buffer, before the following piece of block finishes, but it happens.
https://github.com/cloudfoundry/loggregator/blob/develop/src/truncatingbuffer/truncating_buffer.go#L90-L98
I am working on a Pull Request that solves the problem and attempts to change as little code as possible. Unfortunately, the truncating_buffer
is used by other packages and it will take some effort to align those as well.
This Issue is for tracking purposes and is a good place to get any remarks / comments on the matter prior to the Pull Request. (E.g. if there is already something under way or some restriction I should be aware of).
Regards,
Momchil
How can i read all data stream from doppler? for example, i want to read all data from router?
We are seeing this issue on the traffic controller when we try to push a new app, Any ideas what can be happening?
{"timestamp":1400165945.950070858,"process_id":1557,"source":"loggregator trafficcontroller","log_level":"error","message":"Output Proxy: Error reading from the client - unexpected EOF","data":null,"file":"/var/vcap/data/compile/loggregator_trafficcontroller/loggregator/src/trafficcontroller/proxy_handler.go","line":91,"method":"trafficcontroller.(*handler).watchKeepAlive"}
{"timestamp":1400165945.951148272,"process_id":1557,"source":"loggregator trafficcontroller","log_level":"error","message":"Output Proxy: Error reading from the server - EOF - 10.240.6.40:8080","data":null,"file":"/var/vcap/data/compile/loggregator_trafficcontroller/loggregator/src/trafficcontroller/proxy_handler.go","line":77,"method":"trafficcontroller.(*handler).proxyConnectionTo"}
On this same push request we are seeing the following on loggregator's log:
{"timestamp":1400165975.950389862,"process_id":1546,"source":"loggregator","log_level":"info","message":"SinkManager: Sink with channel 10.240.6.45:58622 and identifier %!s(MISSING) requested closing. Closed it.","data":null,"file":"/var/vcap/data/compile/loggregator/loggregator/src/loggregator/sinkserver/sinkmanager/sink_manager.go","line":142,"method":"loggregator/sinkserver/sinkmanager.(*SinkManager).unregisterSink"}
we checked and udp is working on 3456 in loggregator TC and in loggregator.
The description and corresponding image in https://github.com/cloudfoundry/loggregator#architecture state (highlights are mine)
*Doppler*: Responsible for gathering logs from the Metron agents, storing them in temporary buffers, *and forwarding logs to 3rd party syslog drains.*
However, what I gather from the syslog configuration of the metron agents, especially the forwarding configuration is that each component forwards their logs to a configured external syslog drain.
Is my understanding correct that architectural description and implementation are different in this case? Or is it that the configuration is actually done on each machine where the metron agent is installed, but only takes effect on doppler?
Thanks!
When you switched to Go 1.6, you are using the Go based DNS resolver. This resolver has a few known issues and doesn't work well in many situations with Cloud Foundry. Most of the issues have been fixed with Go 1.7.
For the time being, most if not all Go based components have been using the C-based DNS resolver by adding export GODEBUG=netdns=cgo
to their scripts.
You can see examples of this in the HM9000 and Diego scripts.
TruncatingBuffer
drops the whole buffer when full. This seems needlessly fragile, making a small change in workload (one extra message) have a far more destructive impact than it needs to.
A less fragile implementation would discard the newest messages as they arrive. This approach would be better than discarding those at the front of the queue, as the older messages may have information that 'leads up to' the blockage.
An antifragile solution would dynamically resize the buffer when it approaches capacity, up to a larger and more sensible limit.
When I retrieve container metrics from traffic controller through api endpoint: /apps/APP_ID/containermetrics.
Test with incorrect token...
* Trying 10.244.0.34...
* Connected to doppler.bosh-lite.com (10.244.0.34) port 443 (#0)
* TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
* Server certificate: *.bosh-lite.com
> GET /apps/63afa04a-c3aa-453a-aac4-c684a963b91e/containermetrics HTTP/1.1
> Host: doppler.bosh-lite.com
> User-Agent: curl/7.43.0
> Accept: */*
> Authorization: bearer non-exist-token
>
< HTTP/1.1 401 Unauthorized
< Content-Length: 52
< Content-Type: text/plain; charset=utf-8
< Date: Mon, 20 Jun 2016 21:54:10 GMT
< Www-Authenticate: Basic
< X-Vcap-Request-Id: 648208c5-d7d0-4ce5-5448-7d74d7c87615
<
* Connection #0 to host doppler.bosh-lite.com left intact
You are not authorized. Error: Invalid authorization
* Trying 10.244.0.34...
* Connected to doppler.bosh-lite.com (10.244.0.34) port 443 (#0)
* TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
* Server certificate: *.bosh-lite.com
> GET /apps/not-exist-id/containermetrics HTTP/1.1
> Host: doppler.bosh-lite.com
> User-Agent: curl/7.43.0
> Accept: */*
> Authorization: bearer eyJhbGciOiJSUzI1NiIsImtpZCI6ImxlZ2FjeS10b2tlbi1rZXkiLCJ0eXAiOiJKV1QifQ.eyJqdGkiOiIyYzBhMzQ4YWVmMjY0MGQ4OTQyMWE3MDBhNGFkNTc2MiIsInN1YiI6IjllODkwYzk5LTJhZjctNDc4Mi05NmFkLTIyNWNhMGVhMGI2MiIsInNjb3BlIjpbInJvdXRpbmcucm91dGVyX2dyb3Vwcy5yZWFkIiwiY2xvdWRfY29udHJvbGxlci5yZWFkIiwicGFzc3dvcmQud3JpdGUiLCJjbG91ZF9jb250cm9sbGVyLndyaXRlIiwib3BlbmlkIiwiZG9wcGxlci5maXJlaG9zZSIsInNjaW0ud3JpdGUiLCJzY2ltLnJlYWQiLCJjbG91ZF9jb250cm9sbGVyLmFkbWluIiwidWFhLnVzZXIiXSwiY2xpZW50X2lkIjoiY2YiLCJjaWQiOiJjZiIsImF6cCI6ImNmIiwiZ3JhbnRfdHlwZSI6InBhc3N3b3JkIiwidXNlcl9pZCI6IjllODkwYzk5LTJhZjctNDc4Mi05NmFkLTIyNWNhMGVhMGI2MiIsIm9yaWdpbiI6InVhYSIsInVzZXJfbmFtZSI6ImFkbWluIiwiZW1haWwiOiJhZG1pbiIsInJldl9zaWciOiJmMTZlOGQ1YiIsImlhdCI6MTQ2NjQ1OTY1MSwiZXhwIjoxNDY2NDYwMjUxLCJpc3MiOiJodHRwczovL3VhYS5ib3NoLWxpdGUuY29tL29hdXRoL3Rva2VuIiwiemlkIjoidWFhIiwiYXVkIjpbInNjaW0iLCJjbG91ZF9jb250cm9sbGVyIiwicGFzc3dvcmQiLCJjZiIsInVhYSIsIm9wZW5pZCIsImRvcHBsZXIiLCJyb3V0aW5nLnJvdXRlcl9ncm91cHMiXX0.NMdSU-4ZyKHn-_g2uwVM_PCW4xnrYyi2P2WOZ1B6TxxGTN9QAjRhZ2NcqNSdx-hxJySfXbuBZ9tP7U5s6hTV6Ng58J9ADIwc4qh8twHulra8nFJgJHa_1bOGKcENaNv5SkuJ77inxyd9okJEvIBPseopLWKks5LB4wTXNiCG76I
>
< HTTP/1.1 401 Unauthorized
< Content-Length: 52
< Content-Type: text/plain; charset=utf-8
< Date: Mon, 20 Jun 2016 21:54:11 GMT
< Www-Authenticate: Basic
< X-Vcap-Request-Id: 0f82e985-5d14-49a2-5958-432125166f86
<
* Connection #0 to host doppler.bosh-lite.com left intact
You are not authorized. Error: Invalid authorization
* Trying 10.244.0.34...
* Connected to doppler.bosh-lite.com (10.244.0.34) port 443 (#0)
* TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
* Server certificate: *.bosh-lite.com
> GET /apps/63afa04a-c3aa-453a-aac4-c684a963b91e/containermetrics HTTP/1.1
> Host: doppler.bosh-lite.com
> User-Agent: curl/7.43.0
> Accept: */*
> Authorization: bearer eyJhbGciOiJSUzI1NiIsImtpZCI6ImxlZ2FjeS10b2tlbi1rZXkiLCJ0eXAiOiJKV1QifQ.eyJqdGkiOiIyYzBhMzQ4YWVmMjY0MGQ4OTQyMWE3MDBhNGFkNTc2MiIsInN1YiI6IjllODkwYzk5LTJhZjctNDc4Mi05NmFkLTIyNWNhMGVhMGI2MiIsInNjb3BlIjpbInJvdXRpbmcucm91dGVyX2dyb3Vwcy5yZWFkIiwiY2xvdWRfY29udHJvbGxlci5yZWFkIiwicGFzc3dvcmQud3JpdGUiLCJjbG91ZF9jb250cm9sbGVyLndyaXRlIiwib3BlbmlkIiwiZG9wcGxlci5maXJlaG9zZSIsInNjaW0ud3JpdGUiLCJzY2ltLnJlYWQiLCJjbG91ZF9jb250cm9sbGVyLmFkbWluIiwidWFhLnVzZXIiXSwiY2xpZW50X2lkIjoiY2YiLCJjaWQiOiJjZiIsImF6cCI6ImNmIiwiZ3JhbnRfdHlwZSI6InBhc3N3b3JkIiwidXNlcl9pZCI6IjllODkwYzk5LTJhZjctNDc4Mi05NmFkLTIyNWNhMGVhMGI2MiIsIm9yaWdpbiI6InVhYSIsInVzZXJfbmFtZSI6ImFkbWluIiwiZW1haWwiOiJhZG1pbiIsInJldl9zaWciOiJmMTZlOGQ1YiIsImlhdCI6MTQ2NjQ1OTY1MCwiZXhwIjoxNDY2NDYwMjUwLCJpc3MiOiJodHRwczovL3VhYS5ib3NoLWxpdGUuY29tL29hdXRoL3Rva2VuIiwiemlkIjoidWFhIiwiYXVkIjpbInNjaW0iLCJjbG91ZF9jb250cm9sbGVyIiwicGFzc3dvcmQiLCJjZiIsInVhYSIsIm9wZW5pZCIsImRvcHBsZXIiLCJyb3V0aW5nLnJvdXRlcl9ncm91cHMiXX0.BB_oAdNlpNv7VQYfyUNPNyJncGGmYbxTd6mfsjEe9ruHkMm3gLeJOmdEQ9U_CYsyUiUPXI3N8FPvTOcbDegjyWwmaEg_5nYpByGzz7xsj4rkxwwcsF_iio3qYwPPLoBWebqgbPCu3CG9w8_15qVp5tmzsiaYBjwr4796Xm1h4Tg
>
< HTTP/1.1 200 OK
< Content-Length: 612
< Content-Type: multipart/x-protobuf; boundary=3739652dcf4198b311c9c29b345f7be0ab4f8acc659273f97415ebb916e1
< Date: Mon, 20 Jun 2016 21:54:10 GMT
< X-Vcap-Request-Id: 8fe343bf-1f90-4e9a-6776-8ff034149828
<
--3739652dcf4198b311c9c29b345f7be0ab4f8acc659273f97415ebb916e1
DEA 0??̀????b8
$63afa04a-c3aa-453a-aac4-c684a963b91e3?k1??? ???T(j cf-wardenr runner_z1z0?
10.244.0.26
--3739652dcf4198b311c9c29b345f7be0ab4f8acc659273f97415ebb916e1
DEA 0??ߊ????b8
$63afa04a-c3aa-453a-aac4-c684a963b91eIK7.??? ???O(j cf-wardenr runner_z1z0?
10.244.0.26
--3739652dcf4198b311c9c29b345f7be0ab4f8acc659273f97415ebb916e1
DEA 0·??????b8
$63afa04a-c3aa-453a-aac4-c684a963b91e?Ix!?? ???R(j cf-wardenr runner_z1z0?
10.244.0.26
--3739652dcf4198b311c9c29b345f7be0ab4f8acc659273f97415ebb916e1--
I'm prob not only person confused by the renaming of loggregator to doppler. Is "loggregator" a thing still and "doppler" is something else?
Should this repo be renamed? Is the traffic controller now "doppler traffic controller"? Are all instances of "loggregator" (e.g. in the diagrams) now "doppler"? Or is there still a "loggregator" concept?
Github search for "cloudfoundry doppler" returns nothing; which didn't help me figure out the new differences :)
As described in,
https://groups.google.com/a/cloudfoundry.org/forum/#!topic/vcap-dev/n5SimQSbkT0
In my deployment, I'm seeing:
Error while reading from socket ERR, /var/vcap/data/warden/depot/17ilr01935u/jobs/10, EOF
in the logs on the DEA whenever a new app is started up. The rest of my loggregator config appears to be working normally--
that is, I've tested tcp connections between the various components and they are able to communicate.
More details (including my BOSH manifest) can be found in the forum.
When viewing the loggregator stream during application startup I don't see my application startup logs until the application is "started".
To reproduce:
Note:
What makes this issue worse is if the application takes even longer to start loggregator appears to only get the last 70-80 seconds worth of logs. Anything before never appears in loggregator.
can i change it in config file,such as error or warn?
Customer comments..
Hello,
Can we reconfigure loggregator buffer from 100 to a higher number to avoid seeing frequent "log message output too high. we've dropped 100 messages". We have gone through the explaination on http://support.run.pivotal.io/entries/79609435-Troubleshooting-dropped-or-missing-log-messages but believe we may have some benefits for java heap dumps too occassionally with higher value especially for Development spaces.
In the project of loggregator,I just see the dea_agent,loggregator and trafficcontroller,so I think the loggregator can just collect the log of apps in dea.Am I right?
(I'm not sure this is correct place to ask, if not tell me where to post)
When CF user run create-user-provided-service
and configure log forwarding service, they can specify only one endpoint (like syslog://logs.example.com).
Some log service (e.g., Splunk) provides multiple endpoints for HA reason. Now CF user can not specify multiple endpoint. This means when something wrong on one endpoint. Log will not be forwarded.
Do you have a plan to support multiple endpoints (load balancing) ?
Everything that wants to use metron has to have a network.apps
property defined on each job. "apps" doesn't make sense for a generic release (I don't even know the name makes a ton of sense for cf-release). There is a way to get the default network (if one is specified as default), or picks the first one otherwise. See:
(Relevant story in BOSH backlog: https://www.pivotaltracker.com/story/show/79158540)
In SinkManager#SendSyslogErrorToLoggregator only one argument is provided. If it contains
'%' characters it results in error messages and missed expected output in the logs.
When in many cases it is not.
Should perhaps use cc.srv_api_uri
instead? Which, I believe, includes the preferred protocol and doesn't guess at the host name.
The user provided syslog-tis and https sinks are configured to ignore certificate validation whenever the ssl.skip_cert_verify
property is set in the bosh deployment. Within cf-release, that property is intended to be set when the haproxy or ELB is configured with a self-signed certificate and is needed to allow communication within the deployment in that scenario. Generally speaking, that deployment configuration should not impact the certificate verification performed when an application talks to a third party syslog drain.
Certificate verification for syslog sinks should really have its own control point that is separate and distinct from the deployment's ssl.skip_cert_verify
property. Ideally this is something the end user could control when defining the user provided service.
Hi all, this is not a bug report, just a request.
I'd like to try out Loggregator in my CF installation, can you guys publish a sample of the BOSH deployment YML you are using to deploy Loggregator on run.pivotal.io?
Thanks
Troy.
loggregator (LGR) is sometime filtering with the following warning:
Log message output too high. We've dropped 100 messages
It occurs when the application is logging a lot of messages in a short time.
It may be opportunate in the event the application is logging GB/sec for several seconds or minutes.
But sometimes the application is just logging 200 lines at a time, then stop or go back to a moderate logging rate.
It may not be appropriate to just drop those messages, as the log messages being dropped are important.
Why is LGR using a static threshold (100 messages) instead of a dynamic or configurable threshold?
Is that possible to configure it on a per application basis? (running in a public cloud)
Would it be possible to change LGR so that it truncate a buffer only when the syslog endpoint fails to keep up with the current rate?
Or, would it be possible to enable the flood protection only when it's during for a 'long' time? For instance if the application is logging 300 lines / seconds for 10 seconds, only drop after the 3rd or 4th second, so that we get the content of the flood for the first 2 or 3 seconds. Also, it could send the last buffer if the flood is stopping so that we get the 10th second of logs.
How to discriminate the log message of each app instances between the same app.
I read the code very clearly,I find that we cannot know the log message is from which app.
Doppler's README.md (https://github.com/cloudfoundry/loggregator/blob/develop/src/doppler/README.md) has a description below:
In a redundant CloudFoundry setup, Loggregator can be configured to survive zone failures. Log messages from non-affected zones will still make it to the end user. On AWS, availability zones could be used as redundancy zones. The following is an example of a multi zone setup with two zones.
2 points:
![Loggregator Diagram](../../docs/loggregator_multizone.png)
) should be added.So why don't we delete the description from the Doppler's README.md?
This is inspired by this open issue on bosh-release:
cloudfoundry/bosh#1206
The problem appears only when a new vm is provisioned and the job was not deployed before (a new deployment).
Director task 57
Started preparing deployment > Preparing deployment. Done (00:00:03)
Error 100: Unable to render instance groups for deployment. Errors are:
- Unable to render jobs for instance group 'stats_z1'. Errors are:
- Unable to render templates for job 'metron_agent'. Errors are:
- Error filling in template 'syslog_forwarder.conf.erb' (line 44: undefined method `strip' for nil:NilClass)
Removing the striping of a retrieved IP must solve the issue. The way it is done in the nats template does not cause any issue:
https://github.com/cloudfoundry/cf-release/blob/v235/jobs/nats/templates/nats.conf.erb
<%
def network_config
networks = spec.networks.marshal_dump
_, network = networks.find do |_name, network_spec|
network_spec.default
end
if !network
_, network = networks.first
end
if !network
raise "Could not determine IP via network spec: #{networks}"
end
network
end
def discover_external_ip
network_config.ip
end
def discover_external_hostname
network_config.dns_record_name
end
%>
<% self_ip = discover_external_ip %>
syslog_forwarder.conf.erb contains legacy configuration options which are difficult to get right. The problem is that legacy style options (for example queue settings) apply for the next action defined after those options and do not apply to the rest. While for static configuration it might be okay, for dynamic (our case) it is very easy to break expected behavior.
Therefore the following way of configuration:
$ActionResumeRetryCount ...
$ActionQueueType ...
should be replaced by:
action(... action.resumeRetryCount="..." queue.type="...")
More information on the official web site: http://www.rsyslog.com/doc/v8-stable/configuration/action/index.html
Statements modify the next action(s) that is/are defined via legacy syntax after the respective statement. Actions defined via the action() object are not affected by the legacy statements listed here. Use the action() object properties instead.
Tracing CloudFoundry in a cloud environment can become a challange, as the end-to-end transaction for a call/a message is hard to find in the logs and even harder to combine within the transaction flow. Therefore openstack is working on a "global transaction" id which will allow you to trace and find the logs of each transaction in your cloud environment;
https://blueprints.launchpad.net/nova/+spec/cross-service-request-id
As CF grows and the integration of openstack becomes more important; this would be an interesting feature to add in a further loggregator version or in the log-creation process of each component in CF.
Hi!
It seems that MaxMessageSize in metron agent is hardcoded to 4k. Unfortunately this size is too small and we would like to change it without forking the release. The problem that we're facing is the following: In a CF installation DEAs sends messages that goes to the Nats_stream_frowarder. Each message contains information about all applications currently running on that DEA (incl current state, timestamp, etc). Then the nats_stream_forwarder escapes the json file that DEA has send, and logs it. Unfortunately a DEA having 12 running apps results to a log message bigger than 4k. So if there is an ELK stack in the picture this results to an error messages in the log parsers as these logs are not a valid json files.
To summarise - if MaxMessageSize property is configurable everyone would be able to set appropriate log message size so all messages can be logged properly, and to avoid unnecessary error messages.
Regards,
Martina
Hi,
when do you plan to add support for BOSH zones in this release? A lot of jobs here use zones and currently require manual configuration - meaning you have to create definition of a job per zone just to be able to manually specify different zone. This creates excessive bloat in deployment manifests...
We're running release-169. Though I believe this is still an issue in latest since not much has changed in the agent since then.
We've been chasing a bad problem where applications instances seemingly randomly lock up in cloud foundry but don't crash. Causing outages until the application is actively restarted.
We finally were able to duplicate the issue. It appears that when there is an error in logging_stream.go the agent stops listening on the stream, however, the application unknowingly continues to write to the stream seemingly to some buffer somewhere. Once this buffer fills up then the application (a java application in this case) blocks while attempting to write more to the stream. This eventually causes all the application threads attempting to write the stream to freeze up and the application is essentially hung. Blocked attempting to write the stream.
I imagine there are several error situations that could cause this frozen behavior. The one that we were able to duplicate happens when an application attempts to write more than 64K in one line to the stream. When this happens:
{"timestamp":1400852198.287148476,"process_id":20212,"source":"deaagent","log_level":"info","message":"Error while reading from socket OUT, /var/vcap/data/warden/depot/17och3r06s6/jobs/740, bufio.Scanner: token too long","data":null,"file":"/var/vcap/data/compile/dea_logging_agent/loggregator/src/deaagent/loggingstream/logging_stream.go","line":68,"method":"deaagent/loggingstream.func·001"}
This this scenario is caused by logging too much to the stream I imagine there are other scenarios that could also cause an error in the same place. It would be great if this code were much more defensive and careful at not causing this hung situation. Perhaps triggering an application instance crash instead of just returning?
Mike
I'd like to provide some custom syslog config and use the metron agent. Though currently it seems metron agent takes over all syslog forwarding config of vcap* messages since it is config 00 and contains :programname, startswith, "vcap." ~
Any interest in accepting a PR that either make disposal of all vcap.* message optional, changes syslog_forwarder config to something like 05 instead of 00, or breaks the syslog forwarder config out into a separate job? So I can simply replace that job with my custom one?
I'd be happy to submit a PR for any of the above. Thoughts?
RSyslog allows to failover syslog servers by adding additional servers if the primary fails but Cloud Foundry only allows one server.
This means the log collector must provide redundacy on its own.
With change cloudfoundry/loggregator@8325aa7 the collectorregistrar was removed from the TrafficController. I didn't find the concrete change, but I assume there is a similar change for Doppler as well.
These registration messages are not only used for the collector checking /varz endpoints, but also for checking /healthz.
It seems that there is no replacement for the /healthz endpoint registration - therefore, we don't get healthy
metrics for the Doppler and TrafficController components anymore. Can we get the healthy
metric back, plz?
Hi,
It seems that Loggregator would open new connections to all the Dopplers for each new /firehose/<subscription>
connection.
A new connector.Connect
method is called for each new connection:
https://github.com/cloudfoundry/loggregator/blob/develop/src/trafficcontroller/dopplerproxy/doppler_proxy.go#L163
Which in turn creates it's own Doppler connection list:
https://github.com/cloudfoundry/loggregator/blob/develop/src/trafficcontroller/channel_group_connector/channel_group_connector.go#L42
This means that if given C
firehose clients, L
loggregator instances and D
doppler instances, you would end up with roughly C x D / L
outbound connections from each Loggregator and C
inbound connections for each Doppler.
Is there a reason why those Doppler connections are not reused, which would result in D
outbound connections from each Loggregator and L
inbound connections for each Doppler?
Assuming that the clients are likely to be more than the number of Loggregators, this should result in better numbers. You would have to do some log mapping / duplication in Loggregator
which could increase the overhead there, but I have seen the CPU load on Doppler
peek much quicker than on Loggregator
right now, so that could balance the load.
Regards,
Momchil
As part of the CloudFoundry -> Logsearch integration I'm working on - cloudfoundry-community/logsearch-for-cloudfoundry#21 - I'd like for Log Analysis dashboards that include human readable names for log events.
Doppler tags all logs with a unique cf_app_id
UUID, eg:
{"cf_app_id":"afc7d161-b5d5-44f2-a2c8-63870ff8f2db","level":"info","message_type":"OUT","msg":"goexample.apps.54.183.203.97.xip.io - [01/04/2015:10:22:50 +0000] \"GET / HTTP/1.1\" 200 7 \"-\" \"Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)\" 10.10.2.6:60637 x_forwarded_for:\"46.165.195.139, 10.10.2.6\" vcap_request_id:9f598459-f756-4ea4-58e5-e9f212ff328f response_time:0.002463550 app_id:afc7d161-b5d5-44f2-a2c8-63870ff8f2db\n","source_instance":"0","source_type":"RTR","time":"2015-04-01T10:22:50Z"}
UUIDs however, make for crappy dashboards:
Are there any plans to include additional information in addition to the cf_app_id
with each doppler firehost log event; specifically:
cf_app_name
cf_space_id
and cf_space_name
cf_org_id
and cf_org_name
Thanks!
The ExponentialRetryStrategy
is defaulting to a delay of 1ms.
If I'm reading the code correctly, the use of this strategy in SyslogSink
means that each iteration of the loop starts with a 1ms delay. If this is the case, then surely the loop should be timed that it only delays if a connection has failed, instead of delaying up-front?
I haven't been able to find the code that deals with incoming messages, but if it is not also throttled in the same manner then this would explain why the TruncatingBuffer
is filling so easily.
Noticed that our metron agents having many panics throughout the day at random times with following error:
panic: No enabled dopplers available, check your manifest to make sure you have dopplers listening for the following protocols [udp]
After looking into it noticed that many times it correlated to etcd re-elections which were happening frequently due to slow disk i/o on etcd servers. I adjusted the election timeout a bit and that has helped alot but still seeing occasionally panics. Metron agents seem very fragile now if they have any interruptions with etcd or if the dopplers fail to update their entries for some reason.
Hi All,
I have deployed CF170 and am trying to connect a syslog drain to my loggregator. The end result of which is that the loggregator passes a few messages then falls over.
Before binding the drain I can run the nyet tests against my CF. After binding the drain and pushing some logs as described below the nyet tests fail.
https://github.com/FreightTrain/play-cf-env
nc -l 1514
cf cups syslog -l syslog://ip-of-drain-machine:1514
monit restart all
on the loggregator[root@logstash-logger-01 ~]# nc -l 1514
205 <14>1 2014-05-23T15:41:39+00:00 loggregator e62e36d5-a2ab-481d-bfc3-3994b8066770 [App/0] - - [info] application - Could not read config value akka.actor.deployment.default.routees.paths as String
181 <14>1 2014-05-23T15:41:39+00:00 loggregator e62e36d5-a2ab-481d-bfc3-3994b8066770 [App/0] - - [info] application - Could not read config value akka.event-handlers as String
177 <14>1 2014-05-23T15:41:39+00:00 loggregator e62e36d5-a2ab-481d-bfc3-3994b8066770 [App/0] - - [info] application - Could not read config value akka.extensions as String
186 <14>1 2014-05-23T15:41:39+00:00 loggregator e62e36d5-a2ab-481d-bfc3-3994b8066770 [App/0] - - [info] application - Could not read config value play.akka.event-handlers as String
168 <14>1 2014-05-23T15:41:39+00:00 loggregator e62e36d5-a2ab-481d-bfc3-3994b8066770 [App/0] - - [info] application - Rendering time: Fri May 23 15:41:39 UTC 2014
413 <14>1 2014-05-23T15:41:39+00:00 loggregator e62e36d5-a2ab-481d-bfc3-3994b8066770 [RTR] - - troy-test.paas.dev.col.tx.cpgpaas.net - [23/05/2014:15:41:39 +0000] "GET / HTTP/1.1" 200 33469 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:29.0) Gecko/20100101 Firefox/29.0" 192.168.80.26:52908 vcap_request_id:4dd535a398530b89fc06791f6fde0d8a response_time:0.049732475 app_id:e62e36d5-a2ab-481d-bfc3-3994b8066770
{"timestamp":1400859682.107874155,"process_id":6843,"source":"loggregator trafficcontroller","log_level":"error","message":"Output Proxy: Error reading from the server - unexpected EOF - 192.168.80.53:8080","data":null,"file":"/var/vcap/data/compile/loggregator_trafficcontroller/loggregator/src/trafficcontroller/proxy_handler.go","line":77,"method":"trafficcontroller.(*handler).proxyConnectionTo"}
{"timestamp":1400859682.108934164,"process_id":6843,"source":"loggregator trafficcontroller","log_level":"error","message":"Output Proxy: Error reading from the client - read tcp 192.168.80.29:21971: use of closed network connection","data":null,"file":"/var/vcap/data/compile/loggregator_trafficcontroller/loggregator/src/trafficcontroller/proxy_handler.go","line":91,"method":"trafficcontroller.(*handler).watchKeepAlive"}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.