Git Product home page Git Product logo

Comments (20)

camerondavison avatar camerondavison commented on August 28, 2024

After restarting it everything looks rather normal again. I tried to do a kill -QUIT and get a stack trace, but the trace itself was printing incredibly slowly and then I think that my systemd definitions killed it and restarted it in the middle. So I did not get anything.

Before I restarted it I ran strace -f -p $PID and it was only printing

Process 25359 attached with 7 threads
[pid  5361] futex(0xc2085680d8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 23093] futex(0xc20837c458, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 25364] futex(0xc208021258, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 25363] futex(0xd18520, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 25359] futex(0xd0f378, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 25362] clock_gettime(CLOCK_MONOTONIC, {1631679, 472159297}) = 0
[pid 25362] clock_gettime(CLOCK_MONOTONIC, {1631679, 472233554}) = 0
[pid 25362] clock_gettime(CLOCK_REALTIME, {1431543852, 511458926}) = 0
[pid 25362] epoll_wait(5, {}, 128, 0)   = 0
[pid 25362] select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
[pid 25362] clock_gettime(CLOCK_MONOTONIC, {1631680, 423690670}) = 0
[pid 25362] futex(0xd0e7a0, FUTEX_WAIT, 0, {60, 0}) = -1 ETIMEDOUT (Connection timed out)
[pid 25362] clock_gettime(CLOCK_MONOTONIC, {1631741, 549970095}) = 0
[pid 25362] clock_gettime(CLOCK_MONOTONIC, {1631741, 816163981}) = 0
[pid 25362] clock_gettime(CLOCK_REALTIME, {1431543914, 227795643}) = 0
[pid 25362] epoll_wait(5, {}, 128, 0)   = 0
[pid 25362] select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
[pid 25362] clock_gettime(CLOCK_MONOTONIC, {1631742, 140082541}) = 0
[pid 25362] futex(0xd0e7a0, FUTEX_WAIT, 0, {60, 0}) = -1 ETIMEDOUT (Connection timed out)
[pid 25362] clock_gettime(CLOCK_MONOTONIC, {1631803, 334995979}) = 0
[pid 25362] clock_gettime(CLOCK_MONOTONIC, {1631803, 335140475}) = 0
[pid 25362] clock_gettime(CLOCK_REALTIME, {1431543975, 609309540}) = 0
[pid 25362] epoll_wait(5, {}, 128, 0)   = 0
[pid 25362] select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
[pid 25362] clock_gettime(CLOCK_MONOTONIC, {1631803, 521924470}) = 0
[pid 25362] futex(0xd0e7a0, FUTEX_WAIT, 0, {60, 0}

Once I restarted it it was printing the expected open/read of the file descriptors every 5 seconds when my prometheus app polled it for data

from node_exporter.

juliusv avatar juliusv commented on August 28, 2024

Strange! Could you try to get the output of http://<host:port>/debug/pprof/goroutine?debug=2 next time this happens?

from node_exporter.

camerondavison avatar camerondavison commented on August 28, 2024

No good. I cannot make that curl, it just hangs.

from node_exporter.

camerondavison avatar camerondavison commented on August 28, 2024

I was able to send it a sig QUIT this time, still waiting on the output, but what I have so far
https://gist.github.com/a86c6f7964/f170cabd622f64c76d2e

from node_exporter.

juliusv avatar juliusv commented on August 28, 2024

@a86c6f7964 We've been staring at this for a short while, and still not sure exactly what's going on, besides some goroutines being slow / blocked for long times. Hanging goroutines could be some deadlock bug, but that wouldn't really explain the high IO. Is the machine swapping when this happens? Do you happen to have process and/or machine memory metrics? (well, actually you should have both up until the point of total unresponsiveness, if there was a Prometheus scraping that node exporter)

from node_exporter.

juliusv avatar juliusv commented on August 28, 2024

Interesting metrics to look at for the node-exporter process itself, which should be in the Prometheus that scraped it, up until the point when the node exporter became unscrapable:

  • process_goroutines
  • process_resident_memory_bytes and process_virtual_memory_bytes
  • process_cpu_seconds_total (the rate over that)
  • process_open_fds

...and machine-related metrics that would be interesting for the host where the node exporter ran on:

  • node_memory_MemFree
  • node_memory_Cached
  • node_memory_Buffers
  • node_cpu (rate over this one)
  • node_load1
  • node_disk_io_now (in-progress disk IOs)
  • node_network_receive_bytes
  • maybe more

from node_exporter.

camerondavison avatar camerondavison commented on August 28, 2024

Okay I updated the gist with the full stack trace (it took almost 8 hours to generate 😮 )

My prometheus only handles 4 hours worth of data (disk io got reallllllly slow when I gave it more, we are running on a AWS t2.medium machine) ... so those will be kind of hard to get at

from node_exporter.

camerondavison avatar camerondavison commented on August 28, 2024

The usual load on this machine is next to nothing. I have the node exporter set at a memory limit of 50M in my systemd unit (do you think that it needs more?)

Looking through the logs it looks like something happened around 'I0518 18:36:07' where things started getting slower

I0518 18:35:17.344805   30572 node_exporter.go:98] OK: stat success after 0.000160s.
I0518 18:35:22.345006   30572 node_exporter.go:98] OK: stat success after 0.000162s.
I0518 18:35:27.347632   30572 node_exporter.go:98] OK: stat success after 0.000157s.
I0518 18:35:32.344479   30572 node_exporter.go:98] OK: stat success after 0.000167s.
I0518 18:35:37.346816   30572 node_exporter.go:98] OK: stat success after 0.000159s.
I0518 18:35:42.345793   30572 node_exporter.go:98] OK: stat success after 0.000169s.
I0518 18:35:47.345620   30572 node_exporter.go:98] OK: stat success after 0.000162s.
I0518 18:35:52.348205   30572 node_exporter.go:98] OK: stat success after 0.000185s.
I0518 18:35:57.344761   30572 node_exporter.go:98] OK: stat success after 0.000172s.
I0518 18:36:02.345716   30572 node_exporter.go:98] OK: stat success after 0.000157s.
I0518 18:36:07.598528   30572 node_exporter.go:98] OK: stat success after 0.010153s.
I0518 18:36:12.833998   30572 node_exporter.go:98] OK: stat success after 0.276153s.
I0518 18:36:17.852887   30572 node_exporter.go:98] OK: stat success after 0.251907s.
I0518 18:36:22.695873   30572 node_exporter.go:98] OK: stat success after 0.014766s.
I0518 18:36:27.657642   30572 node_exporter.go:98] OK: stat success after 0.016069s.
I0518 18:36:32.658599   30572 node_exporter.go:98] OK: stat success after 0.016960s.
I0518 18:36:37.653351   30572 node_exporter.go:98] OK: stat success after 0.199047s.
I0518 18:36:42.614556   30572 node_exporter.go:98] OK: stat success after 0.006177s.

Looking closer at that time range, I cannot tell what would have caused that slowdown. Things get real bad here

I0518 20:35:56.787647   30572 node_exporter.go:98] OK: netdev success after 0.000339s.
I0518 20:35:56.787674   30572 node_exporter.go:98] OK: time success after 0.000003s.
I0518 20:35:56.787895   30572 node_exporter.go:98] OK: netstat success after 0.000201s.
I0518 20:35:56.788058   30572 node_exporter.go:98] OK: diskstats success after 0.000118s.
I0518 20:35:53.814827   30572 node_exporter.go:98] OK: time success after 0.000011s.
I0518 20:35:25.796216   30572 node_exporter.go:98] OK: textfile success after 0.000001s.
I0518 20:35:29.001793   30572 node_exporter.go:98] OK: textfile success after 0.000001s.
I0518 20:35:52.398907   30572 node_exporter.go:98] OK: time success after 0.000008s.
I0518 20:35:22.792778   30572 node_exporter.go:98] OK: textfile success after 0.000001s.
I0518 20:35:24.086548   30572 node_exporter.go:98] OK: time success after 0.000007s.
I0518 20:35:24.527268   30572 node_exporter.go:98] OK: time success after 0.000006s.
I0518 20:35:50.266417   30572 node_exporter.go:98] OK: filesystem success after 0.000287s.
I0518 20:35:25.796643   30572 node_exporter.go:98] OK: netdev success after 0.000371s.
I0518 20:46:53.967890   30572 node_exporter.go:98] OK: loadavg success after 84.566580s.
I0518 20:35:21.660582   30572 node_exporter.go:98] OK: netdev success after 0.000411s.
I0518 20:53:51.171998   30572 node_exporter.go:98] OK: loadavg success after 977.176057s.
I0518 20:46:54.267224   30572 node_exporter.go:98] OK: loadavg success after 40.160530s.
I0518 20:48:04.293429   30572 node_exporter.go:98] OK: loadavg success after 657.562015s.
I0518 20:47:58.196109   30572 node_exporter.go:98] OK: loadavg success after 699.850948s.
I0518 20:35:50.456721   30572 node_exporter.go:98] OK: textfile success after 0.000001s.
I0518 20:35:21.493450   30572 node_exporter.go:98] OK: textfile success after 0.000001s.
I0518 20:47:57.299693   30572 node_exporter.go:98] OK: textfile success after 0.000001s.
I0518 20:35:48.280044   30572 node_exporter.go:98] OK: time success after 0.000007s.
I0518 20:35:20.950916   30572 node_exporter.go:98] OK: time success after 0.000006s.
I0518 20:47:58.196097   30572 node_exporter.go:98] OK: netstat success after 699.830764s.
I0518 20:51:29.740396   30572 node_exporter.go:98] OK: netstat success after 828.727282s.
I0518 20:48:35.090167   30572 node_exporter.go:98] OK: loadavg success after 657.785694s.
I0518 20:53:21.403094   30572 node_exporter.go:98] OK: attributes success after 863.295223s.
I0518 21:09:17.576817   30572 node_exporter.go:98] OK: netstat success after 1889.755354s.
I0518 20:47:57.788156   30572 node_exporter.go:98] OK: loadavg success after 754.159794s.
I0518 20:35:25.386415   30572 node_exporter.go:98] OK: filesystem success after 0.000293s.
I0518 20:58:41.445939   30572 node_exporter.go:98] OK: textfile success after 0.000001s.
I0518 20:35:49.117876   30572 node_exporter.go:98] OK: textfile success after 0.000001s.
I0518 20:47:56.418723   30572 node_exporter.go:98] OK: time success after 270.894575s.
I0518 20:48:49.139520   30572 node_exporter.go:98] OK: time success after 0.000012s.
I0518 20:52:05.805535   30572 node_exporter.go:98] OK: loadavg success after 698.312665s.
I0518 20:50:58.722561   30572 node_exporter.go:98] OK: loadavg success after 699.930957s.
I0518 20:49:21.125827   30572 node_exporter.go:98] OK: loadavg success after 700.511116s.
I0518 21:00:37.710201   30572 node_exporter.go:98] OK: textfile success after 0.000001s.
I0518 20:35:19.374094   30572 node_exporter.go:98] OK: textfile success after 0.000001s.
I0518 21:02:47.132333   30572 node_exporter.go:98] OK: textfile success after 0.102544s.
I0518 21:02:36.343913   30572 node_exporter.go:98] OK: netstat success after 1433.712689s.
I0518 20:49:43.494054   30572 node_exporter.go:98] OK: loadavg success after 697.772662s.
I0518 21:01:17.995762   30572 node_exporter.go:98] OK: time success after 342.470954s.

I would imagine that something causes a lock and then a whole bunch of go routines pile up? maybe even then cause the memory to go really high, then maybe it starts swapping? I will look into systemd unit swapping.

from node_exporter.

juliusv avatar juliusv commented on August 28, 2024

Yes, that could be the problem. A couple of us have stared at this a bit, but no conclusion yet :-/

from node_exporter.

camerondavison avatar camerondavison commented on August 28, 2024

just FYI ran
curl localhost:7301/debug/pprof/goroutine?debug=2

goroutine 2153562 [running]:
runtime/pprof.writeGoroutineStacks(0x7f1ba1fa80d8, 0xc2087579a0, 0x0, 0x0)
    /usr/local/go/src/runtime/pprof/pprof.go:511 +0x8d
runtime/pprof.writeGoroutine(0x7f1ba1fa80d8, 0xc2087579a0, 0x2, 0x0, 0x0)
    /usr/local/go/src/runtime/pprof/pprof.go:500 +0x4f
runtime/pprof.(*Profile).WriteTo(0xced2c0, 0x7f1ba1fa80d8, 0xc2087579a0, 0x2, 0x0, 0x0)
    /usr/local/go/src/runtime/pprof/pprof.go:229 +0xd5
net/http/pprof.handler.ServeHTTP(0xc2082076f1, 0x9, 0x7f1ba1f85030, 0xc2087579a0, 0xc208173380)
    /usr/local/go/src/net/http/pprof/pprof.go:169 +0x35f
net/http/pprof.Index(0x7f1ba1f85030, 0xc2087579a0, 0xc208173380)
    /usr/local/go/src/net/http/pprof/pprof.go:181 +0x15e
net/http.HandlerFunc.ServeHTTP(0x9a4068, 0x7f1ba1f85030, 0xc2087579a0, 0xc208173380)
    /usr/local/go/src/net/http/server.go:1265 +0x41
net/http.(*ServeMux).ServeHTTP(0xc20803ab10, 0x7f1ba1f85030, 0xc2087579a0, 0xc208173380)
    /usr/local/go/src/net/http/server.go:1541 +0x17d
net/http.serverHandler.ServeHTTP(0xc20805a600, 0x7f1ba1f85030, 0xc2087579a0, 0xc208173380)
    /usr/local/go/src/net/http/server.go:1703 +0x19a
net/http.(*conn).serve(0xc208757900)
    /usr/local/go/src/net/http/server.go:1204 +0xb57
created by net/http.(*Server).Serve
    /usr/local/go/src/net/http/server.go:1751 +0x35e

goroutine 1 [IO wait]:
net.(*pollDesc).Wait(0xc2080d0840, 0x72, 0x0, 0x0)
    /usr/local/go/src/net/fd_poll_runtime.go:84 +0x47
net.(*pollDesc).WaitRead(0xc2080d0840, 0x0, 0x0)
    /usr/local/go/src/net/fd_poll_runtime.go:89 +0x43
net.(*netFD).accept(0xc2080d07e0, 0x0, 0x7f1ba1f82dc0, 0xc2085c4ac0)
    /usr/local/go/src/net/fd_unix.go:419 +0x40b
net.(*TCPListener).AcceptTCP(0xc20804c1e8, 0x48f73e, 0x0, 0x0)
    /usr/local/go/src/net/tcpsock_posix.go:234 +0x4e
net/http.tcpKeepAliveListener.Accept(0xc20804c1e8, 0x0, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/http/server.go:1976 +0x4c
net/http.(*Server).Serve(0xc20805a600, 0x7f1ba1f84e78, 0xc20804c1e8, 0x0, 0x0)
    /usr/local/go/src/net/http/server.go:1728 +0x92
net/http.(*Server).ListenAndServe(0xc20805a600, 0x0, 0x0)
    /usr/local/go/src/net/http/server.go:1718 +0x154
net/http.ListenAndServe(0x7fff614d5f8e, 0x5, 0x0, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/http/server.go:1808 +0xba
main.main()
    /home/bjoern/gocode/src/github.com/prometheus/node_exporter/node_exporter.go:185 +0x73e

goroutine 7 [chan receive]:
github.com/golang/glog.(*loggingT).flushDaemon(0xd0e7c0)
    /home/bjoern/gocode/src/github.com/prometheus/node_exporter/.build/gopath/src/github.com/golang/glog/glog.go:879 +0x78
created by github.com/golang/glog.init·1
    /home/bjoern/gocode/src/github.com/prometheus/node_exporter/.build/gopath/src/github.com/golang/glog/glog.go:410 +0x2a7

goroutine 17 [syscall, 11494 minutes, locked to thread]:
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:2232 +0x1

goroutine 8 [syscall, 11494 minutes]:
os/signal.loop()
    /usr/local/go/src/os/signal/signal_unix.go:21 +0x1f
created by os/signal.init·1
    /usr/local/go/src/os/signal/signal_unix.go:27 +0x35

It is currently using 100MB of memory, where as say the consul_exporters that I am running are running at only 20MB of memory. I feel like things are slowing down as well, the scrape_duration_seconds is steadily climbing and is now at 2.2 seconds.

from node_exporter.

camerondavison avatar camerondavison commented on August 28, 2024

Hmm process_resident_memory_bytes says 12MB ish, and so does ps but cAdvisor is telling me that it is using all 100MB that is allocated for that cgroup.

lsof -p 3662

COMMAND    PID USER   FD   TYPE             DEVICE SIZE/OFF     NODE NAME
node_expo 3662 root  cwd    DIR              202,9     4096        2 /
node_expo 3662 root  rtd    DIR              202,9     4096        2 /
node_expo 3662 root  txt    REG              202,9 10594835       30 /var/lib/prometheus/node_exporter
node_expo 3662 root  mem    REG              202,3    51152    33867 /usr/lib64/libnss_files-2.19.so
node_expo 3662 root  mem    REG              202,3  1787096    33884 /usr/lib64/libc-2.19.so
node_expo 3662 root  mem    REG              202,3   109288    33865 /usr/lib64/libpthread-2.19.so
node_expo 3662 root  mem    REG              202,3   144720    33863 /usr/lib64/ld-2.19.so
node_expo 3662 root    0r   CHR                1,3      0t0     1036 /dev/null
node_expo 3662 root    1u  unix 0xffff88006ecfde80      0t0 31659765 socket
node_expo 3662 root    2u  unix 0xffff88006ecfde80      0t0 31659765 socket
node_expo 3662 root    3u   REG               0,30 92607463 31659776 /tmp/node_exporter.ip-172-31-17-127.root.log.INFO.20150520-162710.3662
node_expo 3662 root    4u  IPv6           31659778      0t0      TCP *:7301 (LISTEN)
node_expo 3662 root    5u  0000               0,10        0     6638 anon_inode

from node_exporter.

camerondavison avatar camerondavison commented on August 28, 2024

Okay. Crazy thing here, I truncated the log file,
echo > /tmp/node_exporter.ip-172-31-17-127.root.log.INFO.20150520-162710.3662
and immediately got back all of the memory.

Also the scrape_duration_seconds went right back down to .1

from node_exporter.

camerondavison avatar camerondavison commented on August 28, 2024

I am adding -logtostderr=true and we shall see if in a week this is solved.

from node_exporter.

beorn7 avatar beorn7 commented on August 28, 2024

Perhaps a problem with your /tmp? Disk full? Or /tmp on ramdisk (and also full)?

from node_exporter.

camerondavison avatar camerondavison commented on August 28, 2024

systemd was essentially saying that the log file seemed to be part of the memory accounting for the cgroup (not sure if glog) is doing something like mlocking the log file or not. I am unable to reproduce it with, but /tmp/ is mounted on a 8GB partition and the file was only 90MB or so, so it was not full. My unit file just looks like.

[Unit]
Description=Prometheus Node Exporter Service
[Service]
TimeoutStartSec=10m
MemoryLimit=100M
Restart=always
RestartSec=30
ExecStart=-/var/lib/prometheus/node_exporter\
 -web.listen-address=:7301

from node_exporter.

juliusv avatar juliusv commented on August 28, 2024

Oh... we should switch node_exporter (like everything else) from glog to https://github.com/prometheus/log, which only logs to stderr anyways. I'll do that, and hope you can already confirm that the issue is gone when logging to stderr!

from node_exporter.

juliusv avatar juliusv commented on August 28, 2024

#79 to switch node_exporter to prometheus/log.

from node_exporter.

juliusv avatar juliusv commented on August 28, 2024

Has this been resolved by the logging change?

from node_exporter.

camerondavison avatar camerondavison commented on August 28, 2024

It has not been a week yet, but I am now seeing the memory stay very steady at < 20MB. I would say it was likely the problem. Thanks for working with me on this.

from node_exporter.

juliusv avatar juliusv commented on August 28, 2024

Great to hear, thanks!

from node_exporter.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.