While running storjlabs/storagenode:alpha
(hash f5421c8370b1
latest at time of writing), the node was up for about 6 days when I noticed the node was OFFLINE
.
Storage Node Dashboard
======================
ID XXXXXXXXXXX
Status OFFLINE
Uptime 157h3m55s
Available Used
Bandwidth 10.0 TB 1.5 MB
Disk 8.0 TB 4.2 GB
Bootstrap bootstrap.storj.io:8888
Internal :28967
External XXXXXXXXXXX:28967
Neighborhood Size 0
I did a log dump and some questioning on the community support portal and I was told that the bootstrap.storj.io
satellite has some issues over the last couple days which was confirmed in the logs.
I believe the storagenode program got into a state where it neither died (and would have been restarted), nor reestablished connection to the satellite (got stuck in a strange state).
Here is the timeline I pieced together from the logs that necessitated a docker container restart:
2019-03-11T21:44:25 Things are running fine. Pieces are flying back and forth:
2019-03-11T21:44:25.636Z INFO piecestore Successfully stored {"Piece ID": "HN88Rt1UKMERYwhXLCPmKoR9xMxFE4VYmVEPWoQszjRt"}
2019-03-11T21:44:26.184Z INFO piecestore Successfully retrieved {"Piece ID": "3qKLQd7CnqHKFZ3UFZPeUkLbnb9EQ7kcvTKV5BwSEeHMbftQbgfqQLS5Q5N74cFuw6yAreGFD86QbwWgvZ9pFcYr", "Allocated": 2351104, "Retrieved": 2351104}
2019-03-11T22:01:39.401Z INFO piecestore Successfully retrieved Stats...
2019-03-11T22:01:39.465Z INFO agreements Sending agreements to satellite {"number of agreements": 6, "satellite id": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2019-03-11T22:02:21 my node can't connect to the satellite (34.80.85.84), but it must have SOME connectivity based on the following message after the error:
2019-03-11T22:02:21.148Z WARN agreements Agreementsender could not dial satellite {"error": "transport error: connection error: desc = \"transport: error while dialing: dial tcp 34.80.85.84:7777: connect: connection refused\"", "errorVerbose": "transport error: connection error: desc = \"transport: error while dialing: dial tcp 34.80.85.84:7777: connect: connection refused\"\n\tstorj.io/storj/pkg/transport.(*Transport).DialNode:82\n\tstorj.io/storj/pkg/piecestore/psserver/agreementsender.(*AgreementSender).SendAgreementsToSatellite:75\n\tstorj.io/storj/pkg/piecestore/psserver/agreementsender.(*AgreementSender).Run:54\n\tstorj.io/storj/storagenode.(*Peer).Run.func3:201\n\tstorj.io/storj/vendor/golang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2019-03-11T22:02:21.148Z INFO agreements Sending agreements to satellite {"number of agreements": 1574, "satellite id": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW"}
2019-03-11T22:07:25.470Z INFO agreements Sending agreements to satellite {"number of agreements": 1, "satellite id": "1Vrf9xmmHw6KaVFMcfR2YPt8YpVVoQZGTUJyjYc6CajeYrAqrB"}
2019-03-11T23:01:35.410Z INFO agreements Sending agreements to satellite {"number of agreements": 6, "satellite id": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2019-03-11T23:01:35 Top of the hour comes and it tries to phone home again only to get the same error:
2019-03-11T23:01:35.432Z INFO piecestore Successfully retrieved Stats...
2019-03-11T23:02:16.056Z WARN agreements Agreementsender could not dial satellite {"error": "transport error: connection error: desc = \"transport: error while dialing: dial tcp 34.80.85.84:7777: connect: connection refused\"", "errorVerbose": "transport error: connection error: desc = \"transport: error while dialing: dial tcp 34.80.85.84:7777: connect: connection refused\"\n\tstorj.io/storj/pkg/transport.(*Transport).DialNode:82\n\tstorj.io/storj/pkg/piecestore/psserver/agreementsender.(*AgreementSender).SendAgreementsToSatellite:75\n\tstorj.io/storj/pkg/piecestore/psserver/agreementsender.(*AgreementSender).Run:54\n\tstorj.io/storj/storagenode.(*Peer).Run.func3:201\n\tstorj.io/storj/vendor/golang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2019-03-12T00:01:33 Another hour ticks by and another phone home attempt, only this time instead of Agreementsender could not dial satellite
it says Agreementsender could not find satellite
:
2019-03-12T00:01:33.927Z INFO piecestore Successfully retrieved Stats...
2019-03-12T00:01:33.928Z INFO agreements Sending agreements to satellite {"number of agreements": 6, "satellite id": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2019-03-12T00:01:34.000Z WARN agreements Agreementsender could not find satellite {"error": "node not found", "errorVerbose": "node not found\n\tstorj.io/storj/pkg/kademlia.(*Kademlia).lookup:259\n\tstorj.io/storj/pkg/kademlia.(*Kademlia).FindNode:216\n\tstorj.io/storj/pkg/piecestore/psserver/agreementsender.(*AgreementSender).SendAgreementsToSatellite:69\n\tstorj.io/storj/pkg/piecestore/psserver/agreementsender.(*AgreementSender).Run:54\n\tstorj.io/storj/storagenode.(*Peer).Run.func3:201\n\tstorj.io/storj/vendor/golang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2019-03-12T01:01:33 Another hour, another phone home. Error repeated.
2019-03-12T01:01:33.932Z INFO piecestore Successfully retrieved Stats...
2019-03-12T01:01:33.935Z INFO agreements Sending agreements to satellite {"number of agreements": 6, "satellite id": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2019-03-12T01:01:33.956Z WARN agreements Agreementsender could not find satellite {"error": "node not found", "errorVerbose": "node not found\n\tstorj.io/storj/pkg/kademlia.(*Kademlia).lookup:259\n\tstorj.io/storj/pkg/kademlia.(*Kademlia).FindNode:216\n\tstorj.io/storj/pkg/piecestore/psserver/agreementsender.(*AgreementSender).SendAgreementsToSatellite:69\n\tstorj.io/storj/pkg/piecestore/psserver/agreementsender.(*AgreementSender).Run:54\n\tstorj.io/storj/storagenode.(*Peer).Run.func3:201\n\tstorj.io/storj/vendor/golang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2019-03-12T01:06:56 I come home to find my dashboard reporting OFFLINE
so I terminate the container and let it restart after which it is ONLINE
and everything is working again.
2019-03-12T01:06:56.143Z INFO Got a signal from the OS: "terminated"
2019-03-12T01:07:37.823Z INFO Configuration loaded from: /app/config/config.yaml
Extra info on container running:
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
storjlabs/storagenode alpha f5421c8370b1 10 days ago 30MB
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
00cdbe654317 storjlabs/storagenode:alpha "./entrypoint.sh" 6 days ago Up About an hour 0.0.0.0:28967->28967/tcp storagenode