hashicorp / raft Goto Github PK

Golang implementation of the Raft consensus protocol

License: Mozilla Public License 2.0

Makefile 0.24% Go 99.68% Shell 0.08%

raft's Introduction

raft

raft is a Go library that manages a replicated log and can be used with an FSM to manage replicated state machines. It is a library for providing consensus.

The use cases for such a library are far-reaching, such as replicated state machines which are a key component of many distributed systems. They enable building Consistent, Partition Tolerant (CP) systems, with limited fault tolerance as well.

Building

If you wish to build raft you'll need Go version 1.16+ installed.

Please check your installation with:

go version

Documentation

For complete documentation, see the associated Godoc.

To prevent complications with cgo, the primary backend MDBStore is in a separate repository, called raft-mdb. That is the recommended implementation for the LogStore and StableStore.

A pure Go backend using Bbolt is also available called raft-boltdb. It can also be used as a LogStore and StableStore.

Community Contributed Examples

Raft gRPC Example - Utilizing the Raft repository with gRPC
Raft-based KV-store Example - Uses Hashicorp Raft to build a distributed key-value store

Tagged Releases

As of September 2017, HashiCorp will start using tags for this library to clearly indicate major version updates. We recommend you vendor your application's dependency on this library.

v0.1.0 is the original stable version of the library that was in main and has been maintained with no breaking API changes. This was in use by Consul prior to version 0.7.0.
v1.0.0 takes the changes that were staged in the library-v2-stage-one branch. This version manages server identities using a UUID, so introduces some breaking API changes. It also versions the Raft protocol, and requires some special steps when interoperating with Raft servers running older versions of the library (see the detailed comment in config.go about version compatibility). You can reference hashicorp/consul#2222 for an idea of what was required to port Consul to these new interfaces.

This version includes some new features as well, including non voting servers, a new address provider abstraction in the transport layer, and more resilient snapshots.

Protocol

raft is based on "Raft: In Search of an Understandable Consensus Algorithm"

A high level overview of the Raft protocol is described below, but for details please read the full Raft paper followed by the raft source. Any questions about the raft protocol should be sent to the raft-dev mailing list.

Protocol Description

Raft nodes are always in one of three states: follower, candidate or leader. All nodes initially start out as a follower. In this state, nodes can accept log entries from a leader and cast votes. If no entries are received for some time, nodes self-promote to the candidate state. In the candidate state nodes request votes from their peers. If a candidate receives a quorum of votes, then it is promoted to a leader. The leader must accept new log entries and replicate to all the other followers. In addition, if stale reads are not acceptable, all queries must also be performed on the leader.

Once a cluster has a leader, it is able to accept new log entries. A client can request that a leader append a new log entry, which is an opaque binary blob to Raft. The leader then writes the entry to durable storage and attempts to replicate to a quorum of followers. Once the log entry is considered committed, it can be applied to a finite state machine. The finite state machine is application specific, and is implemented using an interface.

An obvious question relates to the unbounded nature of a replicated log. Raft provides a mechanism by which the current state is snapshotted, and the log is compacted. Because of the FSM abstraction, restoring the state of the FSM must result in the same state as a replay of old logs. This allows Raft to capture the FSM state at a point in time, and then remove all the logs that were used to reach that state. This is performed automatically without user intervention, and prevents unbounded disk usage as well as minimizing time spent replaying logs.

Lastly, there is the issue of updating the peer set when new servers are joining or existing servers are leaving. As long as a quorum of nodes is available, this is not an issue as Raft provides mechanisms to dynamically update the peer set. If a quorum of nodes is unavailable, then this becomes a very challenging issue. For example, suppose there are only 2 peers, A and B. The quorum size is also 2, meaning both nodes must agree to commit a log entry. If either A or B fails, it is now impossible to reach quorum. This means the cluster is unable to add, or remove a node, or commit any additional log entries. This results in unavailability. At this point, manual intervention would be required to remove either A or B, and to restart the remaining node in bootstrap mode.

A Raft cluster of 3 nodes can tolerate a single node failure, while a cluster of 5 can tolerate 2 node failures. The recommended configuration is to either run 3 or 5 raft servers. This maximizes availability without greatly sacrificing performance.

In terms of performance, Raft is comparable to Paxos. Assuming stable leadership, committing a log entry requires a single round trip to half of the cluster. Thus performance is bound by disk I/O and network latency.

raft's People

Contributors

Stargazers

Watchers

Forkers

rbg robxu9 liucw2012 wheelcomplex godeep alex-devops testn hanshasselberg hartsock pombredanne ai4labs stapelberg ebfe rjammala csw cevaris jbooth sarvex artushin cyx zoutaiqi deafgoat jmptrader superfell dgnorton vshantil ipconfigme sahrizvi wantonsolutions rlayte eftychis tempbottle sean- applied-duality cigolabs li-ang josephglanville euforia mengjinglei sanbit buksy cautio tigerzhang magictour stratoscale ongardie sunrongya cloudxtreme jeffpeiyt ybur-yug wchh rahuahua jim139 abligh schen59 github-cloud maniacs-ops naveenkunareddy obinnaokechukwu bydsky is00hcw novusopt kasisnu w4ngyi lihuawu chyh1990 joe2far andryleon semihalev rogpeppe-contrib dynomitedb nanne007 willsewell hopkings2008 handsomestone rtvt123 oldtree pusher uranium62 aj0ay deathzaplin haskeef gooops weiwei04 zeq9069 0xdenishdev meiliangwhut gale320 cmingxu grantyuan akinswin itachi123456789 jameswei digideskio skianzad abdullin chang290 singularitydlw jojimt tych0

raft's Issues

A reference implementation ?

Hi,

For now, the only way to learn how to use is to dig into consul code, do you think you can provide a basic reference implementation like goraft does (https://github.com/goraft/raftd) to kickstart the development of new project using your raft implementation ?

Thanks!

How to run it?

I am new to go. Need to implement go version of raft client. Got your project, complied but cannot find any file to run. Assumed it could be raft.go but got this error back

C:\go_workspace\raft>go run raft.go
go run: cannot run non-main package

Thanks for your help.

race in `TestRaft_ApplyConcurrent`

https://gist.github.com/dgnorton/a739259c2537bd1eb59c

Why does config.SnapshotInterval exist?

Given that we have config.SnapshotThreshold, I don’t see why the snapshot interval is configurable. Could we not set it to 1s fixed (as an example)? Given that shouldSnapshot() only does two reads (to get the last snapshot index from stablestore and the last log index from logstore), I don’t think it’s too expensive to do once a second…?

Could you please explain the rationale behind this or confirm my understanding, in which case I’ll send a cleanup pull request?

Provide prebuilt binaries for raft?

Currently raft is used by Consul.

Not sure if it makes sense, but Is there any plan to release this as a standalone tool, similar to [Serf]((https://github.com/hashicorp/serf) -- with an agent & a cli?

Removing the leader can lead to different log's applied to the remaining nodes

While doing some stress testing we ran into this issue. If you do a RemovePeer of a node who happens to be the leader at the time, and is also working on other outstanding Apply'd entries, the remaining nodes in the cluster can end up seeing a mismatch in the entries that are applied to the FSMs. In particular some nodes will see one entry applied and other node never see this entry (depending on the entry this can lead to serious data issues).

The entry in question always appears to be an inflight entry that is the log entry after the RemovePeers log entry. e.g. in this run log entry 3827 was the RemovePeer entry, node 0's FSM saw this sequence of Apply's (term/index/data)

3825: 7945C0282719297129F2A5F85B89E9D55373600FA459F8BA7BFD9D443488635A67
3826: BF921247A73A9AFDAD5D2A37D34C4C49636C6353D0173AE6D29682C11685FD6E07
3829: A74CC536489C8AFD48EAA6A5FB2063614E16E15EA13754B362A8A448A7A49CBB75
3830: 6E98BE940FF0CFD71960DBA7C9B3674D18BA97ECAACC8B80DF7211F9702AFF77D7

but node 3's FSM saw these sequence of applies

3825: 7945C0282719297129F2A5F85B89E9D55373600FA459F8BA7BFD9D443488635A67
3826: BF921247A73A9AFDAD5D2A37D34C4C49636C6353D0173AE6D29682C11685FD6E07
1. 3828: 7340F10D4B683B02B08A37077C90CC4B39A7812C4A87EF986190679031DDDB9526
3829: A74CC536489C8AFD48EAA6A5FB2063614E16E15EA13754B362A8A448A7A49CBB75
3830: 6E98BE940FF0CFD71960DBA7C9B3674D18BA97ECAACC8B80DF7211F9702AFF77D7

Note the extra log entry at 3828 that Node 3 saw that Node 0 didn't. (Node 1 was the leader which was the node removed)

In addition, the client request that created log entry 3828 was given a success result.

We added some instrumentation of AppendEntry RPC calls to our transport, and from that log we see this sequence of AE's from the leader at this time
node from -> to, first/last index in message, leaderCommitIndex
rl_1 -> rl_2 3823 - 3824 : 3820
rl_1 -> rl_4 3825 - 3826 : 3822
rl_1 -> rl_2 3825 - 3826 : 3822
rl_1 -> rl_0 3821 - 3824 : 3820
rl_1 -> rl_3 3825 - 3827 : 3822
rl_1 -> rl_0 3825 - 3826 : 3822
rl_1 -> rl_2 3827 - 3827 : 3822
rl_1 -> rl_0 3827 - 3827 : 3822
rl_1 -> rl_4 3827 - 3827 : 3828
rl_1 -> rl_3 3828 - 3828 : 3828
rl_1 -> rl_2 0 - 0 : 3828
rl_1 -> rl_0 0 - 0 : 3828
rl_1 -> rl_3 0 - 0 : 3828

everything looks normal (commit index slightly lagging the log entries as you'd expect) until you get to the RemovePeer entry. note how the leaderCommitIndex jumps to 3828, even though its only ever sent to one node.

Digging through the code, what appears to be happening is that membership changes get applied in 2 steps, raft.go line 926 explicitly makes an additional call to processLog with precommit=true. In ProccessLog, line 1161, if we're removing ourselves r.peers is set to nil. regular log processing continue now, with dispatchLogs being called. At line 1074, the quorum policy for the log entries is created with applyLog.policy = newMajorityQuorum(len(r.peers) + 1) however at this point r.peers is now nil, and so the quorum is based on 1, and not the actual cluster size, as this log entry proceeds to the inflight list, its meets its quorum needs and is immediately committed.

the leader then proceeds with the regular call to ProcessLog which completes the removal of the leader, which tells the replication to only replicate upto the RemovePeer entry, which exacerbates the problem, based on timing only some small subset of the other nodes have the log entry after the RemovePeer. (but based on the leaderCommitIndex, any node that has the next log entry will apply it to its FSM).

[I'll be attaching a pull request to this issue that updates the RemoveLeader test to demonstrate this issue].

It seems like in the case where a node is being removed (including itself) ProcessLog shouldn't update r.peers unless precommit is false. [but I need to walk through that somemore]

Logging destinations

the raft type takes a log.Logger for logging, but other parts (transport / snapshots ) take a writer instead.

It'd be useful if everything took a log.Logger, or even better if there was a simple interface that defined the logging methods, and users could provide their own impl to their favorite logging destination. [a simple interface here would also make it easier to route logging to testing.T.log during tests]

Very slow recovery from partitions

Under certain conditions recovery for a follower after a partition can be very slow, what we see is that the follower is rejecting an AppendEntries call due to a a term miss-match, and the leader replication loop treats this as a failure, and so applies the back-off timer to the loop, which means the next request to the follower with the next lower index waits up to 10 seconds, this makes recovery for the follower extremely slow.

The relevant log entries show
[lp_0 is the follower]
lp_0:18:11:27.965237 [WARN] raft: Previous log term mis-match: ours: 1 remote: 3
lp_0:18:11:38.208667 [WARN] raft: Previous log term mis-match: ours: 1 remote: 3
lp_0:18:11:48.450560 [WARN] raft: Previous log term mis-match: ours: 1 remote: 3

[lp_2 is the leader]
lp_2:18:11:27.965322 [WARN] raft: AppendEntries to lp_0 rejected, sending older logs (next: 374)
lp_2:18:11:38.209203 [WARN] raft: AppendEntries to lp_0 rejected, sending older logs (next: 373)
lp_2:18:11:48.450627 [WARN] raft: AppendEntries to lp_0 rejected, sending older logs (next: 372)

you can see here the each attempt waits ~10 seconds.

One potential fix would be in (*Raft)replicateTo() to reset failures to 0 zero in this case. e.g. in https://github.com/hashicorp/raft/blob/master/replication.go line 184, change s.failures++ to s.failures = 0

Looking at the other ways appendEntries can return in this state don't appear to need the back-off on the next call.

A more complex fix would be for the AppendEntriesResponse to get a new field to indicate if the retry delay can be skipped.

Race Detector

There's 40+ race conditions when running the test suite:

$ go test -v -race

Here's the full output: https://gist.github.com/benbjohnson/e0fe56aed92c63b03d95

I'm getting similar numbers for both:

Mac OS X 10.8.5 - go version go1.2 darwin/amd64
Ubuntu 12.04.1 LTS (Vagrant) - go version go1.2.1 linux/amd64

I'll try looking at it later and send in a PR but I just wanted to log an issue so it doesn't get forgotten.

Panic getting logs after restoring from an older snapshot

https://gist.github.com/fahadullah/6089c60bf6c57da0143b

When we restore a snapshot, the lastApplied index is set to the snapshot index. If we can't restore from the latest snapshot, lastApplied index is set to an old value, X. As we had a successful snapshot afterwords, append log doesn't have log entries from X+1 to the last index and it panics looking up the missing entries.

I would have assumed we would move the applied index forward or fail to restart completely in this case instead of failing later.

Various raft tests fail occasionally

Background: I am testing a new transport and to get this to work need it to pass all tests. Unfortunately some tests fail occasionally with the existing InmemoryTransport.

Running

for i in `seq 1 50` ; do go test -run TestRaft_Voting ; done

I see the occasional failure, like:

--- FAIL: TestRaft_Voting (0.09s)
    raft_test.go:189: [WARN] Fully Connecting
    raft_test.go:93: [INFO] raft: Node at efecfb3b-81c1-a503-46c3-05fbf14221b6 [Follower] entering Follower state
    raft_test.go:93: [INFO] raft: Node at e6ba3ba4-a326-f7d0-e260-9d1cc55719d0 [Follower] entering Follower state
    raft_test.go:93: [INFO] raft: Node at 2696da81-b3ca-69e1-b09e-ef3f91fed6ea [Follower] entering Follower state
    raft_test.go:93: [WARN] raft: Heartbeat timeout reached, starting election
    raft_test.go:93: [INFO] raft: Node at e6ba3ba4-a326-f7d0-e260-9d1cc55719d0 [Candidate] entering Candidate state
    raft_test.go:93: [DEBUG] raft: Votes needed: 2
    raft_test.go:93: [DEBUG] raft: Vote granted from e6ba3ba4-a326-f7d0-e260-9d1cc55719d0. Tally: 1
    raft_test.go:93: [DEBUG] raft: Vote granted from efecfb3b-81c1-a503-46c3-05fbf14221b6. Tally: 2
    raft_test.go:93: [INFO] raft: Election won. Tally: 2
    raft_test.go:93: [INFO] raft: Node at e6ba3ba4-a326-f7d0-e260-9d1cc55719d0 [Leader] entering Leader state
    raft_test.go:93: [WARN] raft: Heartbeat timeout reached, starting election
    raft_test.go:93: [INFO] raft: Node at 2696da81-b3ca-69e1-b09e-ef3f91fed6ea [Candidate] entering Candidate state
    raft_test.go:93: [DEBUG] raft: Votes needed: 2
    raft_test.go:93: [DEBUG] raft: Vote granted from 2696da81-b3ca-69e1-b09e-ef3f91fed6ea. Tally: 1
    raft_test.go:93: [ERR] raft: peer 2696da81-b3ca-69e1-b09e-ef3f91fed6ea has newer term, stopping replication
    raft_test.go:93: [INFO] raft: pipelining replication to peer efecfb3b-81c1-a503-46c3-05fbf14221b6
    raft_test.go:93: [DEBUG] raft: Node e6ba3ba4-a326-f7d0-e260-9d1cc55719d0 updated peer set (2): [e6ba3ba4-a326-f7d0-e260-9d1cc55719d0 efecfb3b-81c1-a503-46c3-05fbf14221b6 2696da81-b3ca-69e1-b09e-ef3f91fed6ea]
    raft_test.go:93: [WARN] raft: Rejecting vote request from 2696da81-b3ca-69e1-b09e-ef3f91fed6ea since we have a leader: e6ba3ba4-a326-f7d0-e260-9d1cc55719d0
    raft_test.go:93: [INFO] raft: Node at e6ba3ba4-a326-f7d0-e260-9d1cc55719d0 [Follower] entering Follower state
    raft_test.go:93: [INFO] raft: Node at 2696da81-b3ca-69e1-b09e-ef3f91fed6ea [Follower] entering Follower state
    raft_test.go:1661: expected vote not to be granted, but was {Term:42 Peers:[146 218 0 36 101 102 101 99 102 98 51 98 45 56 49 99 49 45 97 53 48 51 45 52 54 99 51 45 48 53 102 98 102 49 52 50 50 49 98 54 218 0 36 101 54 98 97 51 98 97 52 45 97 51 50 54 45 102 55 100 48 45 101 50 54 48 45 57 100 49 99 99 53 53 55 49 57 100 48] Granted:true}
    raft_test.go:93: [INFO] raft: aborting pipeline replication to peer efecfb3b-81c1-a503-46c3-05fbf14221b6
FAIL
exit status 1
FAIL    github.com/hashicorp/raft   0.101s

raft.Snapshot() snapshots to 0 if there are no new log entries

I've noticed that under certain conditions if you force a snapshot to be created via raft.Snapshot() it reports that its snapshot'd to 0, e.g.

10:07:45 graphr3 | 2015/08/05 10:07:45 [INFO] raft: Starting snapshot up to 0
10:07:45 graphr3 | 2015/08/05 10:07:45 [INFO] snapshot: Creating new snapshot at /media/data/code/graphr/data/3/snapshots/0-0-1438794465718.tmp
10:07:45 graphr3 | 2015/08/05 10:07:45 [INFO] snapshot: reaping snapshot /media/data/code/graphr/data/3/snapshots/0-0-1438794465718
10:07:45 graphr3 | 2015/08/05 10:07:45 [INFO] raft: Compacting logs from 385002 to 0
10:07:45 graphr3 | 2015/08/05 10:07:45 [INFO] raft: Snapshot to 0 complete

The easiest way to reproduce this is to start a node, force a snapshot, stop the node, restart it, and then force a snapshot again. It seems like something doesn't get initialized until a new log entry has been created.

Graceful leader removal

I am attempting to gracefully stop a daemon which is participating in Raft consensus by calling RemovePeer and then exiting, but in the case of that daemon being the leader, this has the effect of removing all other peers rather than just the leader.

I see there is a stepDown channel (triggered when the leader has a stale term), but that just transitions to a follower leaving the daemon in a position where it may get elected once again.

What I want is a way to remove the leader and leave the rest of the cluster in a state where it will elect a new leader without requiring the current leader to vote.

Is this possible? I am happy to make changes if someone can point me in the right direction.

remove peer, re-add later fails to connect

Having a weird issue, assuming I'm doing something wrong...

I have a cluster, if I remove a peer (and actually remove it from the peer list), the raft lib works as expected, stops trying to connect to the peer and everything.
If I re-add that peer to the cluster later on, the peer is successfully added however the leader is unable to connect to it.

In the leader logs I get this over and over: DEBUG] raft: Failed to contact 172.17.0.5:2377 in <time>
And Failed to heartbeat to 172.17.0.5:2377: read tcp 172.17.0.3:34570->172.17.0.5:2377: i/o timeout

In the re-added peer log I get this over and over raft-net: 172.17.0.5:2380 accepted connection from: 172.17.0.3:34570

Reading TCP dump, it seems like the peer is sending back a 0 byte response on each connection attempt.

_note that of course the port numbers of course change after each connection attempt_

When the peer is simply disconnected and re-connected (not actually removed from the peer list), everything syncs back up as expected.

This is not a problem with a specific node, it's on any node that I remove then re-add.

This is happening with both my own peer store implementation and stream layer implementations, as well as the standard JSON Peer and TCP Transport/stream layer.

It also happens if I remove all the existing state on the re-added node (prior to adding, after removing).
It also continues to happen when a new leader is elected (on the new leader).

Peer should restore from logs

peer change should bind with logs, they are atomic.

Raft crash between DeleteRange and StoreLogs in appendEntries, or crash before newRaft finised. Peers stored in PeerStore is not consistent with log, maybe affect the leader election.

No leader can be elected if a leader from another cluster is doing AppendEntries

If clusters are accidentally joined, one cluster can disrupt the other by spamming AppendEntries.

/cc @sean-

Hard code in error handling

In https://github.com/hashicorp/raft/blob/master/raft.go#L172, using err.Error() != "not found" to check missing key makes implementing StableStore confused. Maybe using ErrKeyNotFound as what ErrLogNotFound does can be better.

Add example

This implementation looks like it would be good to use in many different apps, but I would like a skeleton example so I can understand the API. There's an example in hraftd, but I think it's a bit too complicated to get a quick idea of what's happening.

Move MDB into a separate package

Depending on MDB causes cgo to be involved and makes cross compilation hard. Move the MDB stuff to a separate package.

Cleanup Meta Ticket

Here are the list of issues, grouped together if they might make sense as a single PR.

State Races

Raft state should use locks for any fields that are accessed together (e.g. index and term)

Multi-Row Fetching

~~Replace single row lookups with multi row lookups (LogStore / LogCache) (look at cases around log truncation)~~
Verify the current term has not changed when preparing/processing the AppendEntries message #136

Follower Replication:

replicateTo should verify leadership is current during looping
Check for any hot loops that do not break on stopCh

Change Inflight Tracking

Remove majorityQuorum
Inflight tracker should map Node -> Last Commit Index (match index)
Votes should be ignored from peers that are not part of peer set
precommit may not be necessary with new inflight (likely will be cleaned up via #117)

Improve Membership Tracking

Peer changes should have separate channel and do not pipeline (we don't want more than one peer change in flight at a time) #117
Peers.json should track index and any AddPeer or RemovePeer are ignored from older indexes - #117

Crashes / Restart Issues

~~Panic with old snapshots #85~~
~~TrailingLogs set to 0 with restart bug #86~~

New Tests

Config change under loss of quorum: #127
Setup cluster with {A, B, C, D}
Assume leader is A
Partition {C, D}
Remove {B}
Test should fail to remove B (quorum cannot be reached)

/cc: @superfell @ongardie @sean- @ryanuber @slackpad

Disregard RequestVote RPCs when they believe a current leader exists

This is in update to the latest draft of the Raft. As added to the last of section 6:

To prevent this problem, servers disregard RequestVote RPCs when they believe a current leader exists. Specif- ically, if a server receives a RequestVote RPC within the minimum election timeout of hearing from a current
11
leader, it does not update its term or grant its vote. This does not affect normal elections, where each server waits at least a minimum election timeout before starting an election. However, it helps avoid disruptions from re- moved servers: if a leader is able to get heartbeats to its cluster, then it will not be deposed by larger term num- bers.

This helps to solve the problem of a previously removed node causing an availability outage by constantly restarting the elections.

What is CommitTimeout for?

CommitTimeout is set to 50ms by default, and I notice that the leader sends RPCs to all followers in that interval. config.go says “Time without an Apply() operation before we heartbeat to ensure a timely commit”, but I am having trouble finding the necessity of such a timeout/forced AppendEntries RPC in the raft paper.

Could you clarify what this is for please? Is it a performance optimization only? What is the recommended value to set it to (is there a rule of thumb)? What will happen when I set it to e.g. 10s, while not changing any of the other timeouts?

Thanks!

Node doesn't vote for current leader

Part of the voting rules is that a node won't vote for another node if it thinks there's an active leader. However it applies this rule even when the node asking for the vote is the node that the voter thinks is the leader. e.g. (here node lp_0 refuses to vote for lp_3 because it thinks lp_3 is the leader)

lp_0:17:03:33.954809 [WARN] raft: Rejecting vote from lp_3 since we have a leader: lp_3

Seems like if the node asking for the vote is the node we think is the leader we should go ahead and vote for it.

Behaviour on setting TrailingLogs to 0

https://gist.github.com/fahadullah/e49a4578094f4efd7746

If TrailingLogs is set to 0 and raft node is restarted right after taking a snapshot, it tries to look up the last index but all entries were already compacted. TrailingLogs being 0 should be a valid config though a little inefficient under certain cases.

The correctness of pipeline mode

Hi,

I browse the code and get to known there is a pipeline mode when propragating the log entry from leader to followers. I don't quite understand how it works and its correctness. The paper doesn't include this. Could anyone give some elaboration on it? Thanks.

Occasional failures with go test -race

Occasionally I am seeing failures in the tests if race detection is turned on (go build -race then go test -race). I don't mean races detected (though I see that too sometimes) I mean the test says 'FAIL' at the end of it.

Example:
https://gist.github.com/abligh/b22c1b7623a1ef02d8f7

Sometimes I see both a race detection and a failure, e.g:
https://gist.github.com/abligh/3167c49f1aab24bce844

Sometimes I see just a race detection log, e.g.:
https://gist.github.com/abligh/cb9c1a675dbbb213b655

I believe the race is a real race; it's examining r.peers without any protection. It would be a nice if there was a thread-safe way to read peers anyway. This is the only race I've seen detected and is probably worth fixing.

However, my real concern is the random failures. These appear to happen in multiple tests (and not the one with the race). It looks to me like a timing problem of some sort triggered by race detection being on making something slower. In particular, I don't know whether the timing issue is in the tests or the main code.

I'm running today's master on OS-X on a 2013 Macbook Pro.

What was wrong with https://github.com/goraft/raft?

Decouple VerifyLeadership from disk writes

Currently the AppendEntries call that is used to verify leadership may be blocked on an AppendEntries call which is writing to disk. This can cause a stall of reads.

seeing lots of DuplicateVotes and elections with no outcome

I'm seeing issues where if i start 3 nodes at the same time (e.g. all on the same host using goreman or similar), it takes a long time to get to an election that actually elects a leader, i see this pattern repeated

10:12:26 n3 | 2015/07/22 10:12:26 [WARN] raft: Election timeout reached, restarting election
10:12:26 n3 | 2015/07/22 10:12:26 [INFO] raft: Node at 127.0.0.1:5003 [Candidate] entering Candidate state
10:12:26 n1 | 2015/07/22 10:12:26 [WARN] raft: Election timeout reached, restarting election
10:12:26 n1 | 2015/07/22 10:12:26 [INFO] raft: Node at 127.0.0.1:5001 [Candidate] entering Candidate state
10:12:26 n2 | 2015/07/22 10:12:26 [WARN] raft: Election timeout reached, restarting election
10:12:26 n2 | 2015/07/22 10:12:26 [INFO] raft: Node at 127.0.0.1:5002 [Candidate] entering Candidate state
10:12:26 n1 | 2015/07/22 10:12:26 [DEBUG] raft: Votes needed: 2
10:12:26 n1 | 2015/07/22 10:12:26 [DEBUG] raft: Vote granted. Tally: 1
10:12:26 n3 | 2015/07/22 10:12:26 [DEBUG] raft: Votes needed: 2
10:12:26 n3 | 2015/07/22 10:12:26 [DEBUG] raft: Vote granted. Tally: 1
10:12:26 n3 | 2015/07/22 10:12:26 [INFO] raft: Duplicate RequestVote for same term: 77
10:12:26 n1 | 2015/07/22 10:12:26 [INFO] raft: Duplicate RequestVote for same term: 77
10:12:26 n2 | 2015/07/22 10:12:26 [DEBUG] raft: Votes needed: 2
10:12:26 n2 | 2015/07/22 10:12:26 [DEBUG] raft: Vote granted. Tally: 1
10:12:26 n1 | 2015/07/22 10:12:26 [INFO] raft: Duplicate RequestVote for same term: 77
10:12:26 n3 | 2015/07/22 10:12:26 [INFO] raft: Duplicate RequestVote for same term: 77
10:12:26 n2 | 2015/07/22 10:12:26 [INFO] raft: Duplicate RequestVote for same term: 77
10:12:26 n2 | 2015/07/22 10:12:26 [INFO] raft: Duplicate RequestVote for same term: 77
10:12:28 n3 | 2015/07/22 10:12:28 [WARN] raft: Election timeout reached, restarting election

This pattern repeated for 40 terms before it was finally able to pick a leader.

I was able to repro this with a simple 3 node config/wrapper with the bolt store and a no-op FSM. see this for the code.
https://gist.github.com/superfell/1c8f81df0ccd955ed3fb

This uses the default config as returned by raft.DefaultConfig(), perhaps there's not enough variability in the election timeout?

A set of newer servers can force data loss on an older leader

Going through old notes and creating an issue that one of our customers saw. Unfortunately I can't share the logs, but here's the conversation describing what happened:

Seems like there is probably some way to get a minority to trash doing leader election over and over, until their term takes over, and they can force data loss on the minority that actually has data.

The problem seems like quorum is formed on nodes which aren't aware of the data on another node, so the reconciliation once the new leader is elected bombs the data.

The two sides actually try to reconcile to avoid this. But it turns out to be possible to get things to line up properly to nuke the data. When a new entry is appended we send the {PreviousTerm, PreviousIndex} and the client verifies that before committing new data. The problem is that getting a collision on a {PreviousTerm, and PreviousIndex} is easy if the terms are low cardinality and index is monotonic. We probably want like {PreviousTerm, PreviousIndex, PreviousMD5}, or even SHA1 just to ensure that collision is not accidental

I think what is happening is that there were 3 nodes, 2 went away, so 1 "old" node has the data. Then 2 "new" nodes come, they fight to become leader for a long time (old guy ruins things by rejecting everything due to newer data) Eventually the "new" nodes drive up the PreviousTerm and PreviousIndex until they collide with the old node. Then the old node just nukes its data and everything "syncs".

golint

The golint tool reports 98 "suggestions". I don't think code needs to be 100% compliant but I think there are some tips for improving docs on the exported API.

https://gist.github.com/benbjohnson/ce54185a0a44bb29cc9f

Sanity check configuration

Currently we just assume the configuration is mostly sane. We should do a better job at validating the configuration.

race in `randomTimeout()`

Looks like commit cda1ee8 creates a race in randomTimeout(). Showed up in InfluxDB CI. https://gist.github.com/dgnorton/c11e59fecf2bcf367e76

Small example application

For those getting started with raft, is there a small example application available?

Verify that the stop channel always gets closed when we remove a peer

We were looking at LogRemovePeer and thought that there might be a case where the peers list doesn't have an entry for a peer so the current logic would skip closing the stop channel.

@sean- attempted a fix in #64 but it was failing tests and we weren't sure if this fix was actually necessary. Go through this code again and double check things are ok.

Add code coverage + testing badges

Add read timeouts to incoming RPC commands

We need to apply read timeouts on the incoming RPC commands. There is one particular case, install snapshot, where the reader is passed through into the state FSM. A blocking read is bad. We should guard against the read timing out and the enqueue also timing out.

How to get followers from leader ?

Hi,

I got a basic raft server running,

It seems raft provides no way to retrieve the []net.Addr of followers from the leader, I want to reach followers from the leader (ideally without having to deal with whether they are up or no, since raft already handle this), what is the best way to do that ?

I should read from the PeerStore and contact the server myself (it seems that Raft.peers contains all peers, not only followers) ?

Any thoughts ?
Thanks!

Tests are unstable

Hello!

I have packaged hashicorp/raft for Debian. (Available in unstable here)

Sometimes, the tests fail. You may see the logs here

The error we are getting is

--- FAIL: TestRaft_LeaderFail (0.33s)
    raft_test.go:511: expected new leader

And some other times:

--- FAIL: TestRaft_LeaderLeaseExpire (0.18s)
    raft_test.go:1292: bad: 69f816ea-8214-d42c-015a-e4320f2ebac8

The tests pass about 50% of the time.

Is this a known issue?

Thanks,

vote requests denied incorrectly

The log comparison in the RequestVote request handler is incorrect. The current code, https://github.com/hashicorp/raft/blob/057b893/raft.go#L1463-L1475 , rejects the request if the voter's last log term is greater than the candidate's OR if the voter's last log index is greater than the candidate's.

The Raft paper describes the intended comparison (emphasis added):

Raft determines which of two logs is more up-to-date
by comparing the index and term of the last entries in the
logs. If the logs have last entries with different terms, then
the log with the later term is more up-to-date. If the logs
end with the same term, then whichever log is longer is
more up-to-date.

I believe this can lead to stuck clusters where no leader can be elected. For example:

S1: [1,1,1,1]
S2: [1,1,1,1]
S3: [1,1,2]

Suppose S2 fails, but S1 and S3 are still operational. S1 will not vote for S3, and S3 will not vote for S1.

Add a cluster ID type construct to avoid mixing clusters

We currently use the encryption key to protect against merging of clusters in Serf/Consul, but this does not protect the Raft or RPC layers (TLS with verification turned on still protects against random outsiders interfering, but doesn't protect against most operator error). We should introduce a ClusterID type concept to help prevent merging of clusters due to operator error.

There were two ideas that came from @sean- while discussing this at HashiConf:

Create a way to pin to a specific list of peer IPs.
Implement a "trust on first use" scheme:

If the opportunity presents itself, is there any chance of a TOFU symmetric key that can be introduced for the consul Raft traffic? I'm a huge fan of NaCL and a TOFU exchange using just NaCL's box asymmetric crypto to establish a key, then use Salsa for the remainder of the relationship between peers until a rekey operation is performed would be epic and would harden consul considerably. See https://www.godoc.org/golang.org/x/crypto/nacl/box and https://www.godoc.org/golang.org/x/crypto/salsa20.

Benchmarking

@armon

I was curious how the pipelining was performing so I tried implementing a purely synthetic benchmark but I hit an issue. I may very well just be missing something though.

I'm trying to just pump as many commands through as possible:

func BenchmarkRaftTripleNode(b *testing.B) {
    b.ReportAllocs()

    // Make the cluster
    c := MakeCluster(3, b, nil)
    defer c.Close()

    // Should be one leader
    leader := c.Leader()
    c.EnsureLeader(b, leader.localAddr)

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        cmd := make([]byte, 8)
        binary.BigEndian.PutUint64(cmd, uint64(i))

        future := leader.Apply(cmd, 0)
        if err := future.Error(); err != nil {
            b.Fatalf("err: %d: %v", i, err)
        }
    }
}

If I run it with the default -benchtime 1s then it works and I get 21,754 cmds/sec:

$ go test -v -run=XXX -bench=. -benchmem -benchtime=1s
PASS
BenchmarkRaftTripleNode 2014/08/29 10:31:01 [WARN] Fully Connecting
2014/08/29 10:31:01 [INFO] raft: Node at 00e35516-db0d-762c-23f7-714f51fc78e7 [Follower] entering Follower state
2014/08/29 10:31:01 [INFO] raft: Node at e7a92a8e-39c0-e5fd-5424-36ff682db8f6 [Follower] entering Follower state
2014/08/29 10:31:01 [INFO] raft: Node at 1d5fb5d9-2d32-1abf-c637-87439561cfdf [Follower] entering Follower state
2014/08/29 10:31:01 [WARN] raft: Heartbeat timeout reached, starting election
2014/08/29 10:31:01 [INFO] raft: Node at e7a92a8e-39c0-e5fd-5424-36ff682db8f6 [Candidate] entering Candidate state
2014/08/29 10:31:01 [DEBUG] raft: Votes needed: 2
2014/08/29 10:31:01 [DEBUG] raft: Vote granted. Tally: 1
2014/08/29 10:31:01 [DEBUG] raft: Vote granted. Tally: 2
2014/08/29 10:31:01 [INFO] raft: Election won. Tally: 2
2014/08/29 10:31:01 [INFO] raft: Node at e7a92a8e-39c0-e5fd-5424-36ff682db8f6 [Leader] entering Leader state
2014/08/29 10:31:01 [INFO] raft: pipelining replication to peer 00e35516-db0d-762c-23f7-714f51fc78e7
2014/08/29 10:31:01 [INFO] raft: pipelining replication to peer 1d5fb5d9-2d32-1abf-c637-87439561cfdf
2014/08/29 10:31:01 [DEBUG] raft: Node e7a92a8e-39c0-e5fd-5424-36ff682db8f6 updated peer set (2): [e7a92a8e-39c0-e5fd-5424-36ff682db8f6 00e35516-db0d-762c-23f7-714f51fc78e7 1d5fb5d9-2d32-1abf-c637-87439561cfdf]
2014/08/29 10:31:01 [DEBUG] raft: Node 00e35516-db0d-762c-23f7-714f51fc78e7 updated peer set (2): [e7a92a8e-39c0-e5fd-5424-36ff682db8f6 00e35516-db0d-762c-23f7-714f51fc78e7 1d5fb5d9-2d32-1abf-c637-87439561cfdf]
2014/08/29 10:31:01 [DEBUG] raft: Node 1d5fb5d9-2d32-1abf-c637-87439561cfdf updated peer set (2): [e7a92a8e-39c0-e5fd-5424-36ff682db8f6 00e35516-db0d-762c-23f7-714f51fc78e7 1d5fb5d9-2d32-1abf-c637-87439561cfdf]
2014/08/29 10:31:01 [INFO] raft: aborting pipeline replication to peer 1d5fb5d9-2d32-1abf-c637-87439561cfdf
2014/08/29 10:31:01 [INFO] raft: aborting pipeline replication to peer 00e35516-db0d-762c-23f7-714f51fc78e7
2014/08/29 10:31:01 [WARN] Fully Connecting
2014/08/29 10:31:01 [INFO] raft: Node at 2f89195d-9ef2-e61b-f17c-ef8cbde7a980 [Follower] entering Follower state
2014/08/29 10:31:01 [INFO] raft: Node at 235fe4e8-0ccf-107b-fe52-d5a28a88ed24 [Follower] entering Follower state
2014/08/29 10:31:01 [INFO] raft: Node at d8a0b93f-fb76-c63c-5fdd-6b91748ac76e [Follower] entering Follower state
2014/08/29 10:31:01 [WARN] raft: Heartbeat timeout reached, starting election
2014/08/29 10:31:01 [INFO] raft: Node at 235fe4e8-0ccf-107b-fe52-d5a28a88ed24 [Candidate] entering Candidate state
2014/08/29 10:31:01 [DEBUG] raft: Votes needed: 2
2014/08/29 10:31:01 [DEBUG] raft: Vote granted. Tally: 1
2014/08/29 10:31:01 [DEBUG] raft: Vote granted. Tally: 2
2014/08/29 10:31:01 [INFO] raft: Election won. Tally: 2
2014/08/29 10:31:01 [INFO] raft: Node at 235fe4e8-0ccf-107b-fe52-d5a28a88ed24 [Leader] entering Leader state
2014/08/29 10:31:01 [INFO] raft: pipelining replication to peer 2f89195d-9ef2-e61b-f17c-ef8cbde7a980
2014/08/29 10:31:01 [INFO] raft: pipelining replication to peer d8a0b93f-fb76-c63c-5fdd-6b91748ac76e
2014/08/29 10:31:01 [DEBUG] raft: Node 235fe4e8-0ccf-107b-fe52-d5a28a88ed24 updated peer set (2): [235fe4e8-0ccf-107b-fe52-d5a28a88ed24 2f89195d-9ef2-e61b-f17c-ef8cbde7a980 d8a0b93f-fb76-c63c-5fdd-6b91748ac76e]
2014/08/29 10:31:01 [DEBUG] raft: Node 2f89195d-9ef2-e61b-f17c-ef8cbde7a980 updated peer set (2): [235fe4e8-0ccf-107b-fe52-d5a28a88ed24 2f89195d-9ef2-e61b-f17c-ef8cbde7a980 d8a0b93f-fb76-c63c-5fdd-6b91748ac76e]
2014/08/29 10:31:01 [DEBUG] raft: Node d8a0b93f-fb76-c63c-5fdd-6b91748ac76e updated peer set (2): [235fe4e8-0ccf-107b-fe52-d5a28a88ed24 2f89195d-9ef2-e61b-f17c-ef8cbde7a980 d8a0b93f-fb76-c63c-5fdd-6b91748ac76e]
2014/08/29 10:31:01 [INFO] raft: aborting pipeline replication to peer d8a0b93f-fb76-c63c-5fdd-6b91748ac76e
2014/08/29 10:31:01 [INFO] raft: aborting pipeline replication to peer 2f89195d-9ef2-e61b-f17c-ef8cbde7a980
2014/08/29 10:31:01 [WARN] Fully Connecting
2014/08/29 10:31:01 [INFO] raft: Node at 71eeb349-c628-c18b-efa7-90b8b7bdf1d2 [Follower] entering Follower state
2014/08/29 10:31:01 [INFO] raft: Node at 75ec9f18-fa26-986a-d0b5-ff17f675e167 [Follower] entering Follower state
2014/08/29 10:31:01 [INFO] raft: Node at 09d5f23b-e44b-a6c7-5e73-44c8afdcbcfb [Follower] entering Follower state
2014/08/29 10:31:01 [WARN] raft: Heartbeat timeout reached, starting election
2014/08/29 10:31:01 [INFO] raft: Node at 71eeb349-c628-c18b-efa7-90b8b7bdf1d2 [Candidate] entering Candidate state
2014/08/29 10:31:01 [DEBUG] raft: Votes needed: 2
2014/08/29 10:31:01 [DEBUG] raft: Vote granted. Tally: 1
2014/08/29 10:31:01 [DEBUG] raft: Vote granted. Tally: 2
2014/08/29 10:31:01 [INFO] raft: Election won. Tally: 2
2014/08/29 10:31:01 [INFO] raft: Node at 71eeb349-c628-c18b-efa7-90b8b7bdf1d2 [Leader] entering Leader state
2014/08/29 10:31:01 [INFO] raft: pipelining replication to peer 75ec9f18-fa26-986a-d0b5-ff17f675e167
2014/08/29 10:31:01 [INFO] raft: pipelining replication to peer 09d5f23b-e44b-a6c7-5e73-44c8afdcbcfb
2014/08/29 10:31:01 [DEBUG] raft: Node 71eeb349-c628-c18b-efa7-90b8b7bdf1d2 updated peer set (2): [71eeb349-c628-c18b-efa7-90b8b7bdf1d2 75ec9f18-fa26-986a-d0b5-ff17f675e167 09d5f23b-e44b-a6c7-5e73-44c8afdcbcfb]
2014/08/29 10:31:01 [DEBUG] raft: Node 75ec9f18-fa26-986a-d0b5-ff17f675e167 updated peer set (2): [71eeb349-c628-c18b-efa7-90b8b7bdf1d2 75ec9f18-fa26-986a-d0b5-ff17f675e167 09d5f23b-e44b-a6c7-5e73-44c8afdcbcfb]
2014/08/29 10:31:01 [DEBUG] raft: Node 09d5f23b-e44b-a6c7-5e73-44c8afdcbcfb updated peer set (2): [71eeb349-c628-c18b-efa7-90b8b7bdf1d2 75ec9f18-fa26-986a-d0b5-ff17f675e167 09d5f23b-e44b-a6c7-5e73-44c8afdcbcfb]
2014/08/29 10:31:02 [INFO] raft: aborting pipeline replication to peer 75ec9f18-fa26-986a-d0b5-ff17f675e167
2014/08/29 10:31:02 [INFO] raft: aborting pipeline replication to peer 09d5f23b-e44b-a6c7-5e73-44c8afdcbcfb
2014/08/29 10:31:02 [WARN] Fully Connecting
2014/08/29 10:31:02 [INFO] raft: Node at ad190190-8c9b-4c2b-a801-373e9386fd69 [Follower] entering Follower state
2014/08/29 10:31:02 [INFO] raft: Node at b3683bfd-6844-414b-6f5d-81d365b77088 [Follower] entering Follower state
2014/08/29 10:31:02 [INFO] raft: Node at d9504e37-00cb-6f40-467c-5ce5dc503835 [Follower] entering Follower state
2014/08/29 10:31:02 [WARN] raft: Heartbeat timeout reached, starting election
2014/08/29 10:31:02 [INFO] raft: Node at b3683bfd-6844-414b-6f5d-81d365b77088 [Candidate] entering Candidate state
2014/08/29 10:31:02 [DEBUG] raft: Votes needed: 2
2014/08/29 10:31:02 [DEBUG] raft: Vote granted. Tally: 1
2014/08/29 10:31:02 [DEBUG] raft: Vote granted. Tally: 2
2014/08/29 10:31:02 [INFO] raft: Election won. Tally: 2
2014/08/29 10:31:02 [INFO] raft: Node at b3683bfd-6844-414b-6f5d-81d365b77088 [Leader] entering Leader state
2014/08/29 10:31:02 [INFO] raft: pipelining replication to peer ad190190-8c9b-4c2b-a801-373e9386fd69
2014/08/29 10:31:02 [INFO] raft: pipelining replication to peer d9504e37-00cb-6f40-467c-5ce5dc503835
2014/08/29 10:31:02 [DEBUG] raft: Node b3683bfd-6844-414b-6f5d-81d365b77088 updated peer set (2): [b3683bfd-6844-414b-6f5d-81d365b77088 ad190190-8c9b-4c2b-a801-373e9386fd69 d9504e37-00cb-6f40-467c-5ce5dc503835]
2014/08/29 10:31:02 [DEBUG] raft: Node ad190190-8c9b-4c2b-a801-373e9386fd69 updated peer set (2): [b3683bfd-6844-414b-6f5d-81d365b77088 ad190190-8c9b-4c2b-a801-373e9386fd69 d9504e37-00cb-6f40-467c-5ce5dc503835]
2014/08/29 10:31:02 [DEBUG] raft: Node d9504e37-00cb-6f40-467c-5ce5dc503835 updated peer set (2): [b3683bfd-6844-414b-6f5d-81d365b77088 ad190190-8c9b-4c2b-a801-373e9386fd69 d9504e37-00cb-6f40-467c-5ce5dc503835]
2014/08/29 10:31:04 [INFO] raft: aborting pipeline replication to peer ad190190-8c9b-4c2b-a801-373e9386fd69
2014/08/29 10:31:04 [INFO] raft: aborting pipeline replication to peer d9504e37-00cb-6f40-467c-5ce5dc503835
   50000         45967 ns/op       13183 B/op         84 allocs/op
ok      github.com/hashicorp/raft   3.019s

If I run it with a larger -benchtime of 2s or higher then I get a time out around the 60,000th command:

$ go test -v -run=XXX -bench=. -benchmem -benchtime=2s 
PASS
BenchmarkRaftTripleNode 2014/08/29 10:33:48 [WARN] Fully Connecting
2014/08/29 10:33:48 [INFO] raft: Node at fb268452-07fe-410a-8e7b-1e9a1e07100d [Follower] entering Follower state
2014/08/29 10:33:48 [INFO] raft: Node at bc16fa48-e344-d477-5af4-72be88532b91 [Follower] entering Follower state
2014/08/29 10:33:48 [INFO] raft: Node at 6196b181-e364-d199-e906-a3ac1aa257b5 [Follower] entering Follower state
2014/08/29 10:33:48 [WARN] raft: Heartbeat timeout reached, starting election
2014/08/29 10:33:48 [INFO] raft: Node at bc16fa48-e344-d477-5af4-72be88532b91 [Candidate] entering Candidate state
2014/08/29 10:33:48 [DEBUG] raft: Votes needed: 2
2014/08/29 10:33:48 [DEBUG] raft: Vote granted. Tally: 1
2014/08/29 10:33:48 [DEBUG] raft: Vote granted. Tally: 2
2014/08/29 10:33:48 [INFO] raft: Election won. Tally: 2
2014/08/29 10:33:48 [INFO] raft: Node at bc16fa48-e344-d477-5af4-72be88532b91 [Leader] entering Leader state
2014/08/29 10:33:48 [INFO] raft: pipelining replication to peer fb268452-07fe-410a-8e7b-1e9a1e07100d
2014/08/29 10:33:48 [INFO] raft: pipelining replication to peer 6196b181-e364-d199-e906-a3ac1aa257b5
2014/08/29 10:33:48 [DEBUG] raft: Node bc16fa48-e344-d477-5af4-72be88532b91 updated peer set (2): [bc16fa48-e344-d477-5af4-72be88532b91 fb268452-07fe-410a-8e7b-1e9a1e07100d 6196b181-e364-d199-e906-a3ac1aa257b5]
2014/08/29 10:33:48 [DEBUG] raft: Node fb268452-07fe-410a-8e7b-1e9a1e07100d updated peer set (2): [bc16fa48-e344-d477-5af4-72be88532b91 fb268452-07fe-410a-8e7b-1e9a1e07100d 6196b181-e364-d199-e906-a3ac1aa257b5]
2014/08/29 10:33:48 [DEBUG] raft: Node 6196b181-e364-d199-e906-a3ac1aa257b5 updated peer set (2): [bc16fa48-e344-d477-5af4-72be88532b91 fb268452-07fe-410a-8e7b-1e9a1e07100d 6196b181-e364-d199-e906-a3ac1aa257b5]
2014/08/29 10:33:48 [INFO] raft: aborting pipeline replication to peer fb268452-07fe-410a-8e7b-1e9a1e07100d
2014/08/29 10:33:48 [INFO] raft: aborting pipeline replication to peer 6196b181-e364-d199-e906-a3ac1aa257b5
2014/08/29 10:33:48 [WARN] Fully Connecting
2014/08/29 10:33:48 [INFO] raft: Node at 70d07722-a168-ca29-7987-5cafc20426e2 [Follower] entering Follower state
2014/08/29 10:33:48 [INFO] raft: Node at 40e4889a-3eae-c654-0849-cf7dffe7f27d [Follower] entering Follower state
2014/08/29 10:33:48 [INFO] raft: Node at 419c7388-5d50-0669-938d-7798bbbf6721 [Follower] entering Follower state
2014/08/29 10:33:48 [WARN] raft: Heartbeat timeout reached, starting election
2014/08/29 10:33:48 [INFO] raft: Node at 40e4889a-3eae-c654-0849-cf7dffe7f27d [Candidate] entering Candidate state
2014/08/29 10:33:48 [DEBUG] raft: Votes needed: 2
2014/08/29 10:33:48 [DEBUG] raft: Vote granted. Tally: 1
2014/08/29 10:33:48 [DEBUG] raft: Vote granted. Tally: 2
2014/08/29 10:33:48 [INFO] raft: Election won. Tally: 2
2014/08/29 10:33:48 [INFO] raft: Node at 40e4889a-3eae-c654-0849-cf7dffe7f27d [Leader] entering Leader state
2014/08/29 10:33:48 [INFO] raft: pipelining replication to peer 70d07722-a168-ca29-7987-5cafc20426e2
2014/08/29 10:33:48 [INFO] raft: pipelining replication to peer 419c7388-5d50-0669-938d-7798bbbf6721
2014/08/29 10:33:48 [DEBUG] raft: Node 40e4889a-3eae-c654-0849-cf7dffe7f27d updated peer set (2): [40e4889a-3eae-c654-0849-cf7dffe7f27d 70d07722-a168-ca29-7987-5cafc20426e2 419c7388-5d50-0669-938d-7798bbbf6721]
2014/08/29 10:33:48 [DEBUG] raft: Node 70d07722-a168-ca29-7987-5cafc20426e2 updated peer set (2): [40e4889a-3eae-c654-0849-cf7dffe7f27d 70d07722-a168-ca29-7987-5cafc20426e2 419c7388-5d50-0669-938d-7798bbbf6721]
2014/08/29 10:33:48 [DEBUG] raft: Node 419c7388-5d50-0669-938d-7798bbbf6721 updated peer set (2): [40e4889a-3eae-c654-0849-cf7dffe7f27d 70d07722-a168-ca29-7987-5cafc20426e2 419c7388-5d50-0669-938d-7798bbbf6721]
2014/08/29 10:33:48 [INFO] raft: aborting pipeline replication to peer 70d07722-a168-ca29-7987-5cafc20426e2
2014/08/29 10:33:48 [INFO] raft: aborting pipeline replication to peer 419c7388-5d50-0669-938d-7798bbbf6721
2014/08/29 10:33:48 [WARN] Fully Connecting
2014/08/29 10:33:48 [INFO] raft: Node at 0b48649f-ca20-193c-59b0-5f3bb9383832 [Follower] entering Follower state
2014/08/29 10:33:48 [INFO] raft: Node at ae46447b-d128-812b-6521-43392de18855 [Follower] entering Follower state
2014/08/29 10:33:48 [INFO] raft: Node at 4ecf766e-b364-d73f-3acc-4ac7c276bb48 [Follower] entering Follower state
2014/08/29 10:33:48 [WARN] raft: Heartbeat timeout reached, starting election
2014/08/29 10:33:48 [INFO] raft: Node at 0b48649f-ca20-193c-59b0-5f3bb9383832 [Candidate] entering Candidate state
2014/08/29 10:33:48 [DEBUG] raft: Votes needed: 2
2014/08/29 10:33:48 [DEBUG] raft: Vote granted. Tally: 1
2014/08/29 10:33:48 [DEBUG] raft: Vote granted. Tally: 2
2014/08/29 10:33:48 [INFO] raft: Election won. Tally: 2
2014/08/29 10:33:48 [INFO] raft: Node at 0b48649f-ca20-193c-59b0-5f3bb9383832 [Leader] entering Leader state
2014/08/29 10:33:48 [INFO] raft: pipelining replication to peer ae46447b-d128-812b-6521-43392de18855
2014/08/29 10:33:48 [INFO] raft: pipelining replication to peer 4ecf766e-b364-d73f-3acc-4ac7c276bb48
2014/08/29 10:33:48 [DEBUG] raft: Node 0b48649f-ca20-193c-59b0-5f3bb9383832 updated peer set (2): [0b48649f-ca20-193c-59b0-5f3bb9383832 ae46447b-d128-812b-6521-43392de18855 4ecf766e-b364-d73f-3acc-4ac7c276bb48]
2014/08/29 10:33:48 [DEBUG] raft: Node ae46447b-d128-812b-6521-43392de18855 updated peer set (2): [0b48649f-ca20-193c-59b0-5f3bb9383832 ae46447b-d128-812b-6521-43392de18855 4ecf766e-b364-d73f-3acc-4ac7c276bb48]
2014/08/29 10:33:48 [DEBUG] raft: Node 4ecf766e-b364-d73f-3acc-4ac7c276bb48 updated peer set (2): [0b48649f-ca20-193c-59b0-5f3bb9383832 ae46447b-d128-812b-6521-43392de18855 4ecf766e-b364-d73f-3acc-4ac7c276bb48]
2014/08/29 10:33:49 [INFO] raft: aborting pipeline replication to peer ae46447b-d128-812b-6521-43392de18855
2014/08/29 10:33:49 [INFO] raft: aborting pipeline replication to peer 4ecf766e-b364-d73f-3acc-4ac7c276bb48
2014/08/29 10:33:49 [WARN] Fully Connecting
2014/08/29 10:33:49 [INFO] raft: Node at 153d6df7-c30f-9597-f6cb-d71d26f8a7e2 [Follower] entering Follower state
2014/08/29 10:33:49 [INFO] raft: Node at e25757a9-890c-d4df-2015-31fd80786594 [Follower] entering Follower state
2014/08/29 10:33:49 [INFO] raft: Node at ef6fe31f-7daf-55e0-e6e9-b8e86464ff66 [Follower] entering Follower state
2014/08/29 10:33:49 [WARN] raft: Heartbeat timeout reached, starting election
2014/08/29 10:33:49 [INFO] raft: Node at e25757a9-890c-d4df-2015-31fd80786594 [Candidate] entering Candidate state
2014/08/29 10:33:49 [DEBUG] raft: Votes needed: 2
2014/08/29 10:33:49 [DEBUG] raft: Vote granted. Tally: 1
2014/08/29 10:33:49 [DEBUG] raft: Vote granted. Tally: 2
2014/08/29 10:33:49 [INFO] raft: Election won. Tally: 2
2014/08/29 10:33:49 [INFO] raft: Node at e25757a9-890c-d4df-2015-31fd80786594 [Leader] entering Leader state
2014/08/29 10:33:49 [INFO] raft: pipelining replication to peer 153d6df7-c30f-9597-f6cb-d71d26f8a7e2
2014/08/29 10:33:49 [INFO] raft: pipelining replication to peer ef6fe31f-7daf-55e0-e6e9-b8e86464ff66
2014/08/29 10:33:49 [DEBUG] raft: Node e25757a9-890c-d4df-2015-31fd80786594 updated peer set (2): [e25757a9-890c-d4df-2015-31fd80786594 153d6df7-c30f-9597-f6cb-d71d26f8a7e2 ef6fe31f-7daf-55e0-e6e9-b8e86464ff66]
2014/08/29 10:33:49 [DEBUG] raft: Node 153d6df7-c30f-9597-f6cb-d71d26f8a7e2 updated peer set (2): [e25757a9-890c-d4df-2015-31fd80786594 153d6df7-c30f-9597-f6cb-d71d26f8a7e2 ef6fe31f-7daf-55e0-e6e9-b8e86464ff66]
2014/08/29 10:33:49 [DEBUG] raft: Node ef6fe31f-7daf-55e0-e6e9-b8e86464ff66 updated peer set (2): [e25757a9-890c-d4df-2015-31fd80786594 153d6df7-c30f-9597-f6cb-d71d26f8a7e2 ef6fe31f-7daf-55e0-e6e9-b8e86464ff66]
2014/08/29 10:33:51 [WARN] raft: Failed to contact 153d6df7-c30f-9597-f6cb-d71d26f8a7e2 in 54.326937ms
2014/08/29 10:33:51 [WARN] raft: Failed to contact ef6fe31f-7daf-55e0-e6e9-b8e86464ff66 in 54.325185ms
2014/08/29 10:33:51 [WARN] raft: Failed to contact quorum of nodes, stepping down
2014/08/29 10:33:51 [INFO] raft: Node at e25757a9-890c-d4df-2015-31fd80786594 [Follower] entering Follower state
2014/08/29 10:33:51 [WARN] raft: Heartbeat timeout reached, starting election
2014/08/29 10:33:51 [INFO] raft: Node at ef6fe31f-7daf-55e0-e6e9-b8e86464ff66 [Candidate] entering Candidate state
2014/08/29 10:33:51 [DEBUG] raft: Votes needed: 2
2014/08/29 10:33:51 [DEBUG] raft: Vote granted. Tally: 1
2014/08/29 10:33:51 [ERR] raft: peer ef6fe31f-7daf-55e0-e6e9-b8e86464ff66 has newer term, stopping replication
2014/08/29 10:33:51 [WARN] raft: Rejecting vote from ef6fe31f-7daf-55e0-e6e9-b8e86464ff66 since we have a leader: e25757a9-890c-d4df-2015-31fd80786594
2014/08/29 10:33:51 [INFO] raft: aborting pipeline replication to peer 153d6df7-c30f-9597-f6cb-d71d26f8a7e2
2014/08/29 10:33:51 [INFO] raft: aborting pipeline replication to peer ef6fe31f-7daf-55e0-e6e9-b8e86464ff66
2014/08/29 10:33:51 [ERR] raft: Failed to make RequestVote RPC to e25757a9-890c-d4df-2015-31fd80786594: command timed out
--- FAIL: BenchmarkRaftTripleNode
    raft_test.go:1408: err: 63147: node is not the leader
ok      github.com/hashicorp/raft   3.260s

I tried adjusting the heartbeat and election timeouts but I keep hitting replication issues around the 60,000th command.

Does my benchmark code look correct? I was trying to base it off the TestRaft_TripleNode() test.

Thanks!

Inconsistency in closing pipelines

I am a little confused over if / when pipelines should be closed by the transport.

net_transport.go maintains no record of open pipelines, which means netPipeline.Close() is only called by raft. It's not obvious to me this is always called if raft is shutdown, though I suppose it may be.

inmem_transport.go maintains a slice of pipelines (pipelines). The main purpose of this appears to be to call Close() on each of the pipelines when a peer is Disconnect()-ed. Fair enough, but when the pipeline is closed by raft itself (i.e. balancing the AppendEntriesPipeline call), the pipeline is not removed from the slice, meaning that the slice will grow indefinitely.

I think both behaviours are wrong:

inmem_transport.go should be removing the entries from the slice of pipelines when Close() is called, however Close() is called.
net_transport should in theory be tracking pipelines so that the Close() method could close them; I suspect in the general case raft has already cleared these up.

Could someone confirm please?

Include voter in vote debug logging

The current logging around voting/elections is typically

16:57:19.062392 [INFO] raft: Node at sv_4 [Candidate] entering Candidate state
16:57:19.062889 [DEBUG] raft: Votes needed: 3
16:57:19.062901 [DEBUG] raft: Vote granted. Tally: 1
16:57:19.063332 [DEBUG] raft: Vote granted. Tally: 2
16:57:19.063595 [DEBUG] raft: Vote granted. Tally: 3
16:57:19.063632 [INFO] raft: Election won. Tally: 3

It'd be useful if the Vote granted lined include the name of the node that granted the vote

TestFileSS_BadPerm is failed on the win32

TestFileSS_BadPerm is failed on the win32:

--- FAIL: TestFileSS_BadPerm (0.00s)
        file_snapshot_test.go:249: should fail to use root

can apply this patch:

diff --git a/file_snapshot_test.go b/file_snapshot_test.go
index e5336e5..44dbf22 100644
--- a/file_snapshot_test.go
+++ b/file_snapshot_test.go
@@ -5,6 +5,7 @@ import (
        "io"
        "io/ioutil"
        "os"
+       "runtime"
        "testing"
 )

@@ -243,6 +244,9 @@ func TestFileSS_Retention(t *testing.T) {
 }

 func TestFileSS_BadPerm(t *testing.T) {
+       if "windows" == runtime.GOOS {
+               t.Skip("skip it on the windows.")
+       }
        // Should fail
        _, err := NewFileSnapshotStore("/", 3, nil)
        if err == nil {

go get github.com/hashicorp/raft fails

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 14.04.1 LTS
Release:    14.04
Codename:   trusty
$ go version
go version go1.2.1 linux/amd64
$ go get github.com/hashicorp/raft
# github.com/prometheus/client_golang/model
/home/tmp/src/github.com/prometheus/client_golang/model/signature.go:32: undefined: sync.Pool

Clear Raft peer set on graceful leave

Server node should clear the peer set on a graceful leave, otherwise we can cause issues on restart.

rpcErr is always nil in defer

I see the code in raft.go

var rpcErr error
defer rpc.Respond(resp, rpcErr)

and you will assign an error to rpcErr later, but this doesn't work.

A simple example http://play.golang.org/p/IYSUpouNhs

Not 2-phase in member change?

Hi,

I take a look into the source code. It seems that Leader would simply just applies the peer change(add or remove) to local peers. And it's not 2-phase approach in member change as described in raft paper. Do I miss something?

Support Pre-Vote optimization

Use a pre-vote to avoid disrupting a stable cluster when possible

hashicorp / raft Goto Github PK

raft's Introduction

raft

Building

Documentation

Community Contributed Examples

Tagged Releases

Protocol

Protocol Description

raft's People

Contributors

Stargazers

Watchers

Forkers

raft's Issues

Recommend Projects

Recommend Topics

Recommend Org