spotify / cassandra-reaper Goto Github PK

Software to run automated repairs of cassandra

cassandra cassandra-reaper cassandra-cluster cassandra-repairs

cassandra-reaper's Introduction

This fork is no longer active. The awesome folks at The Last Pickle have taken over the development of Reaper for Apache Cassandra.

cassandra-reaper's People

Contributors

Stargazers

Watchers

cassandra-reaper's Issues

Run multiple repairs for same token ring in parallel

Currently Reaper supports running one repair process at a time per one cluster repair run. Make it possible to have more than one process running in parallel targeting same token ring, but different data (replicas of same data count as same data).

The way these parallel repairs can be run depends on the used write replication factor, and to the configured replica placement strategy. E.g. in a single site cluster with six nodes and write replication factor of three, using simple replica placement strategy, you can have two processes running in parallel on the opposite sides of the token ring.

Handle "repair finished" message not reaching all peers

We recently ran into a case of a hanging repair run. Some segments kept getting postponed indefinitely, because an involved node reported that it was participating in a repair. We got the repair session's hash from the node's log. Other nodes' logs reported this session as finished, but not that node. So apparently, the cross-communication within Cassandra had failed there.

Reaper on the other hand got notified that the repair was done, so it moved along to remaining segments. But segments within that node's range obviously got stopped by SegmentRunner::canRepair, because a repair was already underway according to the node.

Potential fix: when SegmentRunner::canRepair discovers a node that's already busy with repair, compare with Reaper's storage to determine if it really should have a repair ongoing. If not, use JmxProxy::cancelAllRepairs to clear that node's state.

Does it support Cassandra 3.11?

Getting issue while using it with Cassandra 3.11:

WARN [2017-07-27 04:16:44,669] [testcluster:1:355] c.s.r.s.SegmentRunner - SegmentRunner declined to repair segment 355 because of an error collecting information from one of the hosts (127.0.0.1): {}
java.lang.reflect.UndeclaredThrowableException: null
at com.sun.proxy.$Proxy51.getPendingTasks(Unknown Source) ˜[na:1.8.0-internal]
at com.spotify.reaper.cassandra.JmxProxy.getPendingCompactions(JmxProxy.java:259) ˜[cassandra-reaper-1.1.1-SNAPSHOT.jar:1.1.1-SNAPSHOT]
at com.spotify.reaper.service.SegmentRunner.canRepair(SegmentRunner.java:262) [cassandra-reaper-1.1.1-SNAPSHOT.jar:1.1.1-SNAPSHOT]
at com.spotify.reaper.service.SegmentRunner.runRepair(SegmentRunner.java:178) [cassandra-reaper-1.1.1-SNAPSHOT.jar:1.1.1-SNAPSHOT]
at com.spotify.reaper.service.SegmentRunner.run(SegmentRunner.java:96) [cassandra-reaper-1.1.1-SNAPSHOT.jar:1.1.1-SNAPSHOT]
Caused by: javax.management.AttributeNotFoundException: No such attribute: PendingTasks
at com.sun.jmx.mbeanserver.PerInterface.getAttribute(PerInterface.java:93) ˜[na:1.8.0-internal]
at com.sun.jmx.mbeanserver.MBeanSupport.getAttribute(MBeanSupport.java:218) ˜[na:1.8.0-internal]
at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:659) ˜[na:1.8.0-internal]
at com.sun.jmx.mbeanserver.JmxMBeanServer.getAttribute(JmxMBeanServer.java:690) ˜[na:1.8.0-internal]

Add support for jmx authentication on cassandra cluster

Store one "last event" per repair runner currentlyRunningSegments slot

Since we started doing parallel repairs, the "last event" portion of repair runs got a lot less informative. While most threads may be idle waiting for a repair segment to finish, one or more threads are usually trying to repair but postponing for various reasons. The result is that "last event" usually says "Postponed due to ...". This gives an impression that things aren't moving forward when they are.

I propose that we store one message per thread. That will help us get an overview of current activity, and see when everything is truly stuck.

Make it easier to deploy Reaper database schema

Currently the database schema is defined in a single SQL file, which can be run on a database host as admin user to create the database. Make it easier, or at least well documented howto install the schema and a Postgres database for running Reaper service.

Entire repair run moves to ERROR state on exceptions in JMX methods

I've had several repair runs fail due to transient errors (usually involving a node going up or down, but now always). The exceptions are thrown by JMX remote calls. For example:

ERROR [2015-03-30 22:37:34,539] com.spotify.reaper.service.RepairRunner: RepairRun FAILURE
ERROR [2015-03-30 22:37:34,540] com.spotify.reaper.service.RepairRunner: java.lang.reflect.UndeclaredThrowableException
ERROR [2015-03-30 22:37:34,540] com.spotify.reaper.service.RepairRunner: [com.sun.proxy.$Proxy59.forceTerminateAllRepairSessions(Unknown Source), com.spotify.reaper.cassandra.JmxProxy.cancelAllRepairs(JmxProxy.java:265), com.spotify.reaper.service.SegmentRunner.abort(SegmentRunner.java:89), com.spotify.reaper.service.SegmentRunner.abort(SegmentRunner.java:212), com.spotify.reaper.service.SegmentRunner.runRepair(SegmentRunner.java:150), com.spotify.reaper.service.SegmentRunner.triggerRepair(SegmentRunner.java:70), com.spotify.reaper.service.RepairRunner.repairSegment(RepairRunner.java:202), com.spotify.reaper.service.RepairRunner.startNextSegment(RepairRunner.java:156), com.spotify.reaper.service.RepairRunner.run(RepairRunner.java:89), java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471), java.util.concurrent.FutureTask.run(FutureTask.java:262), java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178), java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292), java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145), java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615), java.lang.Thread.run(Thread.java:745)]

and

ERROR [2015-04-08 21:19:01,950] com.spotify.reaper.service.RepairRunner: RepairRun FAILURE
ERROR [2015-04-08 21:19:01,950] com.spotify.reaper.service.RepairRunner: java.lang.reflect.UndeclaredThrowableException
ERROR [2015-04-08 21:19:01,950] com.spotify.reaper.service.RepairRunner: [com.sun.proxy.$Proxy60.getPendingTasks(Unknown Source), com.spotify.reaper.cassandra.JmxProxy.getPendingCompactions(JmxProxy.java:232), com.spotify.reaper.service.SegmentRunner.canRepair(SegmentRunner.java:177), com.spotify.reaper.service.SegmentRunner.runRepair(SegmentRunner.java:105), com.spotify.reaper.service.SegmentRunner.triggerRepair(SegmentRunner.java:70), com.spotify.reaper.service.RepairRunner.repairSegment(RepairRunner.java:202), com.spotify.reaper.service.RepairRunner.startNextSegment(RepairRunner.java:156), com.spotify.reaper.service.RepairRunner.run(RepairRunner.java:89), java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471), java.util.concurrent.FutureTask.run(FutureTask.java:262), java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178), java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292), java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145), java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615), java.lang.Thread.run(Thread.java:745)]

I've also seen an exception (don't have the logs, sorry) thrown by tokenRangeToEndpoint.

Following the style of ed21152, I wrote a patch that catches RuntimeException in a couple of places in SegmentRunner.canRepair()

Are you interested in a pull request, a patch file, or something else?

Run repair for limited token range within a cluster

Currently it is only possible to run a repair process for the whole token ring in a Cassandra cluster. Make it possible to define a token range for a repair run, i.e. repair only part of a token ring per repair run.

SchedulingManager throwing exceptions and not starting any new repairs

I discovered that our production reaper has not started repairs that were on schedule for nearly 5 days. Many schedules now have a next_activation date in the past. The log shows this exception being thrown repeatedly:

com.spotify.reaper.ReaperException: failed to generate repair segments for cluster "insert_cluster_name_here"
at com.spotify.reaper.resources.CommonTools.generateSegments(CommonTools.java:113) ~[cassandra-reaper-0.2.1-SNAPSHOT.jar:0.2.1-SNAPSHOT]
at com.spotify.reaper.resources.CommonTools.registerRepairRun(CommonTools.java:60) ~[cassandra-reaper-0.2.1-SNAPSHOT.jar:0.2.1-SNAPSHOT]
at com.spotify.reaper.service.SchedulingManager.startNewRunForUnit(SchedulingManager.java:160) ~[cassandra-reaper-0.2.1-SNAPSHOT.jar:0.2.1-SNAPSHOT]
at com.spotify.reaper.service.SchedulingManager.manageSchedule(SchedulingManager.java:133) ~[cassandra-reaper-0.2.1-SNAPSHOT.jar:0.2.1-SNAPSHOT]
at com.spotify.reaper.service.SchedulingManager.run(SchedulingManager.java:82) ~[cassandra-reaper-0.2.1-SNAPSHOT.jar:0.2.1-SNAPSHOT]
at java.util.TimerThread.mainLoop(Timer.java:555) [na:1.7.0_60]
at java.util.TimerThread.run(Timer.java:505) [na:1.7.0_60]

Cassandra 2.1.13 works only with PR#121 and not current master

Tried master branch on latest stable Cassandra 2.1.13, got exactly same error as in #65, despite merged support of C* 2.1.8:(
Tried this PR - #121 - works like a charm, even with incrementalRepair:false
Unfortunately, PR#121 has merge conflicts with current master...

Repairs timeout with error state in case of nothing to repair

The source code seems to ignore return value 0 from triggerRepair in JmxProxy. This value indicate a "nothing to repair" scenario. However, its handled as a regular commandId by SegmentRunner. Eventually the task will end up in an error state after a timeout.

I've tried to fix this issue with the following code. However, tests will still fail due to a wrong repair state and I can't really figure out whats why.

        commandId = coordinator.triggerRepair(segment.getStartToken(), segment.getEndToken(),
                                              keyspace, repairRun.getRepairParallelism(),
                                              repairUnit.getColumnFamilies());

        if(commandId == 0) {
          LOG.info("Nothing to repair for keyspace {}", keyspace);
          context.storage.updateRepairSegment(segment.with()
                  .coordinatorHost(coordinator.getHost())
                  .state(RepairSegment.State.DONE)
                  .build(segmentId));
          segmentRunners.remove(segment.getId());
          return;
        }

Reaper fails repairing 2.1 cluster

JMX signature doesn't match.
https://pastebin.mozilla.org/8822185

Make it possible to use custom JMX ports for a cluster.

Currently Reaper supports only default JMX ports for connecting to cluster hosts. Make it possible to configure custom JMX ports per host on a cluster.

repair_segment table grows and grows -- what are your clean up mechanisms that you have in place ?

reaper_db=> select count(*) from repair_segment;
count
--------
142081
(1 row)

I was wondering what you guys use to prevent the postgres DB from growing indefinitely.

From what I can see, the segments for runs that completed successfully don't really need to stay in the DB ?

Thanks!

Limit the number of concurrent segment repairs active per node

A useful feature would be to have a max number of repairs per node across all keyspaces.

We have a fork of reaper where we are adding this and can raise a PR if you think this will be useful?

Arithmetic Exception, division by 0

  public static int getPossibleParallelRepairsCount(Map<List<String>, List<String>> ranges)
      throws ReaperException {
    if (ranges.size() == 0) {
      String msg = "Repairing 0-sized cluster.";
      LOG.error(msg);
      throw new ReaperException(msg);
    }
    return ranges.size() / ranges.values().iterator().next().size();
  }

my ranges contains no values (no endPoint), how would that be possible ? I'm not super familiar with Cassandra's internal.

{[2611504485045026489, 2615130133280966460]=[], [7274155616532593021, 7278291118132873602]=[], [3054106705702366443, 3055273262771481433]=[], [8609404532022392496, 8636088058183964814]=[], [-2623657254929043697, -2612593662847064200]=[], [-5221194959218228745, -5194047772672100208]=[], [2745819238607856147, 2770130583011817306]=[], [-7999645183218416099, -7996023586490762505]=[], [6508128789230567670, 6528741294189409016]=[], [5226309577100597733, 5244671631025045787]=[], [3959201707837294381, 3966256120497110936]=[], [-5644009797308341302, -5625436891321622136]=[], [3783559696387918540, 3802651562781519779]=[], [-8083608256777565735, -8075625381541615307]=[], [-4592600165800377201, -4591126960720786341]=[], [6963379035523023862, 6966105798152398875]=[], [-2873661126563962000, -2865006698360587614]=[], [-3020441752010504375, -3011810835899829272]=[], [-6700462238301430269, -6699675242730251345]=[], [5303188487510411273, 5315618109505838630]=[], [1730432212229141980, 1775046580520098842]=[], [-7837449111407677059, -7835533857396313744]=[], [3608126377440011913, 3631763644187782196]=[], [1707987676098870457, 1715278190568603790]=[], [1262100321431177247, 1282693436558994280]=[], [-4238320399166945720, -4231722489526295728]=[], [-3527036946169677816, -3517969055276438922]=[], [186748954390194875, 212371013041908784]=[], [-3054876837884326554, -3047569376577333773]=[], [-1603482665970338357, -1586132657391656566]=[], [6482788915242085216, 6491980926278450301]=[], [-109575287580986427, -95112322885815284]=[], [-7835533857396313744, -7813710847714650905]=[], [5515671905275458960, 5528493110189420233]=[], [-8332208069486211233, -8322828368529916204]=[], [870287520923088409, 906880799052678148]=[], [-5906296738643732553, -5898911102407369534]=[], [-1154549854382614947, -1152130983607815360]=[], [-853787615013518331, -853561760424292496]=[], [8740747575121493913, 8744383075276610422]=[], [8536024164660715472, 8581577728470934773]=[], [-6099801613660131827, -6094779650796445105]=[], [-4815445404098203546, -4760596029286127984]=[], [793625834421362534, 802063842358511807]=[], [-6978727080305525807, -6960767413033219609]=[], [2653949753285750519, 2716129961675479772]=[], [-115625167908946349, -109575287580986427]=[], [-942808367997680743, -939340085718116907]=[], [-1191905990458365409, -1186950452358740285]=[], [-4548275129798136721, -4521880798213419919]=[], [2652653296481014382, 2653949753285750519]=[], [4820701567469045791, 4831307925540474322]=[], [-6882225436192255342, -6846736775610151994]=[], [2582680286900566733, 2588891414424736172]=[], [-8475287941706663919, -8469715540705797889]=[], [4702944307435693455, 4719031281134601860]=[], [1505166145408500241, 1529215559298962360]=[], [7558259524976519622, 7589259310450757101]=[], [2457068377656749494, 2518963151239571349]=[], [-662620843226419540, -648816152089556760]=[], [1811685704180907092, 1866983848131886172]=[], [906880799052678148, 921207736693901268]=[], [-1256188405096295761, -1251839831017881262]=[], [3506651699717477204, 3509005924571210529]=[], [5904060532436192607, 5922942276556243371]=[], [8335880026770499616, 8335917059173530463]=[], [6998566955207500818, 7015763263441566094]=[], [4307399684438933050, 4322999041213705495]=[], [-6609092260989527881, -6600126709078558353]=[], [8059322415385003331, 8076732515624608701]=[], [-4062915327184017076, -4046699511518079622]=[], [4191269640571890854, 4203853163925490084]=[], [-9080249770355568928, -9063200956908944635]=[], [-631814065810069940, -623688715584221616]=[], [-7660869200973058086, -7646624639571456699]=[], [-2962854858252458557, -2953824875336680415]=[], [-6418668755306402216, -6351235315188280096]=[], [-5647892745631415975, -5644009797308341302]=[], [-1603843621011863418, -1603482665970338357]=[], [1876417635429492045, 1893155383920697859]=[], [-4924490251279821248, -4923877863142114892]=[], [-8361462506805035968, -8353650372536375934]=[], [4250325287347841321, 4267170985874019945]=[], [4612134641152533621, 4619610064126383476]=[], [-7868555664379111545, -7868542706126328701]=[], [6119031588241292905, 6146086992145105572]=[], [-505646614850450365, -486102891606279657]=[], [-1462878341231497146, -1452615416022512616]=[], [2408268696531021204, 2410981699084074123]=[], [943801435354016476, 948414470885543748]=[], [8990666948923936077, 9003094987575547333]=[], [-7061838253575465319, -7039630780308410046]=[], [1253117655404763727, 1256318725157009403]=[], [5179330839568694295, 5184672662900303094]=[], [-6724106150333340926, -6724041062160340709]=[], [3513139307656289127, 3516441224538621500]=[], [3465708221199635625, 3498036162642871789]=[], [6528741294189409016, 6548302136885399016]=[], [-8899566728133865186, -8899290714192025948]=[], [1083594821371023613, 1106697761559329266]=[], [1715278190568603790, 1726516607993845481]=[], [3461205081703141761, 3465708221199635625]=[], [7721136384482281784, 7734576158135262662]=[], [8026743345873120493, 8027977896055985531]=[], [-8812148917309903670, -8784559319761068530]=[], [-8075625381541615307, -8060266246506302962]=[], [-1975033874708932800, -1973718664427524439]=[], [-7962717502982383030, -7943598187509914760]=[], [5399640582801687217, 5403279252363194250]=[], [-1702832184853833674, -1680506190129601621]=[], [8764807191570283161, 8770000187598680596]=[], [332438817315668276, 336852403120333418]=[], [7389816819762951376, 7393420826698086416]=[], [3605548927754165806, 3608126377440011913]=[], [9187798577141093902, 9195965453675475249]=[], [-5690861705648700032, -5690790857306008532]=[], [1805803197128714699, 1811685704180907092]=[], [2029787531917813950, 2033101617298840319]=[], [-6344652383117552975, -6339892976214761973]=[], [-4295978430947398498, -4260725644201221871]=[], [-7034644126950607670, -6978727080305525807]=[], [1617298226732141143, 1618134943813191261]=[], [5131847942196517727, 5157044652716166141]=[], [7050998661204678717, 7103162083982322747]=[], [2043171264515475411, 2066280524359667545]=[], [3814908392648448902, 3860074099310004349]=[], [3325598551729899505, 3340943865480076041]=[], [4501961940135937352, 4521399429357646230]=[], [6014335608779734798, 6046124449921644903]=[], [3972022488430069138, 3973220941796485922]=[], [8264295159130053667, 8282361411225597475]=[], [-2890551887746088603, -2888566595526026688]=[], [3658292606441184720, 3680960737037990642]=[], [-4231722489526295728, -4189902401525590704]=[], [-6296225424064419225, -6285252089317113781]=[], [-2222934870101806198, -2205673561894215716]=[], [-3865575152262825661, -3856228553905702310]=[], [4270742711007831608, 4273862105801589540]=[], [-4564124595870376695, -4563287228328016503]=[], [-3283165459366861359, -3249340825117023169]=[], [3173046240608825950, 3174003194432480352]=[], [-2373847377672482858, -2363951691242349559]=[], [6730248520168442926, 6731197724093043690]=[], [-7292576966841004560, -7232581208743552463]=[], [3874958026732137083, 3932257392466547857]=[], [1256318725157009403, 1262100321431177247]=[], [-3678958354173338857, -3673526562917969902]=[], [924559986490197309, 943801435354016476]=[], [-623688715584221616, -612918450642994690]=[], [-486102891606279657, -465960252674737558]=[], [-699815469576284311, -695833686799280768]=[], [777698140801981474, 778004333013543619]=[], [3303490870760989266, 3325598551729899505]=[], [-8269965932649251183, -8201761003212298481]=[], [-4255215278728230818, -4244858298720462958]=[], [-8894646344406370330, -8884473028078832553]=[], [-3089837964917273827, -3066399449684943945]=[], [-6135846716971543667, -6128769424372044523]=[], [-84663362273789934, -83261333287482341]=[], [-3215395696353078846, -3189012426503304845]=[], [8917676462718890876, 8930630902813048632]=[], [3564320927054922271, 3572776171717641655]=[], [6224960937593489714, 6247778619083953506]=[], [-6507463319982213959, -6506195827697823008]=[], [-6699675242730251345, -6690832607240602953]=[], [-1832217519835529077, -1825310243407932211]=[], [-6464821491112310015, -6418668755306402216]=[], [2606693422494481711, 2611504485045026489]=[], [-2439829940274324050, -2436274371622551964]=[], [-2012241667722995678, -2008977440493730003]=[], [-5157751780714774909, -5117504227621037998]=[], [5806561163628451265, 5831839821834325998]=[], [2554658082322697082, 2569663300344021975]=[], [4089715211574107038, 4090484081282824278]=[], [5846467236833918221, 5854253811140274593]=[], [-8857880558602091425, -8845794254891955023]=[], [-4625544162326249603, -4611754786393053725]=[], [-8420968949991206146, -8420397026677105115]=[], [8185491851540318761, 8202339695520218706]=[], [8076732515624608701, 8089944725458638741]=[], [-6781323620485024869, -6780344315699292376]=[], [-7405726271750241303, -7357685494216847049]=[], [-900751575004035776, -882725617777444514]=[], [4145686518343249194, 4168063025588369973]=[], [-8834976298699024779, -8817582153227691279]=[], [7374532903989328575, 7375715483635666668]=[], [-7872839247241010043, -7868555664379111545]=[], [-5022448406419808087, -4953993286810209089]=[], [-6489371667136489558, -6482138593431599090]=[], [5554776409926992721, 5557462748327930275]=[], [-2312294363823709719, -2306712041928587974]=[], [-6114955094118052322, -6105011368485412949]=[], [5008764345769755334, 5044343553409740146]=[], [-5997392631295530911, -5994488142340287316]=[], [7852341141241434681, 7852749725824012540]=[], [-4873240827374594919, -4829473744604379967]=[], [-8817582153227691279, -8812148917309903670]=[], [4267170985874019945, 4270742711007831608]=[], [-3496490170220558376, -3479880017117824004]=[], [-2449231487716426942, -2446868073202183462]=[], [4800983365297636960, 4801980097566042631]=[], [2217312117454628534, 2232928036564433036]=[], [4791556334204487072, 4800983365297636960]=[], [-524430890406018839, -516164566468219378]=[], [-2128115202370700523, -2121966682156037184]=[], [-4010443936225520423, -3982890185210045501]=[], [-2250150729707074213, -2239012480829371690]=[], [155208214722633207, 183029690912963010]=[], [-4698430307027007341, -4671341321902941796]=[], [7355598039...

Terminating a repair through JMX will make Reaper consider the repair segment succeeded

This is not really Reapers fault, but rather a problem in Cassandra that doesn't report aborted repairs correctly. I've created a Cassandra ticket for this: https://issues.apache.org/jira/browse/CASSANDRA-9268

Note that this is only relevant for repairs that are aborted outside of Reaper. Repairs that are aborted by Reaper due to timeout are still handled correctly.

Estimated time of arrival should ignore pause times

When run has been stopped for a while and then resumed "estimated_time_of_arrival" will count pause time(s) as time spent on actual repair which makes it much less usable.

Automatically setup schedule repairs for each cluster

We've added support for automatically setting scheduled repairs for keyspaces in a cluster - in our local copy. Let me know if this is a feature you'll be interested in, so I can submit the changes.

Features:

Schedule repairs are automatically created for all keyspaces within a cluster when the cluster is added.
Cluster keyspaces are periodically checked for changes, so that the scheduled repairs are always in sync with the cluster keyspaces - i.e. scheduled repairs are added for new keyspaces and removed for deleted keyspaces
Configuration options are available to turn on/off auto scheduling, specify when the scheduling repairs will be activated and how often the keyspaces sync is done
Schedule repairs are not created for keyspaces with no table.

Reaper status shows RUNNING, though the repair is successfully completed

Reaper status shows as RUNNING, though the repair is successfully completed.

Cassandra Version: 2.0.15

Couple of questions here

I tried running repair multiple times for different keyspaces, the repair state still shows "RUNNING" for more than 48 hours though the repair is completed.
The repair always runs only on the node "10.16.3.162", which is a seed node. How can we make sure repair continuous on next nodes once this is completed.
FYI:
seeds node: ./bin/spreaper add-cluster 10.16.3.162

[root@test cassandra-reaper]# ./bin/spreaper list-runs

Report improvements/bugs at https://github.com/spotify/cassandra-reaper/issues

------------------------------------------------------------------------------

Listing repair runs

Found 1 repair runs

[
{
"cause": "manual spreaper run",
"cluster_name": "test",
"column_families": [],
"creation_time": "2015-12-08T06:19:39Z",
"duration": null,
"end_time": null,
"estimated_time_of_arrival": "2016-04-07T11:54:22Z",
"id": 1,
"intensity": 0.900,
"keyspace_name": "cache",
"last_event": "Triggered repair of segment 1 via host 10.16.3.162",
"owner": "root",
"pause_time": null,
"repair_parallelism": "DATACENTER_AWARE",
"segments_repaired": 1,
"start_time": "2015-12-08T06:19:40Z",
"state": "RUNNING",
"total_segments": 301
}
]

Any help on this is appreciated.

Add reasons for RepairRun ending in ERROR in lastEvent field

When you see that a repair run ended in ERROR, it's never clear why that happened. You have to dig in the logs to try to figure out the cause. The lastEvent field in RepairRun provides no helpful information when an error has occured. The reason for failure should be stored there.It could be a stack trace or something more short and to the point.

Reaper pausing on too many compaction > 20

Hi,
On our Cassandra cluster, i see the below message no of times when the reaper is running.

SegmentRunner - SegmentRunner declined to repair segment 556369 because of too many pending compactions (> 20) on host.

But when i run the "nodetool compactionstats" on that node, most of the time, i see only one compaction task is running. Though the pending tasks show more than 20 .
Now, i understand the "pending tasks" is actually only an estimate, not the actual no. of tasks pending.

So my question is, why should Reaper wait on the pending tasks, even though there are less than 20 pending tasks running on that node.

Thanks

Kishore

Server not starting when using database as storage

Normally when storageType is set to memory it works fine:

INFO   [2015-12-17 10:07:57,164] [main] i.d.a.AssetsBundle - Registering AssetBundle with name: assets for path /webui/*
INFO   [2015-12-17 10:07:57,197] [main] c.s.r.ReaperApplication - initializing runner thread pool with 15 threads
INFO   [2015-12-17 10:07:57,202] [main] c.s.r.ReaperApplication - initializing storage of type: memory
INFO   [2015-12-17 10:07:57,204] [main] c.s.r.ReaperApplication - no JMX connection factory given in context, creating default
INFO   [2015-12-17 10:07:57,208] [main] c.s.r.ReaperApplication - creating and registering health checks
INFO   [2015-12-17 10:07:57,208] [main] c.s.r.ReaperApplication - creating resources and registering endpoints
INFO   [2015-12-17 10:07:58,214] [main] c.s.r.s.SchedulingManager - Starting new SchedulingManager instance
INFO   [2015-12-17 10:07:58,215] [main] c.s.r.ReaperApplication - resuming pending repair runs
INFO   [2015-12-17 10:07:58,228] [main] i.d.s.ServerFactory - Starting cassandra-reaper
_________                                          .___               __________
\_   ___ \_____    ______ ___________    ____    __| _/___________    \______   \ ____ _____  ______   ___________
/    \  \/\__  \  /  ___//  ___/\__  \  /    \  / __ |\_  __ \__  \    |       _// __ \\__  \ \____ \_/ __ \_  __ \
\     \____/ __ \_\___ \ \___ \  / __ \|   |  \/ /_/ | |  | \// __ \_  |    |   \  ___/ / __ \|  |_> >  ___/|  | \/
 \______  (____  /____  >____  >(____  /___|  /\____ | |__|  (____  /  |____|_  /\___  >____  /   __/ \___  >__|
        \/     \/     \/     \/      \/     \/      \/            \/          \/     \/     \/|__|        \/

INFO   [2015-12-17 10:07:58,302] [main] o.e.j.s.SetUIDListener - Opened application@5a9d6f02{HTTP/1.1}{0.0.0.0:8666}
INFO   [2015-12-17 10:07:58,302] [main] o.e.j.s.SetUIDListener - Opened admin@362045c0{HTTP/1.1}{0.0.0.0:8667}
INFO   [2015-12-17 10:07:58,305] [main] o.e.j.s.Server - jetty-9.0.z-SNAPSHOT
INFO   [2015-12-17 10:07:58,394] [main] c.s.j.s.i.a.WebApplicationImpl - Initiating Jersey application, version 'Jersey: 1.18.1 02/19/2014 03:28 AM'
INFO   [2015-12-17 10:07:58,470] [main] i.d.j.DropwizardResourceConfig - The following paths were found for the configured resources:

    GET     /ping (com.spotify.reaper.resources.PingResource)
    DELETE  /cluster/{cluster_name} (com.spotify.reaper.resources.ClusterResource)
    GET     /cluster (com.spotify.reaper.resources.ClusterResource)
    GET     /cluster/{cluster_name} (com.spotify.reaper.resources.ClusterResource)
    POST    /cluster (com.spotify.reaper.resources.ClusterResource)
    DELETE  /repair_run/{id} (com.spotify.reaper.resources.RepairRunResource)
    GET     /repair_run (com.spotify.reaper.resources.RepairRunResource)
    GET     /repair_run/cluster/{cluster_name} (com.spotify.reaper.resources.RepairRunResource)
    GET     /repair_run/{id} (com.spotify.reaper.resources.RepairRunResource)
    POST    /repair_run (com.spotify.reaper.resources.RepairRunResource)
    PUT     /repair_run/{id} (com.spotify.reaper.resources.RepairRunResource)
    DELETE  /repair_schedule/{id} (com.spotify.reaper.resources.RepairScheduleResource)
    GET     /repair_schedule (com.spotify.reaper.resources.RepairScheduleResource)
    GET     /repair_schedule/cluster/{cluster_name} (com.spotify.reaper.resources.RepairScheduleResource)
    GET     /repair_schedule/{id} (com.spotify.reaper.resources.RepairScheduleResource)
    POST    /repair_schedule (com.spotify.reaper.resources.RepairScheduleResource)
    PUT     /repair_schedule/{id} (com.spotify.reaper.resources.RepairScheduleResource)

But if using configuration as follows:

segmentCount: 200
repairParallelism: DATACENTER_AWARE
repairIntensity: 0.9
#scheduleDaysBetween: 7
#daysToExpireAfterDone: 2
repairRunThreadCount: 15
hangingRepairTimeoutMins: 30
storageType: database
enableCrossOrigin: true
incrementalRepair: false

logging:
  level: DEBUG
  loggers:
    io.dropwizard: DEBUG
    org.eclipse.jetty: DEBUG
  appenders:
    - type: console
      logFormat: "%-6level [%d] [%t] %logger{5} - %msg %n"
    - type: file
      logFormat: "%-6level [%d] [%t] %logger{5} - %msg %n"
      currentLogFilename: /var/log/cassandra_reaper/cassandra_reaper.log
      archivedLogFilenamePattern: /var/log/cassandra_reaper/cassandra_reaper.%d.log.gz

server:
  type: default
  applicationConnectors:
    - type: http
      port: 8666
      bindHost: 0.0.0.0
  adminConnectors:
    - type: http
      port: 8667
      bindHost: 0.0.0.0

database:
  driverClass: org.postgresql.Driver
  user: reaper
  password: my_secret_password
  url: jdbc:postgresql://localhost/reaper_db

jmxAuth:
    username: controlRole
    password: XXX

Then debug logs show:

DEBUG  [2015-12-17 10:09:05,918] [main] o.e.j.u.log - Logging to Logger[org.eclipse.jetty.util.log] via org.eclipse.jetty.util.log.Slf4jLog
DEBUG  [2015-12-17 10:09:05,927] [main] o.e.j.u.c.ContainerLifeCycle - i.d.j.MutableServletContextHandler@1fc32e4f{/,null,null} added {org.eclipse.jetty.servlet.ServletHandler@2f67b837,AUTO}
DEBUG  [2015-12-17 10:09:05,928] [main] o.e.j.u.c.ContainerLifeCycle - i.d.j.MutableServletContextHandler@6cce16f4{/,null,null} added {org.eclipse.jetty.servlet.ServletHandler@7efaad5a,AUTO}
DEBUG  [2015-12-17 10:09:05,941] [main] o.e.j.u.c.ContainerLifeCycle - org.eclipse.jetty.servlet.ServletHandler@7efaad5a added {tasks@6907b8e==io.dropwizard.servlets.tasks.TaskServlet,1,true,AUTO}
DEBUG  [2015-12-17 10:09:05,943] [main] o.e.j.u.c.ContainerLifeCycle - org.eclipse.jetty.servlet.ServletHandler@7efaad5a added {[/tasks/*]=>tasks,POJO}
DEBUG  [2015-12-17 10:09:05,946] [main] o.e.j.s.ServletHandler - filterNameMap={}
DEBUG  [2015-12-17 10:09:05,946] [main] o.e.j.s.ServletHandler - pathFilters=null
DEBUG  [2015-12-17 10:09:05,946] [main] o.e.j.s.ServletHandler - servletFilterMap=null
DEBUG  [2015-12-17 10:09:05,946] [main] o.e.j.s.ServletHandler - servletPathMap={/tasks/*=tasks@6907b8e==io.dropwizard.servlets.tasks.TaskServlet,1,true}
DEBUG  [2015-12-17 10:09:05,946] [main] o.e.j.s.ServletHandler - servletNameMap={tasks=tasks@6907b8e==io.dropwizard.servlets.tasks.TaskServlet,1,true}
INFO   [2015-12-17 10:09:05,955] [main] i.d.a.AssetsBundle - Registering AssetBundle with name: assets for path /webui/*
DEBUG  [2015-12-17 10:09:05,985] [main] o.e.j.u.c.ContainerLifeCycle - org.eclipse.jetty.servlet.ServletHandler@2f67b837 added {assets@ac107383==io.dropwizard.servlets.assets.AssetS
ervlet,1,true,AUTO}
DEBUG  [2015-12-17 10:09:05,985] [main] o.e.j.u.c.ContainerLifeCycle - org.eclipse.jetty.servlet.ServletHandler@2f67b837 added {[/webui/*]=>assets,POJO}
DEBUG  [2015-12-17 10:09:05,985] [main] o.e.j.s.ServletHandler - filterNameMap={}
DEBUG  [2015-12-17 10:09:05,985] [main] o.e.j.s.ServletHandler - pathFilters=null
DEBUG  [2015-12-17 10:09:05,985] [main] o.e.j.s.ServletHandler - servletFilterMap=null
DEBUG  [2015-12-17 10:09:05,986] [main] o.e.j.s.ServletHandler - servletPathMap={/webui/*=assets@ac107383==io.dropwizard.servlets.assets.AssetServlet,1,true}
DEBUG  [2015-12-17 10:09:05,986] [main] o.e.j.s.ServletHandler - servletNameMap={assets=assets@ac107383==io.dropwizard.servlets.assets.AssetServlet,1,true}
DEBUG  [2015-12-17 10:09:05,986] [main] c.s.r.ReaperApplication - repairIntensity: 0.9
DEBUG  [2015-12-17 10:09:05,986] [main] c.s.r.ReaperApplication - incrementalRepair:false
DEBUG  [2015-12-17 10:09:05,986] [main] c.s.r.ReaperApplication - repairRunThreadCount: 15
DEBUG  [2015-12-17 10:09:05,986] [main] c.s.r.ReaperApplication - segmentCount: 200
DEBUG  [2015-12-17 10:09:05,986] [main] c.s.r.ReaperApplication - repairParallelism: DATACENTER_AWARE
DEBUG  [2015-12-17 10:09:05,987] [main] c.s.r.ReaperApplication - hangingRepairTimeoutMins: 30
DEBUG  [2015-12-17 10:09:05,987] [main] c.s.r.ReaperApplication - jmxPorts: null
DEBUG  [2015-12-17 10:09:05,987] [main] c.s.r.ReaperApplication - adding signal handler for SIGHUP
INFO   [2015-12-17 10:09:05,987] [main] c.s.r.ReaperApplication - initializing runner thread pool with 15 threads
INFO   [2015-12-17 10:09:05,993] [main] c.s.r.ReaperApplication - initializing storage of type: database
INFO   [2015-12-17 10:09:06,035] [main] c.s.r.ReaperApplication - no JMX connection factory given in context, creating default
DEBUG  [2015-12-17 10:09:06,037] [main] o.e.j.u.c.ContainerLifeCycle - org.eclipse.jetty.servlet.ServletHandler@2f67b837 added {crossOriginRequests,AUTO}
DEBUG  [2015-12-17 10:09:06,040] [main] o.e.j.u.c.ContainerLifeCycle - org.eclipse.jetty.servlet.ServletHandler@2f67b837 added {[/*]/[]==31=>crossOriginRequests,POJO}
DEBUG  [2015-12-17 10:09:06,041] [main] o.e.j.s.ServletHandler - filterNameMap={crossOriginRequests=crossOriginRequests}
DEBUG  [2015-12-17 10:09:06,042] [main] o.e.j.s.ServletHandler - pathFilters=[[/*]/[]==31=>crossOriginRequests]
DEBUG  [2015-12-17 10:09:06,042] [main] o.e.j.s.ServletHandler - servletFilterMap={}
DEBUG  [2015-12-17 10:09:06,042] [main] o.e.j.s.ServletHandler - servletPathMap={/webui/*=assets@ac107383==io.dropwizard.servlets.assets.AssetServlet,1,true}
DEBUG  [2015-12-17 10:09:06,042] [main] o.e.j.s.ServletHandler - servletNameMap={assets=assets@ac107383==io.dropwizard.servlets.assets.AssetServlet,1,true}
DEBUG  [2015-12-17 10:09:06,042] [main] c.s.r.ReaperApplication - using specified JMX credentials for authentication
INFO   [2015-12-17 10:09:06,042] [main] c.s.r.ReaperApplication - creating and registering health checks
INFO   [2015-12-17 10:09:06,043] [main] c.s.r.ReaperApplication - creating resources and registering endpoints
INFO   [2015-12-17 10:09:07,049] [main] c.s.r.s.SchedulingManager - Starting new SchedulingManager instance
INFO   [2015-12-17 10:09:07,050] [main] c.s.r.ReaperApplication - resuming pending repair runs

And server is not started.

Separate the spreaper CLI tool into a separate package

The CLI tool spreaper is currently part of the main package, while it is a separate tool that should be configurable also for other hosts that don't have the Reaper service installed.

resume repairs automatically after errors

The repair often ends up in ERROR state if nodes are down or restarted. Sometimes the message is "Exception: null". After this happens, the repair must be resumed manually with spreaper. It would be preferable if it would resume automatically perhaps after some delay.

Add CLI commands for removing schedules, clusters, and old runs

The Reaper will get cluttered rather fast with all the old runs, schedulings, and clusters. It is not possible currently to clean up the Reaper, unless one connects directly to the database and deletes rows from there manually.

Executing SegmentRunner failed: null

Just trying to launch this tool for the very first time:

./spreaper add-cluster 10.210.3.221
./spreaper repair productioncluster sync --tables user_quota

and I'm getting only:

INFO   [2015-12-03 13:47:03,600] [productioncluster:1:3184] c.s.r.s.SegmentRunner - It is ok to repair segment '3184' on repair run with id '1' 
INFO   [2015-12-03 13:47:03,727] [productioncluster:1:3184] c.s.r.c.JmxProxy - Triggering repair of range (5191522389173550088,5192411715538110057] for keyspace "sync" on host 10.195.15.167, with repair parallelism DATACENTER_AWARE, in cluster with Cassandra version '2.1.11' (can use DATACENTER_AWARE 'true'), for column families: [user_quota] 
ERROR  [2015-12-03 13:47:04,097] [productioncluster:1:3184] c.s.r.s.RepairRunner - Executing SegmentRunner failed: null 
INFO   [2015-12-03 13:47:04,897] [productioncluster:1:3202] c.s.r.s.SegmentRunner - It is ok to repair segment '3202' on repair run with id '1' 
INFO   [2015-12-03 13:47:05,138] [productioncluster:1:3202] c.s.r.c.JmxProxy - Triggering repair of range (5287223852807787957,5288653625584332214] for keyspace "sync" on host 10.195.15.176, with repair parallelism DATACENTER_AWARE, in cluster with Cassandra version '2.1.11' (can use DATACENTER_AWARE 'true'), for column families: [user_quota] 
ERROR  [2015-12-03 13:47:05,502] [productioncluster:1:3202] c.s.r.s.RepairRunner - Executing SegmentRunner failed: null 
INFO   [2015-12-03 13:47:05,930] [productioncluster:1:3232] c.s.r.s.SegmentRunner - It is ok to repair segment '3232' on repair run with id '1' 
INFO   [2015-12-03 13:47:06,002] [productioncluster:1:3232] c.s.r.c.JmxProxy - Triggering repair of range (5392661295720643869,5396401862543039721] for keyspace "sync" on host 10.210.3.221, with repair parallelism DATACENTER_AWARE, in cluster with Cassandra version '2.1.11' (can use DATACENTER_AWARE 'true'), for column families: [user_quota] 
ERROR  [2015-12-03 13:47:06,103] [productioncluster:1:3232] c.s.r.s.RepairRunner - Executing SegmentRunner failed: null 
INFO   [2015-12-03 13:47:06,464] [productioncluster:1:3238] c.s.r.s.SegmentRunner - It is ok to repair segment '3238' on repair run with id '1' 
INFO   [2015-12-03 13:47:06,518] [productioncluster:1:3238] c.s.r.c.JmxProxy - Triggering repair of range (5428508491334977628,5429554845014064640] for keyspace "sync" on host 10.210.3.117, with repair parallelism DATACENTER_AWARE, in cluster with Cassandra version '2.1.11' (can use DATACENTER_AWARE 'true'), for column families: [user_quota] 
ERROR  [2015-12-03 13:47:06,601] [productioncluster:1:3238] c.s.r.s.RepairRunner - Executing SegmentRunner failed: null 
INFO   [2015-12-03 13:47:06,665] [productioncluster:1:3244] c.s.r.s.SegmentRunner - It is ok to repair segment '3244' on repair run with id '1' 
INFO   [2015-12-03 13:47:06,748] [productioncluster:1:3244] c.s.r.c.JmxProxy - Triggering repair of range (5447912260606081125,5452254409804456671] for keyspace "sync" on host 10.210.3.162, with repair parallelism DATACENTER_AWARE, in cluster with Cassandra version '2.1.11' (can use DATACENTER_AWARE 'true'), for column families: [user_quota] 
ERROR  [2015-12-03 13:47:06,836] [productioncluster:1:3244] c.s.r.s.RepairRunner - Executing SegmentRunner failed: null

I'm using C* 2.1.11 in two datacenter 8 nodes each and cassandra-reaper 0.2.4.

ping works fine:

mlowicki:bin mlowicki$ ./spreaper ping
# Report improvements/bugs at https://github.com/spotify/cassandra-reaper/issues
# ------------------------------------------------------------------------------
# Sending PING to Reaper...
# [Reply] Cassandra Reaper ping resource: PONG
# Cassandra Reaper is answering in: localhost:8080

cluster seems to be registered correctly:

mlowicki:bin mlowicki$ ./spreaper list-clusters
# Report improvements/bugs at https://github.com/spotify/cassandra-reaper/issues
# ------------------------------------------------------------------------------
# Listing all registered Cassandra clusters
# Found 1 clusters:
productioncluster

configuration file:

segmentCount: 200 
repairParallelism: DATACENTER_AWARE
repairIntensity: 0.9 
scheduleDaysBetween: 7
#daysToExpireAfterDone: 2
repairRunThreadCount: 15
hangingRepairTimeoutMins: 30
storageType: memory
enableCrossOrigin: true

logging:
  level: INFO
  loggers:
    io.dropwizard: INFO
    org.eclipse.jetty: INFO
  appenders:
    - type: console
      logFormat: "%-6level [%d] [%t] %logger{5} - %msg %n"

server:
  type: default
  applicationConnectors:
    - type: http
      port: 8080
      bindHost: 0.0.0.0
  adminConnectors:
    - type: http
      port: 8081
      bindHost: 0.0.0.0

database:
  driverClass: org.postgresql.Driver
  user: pg-user
  password: pg-pass
  url: jdbc:postgresql://db.example.com/db-prod

jmxAuth:
    username: controlRole
    password: XXX

jdk version conflict

debian build package require default-jdk -> 6
while maven project require jdk 7

Does it overlaps with the commercial version of DSE

We are in process to analyze how to manage repair via API or scripting without using Opscenter. Does this tool overlap with DSE ?

Cassandra 2.2 and/or 3.x

Hi,

Does this tool works with Cassandra 2.2 and/or 3.x?

thanks

Support for older cassandra versions

Does it work with cassandra 1.2.19? I tried building for cassandra 1.2.19 by changing the pom.xml file but there are many compilation errors

ERROR] /root/cassandra-reaper/src/main/java/com/spotify/reaper/core/RepairRun.java:[16,35] package org.apache.cassandra.repair does not exist
[ERROR] /root/cassandra-reaper/src/main/java/com/spotify/reaper/core/RepairRun.java:[40,17] cannot find symbol
symbol: class RepairParallelism
location: class com.spotify.reaper.core.RepairRun
[ERROR] /root/cassandra-reaper/src/main/java/com/spotify/reaper/core/RepairRun.java:[111,10] cannot find symbol
symbol: class RepairParallelism

location: class com.spotify.reaper.core.RepairRun

Keep raeper state inside Cassandra

Instead of relying on a outside data source, like Postgres, it would make sense to keep the state inside Cassandra, in a separate keyspace.

No coordinators for range

I'm trying to understand how I did run into this issue "No coordinators for range (8285692895160687807,8293377386900436865] " since after restarting the Reaper it did work for the exact same range ...

Make it possible to use custom JMX ports for a cluster again.

Our cassandra clusters share the same servers, so each cluster has its own unique port set. I have created small patch, that allows to run repair using non standard JMX ports on these clusters.
cassandra-reaper.zip

Create a GUI for using Reaper

Currently there is only the REST API and a CLI tool for using and configuring Reaper. It would be better for wider audience to provide a simple GUI that exposes the Reaper functionality.

Loop on only 3 segments

Hello,

Cassandra version: 2.0.10
3 Nodes.

I found that reaper happening all the time loop on only 3 segments: 1, 68 and 135.

Log:

INFO [2015-10-27 14:47:51,225] [dw-17 - POST /repair_run?tables=Standard1&clusterName=spotify&keyspace=Keyspace1&owner=aeljami&cause=manual+spreaper+run] c.s.r.s.SegmentGenerator -
Dividing token range [-9223372036854775808,-3074457345618258603) into 67 segments

INFO [2015-10-27 14:47:51,228] [dw-17 - POST /repair_run?tables=Standard1&clusterName=spotify&keyspace=Keyspace1&owner=aeljami&cause=manual+spreaper+run] c.s.r.s.SegmentGenerator -
Dividing token range [-3074457345618258603,3074457345618258602) into 67 segments

INFO [2015-10-27 14:47:51,230] [dw-17 - POST /repair_run?tables=Standard1&clusterName=spotify&keyspace=Keyspace1&owner=aeljami&cause=manual+spreaper+run] c.s.r.s.SegmentGenerator -
Dividing token range [3074457345618258602,-9223372036854775808) into 67 segments

-----Segment 1----
INFO [2015-10-27 14:47:51,462] [spotify:1:1] c.s.r.s.SegmentRunner - It is ok to repair segment '1' on repair run with id '1'
INFO [2015-10-27 14:47:51,468] [spotify:1:1] c.s.r.c.JmxProxy - Triggering repair of range (-9223372036854775808,-9131597190716917343] for keyspace "Keyspace1" on host 127.0.0.2, with repair parallelism SEQUENTIAL, in cluster with Cassandra version '2.0.10' (can use DATACENTER_AWARE 'false'), for column families: [Standard1]
INFO [2015-10-27 14:47:51,486] [spotify:1:1] c.s.r.s.SegmentRunner - Repair for segment 1 started, status wait will timeout in 1800000 millis
INFO [2015-10-27 14:47:51,488] [spotify:1:1] c.s.r.s.SegmentRunner - Repair command 1854 on segment 1 returned with state RUNNING
INFO [2015-10-27 14:47:51,489] [spotify:1:1] c.s.r.s.SegmentRunner - Repair command 1854 on segment 1 has been cancelled while running
INFO [2015-10-27 14:47:51,489] [spotify:1:1] c.s.r.s.SegmentRunner - Postponing segment 1
INFO [2015-10-27 14:47:51,489] [spotify:1:1] c.s.r.s.SegmentRunner - Aborting repair on segment with id 1 on coordinator 127.0.0.2

-----Segment 68----

INFO [2015-10-27 14:47:51,519] [spotify:1:68] c.s.r.s.SegmentRunner - It is ok to repair segment '68' on repair run with id '1'
INFO [2015-10-27 14:47:51,521] [spotify:1:68] c.s.r.c.JmxProxy - Triggering repair of range (-3074457345618258603,-2982682499480400138] for keyspace "Keyspace1" on host 127.0.0.3, with repair parallelism SEQUENTIAL, in cluster with Cassandra version '2.0.10' (can use DATACENTER_AWARE 'false'), for column families: [Standard1]
INFO [2015-10-27 14:47:51,523] [spotify:1:68] c.s.r.s.SegmentRunner - Repair for segment 68 started, status wait will timeout in 1800000 millis
INFO [2015-10-27 14:47:51,533] [spotify:1:68] c.s.r.s.SegmentRunner - Repair command 1855 on segment 68 returned with state RUNNING
INFO [2015-10-27 14:47:51,533] [spotify:1:68] c.s.r.s.SegmentRunner - Repair command 1855 on segment 68 has been cancelled while running
INFO [2015-10-27 14:47:51,533] [spotify:1:68] c.s.r.s.SegmentRunner - Postponing segment 68
INFO [2015-10-27 14:47:51,533] [spotify:1:68] c.s.r.s.SegmentRunner - Aborting repair on segment with id 68 on coordinator 127.0.0.3

-----Segment 135----

INFO [2015-10-27 14:48:21,548] [spotify:1:135] c.s.r.c.JmxProxy - Triggering repair of range (3074457345618258602,3166232191756117067] for keyspace "Keyspace1" on host 127.0.0.1, with repair parallelism SEQUENTIAL, in cluster with Cassandra version '2.0.10' (can use DATACENTER_AWARE 'false'), for column families: [Standard1]
INFO [2015-10-27 14:48:21,551] [spotify:1:135] c.s.r.s.SegmentRunner - Repair for segment 135 started, status wait will timeout in 1800000 millis
INFO [2015-10-27 14:48:21,560] [spotify:1:135] c.s.r.s.SegmentRunner - Repair command 1859 on segment 135 returned with state RUNNING
INFO [2015-10-27 14:48:21,560] [spotify:1:135] c.s.r.s.SegmentRunner - Repair command 1859 on segment 135 has been cancelled while running
INFO [2015-10-27 14:48:21,560] [spotify:1:135] c.s.r.s.SegmentRunner - Postponing segment 135
INFO [2015-10-27 14:48:21,561] [spotify:1:135] c.s.r.s.SegmentRunner - Aborting repair on segment with id 135 on coordinator 127.0.0.1

-----Segment 1----
INFO [2015-10-27 14:49:51,464] [spotify:1:1] c.s.r.c.JmxProxy - Triggering repair of range (-9223372036854775808,-9131597190716917343] for keyspace "Keyspace1" on host 127.0.0.2, with repair parallelism SEQUENTIAL, in cluster with Cassandra version '2.0.10' (can use DATACENTER_AWARE 'false'), for column families: [Standard1]
INFO [2015-10-27 14:49:51,465] [spotify:1:1] c.s.r.s.SegmentRunner - Repair for segment 1 started, status wait will timeout in 1800000 millis
INFO [2015-10-27 14:49:51,471] [spotify:1:1] c.s.r.s.SegmentRunner - Repair command 1866 on segment 1 returned with state RUNNING
INFO [2015-10-27 14:49:51,471] [spotify:1:1] c.s.r.s.SegmentRunner - Repair command 1866 on segment 1 has been cancelled while running
INFO [2015-10-27 14:49:51,471] [spotify:1:1] c.s.r.s.SegmentRunner - Postponing segment 1
INFO [2015-10-27 14:49:51,472] [spotify:1:1] c.s.r.s.SegmentRunner - Aborting repair on segment with id 1 on coordinator 127.0.0.2

-----Segment 68----

INFO [2015-10-27 14:49:51,500] [spotify:1:68] c.s.r.c.JmxProxy - Triggering repair of range (-3074457345618258603,-2982682499480400138] for keyspace "Keyspace1" on host 127.0.0.3, with repair parallelism SEQUENTIAL, in cluster with Cassandra version '2.0.10' (can use DATACENTER_AWARE 'false'), for column families: [Standard1]
INFO [2015-10-27 14:49:51,501] [spotify:1:68] c.s.r.s.SegmentRunner - Repair for segment 68 started, status wait will timeout in 1800000 millis
INFO [2015-10-27 14:49:51,508] [spotify:1:68] c.s.r.s.SegmentRunner - Repair command 1867 on segment 68 returned with state RUNNING
INFO [2015-10-27 14:49:51,509] [spotify:1:68] c.s.r.s.SegmentRunner - Repair command 1867 on segment 68 has been cancelled while running
INFO [2015-10-27 14:49:51,509] [spotify:1:68] c.s.r.s.SegmentRunner - Postponing segment 68
INFO [2015-10-27 14:49:51,509] [spotify:1:68] c.s.r.s.SegmentRunner - Aborting repair on segment with id 68 on coordinator 127.0.0.3

-----Segment 135----
INFO [2015-10-27 14:49:51,536] [spotify:1:135] c.s.r.c.JmxProxy - Triggering repair of range (3074457345618258602,3166232191756117067] for keyspace "Keyspace1" on host 127.0.0.1, with repair parallelism SEQUENTIAL, in cluster with Cassandra version '2.0.10' (can use DATACENTER_AWARE 'false'), for column families: [Standard1]
INFO [2015-10-27 14:49:51,538] [spotify:1:135] c.s.r.s.SegmentRunner - Repair for segment 135 started, status wait will timeout in 1800000 millis
INFO [2015-10-27 14:49:51,545] [spotify:1:135] c.s.r.s.SegmentRunner - Repair command 1868 on segment 135 returned with state RUNNING
INFO [2015-10-27 14:49:51,545] [spotify:1:135] c.s.r.s.SegmentRunner - Repair command 1868 on segment 135 has been cancelled while running
INFO [2015-10-27 14:49:51,545] [spotify:1:135] c.s.r.s.SegmentRunner - Postponing segment 135
INFO [2015-10-27 14:49:51,545] [spotify:1:135] c.s.r.s.SegmentRunner - Aborting repair on segment with id 135 on coordinator 127.0.0.1

endless, no other segments.....
Any Idea ?

Greets,

Track Cassandra's repair hash in RepairSegment

Cassandra gives each repair command a unique hash, which is not the same as the repair number which we already track. It might be convenient to have this hash in the repair segment, so that you can easily find the right repair command in the log on all nodes involved in the repair.

Example: repair #2b41cfb0-a93d-11e4-b542-f707d752ca5f