Git Product home page Git Product logo

cassandra-reaper's Introduction

cassandra-reaper's People

Contributors

ahenry avatar bj0rnen avatar djsly avatar echoajohnson avatar liljencrantz avatar mattnworb avatar mmatalka avatar n0rad avatar rouzwawi avatar rzvoncek avatar spodkowinski avatar varjoranta avatar yarin78 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cassandra-reaper's Issues

Run multiple repairs for same token ring in parallel

Currently Reaper supports running one repair process at a time per one cluster repair run. Make it possible to have more than one process running in parallel targeting same token ring, but different data (replicas of same data count as same data).

The way these parallel repairs can be run depends on the used write replication factor, and to the configured replica placement strategy. E.g. in a single site cluster with six nodes and write replication factor of three, using simple replica placement strategy, you can have two processes running in parallel on the opposite sides of the token ring.

Handle "repair finished" message not reaching all peers

We recently ran into a case of a hanging repair run. Some segments kept getting postponed indefinitely, because an involved node reported that it was participating in a repair. We got the repair session's hash from the node's log. Other nodes' logs reported this session as finished, but not that node. So apparently, the cross-communication within Cassandra had failed there.

Reaper on the other hand got notified that the repair was done, so it moved along to remaining segments. But segments within that node's range obviously got stopped by SegmentRunner::canRepair, because a repair was already underway according to the node.

Potential fix: when SegmentRunner::canRepair discovers a node that's already busy with repair, compare with Reaper's storage to determine if it really should have a repair ongoing. If not, use JmxProxy::cancelAllRepairs to clear that node's state.

Does it support Cassandra 3.11?

Getting issue while using it with Cassandra 3.11:

WARN [2017-07-27 04:16:44,669] [testcluster:1:355] c.s.r.s.SegmentRunner - SegmentRunner declined to repair segment 355 because of an error collecting information from one of the hosts (127.0.0.1): {}
java.lang.reflect.UndeclaredThrowableException: null
at com.sun.proxy.$Proxy51.getPendingTasks(Unknown Source) ˜[na:1.8.0-internal]
at com.spotify.reaper.cassandra.JmxProxy.getPendingCompactions(JmxProxy.java:259) ˜[cassandra-reaper-1.1.1-SNAPSHOT.jar:1.1.1-SNAPSHOT]
at com.spotify.reaper.service.SegmentRunner.canRepair(SegmentRunner.java:262) [cassandra-reaper-1.1.1-SNAPSHOT.jar:1.1.1-SNAPSHOT]
at com.spotify.reaper.service.SegmentRunner.runRepair(SegmentRunner.java:178) [cassandra-reaper-1.1.1-SNAPSHOT.jar:1.1.1-SNAPSHOT]
at com.spotify.reaper.service.SegmentRunner.run(SegmentRunner.java:96) [cassandra-reaper-1.1.1-SNAPSHOT.jar:1.1.1-SNAPSHOT]
Caused by: javax.management.AttributeNotFoundException: No such attribute: PendingTasks
at com.sun.jmx.mbeanserver.PerInterface.getAttribute(PerInterface.java:93) ˜[na:1.8.0-internal]
at com.sun.jmx.mbeanserver.MBeanSupport.getAttribute(MBeanSupport.java:218) ˜[na:1.8.0-internal]
at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:659) ˜[na:1.8.0-internal]
at com.sun.jmx.mbeanserver.JmxMBeanServer.getAttribute(JmxMBeanServer.java:690) ˜[na:1.8.0-internal]

Store one "last event" per repair runner currentlyRunningSegments slot

Since we started doing parallel repairs, the "last event" portion of repair runs got a lot less informative. While most threads may be idle waiting for a repair segment to finish, one or more threads are usually trying to repair but postponing for various reasons. The result is that "last event" usually says "Postponed due to ...". This gives an impression that things aren't moving forward when they are.

I propose that we store one message per thread. That will help us get an overview of current activity, and see when everything is truly stuck.

Make it easier to deploy Reaper database schema

Currently the database schema is defined in a single SQL file, which can be run on a database host as admin user to create the database. Make it easier, or at least well documented howto install the schema and a Postgres database for running Reaper service.

Entire repair run moves to ERROR state on exceptions in JMX methods

I've had several repair runs fail due to transient errors (usually involving a node going up or down, but now always). The exceptions are thrown by JMX remote calls. For example:

ERROR [2015-03-30 22:37:34,539] com.spotify.reaper.service.RepairRunner: RepairRun FAILURE
ERROR [2015-03-30 22:37:34,540] com.spotify.reaper.service.RepairRunner: java.lang.reflect.UndeclaredThrowableException
ERROR [2015-03-30 22:37:34,540] com.spotify.reaper.service.RepairRunner: [com.sun.proxy.$Proxy59.forceTerminateAllRepairSessions(Unknown Source), com.spotify.reaper.cassandra.JmxProxy.cancelAllRepairs(JmxProxy.java:265), com.spotify.reaper.service.SegmentRunner.abort(SegmentRunner.java:89), com.spotify.reaper.service.SegmentRunner.abort(SegmentRunner.java:212), com.spotify.reaper.service.SegmentRunner.runRepair(SegmentRunner.java:150), com.spotify.reaper.service.SegmentRunner.triggerRepair(SegmentRunner.java:70), com.spotify.reaper.service.RepairRunner.repairSegment(RepairRunner.java:202), com.spotify.reaper.service.RepairRunner.startNextSegment(RepairRunner.java:156), com.spotify.reaper.service.RepairRunner.run(RepairRunner.java:89), java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471), java.util.concurrent.FutureTask.run(FutureTask.java:262), java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178), java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292), java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145), java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615), java.lang.Thread.run(Thread.java:745)]

and

ERROR [2015-04-08 21:19:01,950] com.spotify.reaper.service.RepairRunner: RepairRun FAILURE
ERROR [2015-04-08 21:19:01,950] com.spotify.reaper.service.RepairRunner: java.lang.reflect.UndeclaredThrowableException
ERROR [2015-04-08 21:19:01,950] com.spotify.reaper.service.RepairRunner: [com.sun.proxy.$Proxy60.getPendingTasks(Unknown Source), com.spotify.reaper.cassandra.JmxProxy.getPendingCompactions(JmxProxy.java:232), com.spotify.reaper.service.SegmentRunner.canRepair(SegmentRunner.java:177), com.spotify.reaper.service.SegmentRunner.runRepair(SegmentRunner.java:105), com.spotify.reaper.service.SegmentRunner.triggerRepair(SegmentRunner.java:70), com.spotify.reaper.service.RepairRunner.repairSegment(RepairRunner.java:202), com.spotify.reaper.service.RepairRunner.startNextSegment(RepairRunner.java:156), com.spotify.reaper.service.RepairRunner.run(RepairRunner.java:89), java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471), java.util.concurrent.FutureTask.run(FutureTask.java:262), java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178), java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292), java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145), java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615), java.lang.Thread.run(Thread.java:745)]

I've also seen an exception (don't have the logs, sorry) thrown by tokenRangeToEndpoint.

Following the style of ed21152, I wrote a patch that catches RuntimeException in a couple of places in SegmentRunner.canRepair()

Are you interested in a pull request, a patch file, or something else?

Run repair for limited token range within a cluster

Currently it is only possible to run a repair process for the whole token ring in a Cassandra cluster. Make it possible to define a token range for a repair run, i.e. repair only part of a token ring per repair run.

SchedulingManager throwing exceptions and not starting any new repairs

I discovered that our production reaper has not started repairs that were on schedule for nearly 5 days. Many schedules now have a next_activation date in the past. The log shows this exception being thrown repeatedly:

com.spotify.reaper.ReaperException: failed to generate repair segments for cluster "insert_cluster_name_here"
at com.spotify.reaper.resources.CommonTools.generateSegments(CommonTools.java:113) ~[cassandra-reaper-0.2.1-SNAPSHOT.jar:0.2.1-SNAPSHOT]
at com.spotify.reaper.resources.CommonTools.registerRepairRun(CommonTools.java:60) ~[cassandra-reaper-0.2.1-SNAPSHOT.jar:0.2.1-SNAPSHOT]
at com.spotify.reaper.service.SchedulingManager.startNewRunForUnit(SchedulingManager.java:160) ~[cassandra-reaper-0.2.1-SNAPSHOT.jar:0.2.1-SNAPSHOT]
at com.spotify.reaper.service.SchedulingManager.manageSchedule(SchedulingManager.java:133) ~[cassandra-reaper-0.2.1-SNAPSHOT.jar:0.2.1-SNAPSHOT]
at com.spotify.reaper.service.SchedulingManager.run(SchedulingManager.java:82) ~[cassandra-reaper-0.2.1-SNAPSHOT.jar:0.2.1-SNAPSHOT]
at java.util.TimerThread.mainLoop(Timer.java:555) [na:1.7.0_60]
at java.util.TimerThread.run(Timer.java:505) [na:1.7.0_60]

Repairs timeout with error state in case of nothing to repair

The source code seems to ignore return value 0 from triggerRepair in JmxProxy. This value indicate a "nothing to repair" scenario. However, its handled as a regular commandId by SegmentRunner. Eventually the task will end up in an error state after a timeout.

I've tried to fix this issue with the following code. However, tests will still fail due to a wrong repair state and I can't really figure out whats why.

        commandId = coordinator.triggerRepair(segment.getStartToken(), segment.getEndToken(),
                                              keyspace, repairRun.getRepairParallelism(),
                                              repairUnit.getColumnFamilies());

        if(commandId == 0) {
          LOG.info("Nothing to repair for keyspace {}", keyspace);
          context.storage.updateRepairSegment(segment.with()
                  .coordinatorHost(coordinator.getHost())
                  .state(RepairSegment.State.DONE)
                  .build(segmentId));
          segmentRunners.remove(segment.getId());
          return;
        }

Arithmetic Exception, division by 0

  public static int getPossibleParallelRepairsCount(Map<List<String>, List<String>> ranges)
      throws ReaperException {
    if (ranges.size() == 0) {
      String msg = "Repairing 0-sized cluster.";
      LOG.error(msg);
      throw new ReaperException(msg);
    }
    return ranges.size() / ranges.values().iterator().next().size();
  }

my ranges contains no values (no endPoint), how would that be possible ? I'm not super familiar with Cassandra's internal.

{[2611504485045026489, 2615130133280966460]=[], [7274155616532593021, 7278291118132873602]=[], [3054106705702366443, 3055273262771481433]=[], [8609404532022392496, 8636088058183964814]=[], [-2623657254929043697, -2612593662847064200]=[], [-5221194959218228745, -5194047772672100208]=[], [2745819238607856147, 2770130583011817306]=[], [-7999645183218416099, -7996023586490762505]=[], [6508128789230567670, 6528741294189409016]=[], [5226309577100597733, 5244671631025045787]=[], [3959201707837294381, 3966256120497110936]=[], [-5644009797308341302, -5625436891321622136]=[], [3783559696387918540, 3802651562781519779]=[], [-8083608256777565735, -8075625381541615307]=[], [-4592600165800377201, -4591126960720786341]=[], [6963379035523023862, 6966105798152398875]=[], [-2873661126563962000, -2865006698360587614]=[], [-3020441752010504375, -3011810835899829272]=[], [-6700462238301430269, -6699675242730251345]=[], [5303188487510411273, 5315618109505838630]=[], [1730432212229141980, 1775046580520098842]=[], [-7837449111407677059, -7835533857396313744]=[], [3608126377440011913, 3631763644187782196]=[], [1707987676098870457, 1715278190568603790]=[], [1262100321431177247, 1282693436558994280]=[], [-4238320399166945720, -4231722489526295728]=[], [-3527036946169677816, -3517969055276438922]=[], [186748954390194875, 212371013041908784]=[], [-3054876837884326554, -3047569376577333773]=[], [-1603482665970338357, -1586132657391656566]=[], [6482788915242085216, 6491980926278450301]=[], [-109575287580986427, -95112322885815284]=[], [-7835533857396313744, -7813710847714650905]=[], [5515671905275458960, 5528493110189420233]=[], [-8332208069486211233, -8322828368529916204]=[], [870287520923088409, 906880799052678148]=[], [-5906296738643732553, -5898911102407369534]=[], [-1154549854382614947, -1152130983607815360]=[], [-853787615013518331, -853561760424292496]=[], [8740747575121493913, 8744383075276610422]=[], [8536024164660715472, 8581577728470934773]=[], [-6099801613660131827, -6094779650796445105]=[], [-4815445404098203546, -4760596029286127984]=[], [793625834421362534, 802063842358511807]=[], [-6978727080305525807, -6960767413033219609]=[], [2653949753285750519, 2716129961675479772]=[], [-115625167908946349, -109575287580986427]=[], [-942808367997680743, -939340085718116907]=[], [-1191905990458365409, -1186950452358740285]=[], [-4548275129798136721, -4521880798213419919]=[], [2652653296481014382, 2653949753285750519]=[], [4820701567469045791, 4831307925540474322]=[], [-6882225436192255342, -6846736775610151994]=[], [2582680286900566733, 2588891414424736172]=[], [-8475287941706663919, -8469715540705797889]=[], [4702944307435693455, 4719031281134601860]=[], [1505166145408500241, 1529215559298962360]=[], [7558259524976519622, 7589259310450757101]=[], [2457068377656749494, 2518963151239571349]=[], [-662620843226419540, -648816152089556760]=[], [1811685704180907092, 1866983848131886172]=[], [906880799052678148, 921207736693901268]=[], [-1256188405096295761, -1251839831017881262]=[], [3506651699717477204, 3509005924571210529]=[], [5904060532436192607, 5922942276556243371]=[], [8335880026770499616, 8335917059173530463]=[], [6998566955207500818, 7015763263441566094]=[], [4307399684438933050, 4322999041213705495]=[], [-6609092260989527881, -6600126709078558353]=[], [8059322415385003331, 8076732515624608701]=[], [-4062915327184017076, -4046699511518079622]=[], [4191269640571890854, 4203853163925490084]=[], [-9080249770355568928, -9063200956908944635]=[], [-631814065810069940, -623688715584221616]=[], [-7660869200973058086, -7646624639571456699]=[], [-2962854858252458557, -2953824875336680415]=[], [-6418668755306402216, -6351235315188280096]=[], [-5647892745631415975, -5644009797308341302]=[], [-1603843621011863418, -1603482665970338357]=[], [1876417635429492045, 1893155383920697859]=[], [-4924490251279821248, -4923877863142114892]=[], [-8361462506805035968, -8353650372536375934]=[], [4250325287347841321, 4267170985874019945]=[], [4612134641152533621, 4619610064126383476]=[], [-7868555664379111545, -7868542706126328701]=[], [6119031588241292905, 6146086992145105572]=[], [-505646614850450365, -486102891606279657]=[], [-1462878341231497146, -1452615416022512616]=[], [2408268696531021204, 2410981699084074123]=[], [943801435354016476, 948414470885543748]=[], [8990666948923936077, 9003094987575547333]=[], [-7061838253575465319, -7039630780308410046]=[], [1253117655404763727, 1256318725157009403]=[], [5179330839568694295, 5184672662900303094]=[], [-6724106150333340926, -6724041062160340709]=[], [3513139307656289127, 3516441224538621500]=[], [3465708221199635625, 3498036162642871789]=[], [6528741294189409016, 6548302136885399016]=[], [-8899566728133865186, -8899290714192025948]=[], [1083594821371023613, 1106697761559329266]=[], [1715278190568603790, 1726516607993845481]=[], [3461205081703141761, 3465708221199635625]=[], [7721136384482281784, 7734576158135262662]=[], [8026743345873120493, 8027977896055985531]=[], [-8812148917309903670, -8784559319761068530]=[], [-8075625381541615307, -8060266246506302962]=[], [-1975033874708932800, -1973718664427524439]=[], [-7962717502982383030, -7943598187509914760]=[], [5399640582801687217, 5403279252363194250]=[], [-1702832184853833674, -1680506190129601621]=[], [8764807191570283161, 8770000187598680596]=[], [332438817315668276, 336852403120333418]=[], [7389816819762951376, 7393420826698086416]=[], [3605548927754165806, 3608126377440011913]=[], [9187798577141093902, 9195965453675475249]=[], [-5690861705648700032, -5690790857306008532]=[], [1805803197128714699, 1811685704180907092]=[], [2029787531917813950, 2033101617298840319]=[], [-6344652383117552975, -6339892976214761973]=[], [-4295978430947398498, -4260725644201221871]=[], [-7034644126950607670, -6978727080305525807]=[], [1617298226732141143, 1618134943813191261]=[], [5131847942196517727, 5157044652716166141]=[], [7050998661204678717, 7103162083982322747]=[], [2043171264515475411, 2066280524359667545]=[], [3814908392648448902, 3860074099310004349]=[], [3325598551729899505, 3340943865480076041]=[], [4501961940135937352, 4521399429357646230]=[], [6014335608779734798, 6046124449921644903]=[], [3972022488430069138, 3973220941796485922]=[], [8264295159130053667, 8282361411225597475]=[], [-2890551887746088603, -2888566595526026688]=[], [3658292606441184720, 3680960737037990642]=[], [-4231722489526295728, -4189902401525590704]=[], [-6296225424064419225, -6285252089317113781]=[], [-2222934870101806198, -2205673561894215716]=[], [-3865575152262825661, -3856228553905702310]=[], [4270742711007831608, 4273862105801589540]=[], [-4564124595870376695, -4563287228328016503]=[], [-3283165459366861359, -3249340825117023169]=[], [3173046240608825950, 3174003194432480352]=[], [-2373847377672482858, -2363951691242349559]=[], [6730248520168442926, 6731197724093043690]=[], [-7292576966841004560, -7232581208743552463]=[], [3874958026732137083, 3932257392466547857]=[], [1256318725157009403, 1262100321431177247]=[], [-3678958354173338857, -3673526562917969902]=[], [924559986490197309, 943801435354016476]=[], [-623688715584221616, -612918450642994690]=[], [-486102891606279657, -465960252674737558]=[], [-699815469576284311, -695833686799280768]=[], [777698140801981474, 778004333013543619]=[], [3303490870760989266, 3325598551729899505]=[], [-8269965932649251183, -8201761003212298481]=[], [-4255215278728230818, -4244858298720462958]=[], [-8894646344406370330, -8884473028078832553]=[], [-3089837964917273827, -3066399449684943945]=[], [-6135846716971543667, -6128769424372044523]=[], [-84663362273789934, -83261333287482341]=[], [-3215395696353078846, -3189012426503304845]=[], [8917676462718890876, 8930630902813048632]=[], [3564320927054922271, 3572776171717641655]=[], [6224960937593489714, 6247778619083953506]=[], [-6507463319982213959, -6506195827697823008]=[], [-6699675242730251345, -6690832607240602953]=[], [-1832217519835529077, -1825310243407932211]=[], [-6464821491112310015, -6418668755306402216]=[], [2606693422494481711, 2611504485045026489]=[], [-2439829940274324050, -2436274371622551964]=[], [-2012241667722995678, -2008977440493730003]=[], [-5157751780714774909, -5117504227621037998]=[], [5806561163628451265, 5831839821834325998]=[], [2554658082322697082, 2569663300344021975]=[], [4089715211574107038, 4090484081282824278]=[], [5846467236833918221, 5854253811140274593]=[], [-8857880558602091425, -8845794254891955023]=[], [-4625544162326249603, -4611754786393053725]=[], [-8420968949991206146, -8420397026677105115]=[], [8185491851540318761, 8202339695520218706]=[], [8076732515624608701, 8089944725458638741]=[], [-6781323620485024869, -6780344315699292376]=[], [-7405726271750241303, -7357685494216847049]=[], [-900751575004035776, -882725617777444514]=[], [4145686518343249194, 4168063025588369973]=[], [-8834976298699024779, -8817582153227691279]=[], [7374532903989328575, 7375715483635666668]=[], [-7872839247241010043, -7868555664379111545]=[], [-5022448406419808087, -4953993286810209089]=[], [-6489371667136489558, -6482138593431599090]=[], [5554776409926992721, 5557462748327930275]=[], [-2312294363823709719, -2306712041928587974]=[], [-6114955094118052322, -6105011368485412949]=[], [5008764345769755334, 5044343553409740146]=[], [-5997392631295530911, -5994488142340287316]=[], [7852341141241434681, 7852749725824012540]=[], [-4873240827374594919, -4829473744604379967]=[], [-8817582153227691279, -8812148917309903670]=[], [4267170985874019945, 4270742711007831608]=[], [-3496490170220558376, -3479880017117824004]=[], [-2449231487716426942, -2446868073202183462]=[], [4800983365297636960, 4801980097566042631]=[], [2217312117454628534, 2232928036564433036]=[], [4791556334204487072, 4800983365297636960]=[], [-524430890406018839, -516164566468219378]=[], [-2128115202370700523, -2121966682156037184]=[], [-4010443936225520423, -3982890185210045501]=[], [-2250150729707074213, -2239012480829371690]=[], [155208214722633207, 183029690912963010]=[], [-4698430307027007341, -4671341321902941796]=[], [7355598039...

Automatically setup schedule repairs for each cluster

We've added support for automatically setting scheduled repairs for keyspaces in a cluster - in our local copy. Let me know if this is a feature you'll be interested in, so I can submit the changes.

Features:

  • Schedule repairs are automatically created for all keyspaces within a cluster when the cluster is added.
  • Cluster keyspaces are periodically checked for changes, so that the scheduled repairs are always in sync with the cluster keyspaces - i.e. scheduled repairs are added for new keyspaces and removed for deleted keyspaces
  • Configuration options are available to turn on/off auto scheduling, specify when the scheduling repairs will be activated and how often the keyspaces sync is done
  • Schedule repairs are not created for keyspaces with no table.

Reaper status shows RUNNING, though the repair is successfully completed

Reaper status shows as RUNNING, though the repair is successfully completed.

Cassandra Version: 2.0.15

Couple of questions here

  1. I tried running repair multiple times for different keyspaces, the repair state still shows "RUNNING" for more than 48 hours though the repair is completed.
  2. The repair always runs only on the node "10.16.3.162", which is a seed node. How can we make sure repair continuous on next nodes once this is completed.
    FYI:
    seeds node: ./bin/spreaper add-cluster 10.16.3.162

[root@test cassandra-reaper]# ./bin/spreaper list-runs

Report improvements/bugs at https://github.com/spotify/cassandra-reaper/issues

------------------------------------------------------------------------------

Listing repair runs

Found 1 repair runs

[
{
"cause": "manual spreaper run",
"cluster_name": "test",
"column_families": [],
"creation_time": "2015-12-08T06:19:39Z",
"duration": null,
"end_time": null,
"estimated_time_of_arrival": "2016-04-07T11:54:22Z",
"id": 1,
"intensity": 0.900,
"keyspace_name": "cache",
"last_event": "Triggered repair of segment 1 via host 10.16.3.162",
"owner": "root",
"pause_time": null,
"repair_parallelism": "DATACENTER_AWARE",
"segments_repaired": 1,
"start_time": "2015-12-08T06:19:40Z",
"state": "RUNNING",
"total_segments": 301
}
]

Any help on this is appreciated.

Add reasons for RepairRun ending in ERROR in lastEvent field

When you see that a repair run ended in ERROR, it's never clear why that happened. You have to dig in the logs to try to figure out the cause. The lastEvent field in RepairRun provides no helpful information when an error has occured. The reason for failure should be stored there.It could be a stack trace or something more short and to the point.

Reaper pausing on too many compaction > 20

Hi,
On our Cassandra cluster, i see the below message no of times when the reaper is running.

SegmentRunner - SegmentRunner declined to repair segment 556369 because of too many pending compactions (> 20) on host.

But when i run the "nodetool compactionstats" on that node, most of the time, i see only one compaction task is running. Though the pending tasks show more than 20 .
Now, i understand the "pending tasks" is actually only an estimate, not the actual no. of tasks pending.

So my question is, why should Reaper wait on the pending tasks, even though there are less than 20 pending tasks running on that node.

Thanks

Kishore

Server not starting when using database as storage

Normally when storageType is set to memory it works fine:

INFO   [2015-12-17 10:07:57,164] [main] i.d.a.AssetsBundle - Registering AssetBundle with name: assets for path /webui/*
INFO   [2015-12-17 10:07:57,197] [main] c.s.r.ReaperApplication - initializing runner thread pool with 15 threads
INFO   [2015-12-17 10:07:57,202] [main] c.s.r.ReaperApplication - initializing storage of type: memory
INFO   [2015-12-17 10:07:57,204] [main] c.s.r.ReaperApplication - no JMX connection factory given in context, creating default
INFO   [2015-12-17 10:07:57,208] [main] c.s.r.ReaperApplication - creating and registering health checks
INFO   [2015-12-17 10:07:57,208] [main] c.s.r.ReaperApplication - creating resources and registering endpoints
INFO   [2015-12-17 10:07:58,214] [main] c.s.r.s.SchedulingManager - Starting new SchedulingManager instance
INFO   [2015-12-17 10:07:58,215] [main] c.s.r.ReaperApplication - resuming pending repair runs
INFO   [2015-12-17 10:07:58,228] [main] i.d.s.ServerFactory - Starting cassandra-reaper
_________                                          .___               __________
\_   ___ \_____    ______ ___________    ____    __| _/___________    \______   \ ____ _____  ______   ___________
/    \  \/\__  \  /  ___//  ___/\__  \  /    \  / __ |\_  __ \__  \    |       _// __ \\__  \ \____ \_/ __ \_  __ \
\     \____/ __ \_\___ \ \___ \  / __ \|   |  \/ /_/ | |  | \// __ \_  |    |   \  ___/ / __ \|  |_> >  ___/|  | \/
 \______  (____  /____  >____  >(____  /___|  /\____ | |__|  (____  /  |____|_  /\___  >____  /   __/ \___  >__|
        \/     \/     \/     \/      \/     \/      \/            \/          \/     \/     \/|__|        \/

INFO   [2015-12-17 10:07:58,302] [main] o.e.j.s.SetUIDListener - Opened application@5a9d6f02{HTTP/1.1}{0.0.0.0:8666}
INFO   [2015-12-17 10:07:58,302] [main] o.e.j.s.SetUIDListener - Opened admin@362045c0{HTTP/1.1}{0.0.0.0:8667}
INFO   [2015-12-17 10:07:58,305] [main] o.e.j.s.Server - jetty-9.0.z-SNAPSHOT
INFO   [2015-12-17 10:07:58,394] [main] c.s.j.s.i.a.WebApplicationImpl - Initiating Jersey application, version 'Jersey: 1.18.1 02/19/2014 03:28 AM'
INFO   [2015-12-17 10:07:58,470] [main] i.d.j.DropwizardResourceConfig - The following paths were found for the configured resources:

    GET     /ping (com.spotify.reaper.resources.PingResource)
    DELETE  /cluster/{cluster_name} (com.spotify.reaper.resources.ClusterResource)
    GET     /cluster (com.spotify.reaper.resources.ClusterResource)
    GET     /cluster/{cluster_name} (com.spotify.reaper.resources.ClusterResource)
    POST    /cluster (com.spotify.reaper.resources.ClusterResource)
    DELETE  /repair_run/{id} (com.spotify.reaper.resources.RepairRunResource)
    GET     /repair_run (com.spotify.reaper.resources.RepairRunResource)
    GET     /repair_run/cluster/{cluster_name} (com.spotify.reaper.resources.RepairRunResource)
    GET     /repair_run/{id} (com.spotify.reaper.resources.RepairRunResource)
    POST    /repair_run (com.spotify.reaper.resources.RepairRunResource)
    PUT     /repair_run/{id} (com.spotify.reaper.resources.RepairRunResource)
    DELETE  /repair_schedule/{id} (com.spotify.reaper.resources.RepairScheduleResource)
    GET     /repair_schedule (com.spotify.reaper.resources.RepairScheduleResource)
    GET     /repair_schedule/cluster/{cluster_name} (com.spotify.reaper.resources.RepairScheduleResource)
    GET     /repair_schedule/{id} (com.spotify.reaper.resources.RepairScheduleResource)
    POST    /repair_schedule (com.spotify.reaper.resources.RepairScheduleResource)
    PUT     /repair_schedule/{id} (com.spotify.reaper.resources.RepairScheduleResource)

But if using configuration as follows:

segmentCount: 200
repairParallelism: DATACENTER_AWARE
repairIntensity: 0.9
#scheduleDaysBetween: 7
#daysToExpireAfterDone: 2
repairRunThreadCount: 15
hangingRepairTimeoutMins: 30
storageType: database
enableCrossOrigin: true
incrementalRepair: false

logging:
  level: DEBUG
  loggers:
    io.dropwizard: DEBUG
    org.eclipse.jetty: DEBUG
  appenders:
    - type: console
      logFormat: "%-6level [%d] [%t] %logger{5} - %msg %n"
    - type: file
      logFormat: "%-6level [%d] [%t] %logger{5} - %msg %n"
      currentLogFilename: /var/log/cassandra_reaper/cassandra_reaper.log
      archivedLogFilenamePattern: /var/log/cassandra_reaper/cassandra_reaper.%d.log.gz

server:
  type: default
  applicationConnectors:
    - type: http
      port: 8666
      bindHost: 0.0.0.0
  adminConnectors:
    - type: http
      port: 8667
      bindHost: 0.0.0.0

database:
  driverClass: org.postgresql.Driver
  user: reaper
  password: my_secret_password
  url: jdbc:postgresql://localhost/reaper_db

jmxAuth:
    username: controlRole
    password: XXX

Then debug logs show:

DEBUG  [2015-12-17 10:09:05,918] [main] o.e.j.u.log - Logging to Logger[org.eclipse.jetty.util.log] via org.eclipse.jetty.util.log.Slf4jLog
DEBUG  [2015-12-17 10:09:05,927] [main] o.e.j.u.c.ContainerLifeCycle - i.d.j.MutableServletContextHandler@1fc32e4f{/,null,null} added {org.eclipse.jetty.servlet.ServletHandler@2f67b837,AUTO}
DEBUG  [2015-12-17 10:09:05,928] [main] o.e.j.u.c.ContainerLifeCycle - i.d.j.MutableServletContextHandler@6cce16f4{/,null,null} added {org.eclipse.jetty.servlet.ServletHandler@7efaad5a,AUTO}
DEBUG  [2015-12-17 10:09:05,941] [main] o.e.j.u.c.ContainerLifeCycle - org.eclipse.jetty.servlet.ServletHandler@7efaad5a added {tasks@6907b8e==io.dropwizard.servlets.tasks.TaskServlet,1,true,AUTO}
DEBUG  [2015-12-17 10:09:05,943] [main] o.e.j.u.c.ContainerLifeCycle - org.eclipse.jetty.servlet.ServletHandler@7efaad5a added {[/tasks/*]=>tasks,POJO}
DEBUG  [2015-12-17 10:09:05,946] [main] o.e.j.s.ServletHandler - filterNameMap={}
DEBUG  [2015-12-17 10:09:05,946] [main] o.e.j.s.ServletHandler - pathFilters=null
DEBUG  [2015-12-17 10:09:05,946] [main] o.e.j.s.ServletHandler - servletFilterMap=null
DEBUG  [2015-12-17 10:09:05,946] [main] o.e.j.s.ServletHandler - servletPathMap={/tasks/*=tasks@6907b8e==io.dropwizard.servlets.tasks.TaskServlet,1,true}
DEBUG  [2015-12-17 10:09:05,946] [main] o.e.j.s.ServletHandler - servletNameMap={tasks=tasks@6907b8e==io.dropwizard.servlets.tasks.TaskServlet,1,true}
INFO   [2015-12-17 10:09:05,955] [main] i.d.a.AssetsBundle - Registering AssetBundle with name: assets for path /webui/*
DEBUG  [2015-12-17 10:09:05,985] [main] o.e.j.u.c.ContainerLifeCycle - org.eclipse.jetty.servlet.ServletHandler@2f67b837 added {assets@ac107383==io.dropwizard.servlets.assets.AssetS
ervlet,1,true,AUTO}
DEBUG  [2015-12-17 10:09:05,985] [main] o.e.j.u.c.ContainerLifeCycle - org.eclipse.jetty.servlet.ServletHandler@2f67b837 added {[/webui/*]=>assets,POJO}
DEBUG  [2015-12-17 10:09:05,985] [main] o.e.j.s.ServletHandler - filterNameMap={}
DEBUG  [2015-12-17 10:09:05,985] [main] o.e.j.s.ServletHandler - pathFilters=null
DEBUG  [2015-12-17 10:09:05,985] [main] o.e.j.s.ServletHandler - servletFilterMap=null
DEBUG  [2015-12-17 10:09:05,986] [main] o.e.j.s.ServletHandler - servletPathMap={/webui/*=assets@ac107383==io.dropwizard.servlets.assets.AssetServlet,1,true}
DEBUG  [2015-12-17 10:09:05,986] [main] o.e.j.s.ServletHandler - servletNameMap={assets=assets@ac107383==io.dropwizard.servlets.assets.AssetServlet,1,true}
DEBUG  [2015-12-17 10:09:05,986] [main] c.s.r.ReaperApplication - repairIntensity: 0.9
DEBUG  [2015-12-17 10:09:05,986] [main] c.s.r.ReaperApplication - incrementalRepair:false
DEBUG  [2015-12-17 10:09:05,986] [main] c.s.r.ReaperApplication - repairRunThreadCount: 15
DEBUG  [2015-12-17 10:09:05,986] [main] c.s.r.ReaperApplication - segmentCount: 200
DEBUG  [2015-12-17 10:09:05,986] [main] c.s.r.ReaperApplication - repairParallelism: DATACENTER_AWARE
DEBUG  [2015-12-17 10:09:05,987] [main] c.s.r.ReaperApplication - hangingRepairTimeoutMins: 30
DEBUG  [2015-12-17 10:09:05,987] [main] c.s.r.ReaperApplication - jmxPorts: null
DEBUG  [2015-12-17 10:09:05,987] [main] c.s.r.ReaperApplication - adding signal handler for SIGHUP
INFO   [2015-12-17 10:09:05,987] [main] c.s.r.ReaperApplication - initializing runner thread pool with 15 threads
INFO   [2015-12-17 10:09:05,993] [main] c.s.r.ReaperApplication - initializing storage of type: database
INFO   [2015-12-17 10:09:06,035] [main] c.s.r.ReaperApplication - no JMX connection factory given in context, creating default
DEBUG  [2015-12-17 10:09:06,037] [main] o.e.j.u.c.ContainerLifeCycle - org.eclipse.jetty.servlet.ServletHandler@2f67b837 added {crossOriginRequests,AUTO}
DEBUG  [2015-12-17 10:09:06,040] [main] o.e.j.u.c.ContainerLifeCycle - org.eclipse.jetty.servlet.ServletHandler@2f67b837 added {[/*]/[]==31=>crossOriginRequests,POJO}
DEBUG  [2015-12-17 10:09:06,041] [main] o.e.j.s.ServletHandler - filterNameMap={crossOriginRequests=crossOriginRequests}
DEBUG  [2015-12-17 10:09:06,042] [main] o.e.j.s.ServletHandler - pathFilters=[[/*]/[]==31=>crossOriginRequests]
DEBUG  [2015-12-17 10:09:06,042] [main] o.e.j.s.ServletHandler - servletFilterMap={}
DEBUG  [2015-12-17 10:09:06,042] [main] o.e.j.s.ServletHandler - servletPathMap={/webui/*=assets@ac107383==io.dropwizard.servlets.assets.AssetServlet,1,true}
DEBUG  [2015-12-17 10:09:06,042] [main] o.e.j.s.ServletHandler - servletNameMap={assets=assets@ac107383==io.dropwizard.servlets.assets.AssetServlet,1,true}
DEBUG  [2015-12-17 10:09:06,042] [main] c.s.r.ReaperApplication - using specified JMX credentials for authentication
INFO   [2015-12-17 10:09:06,042] [main] c.s.r.ReaperApplication - creating and registering health checks
INFO   [2015-12-17 10:09:06,043] [main] c.s.r.ReaperApplication - creating resources and registering endpoints
INFO   [2015-12-17 10:09:07,049] [main] c.s.r.s.SchedulingManager - Starting new SchedulingManager instance
INFO   [2015-12-17 10:09:07,050] [main] c.s.r.ReaperApplication - resuming pending repair runs

And server is not started.

resume repairs automatically after errors

The repair often ends up in ERROR state if nodes are down or restarted. Sometimes the message is "Exception: null". After this happens, the repair must be resumed manually with spreaper. It would be preferable if it would resume automatically perhaps after some delay.

Executing SegmentRunner failed: null

Just trying to launch this tool for the very first time:

./spreaper add-cluster 10.210.3.221
./spreaper repair productioncluster sync --tables user_quota

and I'm getting only:

INFO   [2015-12-03 13:47:03,600] [productioncluster:1:3184] c.s.r.s.SegmentRunner - It is ok to repair segment '3184' on repair run with id '1' 
INFO   [2015-12-03 13:47:03,727] [productioncluster:1:3184] c.s.r.c.JmxProxy - Triggering repair of range (5191522389173550088,5192411715538110057] for keyspace "sync" on host 10.195.15.167, with repair parallelism DATACENTER_AWARE, in cluster with Cassandra version '2.1.11' (can use DATACENTER_AWARE 'true'), for column families: [user_quota] 
ERROR  [2015-12-03 13:47:04,097] [productioncluster:1:3184] c.s.r.s.RepairRunner - Executing SegmentRunner failed: null 
INFO   [2015-12-03 13:47:04,897] [productioncluster:1:3202] c.s.r.s.SegmentRunner - It is ok to repair segment '3202' on repair run with id '1' 
INFO   [2015-12-03 13:47:05,138] [productioncluster:1:3202] c.s.r.c.JmxProxy - Triggering repair of range (5287223852807787957,5288653625584332214] for keyspace "sync" on host 10.195.15.176, with repair parallelism DATACENTER_AWARE, in cluster with Cassandra version '2.1.11' (can use DATACENTER_AWARE 'true'), for column families: [user_quota] 
ERROR  [2015-12-03 13:47:05,502] [productioncluster:1:3202] c.s.r.s.RepairRunner - Executing SegmentRunner failed: null 
INFO   [2015-12-03 13:47:05,930] [productioncluster:1:3232] c.s.r.s.SegmentRunner - It is ok to repair segment '3232' on repair run with id '1' 
INFO   [2015-12-03 13:47:06,002] [productioncluster:1:3232] c.s.r.c.JmxProxy - Triggering repair of range (5392661295720643869,5396401862543039721] for keyspace "sync" on host 10.210.3.221, with repair parallelism DATACENTER_AWARE, in cluster with Cassandra version '2.1.11' (can use DATACENTER_AWARE 'true'), for column families: [user_quota] 
ERROR  [2015-12-03 13:47:06,103] [productioncluster:1:3232] c.s.r.s.RepairRunner - Executing SegmentRunner failed: null 
INFO   [2015-12-03 13:47:06,464] [productioncluster:1:3238] c.s.r.s.SegmentRunner - It is ok to repair segment '3238' on repair run with id '1' 
INFO   [2015-12-03 13:47:06,518] [productioncluster:1:3238] c.s.r.c.JmxProxy - Triggering repair of range (5428508491334977628,5429554845014064640] for keyspace "sync" on host 10.210.3.117, with repair parallelism DATACENTER_AWARE, in cluster with Cassandra version '2.1.11' (can use DATACENTER_AWARE 'true'), for column families: [user_quota] 
ERROR  [2015-12-03 13:47:06,601] [productioncluster:1:3238] c.s.r.s.RepairRunner - Executing SegmentRunner failed: null 
INFO   [2015-12-03 13:47:06,665] [productioncluster:1:3244] c.s.r.s.SegmentRunner - It is ok to repair segment '3244' on repair run with id '1' 
INFO   [2015-12-03 13:47:06,748] [productioncluster:1:3244] c.s.r.c.JmxProxy - Triggering repair of range (5447912260606081125,5452254409804456671] for keyspace "sync" on host 10.210.3.162, with repair parallelism DATACENTER_AWARE, in cluster with Cassandra version '2.1.11' (can use DATACENTER_AWARE 'true'), for column families: [user_quota] 
ERROR  [2015-12-03 13:47:06,836] [productioncluster:1:3244] c.s.r.s.RepairRunner - Executing SegmentRunner failed: null 

I'm using C* 2.1.11 in two datacenter 8 nodes each and cassandra-reaper 0.2.4.

ping works fine:

mlowicki:bin mlowicki$ ./spreaper ping
# Report improvements/bugs at https://github.com/spotify/cassandra-reaper/issues
# ------------------------------------------------------------------------------
# Sending PING to Reaper...
# [Reply] Cassandra Reaper ping resource: PONG
# Cassandra Reaper is answering in: localhost:8080

cluster seems to be registered correctly:

mlowicki:bin mlowicki$ ./spreaper list-clusters
# Report improvements/bugs at https://github.com/spotify/cassandra-reaper/issues
# ------------------------------------------------------------------------------
# Listing all registered Cassandra clusters
# Found 1 clusters:
productioncluster

configuration file:

segmentCount: 200 
repairParallelism: DATACENTER_AWARE
repairIntensity: 0.9 
scheduleDaysBetween: 7
#daysToExpireAfterDone: 2
repairRunThreadCount: 15
hangingRepairTimeoutMins: 30
storageType: memory
enableCrossOrigin: true

logging:
  level: INFO
  loggers:
    io.dropwizard: INFO
    org.eclipse.jetty: INFO
  appenders:
    - type: console
      logFormat: "%-6level [%d] [%t] %logger{5} - %msg %n"

server:
  type: default
  applicationConnectors:
    - type: http
      port: 8080
      bindHost: 0.0.0.0
  adminConnectors:
    - type: http
      port: 8081
      bindHost: 0.0.0.0

database:
  driverClass: org.postgresql.Driver
  user: pg-user
  password: pg-pass
  url: jdbc:postgresql://db.example.com/db-prod

jmxAuth:
    username: controlRole
    password: XXX

jdk version conflict

debian build package require default-jdk -> 6
while maven project require jdk 7

Support for older cassandra versions

Does it work with cassandra 1.2.19? I tried building for cassandra 1.2.19 by changing the pom.xml file but there are many compilation errors

ERROR] /root/cassandra-reaper/src/main/java/com/spotify/reaper/core/RepairRun.java:[16,35] package org.apache.cassandra.repair does not exist
[ERROR] /root/cassandra-reaper/src/main/java/com/spotify/reaper/core/RepairRun.java:[40,17] cannot find symbol
symbol: class RepairParallelism
location: class com.spotify.reaper.core.RepairRun
[ERROR] /root/cassandra-reaper/src/main/java/com/spotify/reaper/core/RepairRun.java:[111,10] cannot find symbol
symbol: class RepairParallelism

location: class com.spotify.reaper.core.RepairRun

Keep raeper state inside Cassandra

Instead of relying on a outside data source, like Postgres, it would make sense to keep the state inside Cassandra, in a separate keyspace.

No coordinators for range

I'm trying to understand how I did run into this issue "No coordinators for range (8285692895160687807,8293377386900436865] " since after restarting the Reaper it did work for the exact same range ...

Create a GUI for using Reaper

Currently there is only the REST API and a CLI tool for using and configuring Reaper. It would be better for wider audience to provide a simple GUI that exposes the Reaper functionality.

Loop on only 3 segments

​Hello,

Cassandra version: 2.0.10
3 Nodes.

I found that reaper happening all the time loop on only 3 segments: 1, 68 and 135.

Log:


​INFO [2015-10-27 14:47:51,225] [dw-17 - POST /repair_run?tables=Standard1&clusterName=spotify&keyspace=Keyspace1&owner=aeljami&cause=manual+spreaper+run] c.s.r.s.SegmentGenerator -
Dividing token range [-9223372036854775808,-3074457345618258603) into 67 segments

​INFO [2015-10-27 14:47:51,228] [dw-17 - POST /repair_run?tables=Standard1&clusterName=spotify&keyspace=Keyspace1&owner=aeljami&cause=manual+spreaper+run] c.s.r.s.SegmentGenerator -
Dividing token range [-3074457345618258603,3074457345618258602) into 67 segments

​INFO [2015-10-27 14:47:51,230] [dw-17 - POST /repair_run?tables=Standard1&clusterName=spotify&keyspace=Keyspace1&owner=aeljami&cause=manual+spreaper+run] c.s.r.s.SegmentGenerator -
Dividing token range [3074457345618258602,-9223372036854775808) into 67 segments

-----Segment 1----
INFO [2015-10-27 14:47:51,462] [spotify:1:1] c.s.r.s.SegmentRunner - It is ok to repair segment '1' on repair run with id '1'
INFO [2015-10-27 14:47:51,468] [spotify:1:1] c.s.r.c.JmxProxy - Triggering repair of range (-9223372036854775808,-9131597190716917343] for keyspace "Keyspace1" on host 127.0.0.2, with repair parallelism SEQUENTIAL, in cluster with Cassandra version '2.0.10' (can use DATACENTER_AWARE 'false'), for column families: [Standard1]
INFO [2015-10-27 14:47:51,486] [spotify:1:1] c.s.r.s.SegmentRunner - Repair for segment 1 started, status wait will timeout in 1800000 millis
INFO [2015-10-27 14:47:51,488] [spotify:1:1] c.s.r.s.SegmentRunner - Repair command 1854 on segment 1 returned with state RUNNING
INFO [2015-10-27 14:47:51,489] [spotify:1:1] c.s.r.s.SegmentRunner - Repair command 1854 on segment 1 has been cancelled while running
INFO [2015-10-27 14:47:51,489] [spotify:1:1] c.s.r.s.SegmentRunner - Postponing segment 1
INFO [2015-10-27 14:47:51,489] [spotify:1:1] c.s.r.s.SegmentRunner - Aborting repair on segment with id 1 on coordinator 127.0.0.2

-----Segment 68----

INFO [2015-10-27 14:47:51,519] [spotify:1:68] c.s.r.s.SegmentRunner - It is ok to repair segment '68' on repair run with id '1'
INFO [2015-10-27 14:47:51,521] [spotify:1:68] c.s.r.c.JmxProxy - Triggering repair of range (-3074457345618258603,-2982682499480400138] for keyspace "Keyspace1" on host 127.0.0.3, with repair parallelism SEQUENTIAL, in cluster with Cassandra version '2.0.10' (can use DATACENTER_AWARE 'false'), for column families: [Standard1]
INFO [2015-10-27 14:47:51,523] [spotify:1:68] c.s.r.s.SegmentRunner - Repair for segment 68 started, status wait will timeout in 1800000 millis
INFO [2015-10-27 14:47:51,533] [spotify:1:68] c.s.r.s.SegmentRunner - Repair command 1855 on segment 68 returned with state RUNNING
INFO [2015-10-27 14:47:51,533] [spotify:1:68] c.s.r.s.SegmentRunner - Repair command 1855 on segment 68 has been cancelled while running
INFO [2015-10-27 14:47:51,533] [spotify:1:68] c.s.r.s.SegmentRunner - Postponing segment 68
INFO [2015-10-27 14:47:51,533] [spotify:1:68] c.s.r.s.SegmentRunner - Aborting repair on segment with id 68 on coordinator 127.0.0.3

-----Segment 135----

INFO [2015-10-27 14:48:21,548] [spotify:1:135] c.s.r.c.JmxProxy - Triggering repair of range (3074457345618258602,3166232191756117067] for keyspace "Keyspace1" on host 127.0.0.1, with repair parallelism SEQUENTIAL, in cluster with Cassandra version '2.0.10' (can use DATACENTER_AWARE 'false'), for column families: [Standard1]
INFO [2015-10-27 14:48:21,551] [spotify:1:135] c.s.r.s.SegmentRunner - Repair for segment 135 started, status wait will timeout in 1800000 millis
INFO [2015-10-27 14:48:21,560] [spotify:1:135] c.s.r.s.SegmentRunner - Repair command 1859 on segment 135 returned with state RUNNING
INFO [2015-10-27 14:48:21,560] [spotify:1:135] c.s.r.s.SegmentRunner - Repair command 1859 on segment 135 has been cancelled while running
INFO [2015-10-27 14:48:21,560] [spotify:1:135] c.s.r.s.SegmentRunner - Postponing segment 135
INFO [2015-10-27 14:48:21,561] [spotify:1:135] c.s.r.s.SegmentRunner - Aborting repair on segment with id 135 on coordinator 127.0.0.1

-----Segment 1----
INFO [2015-10-27 14:49:51,464] [spotify:1:1] c.s.r.c.JmxProxy - Triggering repair of range (-9223372036854775808,-9131597190716917343] for keyspace "Keyspace1" on host 127.0.0.2, with repair parallelism SEQUENTIAL, in cluster with Cassandra version '2.0.10' (can use DATACENTER_AWARE 'false'), for column families: [Standard1]
INFO [2015-10-27 14:49:51,465] [spotify:1:1] c.s.r.s.SegmentRunner - Repair for segment 1 started, status wait will timeout in 1800000 millis
INFO [2015-10-27 14:49:51,471] [spotify:1:1] c.s.r.s.SegmentRunner - Repair command 1866 on segment 1 returned with state RUNNING
INFO [2015-10-27 14:49:51,471] [spotify:1:1] c.s.r.s.SegmentRunner - Repair command 1866 on segment 1 has been cancelled while running
INFO [2015-10-27 14:49:51,471] [spotify:1:1] c.s.r.s.SegmentRunner - Postponing segment 1
INFO [2015-10-27 14:49:51,472] [spotify:1:1] c.s.r.s.SegmentRunner - Aborting repair on segment with id 1 on coordinator 127.0.0.2

-----Segment 68----

INFO [2015-10-27 14:49:51,500] [spotify:1:68] c.s.r.c.JmxProxy - Triggering repair of range (-3074457345618258603,-2982682499480400138] for keyspace "Keyspace1" on host 127.0.0.3, with repair parallelism SEQUENTIAL, in cluster with Cassandra version '2.0.10' (can use DATACENTER_AWARE 'false'), for column families: [Standard1]
INFO [2015-10-27 14:49:51,501] [spotify:1:68] c.s.r.s.SegmentRunner - Repair for segment 68 started, status wait will timeout in 1800000 millis
INFO [2015-10-27 14:49:51,508] [spotify:1:68] c.s.r.s.SegmentRunner - Repair command 1867 on segment 68 returned with state RUNNING
INFO [2015-10-27 14:49:51,509] [spotify:1:68] c.s.r.s.SegmentRunner - Repair command 1867 on segment 68 has been cancelled while running
INFO [2015-10-27 14:49:51,509] [spotify:1:68] c.s.r.s.SegmentRunner - Postponing segment 68
INFO [2015-10-27 14:49:51,509] [spotify:1:68] c.s.r.s.SegmentRunner - Aborting repair on segment with id 68 on coordinator 127.0.0.3

-----Segment 135----
INFO [2015-10-27 14:49:51,536] [spotify:1:135] c.s.r.c.JmxProxy - Triggering repair of range (3074457345618258602,3166232191756117067] for keyspace "Keyspace1" on host 127.0.0.1, with repair parallelism SEQUENTIAL, in cluster with Cassandra version '2.0.10' (can use DATACENTER_AWARE 'false'), for column families: [Standard1]
INFO [2015-10-27 14:49:51,538] [spotify:1:135] c.s.r.s.SegmentRunner - Repair for segment 135 started, status wait will timeout in 1800000 millis
INFO [2015-10-27 14:49:51,545] [spotify:1:135] c.s.r.s.SegmentRunner - Repair command 1868 on segment 135 returned with state RUNNING
INFO [2015-10-27 14:49:51,545] [spotify:1:135] c.s.r.s.SegmentRunner - Repair command 1868 on segment 135 has been cancelled while running
INFO [2015-10-27 14:49:51,545] [spotify:1:135] c.s.r.s.SegmentRunner - Postponing segment 135
INFO [2015-10-27 14:49:51,545] [spotify:1:135] c.s.r.s.SegmentRunner - Aborting repair on segment with id 135 on coordinator 127.0.0.1

endless, no other segments.....
Any Idea ?

Greets,

Track Cassandra's repair hash in RepairSegment

Cassandra gives each repair command a unique hash, which is not the same as the repair number which we already track. It might be convenient to have this hash in the repair segment, so that you can easily find the right repair command in the log on all nodes involved in the repair.

Example: repair #2b41cfb0-a93d-11e4-b542-f707d752ca5f

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.