Git Product home page Git Product logo

hyperbahn's People

Contributors

albertyw avatar andrewdeandrade avatar blampe avatar dansimau avatar jcorbin avatar kriskowal avatar lopter avatar lupie avatar malandrew avatar mranney avatar raynos avatar rf avatar shannili avatar strongliang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hyperbahn's Issues

uncaught exception: out of bounds access

RangeError: Trying to access beyond buffer length
    at checkOffset (buffer.js:582:11)
    at Buffer.readUInt16BE (buffer.js:602:5)
    at BufferRW.readLazyFrameFrom [as readFrom] (/var/cache/udeploy/r/hyperbahn/sjc1-produ-0000000111-v1/node_modules/hyperbahn/node_modules/tchannel/v2/lazy_frame.js:90:23)
    at fromBufferResult (/var/cache/udeploy/r/hyperbahn/sjc1-produ-0000000111-v1/node_modules/hyperbahn/node_modules/bufrw/interface.js:120:18)
    at ReadMachine.seek (/var/cache/udeploy/r/hyperbahn/sjc1-produ-0000000111-v1/node_modules/hyperbahn/node_modules/bufrw/stream/read_machine.js:116:15)
    at ReadMachine.handleChunk (/var/cache/udeploy/r/hyperbahn/sjc1-produ-0000000111-v1/node_modules/hyperbahn/node_modules/bufrw/stream/read_machine.js:67:28)
    at Socket.onSocketChunk (/var/cache/udeploy/r/hyperbahn/sjc1-produ-0000000111-v1/node_modules/hyperbahn/node_modules/tchannel/connection.js:451:29)
    at Socket.emit (events.js:95:17)
    at SocketInspector.inspectEmit (/var/cache/udeploy/r/hyperbahn/sjc1-produ-0000000111-v1/node_modules/hyperbahn/clients/socket-inspector.js:132:23)
    at Socket.boundInspectEmit [as emit] (/var/cache/udeploy/r/hyperbahn/sjc1-produ-0000000111-v1/node_modules/hyperbahn/clients/socket-inspector.js:51:14)
    at Socket.<anonymous> (_stream_readable.js:764:14)
    at Socket.emit (events.js:92:17)
    at SocketInspector.inspectEmit (/var/cache/udeploy/r/hyperbahn/sjc1-produ-0000000111-v1/node_modules/hyperbahn/clients/socket-inspector.js:132:23)
    at Socket.boundInspectEmit [as emit] (/var/cache/udeploy/r/hyperbahn/sjc1-produ-0000000111-v1/node_modules/hyperbahn/clients/socket-inspector.js:51:14)
    at emitReadable_ (_stream_readable.js:426:10)
    at emitReadable (_stream_readable.js:422:5)
    at readableAddChunk (_stream_readable.js:165:9)
    at Socket.Readable.push (_stream_readable.js:127:10)
    at TCP.onread (net.js:528:21)

Tests to write for hyperbahn-client

  • advertise with hyperbahn + error frame
  • advertise with hyperbahn + error frame + no hardFail
  • advertise with invalid serviceName
  • advertise with invalid host port
  • advertise with unexpected hyperbahn failure
  • advertise with invalid serviceName + no hardFail
  • advertise with invalid host port + no hardFail
  • advertise with unexpected hyperbahn failure + no hardFail
  • advertise in a loop
  • calling advertise() after destroy
  • calling getClientSubChannel() after destroy

Feature: [Service Discovery] Having a fallback router

It would be nice to have a fallback router.

Where you can configure a local hyperbahn worker to re-router all traffic that it does not have local peers for to some other cluster.

It would be nice to configure this:

  • per service
  • globally.

BUG: When reaping peers we get a null pointer sad.

TypeError: Cannot read property 'connections' of null
    at ServiceDispatchHandler.reapPeers (/var/cache/udeploy/r/hyperbahn/sjc1-produ-0000000072-v1/node_modules/hyperbahn/service-proxy.js:819:29)
    at reapPeers [as _onTimeout] (/var/cache/udeploy/r/hyperbahn/sjc1-produ-0000000072-v1/node_modules/hyperbahn/service-proxy.js:110:14)
    at Timer.listOnTimeout [as ontimeout] (timers.js:112:15)

Advertisement Timeout SLA

{
  "log": "{\"error\":{\"stack\":\"TchannelRequestTimeoutError: request timed out after 500ms (limit was 500ms)\\n    at Object.createError [as RequestTimeoutError] (/home/udocker/chronotrigger/node_modules/tchannel/node_modules/error/typed.js:31:22)\\n    at V2OutRequest.onTimeout (/home/udocker/chronotrigger/node_modules/tchannel/out_request.js:555:31)\\n    at TimeHeap.callExpiredTimeouts (/home/udocker/chronotrigger/node_modules/tchannel/time_heap.js:169:14)\\n    at TimeHeap.drainExpired (/home/udocker/chronotrigger/node_modules/tchannel/time_heap.js:160:14)\\n    at TimeHeap.onTimeout (/home/udocker/chronotrigger/node_modules/tchannel/time_heap.js:144:10)\\n    at onTimeout [as _onTimeout] (/home/udocker/chronotrigger/node_modules/tchannel/time_heap.js:135:14)\\n    at Timer.listOnTimeout [as ontimeout] (timers.js:112:15)\",\"type\":\"tchannel.request.timeout\",\"message\":\"request timed out after 500ms (limit was 500ms)\",\"id\":871,\"start\":1443038570760,\"elapsed\":500,\"timeout\":500,\"logical\":true,\"name\":\"TchannelRequestTimeoutError\",\"fullType\":\"tchannel.request.timeout\"},\"serviceName\":\"chronotrigger\",\"level\":\"error\",\"message\":\"HyperbahnClient: advertisement failure, marking server as sick\"}\n",
  "stream": "stderr",
  "time": "2015-09-23T20:02:51.260517901Z"
}

@Raynos @jcorbin Deployment Process results in hyperbahn timeouts

Peer reaper needs to be amortized

The peer reaper appears to be taking a huge amount of time to run, which is causing a spike in event loop lag, causing workers to be marked as dead, causing a bunch of ring changes and a spike in latency across the board.

The peer reaper needs to be spread across event loop ticks.

Circuit breaker too aggressive during service restart

If a service gets redeployed, it will stop and start a bunch of workers.

For each worker that's restarted it will ECONNRESET the socket and kill some inflight RPCs.

The circuit breaker immediately kicks in and things there is something wrong with the service.

We should ensure that we can do a rolling restart of all instances of a service without the circuit breaker penalizing all the caller.

cc: @kriskowal @jcorbin @rf

membershipChanged -> ringChanged

Evidently, ringChanged is a subset of membershipChanged (membership only grows), and affinity would only change in response to a ringChanged event. We should carefully consider using ringChanged instead to avoid unnecessary affinity updates in service proxy.

Hyperbahn exception

TypeError: Cannot read property 'ops' of null
    at ServiceDispatchHandler.rateLimit (/var/cache/udeploy/r/hyperbahn/sjc1-produ-0000000066-v1/node_modules/hyperbahn/service-proxy.js:202:23)
    at ServiceDispatchHandler.handleRequest (/var/cache/udeploy/r/hyperbahn/sjc1-produ-0000000066-v1/node_modules/hyperbahn/service-proxy.js:172:41)
    at TChannelSelfConnection.runHandler (/var/cache/udeploy/r/hyperbahn/sjc1-produ-0000000066-v1/node_modules/hyperbahn/node_modules/tchannel/connection_base.js:207:26)
    at runHandler (/var/cache/udeploy/r/hyperbahn/sjc1-produ-0000000066-v1/node_modules/hyperbahn/node_modules/tchannel/connection_base.js:162:14)
    at process._tickDomainCallback (node.js:463:13)

Hyperbahn lazy relay errors

1.  Cannot read property 'length' of undefined      316
2.  Object #<LazyRelayOutReq> has no method 'onTimeout'     178
3.  Cannot call method 'cancel' of null     4
4.  Object #<LazyRelayInReq> has no method 'onTimeout'         2

cc @jcorbin

npm install failed.

I run the command npm install, got the following errors :

[email protected] install /home/hjianhao/nodeprj/hyperbahn/node_modules/farmhash
node-gyp rebuild

make: Entering directory '/home/hjianhao/nodeprj/hyperbahn/node_modules/farmhash/build'
CXX(target) Release/obj.target/farmhash/src/upstream/farmhash.o
CXX(target) Release/obj.target/farmhash/src/bindings.o
In file included from ../../nan/nan_new.h:190:0,
from ../../nan/nan.h:74,
from ../src/bindings.cc:4:
../../nan/nan_implementation_12_inl.h: In static member function ‘static NanIntern::FactoryBasev8::Signature::return_t NanIntern::Factoryv8::Signature::New(NanIntern::Factoryv8::Signature::FTH, int, NanIntern::Factoryv8::Signature::FTH_)’:
../../nan/nan_implementation_12_inl.h:181:76: error: no matching function for call to ‘v8::Signature::New(v8::Isolate_, NanIntern::Factoryv8::Signature::FTH&, int&, NanIntern::Factoryv8::Signature::FTH_&)’
return v8::Signature::New(v8::Isolate::GetCurrent(), receiver, argc, argv);
^
../../nan/nan_implementation_12_inl.h:181:76: note: candidate is:
In file included from /home/hjianhao/.node-gyp/6.0.0/include/node/node.h:42:0,
from ../src/bindings.cc:2:
/home/hjianhao/.node-gyp/6.0.0/include/node/v8.h:4798:27: note: static v8::Localv8::Signature v8::Signature::New(v8::Isolate_, v8::Localv8::FunctionTemplate)
static Local New(
^
/home/hjianhao/.node-gyp/6.0.0/include/node/v8.h:4798:27: note: candidate expects 2 arguments, 4 provided
In file included from ../src/bindings.cc:4:0:
../../nan/nan.h: At global scope:
../../nan/nan.h:165:25: error: redefinition of ‘template v8::Local _NanEnsureLocal(v8::Local)’
NAN_INLINE v8::Local _NanEnsureLocal(v8::Local val) {
^
../../nan/nan.h:160:25: note: ‘template v8::Local _NanEnsureLocal(v8::Handle)’ previously declared here
NAN_INLINE v8::Local _NanEnsureLocal(v8::Handle val) {
^
../../nan/nan.h:369:20: error: variable or field ‘NanAddGCEpilogueCallback’ declared void
v8::Isolate::GCEpilogueCallback callback
^
../../nan/nan.h:369:7: error: ‘GCEpilogueCallback’ is not a member of ‘v8::Isolate’

                ^

../../nan/nan.h:369:7: error: ‘GCEpilogueCallback’ is not a member of ‘v8::Isolate’

my node.js and npm version as following.

npm ERR! Linux 3.19.0-16-generic
npm ERR! argv "/home/hjianhao/dev/node-v6.0.0-linux-x64/bin/node" "/home/hjianhao/dev/node/bin/npm" "install"
npm ERR! node v6.0.0
npm ERR! npm v3.8.6
npm ERR! code ELIFECYCLE

Upgrade to Node 5

Questionable:

  • probably faster. At least for peer choice there was a clear speed improvement
  • Map/Set, probably faster over object/array for applicable work

Pros:

  • Support for IRHydra for more better optimizations
  • Better supported in the community; especially security upgrades

Cons:

  • Supporting TChannel in 2 different node versions
  • optimizations for Node 5 may not be applicable / may even slow down Node 0.10
  • needs working heap, core & flamegraph tooling
  • cannot use ES2015 features whilst support node0.10

TBD:

  • Better heap analysis tooling ?
  • Better GC analysis tooling ?

Measure per frame relay overhead

our current relay latency stat only measures the "peer selection + call req frame forwarding latency.

It would be good to have more stats like frame forwarding latency per each of the 4 types that does not include peer selection overhead.

This allows us to notice whether we become "slower" when people turn on large responses as we have zero relay latency stats for cont frames

Hyperbahn Lazy Disable Error


TypeError: Object #<LazyRelayOutReq> has no method 'emitResponse'
    at TChannelConnection.onCallResponse (.../tchannel/connection.js:362:9)
    at DefinedEvent.onCallResponse [as listener] (.../tchannel/connection.js:211:14)
    at DefinedEvent.emit (.../tchannel/lib/event_emitter.js:86:14)
    at TChannelV2Handler.handleCallResponse (.../tchannel/v2/handler.js:408:40)
    at TChannelV2Handler.handleEagerFrame (.../tchannel/v2/handler.js:186:25)
    at TChannelConnection.handleReadFrame (.../tchannel/connection.js:338:18)
    at ReadMachine.handleReadFrame [as emit] (.../tchannel/connection.js:203:14)
    at ReadMachine.seek (.../tchannel/node_modules/bufrw/stream/read_machine.js:120:14)
    at ReadMachine.handleChunk (.../tchannel/node_modules/bufrw/stream/read_machine.js:67:28)
    at Socket.onSocketChunk (.../tchannel/connection.js:120:29)
    at Socket.emit (events.js:95:17)
    at Socket.<anonymous> (_stream_readable.js:764:14)
    at Socket.emit (events.js:92:17)
    at emitReadable_ (_stream_readable.js:426:10)
    at emitReadable (_stream_readable.js:422:5)
    at readableAddChunk (_stream_readable.js:165:9)
    at Socket.Readable.push (_stream_readable.js:127:10)
    at TCP.onread (net.js:528:21)

Peer selection not selecting moe instances with both incoming and outgoing connections

While introspecting moe in dca1, we found that a single Hyperbahn node (hyperbahn09-dca1) was not sending any calls to a specific instance of moe. Upon debugging, the only difference between this node and other nodes was that this node had both "in" and "out" connections.

  "10.25.13.4:54135": {
    "connected": {
      "in": true,
      "out": true
    },
    "identified": {
      "in": true,
      "out": true
    },
    "serviceNames": [
      "moe"
    ]
  },
  "10.26.46.21:43397": {
    "connected": {
      "in": false,
      "out": true
    },
    "identified": {
      "in": false,
      "out": true
    },
    "serviceNames": [
      "moe"
    ]

This might be a bug in the peer selection logic.

Fix logger warning debt

We need to get our logger warnings down so we can alert on them:

  • reduce "forwarding error frame" to info(). We need to add new logs for all places where hyperbahn calls sendErrorFrame() we should also add owner service to sendErrorFrame().
  • fix bug with "mismatched conn.onReqDone "
  • reduce "stale tombstone" by fixing bug.
  • downgrade "circuit became unhealthy" but consider circuit breaker alerts.
  • Fix bug with "destroying due to init timeout"
  • Get better at "resetting connection" and "Got a connection error" logs.

When handling request we get a null pointer sad.

TypeError: Cannot read property 'ops' of null
    at ServiceDispatchHandler.handleRequest (/var/cache/udeploy/r/hyperbahn/sjc1-produ-0000000070-v1/node_modules/hyperbahn/service-proxy.js:230:27)
    at TChannelSelfConnection.runHandler (/var/cache/udeploy/r/hyperbahn/sjc1-produ-0000000070-v1/node_modules/hyperbahn/node_modules/tchannel/connection_base.js:207:26)
    at runHandler (/var/cache/udeploy/r/hyperbahn/sjc1-produ-0000000070-v1/node_modules/hyperbahn/node_modules/tchannel/connection_base.js:162:14)
    at process._tickDomainCallback (node.js:463:13)

Audit requests for assumed existence of req.connection

We have observed uncaught exceptions in production due to the assumption that requests have a connection. Specifically in ServiceProxy rate limiting, we need to re-evaluate two occasions where this assumption exists.

Tests to write for hyperbahn

  • forwarding a 80kb payload
  • calling connections for non-existant service
  • calling connections with exit nodes down
  • register two services on one hostPort fails
  • send register to entry who is also exit
  • creating a logger
  • forwarding a call response not ok
  • forwarding a error frame from a service
  • sending a message to autobahn with empty service
  • sending an unknown arg1 to autobahn for service autobahn
  • calling the repl_port endpoint
  • sending corrupted json to autobahn arg2
  • sending corrupted json to autobahn arg3
  • decrement k by a set value
  • advertising during a restart does not error

Disruption Endpoints

We'd like a set of endpoints for creating disruptions between services:

  • block % of traffic
  • delay traffic for n ms
  • eject % of instances
  • manually trigger circuit breaker (sortof done)
  • manually trigger rate limiter
  • fake advertise service X as Y

I propose we use the service name hyperbahn with thrift endpoints in the HyperbahnDisruption service. For example, HyperbahnDisruption::block_traffic_v1, HyperbahnDisruption::delay_traffic_v1, etc.

Remote config change logs from test cluster

I am using the test cluster to run tchannel tests through hyperbahn. The test cluster is spewing the following:

AUTOBAHN INFO: [remote-config] config file changed ~ { changedKeys: 
   [ 'update:rateLimiting.enabled',
     'update:rateLimiting.exemptServices',
     'update:rateLimiting.rpsLimitForServiceName',
     'update:rateLimiting.totalRpsLimit' ],
  newConfig: 
   { 'rateLimiting.enabled': true,
     'rateLimiting.exemptServices': [ 'sam', 'robert' ],
     'rateLimiting.rpsLimitForServiceName': { nancy: 111, bill: 60, summer: 66 },
     'rateLimiting.totalRpsLimit': 1201 } }
AUTOBAHN INFO: [remote-config] config file changed ~ { changedKeys: 
   [ 'update:rateLimiting.enabled',
     'update:rateLimiting.exemptServices',
     'update:rateLimiting.rpsLimitForServiceName',
     'update:rateLimiting.totalRpsLimit' ],
  newConfig: 
   { 'rateLimiting.enabled': true,
     'rateLimiting.exemptServices': [ 'sam', 'robert' ],
     'rateLimiting.rpsLimitForServiceName': { nancy: 111, bill: 60, summer: 66 },
     'rateLimiting.totalRpsLimit': 1201 } }

cc @Raynos

Black hole for 2x rate limit

The TChannel protocol specifies that clients should retry if they receive a busy frame. The rate limiter produces busy frames. There are two kinds of busy: ephemeral business, where a node is temporarily busy but other peers are available; and systemic business, where all peers are backlogged. The rate limiter indicates the latter but induces the behavior for the former.

We could alternately drop requests if a service receives 2x the rate limit. In practice, the normal rate limiter induces 5x the normal volume, by causing some number of retires and different retry policies. A 2x black hole rate limiter would mitigate retry storms caused by the 1x rate limit. This effectively would encourage folks to use a "retry once policy" to keep under the 2x rate limit, and thus getting notified immediately if they hit the rate limiter.

Documentation for setting up a hyperbahn router

Forgive me if I have completely missed the underlying concepts of this, as I have yet to fully understand how the Hyperbahn routing mesh works. It appears to me as though Hyperbahn runs on a number of servers which work to form a consistent hash ring in much the same way as ringpop to handle or forward requests to the server which has information on how to connect to the relevant servers which advertise the given servicename. The problem I am having is twofold; firstly, there seem to be several references to Hyperbahn routers, yet little to no documentation or examples detailing what is required to set up a Hyperbahn router that other services can advertise on (such as this in the docs: https://github.com/uber/tchannel-node/blob/master/docs/GUIDE.md#setting-up-hyperbahn). The second problem I am having, which is very much an individual problem, is that I am not entirely clear on how TChannel/Ringpop/Hyperbahn are related. From what I understand, TChannel is the transport mechanism used by Ringpop to essentially handleorproxy requests to the correct app instance, and return the response. Hyperbahn is built upon Ringpop, and extends it to essentially create a list of ip:ports that correspond to a given servicename. Based on the documentation, in order to use a service on Hyperbahn, one instantiates a server that advertises with servicename on Hyperbahn, and makes a request to other servicenames using the getClientChannel() method, such as (ie. Hyperbahn cannot be a client alone, it must also advertise a service). What I would be extremely grateful for would be for someone to point me in the direction of the documentation/example that demonstrates making a request over Hyperbahn and receiving the response, setting up a (barebones) Hyperbahn router, and a little detail on how tchannel subchannels and Hyperbahn interact given hyperbahn takes the root tchannel and advertises a service, how does one call different tchannel subchannels over Hyperbahn, or is this not possible/intended? And finally for someone to tell me if what I have said is utterly wrong and why. Whoever clarifies this for me shall have a beer bought for them in the event that we meet someday :)

reapSinglePeer has bugs

TypeError: Object.keys called on non-object
    at Function.keys (native)
    at ServiceDispatchHandler.reapSinglePeer (/var/cache/udeploy/r/hyperbahn/sjc1-produ-0000000080-v1/node_modules/hyperbahn/service-proxy.js:912:31)
    at nextPeer (/var/cache/udeploy/r/hyperbahn/sjc1-produ-0000000080-v1/node_modules/hyperbahn/service-proxy.js:876:14)
    at Object.deferNextPeer [as _onImmediate] (/var/cache/udeploy/r/hyperbahn/sjc1-produ-0000000080-v1/node_modules/hyperbahn/service-proxy.js:880:13)
    at processImmediate [as _immediateCallback] (timers.js:345:15)

round robin peer selection

From a flame graph I've observed that some workers / services are really struggling with peer selection

image

If we implement a random peer selection strategy and add a flipr where we can change the peer selection strategy per serviceName

We already have boolean logic to enabled / disable peer heap per serviceName.

A round robin peer selection will reduce CPU utilization and slightly degrade load balancing by increasing variance.

If round robin is involved we can also just implement random peer selection.

Mitigate 100% CPU edge case under ringpop churn

When the ring churns we get a large amount of ringpop ring changes.

This causes updateServiceChannels() to get called at a high frequency.

This method is incredibly CPU intensive.

To avoid high CPU usage under high frequency ringpop changes we should cap the maximum number of updateServiceChannels() calls per minute.

By doing so we add a small amount of staleness but that's a far better trade off then being unavailable and burning a lot of CPU.

Right here ( https://github.com/uber/hyperbahn/blob/master/service-proxy.js#L282-L284 ) instead of scheduling a updateServiceChannels() immediately we should schedule one in the future ( 5 seconds or something ? ) and check to see how many times we've updated in the last minute, maybe have a cap of 12 updates per minute ?

For ad requests, max wait for relay-ad

Ad requests fan out a relay-ad request for each of k affine Hyperbahn relays for the advertiser. If any of those requests fail, typically due to a timeout, the ad response fails. As k increases, the probability of failure grows, and it is a frequent occurrence in production that failures cause "storms" of "relay-ad" requests, bogging down the network and effecting "ad" retries.

To mitigate this problem, let’s consider capping the amount of time an ad will wait for relay-ad responses, and always provide a successful response containing all of the relays that successfully responded, unless there were none, or there were too few (at discretion of assignee).

I got started on this in my working copy, but it’s not even close to started. This just points where the code change probably ought to happen:

https://github.com/uber/hyperbahn/compare/best-effort-ad

The logic should go like this: for about 200ms (out of the 500ms budget allotted to ad requests), we should wait for relay-ad responses. If all of the relays respond before the 200ms deadline, respond fast with the whole list. If the 200ms expires, respond immediately. Maybe if there are too few (arbitrary) responses, return an error, maybe forward the last error. Otherwise, return a success ad response with whatever relay-ad successes were obtained.

Use timers.setTimeout instead of global setTimeout. We may need to use the time heap since we have a lot of timers, but if simple works in prod, the feature is done.

Dual write tchannel stats for ringpop

The ringpop sub channel needs to have both global & per worker tchannel stats

This is the only way to debug slow instances.

If we turn on per worker stats ONLY for ringpop the increased cardinality should be low.

Hyperbahn received fatal exception

Hyperbahn node received fatal exception. Here are some stack trace info.

TchannelResponseAlreadyDoneError: cannot send error frame, response already done in state: 3
    at Object.createError [as ResponseAlreadyDone] (/var/cache/udeploy/r/hyperbahn/sjc1-produ-0000000057-v1/node_modules/hyperbahn/node_modules/tchannel/node_modules/error/typed.js:31:22)
    at V2OutResponse.sendError (/var/cache/udeploy/r/hyperbahn/sjc1-produ-0000000057-v1/node_modules/hyperbahn/node_modules/tchannel/out_response.js:209:43)
    at TChannelV2Handler.onReqError (/var/cache/udeploy/r/hyperbahn/sjc1-produ-0000000057-v1/node_modules/hyperbahn/node_modules/tchannel/v2/handler.js:908:17)
    at DefinedEvent.onReqError [as listener] (/var/cache/udeploy/r/hyperbahn/sjc1-produ-0000000057-v1/node_modules/hyperbahn/node_modules/tchannel/v2/handler.js:91:14)
    at DefinedEvent.emit (/var/cache/udeploy/r/hyperbahn/sjc1-produ-0000000057-v1/node_modules/hyperbahn/node_modules/tchannel/lib/event_emitter.js:86:14)
    at TChannelConnection.buildResponse (/var/cache/udeploy/r/hyperbahn/sjc1-produ-0000000057-v1/node_modules/hyperbahn/node_modules/tchannel/connection_base.js:211:24)
    at buildResponse (/var/cache/udeploy/r/hyperbahn/sjc1-produ-0000000057-v1/node_modules/hyperbahn/node_modules/tchannel/connection_base.js:202:21)
    at onArg23 (/var/cache/udeploy/r/hyperbahn/sjc1-produ-0000000057-v1/node_modules/hyperbahn/node_modules/tchannel/request-handler.js:57:19)
    at compatCall (/var/cache/udeploy/r/hyperbahn/sjc1-produ-0000000057-v1/node_modules/hyperbahn/node_modules/tchannel/streaming_in_request.js:106:9)
    at end (/var/cache/udeploy/r/hyperbahn/sjc1-produ-0000000057-v1/node_modules/hyperbahn/node_modules/tchannel/node_modules/run-parallel/index.js:16:15)
    at done (/var/cache/udeploy/r/hyperbahn/sjc1-produ-0000000057-v1/node_modules/hyperbahn/node_modules/tchannel/node_modules/run-parallel/index.js:20:10)
    at each (/var/cache/udeploy/r/hyperbahn/sjc1-produ-0000000057-v1/node_modules/hyperbahn/node_modules/tchannel/node_modules/run-parallel/index.js:26:7)
    at signalReady (/var/cache/udeploy/r/hyperbahn/sjc1-produ-0000000057-v1/node_modules/hyperbahn/node_modules/ready-signal/index.js:21:26)
    at StreamArg.finish (/var/cache/udeploy/r/hyperbahn/sjc1-produ-0000000057-v1/node_modules/hyperbahn/node_modules/tchannel/argstream.js:257:9)
    at StreamArg.emit (events.js:117:20)
    at _stream_readable.js:943:16
    at process._tickDomainCallback (node.js:463:13)

`ad` validation

Currently we do no validation in the ad endpoint handler, so invalid host ports like [::]:2382 can make it through. We need to validate service names and host ports to make sure they're valid.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.