Git Product home page Git Product logo

swarm's Introduction

Description

Swarm is a framework for the creation of asynchronous, distributed client/server systems. Swarm is built on top of ocean.

A Tale of Two Protocols

The code in swarm is currently in transition. There exist two parallel client/ server architectures in the repo: a new architecture (dubbed "neo") -- located in the src/swarm/neo package -- and a legacy architecture -- located in the other packages of src/swarm. The neo protocol is being introduced in stages, progressively adding features to the core client and server code over a series of releases.

When the legacy protocol is no longer in active use, it will be deprecated and eventually removed.

User Documentation

An overview of the features of the legacy and neo client architecture can be found here:

Legacy client documentation.

Neo client documentation.

Neo Support in Clients

The neo client functionality is implemented in such a way that it can be added to existing legacy clients. Thus, the functionality of _both_ protocols can be accessed through a single client instance. (Likewise, the servers are able to handle requests of both types, handling the two protocols on different ports.)

While the neo architecture is being developed, the legacy protocol and the associated client features remain unchanged -- indeed, there is no interaction between the neo functionality and the legacy functionality of the client, except at the system level (e.g. the allocation of file descriptors, etc).

Developer Documentation

Architectural overviews of the neo client/protocol/server:

Neo protocol overview.

Example

A simple example of how to construct a client and node using the neo protocol can be found here.

Build / Use

Dependencies

Dependency Version
ocean v4.0.x
makd v2.1.x
turtle v9.0.1

The following libraries are required (for an absolutely up to date list you can take a look at the Build.mak file, in the $O/%unittests target):

  • -lglib-2.0
  • -lebtree
  • -llzo2
  • -lgcrypt
  • -lgpg-error
  • -lrt

Please note that ebtree is not the vanilla upstream version. We created our own fork of it to be able to write D bindings more easily. You can find the needed ebtree library in https://github.com/sociomantic-tsunami/ebtree/releases (look only for the v6.0.socioX releases, some pre-built Ubuntu packages are provided).

If you plan to use the provided Makefile (you need it to run the tests), you need to also checkout the submodules with git submodule update --init. This will fetch the `Makd<https://github.com/sociomantic-tsunami/makd>`_ project in submodules/makd.

Versioning

swarm's versioning follows Neptune.

This means that the major version is increased for breaking changes, the minor version is increased for feature releases, and the patch version is increased for bug fixes that don't cause breaking changes.

Support Guarantees

  • Major branch development period: 6 months
  • Maintained minor versions: 2 most recent

Maintained Major Branches

Major Initial release date Supported until
v6.x.x v6.0.0: 04/06/2019 TBD

Contributing

See the guide for contributing to Neptune-versioned libraries.

swarm's People

Contributors

david-eckardt-sociomantic avatar don-clugston-sociomantic avatar geod24 avatar jens-mueller-sociomantic avatar joseph-wakeling-frequenz avatar joseph-wakeling-sociomantic avatar leandro-lucarella-sociomantic avatar mathias-baumann-sociomantic avatar mathias-lang-sociomantic avatar matthias-wende-sociomantic avatar mihails-strasuns-sociomantic avatar nemanja-boric-sociomantic avatar tiyash-basu-sociomantic avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

swarm's Issues

Balance received message handling between connections

The current connection (read) handling works as follows:

  1. All connections waiting to read are registered with epoll.
  2. epoll_wait reports that one or more connections have received something.
  3. For each connection that has received something:
    a. Check if at least one full message has been received. If not, re-register with epoll.
    b. If one or more full messages have been received, iterate over them, passing them to the appropriate request handlers.

The problem is at 3b. Given the various input buffers that are in use, it's possible for hundreds (even thousands -- in experiments we have seen this) of messages on a connection to have been received and now be ready for handling. So, step 3b could actually take quite some time to handle, leaving other connections and requests starved in the meantime.

This is not ideal asynchronous behaviour -- the assumption is that handling an event is quick, allowing other epoll clients to get their share of attention.

One idea to fix this problem would be to do the following:

  1. All connections waiting to read are registered with epoll.
  2. epoll_wait reports that one or more connections have received something.
  3. For each connection that has received something:
    a. Check if at least one full message has been received. If not, re-register with epoll.
    b. If one or more full messages have been received, set a flag indicating that they are waiting to be handled.
  4. Iterate over all connections which are flagged as having messages ready to be handled, handling one message per connection until all have been handled. In this way, the handling of messages is balanced between connections.

D2: Automatically converted tags and push them

To move forward with D2 we need to provide automatically converted libraries for projects that are built only for D2.

Every time a new tag is pushed, we have to convert it to D2, tag the converted version with a +d2 build metadata information appended (so tag v2.3.4 --- conversion D2 ---> v2.3.4+d2) and push it back (making sure the new tag is not converted again!).

This capability will be probably added to beaver, but at some point we need to use that feature from beaver here.

Dismantle core node framework

The node framework in swarm.core provides many useful components for building nodes. Currently, they are all glued together in a very rigid way. There is essentially one class (ChannelsNodeBase) which all nodes must derive from, in order to be able to access the full range of features. As the different types of nodes diverge in their needs, however, this approach is becoming less useful. A nicer approach would be to separate out each of the components so that they can be assembled as required.

Components:

  • Select listener for incoming connections (legacy protocol).
  • Select listener for incoming connections (neo protocol).
  • Select listener for unix socket connections.
  • Per-request stats tracking.
  • Per-action stats tracking.
  • Global I/O stats tracking (bytes sent/received).
  • Node stats logging (this is actually a separate module, but it hooks into the information interfaces INodeInfo and IChannelsNodeInfo).
  • Maintaining a set of storage channels (including a pool of recycled channel instances).

It's interesting that the majority of the really interconnected stuff is for stats tracking / logging. Probably a better approach would be to have separate stats instances for different things (per-request, global I/O, etc), each with a log() method.

Pass request parameters to notifier as const

Once a request has been started and the parameters handed over from the application to the swarm client, they should not change. The notifier should not be able to modify the request parameters. This can be enforced by passing them as const to the delegate, in D2.

Wrong Listeners iteration if removing listener

The following situation happens in a swarm neo application, but could happen with classic swarm, too:

  • A channel with two listeners is removed.
  • IListeners.trigger is called with Listener.Code.Finish.
  • IListeners.trigger_ iterates over ListenerSet, which iterates over ListenerSet.listeners, a dynamic array of the registered listeners.
  • Iterate over listener = ListenerSet.listeners[0], call listener.trigger.
  • listener.trigger suspends the currently running fiber A.
  • Another fiber B removes this listener, calling ListenerSet.remove.
  • ListenerSet.remove moves the removed listener to the end of the array. Because there are only two listeners registered it swaps the two listeners in the array.
  • ListenerSet.remove returns.
  • Fiber B is suspended, fiber A is resumed.
  • The iteration continues with listener = ListenerSet.listeners[1]. Because ListenerSet.remove swapped the two array elements this is now the same object as in the first iteration cycle, the listener that was just unregistered.

Result: One listener is triggered a second time after it has been removed, the other is not triggered at all. Even without a fiber race condition this could happen if Listener.trigger calls ListenerSet.remove.

I'm working on a bug fix to allow for removing a listener during an iteration.

Double copy of user-specified params

#6 changed the way in which the user-specified params for a request are serialized, allowing for a fully const API and fully mutable internals. This introduces an intermediary buffer into which the params are serialized, before being serialized again (i.e. the array is copied) into the request context. This double copy is inefficient and should be optimized at some point.

Automated node discovery

Currently, all of our distributed systems rely on the clients having a fully specified list of the addresses of all of the nodes -- the nodes definition file. Even though the set of nodes is mostly static, this approach is still somewhat rigid. When the set of nodes does change, a lot of manual work is involved to update the nodes definition files of all clients.

A futuristic idea that we've contemplated for some time: how can clients connect to a distributed set of nodes without requiring a fully specified list of their addresses?

One proposal would be to use multicast to allow clients to discover a set of nodes. The basic principle is that the nodes would register themselves as a member of a certain multicast group. The clients can then send a message to that multicast group and it will automatically be routed to all registered nodes.

Systems for this already exist, which might be useful, for example:
http://www.dns-sd.org/
http://www.multicastdns.org/

Add sanity check / limit on message size

Currently, it is possible to send messages of any size via the neo protocol. With the message header parity checking, it is unlikely that a super long message length could be received in error (i.e. junk data being sent), but it would be perfectly valid to start sending a 1Gb message.

The concern is that the message length entails a buffer in MessageSender big enough to store the message. It may be sensible to add a sane limit here, say 1Mb.

It must be possible to disable output buffering via TCP cork

dd6c928 enabled TCP cork output buffering on all neo connections. In real-world use cases, where traffic is expected to be high, this is a sensible default. However, in testing scenarios, where traffic is usually low, the 200ms delay in sending non-full packets is crippling.

For testing, we need to be able to disable output buffering -- all existing tests have been written assuming it is not present.

Investigate GC behaviour with serialized references in request args

(#55)

Note that if you use ocean contiguous serializer it doesn't matter because whole serialized array is always marked as NO_SCAN right now (https://github.com/sociomantic-tsunami/ocean/blob/v2.x.x/src/ocean/util/serialize/contiguous/Serializer.d#L210)

This means that any reference serialized into a request's context will not be GC scanned. Thus, if this is the only reference to an object, it may be collected.

We'll need to look into usages of the neo clients, but this is probably not what users are expecting.

Ensure that pointer-types are aligned when serializing request args

The GC scans will only find aligned addresses so if data that potentially contains a pointer is not on an aligned address, the pointer could be lost and the referenced object collected.

This is specifically the case for the legacy context union which is serialized on a 4-byte boundary (because the command code came first) at https://github.com/sociomantic/swarm/blob/master/swarm/core/client/request/params/IRequestParams.d#L326. This also applies to pointers serialized in the neo client.

The serialization should be enhanced so that it makes sure that pointer types are always aligned.

One idea was, to do it similar to the ConfigFiller, which scans a class and it's super classes for all the members. Templates could be used to add information on how to serialize a certain member (e.g. DeepCopy or ShallowCopy)

Improve performance of RecordStream

The TODO here needs to be addressed

/***************************************************************************
Helper function to read from a stream into the provided buffer until the
specified separator character is encountered. The separator is *not*
appended to the output buffer.
Template params:
C = output buffer array element type (must be a one byte type)
Params:
stream = stream to read from
sep = separator to read until
dst = buffer to write read data into
TODO: this function could be more efficiently implemented by reading
larger chunks from the stream, then splitting them by \n. However, this
would make the module more complicated and, until we have evidence of it
being necessary, I'd prefer to keep the code simple.
***************************************************************************/
private static void readUntil ( C )
( InputStream stream, char sep, ref C[] dst )
{
static assert(C.sizeof == char.sizeof);
char c;
SimpleSerializer.read(stream, c);
while ( c != sep )
{
dst ~= cast(C)c;
SimpleSerializer.read(stream, c);
}
}

Weighted round-robin connection iteration

With a plain round-robin algorithm, if one node is responding slowly, pending requests can stack up in the client. A common pattern, in this situation, would be for the client to detect this and suspend its input (a stream from a DMQ, for example).

One technique which could improve this situation is to have each node selected by the round-robin algorithm based on some weighting or priority. When a connection responds slowly, its weighting is reduced, meaning that less requests will be sent to it. (If all connections respond slowly, all of their weightings will go down, meaning that the balance of requests won't change.)

How exactly the client should measure the responsiveness of each connection is an open question.

This is not as simple as it seems, however. We also need to take into account the side-effects of such user-facing load-balancing. One consequence of more requests being sent to certain nodes is that those nodes may end up storing more data than others, which is not ideal and leads to further imbalance (e.g. data returned to query requests). This will require more thought.

Client active requests and disconnection

The node cancels all requests in a connection's RequestSet, when the connection is shut down (see

// super.shutdownImpl will only shutdown requests registered for
// receiving or sending. There may also be requests that are registered
// for neither sending nor receiving which need to be shutdown.
this.request_set.shutdownAll(e);
). (There is no way that the requests can continue operating after the connection is shut down.)

The client should have similar logic, but it is more complicated as the client just has a single RequestSet which is shared by all connections. Thus, the client would need to cancel all requests that are associated with the connection which has been shut down. Most of the time, these will be registered with the connection for sending or receiving, so are easy to spot. However, it's also possible for a request to be in a state where it's waiting for some other event to occur (e.g. the user to call resume() on its controller). There is currently no way to detect requests in this state and cancel them properly if a connection is shut down.

How exactly this would need to work varies by request type:

  • For all-nodes requests this is obvious: each request-on-conn is tied to single connection.
  • Single-node and round-robin request-on-conns may switch between connections over their lifetime. We'd have to track which requests were using which connection in order to be able to cancel them.

Could the suspendable request controller work while some nodes are initialising?

Currently, the request controller may only be used once all known nodes have either started handling the request or are not connected. This could cause problems, if some nodes start handling the request (sending data to the client, say) while others are still initialising. The request being handled on some nodes may cause the client to want to suspend it, but this would not be possible until all nodes are initialised.

It may be possible to rework the controller logic to allow control messages to be sent to any node which is initialised, ignoring any that are still initialising.

Reassess swarm versioning / maintenance policy

Swarm no longer contains concrete client classes and is thus no longer directly used by applications. It is only used indirectly via other libraries which implement concrete clients.

Thus, the set of versions of swarm that need to be maintained is defined solely by the needs of the dependent libraries. Keeping a strict, 6-month maintenance period for past major versions may not be necessary.

Error delegate passed to ConnectionSetupParams is overwritten in Node

The API is unclear on this point. It appears that the user (the implementer of a Node-derived class) has the option of passing an error delegate which will be passed through to the constructor of each connection handler in the pool. However, this delegate is ignored and is overwritten with the Node's own delegate at

conn_setup_params.error_dg = &this.error;
.

The error_callback() method can then be called at some later time to set the user's delegate again.

It's all a bit convoluted and looks like it could be tidied up.

Add request controller helpers

The neo suspendable requests all share a significant chunk of code relating to the controller:

  • The IController interface.
  • The Controller class which implements this interface.
  • The shared working data which is required to manage controlling all-nodes requests.

DelayedSuspender isn't safe to use with RequestEventDispatcher

DelayedSuspender directly suspends and resumes a request fiber. This is fine for use directly with a request-on-conn fiber, but is risky with multi-fiber requests. The reason is that, in order for a fiber to be properly killed (== cleanly deallocated, if allocated on the stack), it must be registered with the RequestEventDispatcher, which handles the events from the request-on-conn, including any exceptions thrown. An exception is re-thrown in each fiber registered with the RequestEventDispatcher, ensuring all are killed.

Add counter and logging of protocol errors

A protocol error that occurs while handling a request represents a serious error. These should not go unnoticed. If we write them to the stats log, then we can set up alerting in the live system to notify maintainers.

Add config class (struct?) for neo ctor parameters

For use with the class filler in ocean.

Construction of a client will then look like:

auto client = new Client(epoll, legacy_config, neo_config);
auto client_legacy_only = new Client(epoll, legacy_config);

Unify RequestOnConn.EventDispatcher methods for sending / receiving / yielding

Currently there are only methods to:

  1. Yield and handle events.
  2. Yield, receive, and handle events.

We also need:
3. Yield, send, and handle events.
4. Yield, send, receive, and handle events.

The combinations are getting silly. All these methods should be refactored into a single method which waits for one of the specified types of events and returns which event occurred.

How to handle exceptions thrown by user notifiers?

Exceptions thrown by a user notifier are currently not explicitly handled and result in unexpected behaviour in the client:

  1. Exceptions thrown by the connection notifier are handled here. As the comment (Exceptions thrown by this.outer.connect() indicate that the connection could not be established (can include protocol or authentication errors). We simply exit the fiber method, in this case.) indicates, exceptions caught here cause the connection process to be aborted. This is absolutely not what we want :D
  2. Exceptions thrown by a request notifier are not handled by the request's handle function and will be thrown out into the request-on-conn fiber method. They're not handled there either, so will simply kill the fiber.

Assess optimal record batch size

@leandro-lucarella-sociomantic:

...if the current size of batched responses is 64K and that buffer then gets compressed to something smaller, we might want to experiment with a slightly larger buffer, once for which the average compressed size would be around 64K, which is the maximum TCP packet size. That could optimize the throughput for batched responses, fitting as much data as possible in a single TCP package (although if the batching and compression are fast the kernel should take care of this packing). Anyway, just a wild idea :P

@mathias-baumann-sociomantic:

I am not sure about that 64k number as highest possible TCP packet size. According to some sources the lower layers of TCP already have smaller limits so that the theoretical TCP packet size limit can't be reached.

@leandro-lucarella-sociomantic:

Well, yeah, usually the MTU (Maximum Transmission Unit) is 1500 bytes for ethernet, but if you use a large TCP packet at least you are not repeating the IP/TCP headers (like 20 bytes at least) for each ethernet fragment, at the risk of having to resend all the ethernet fragments for a particular big TCP packet if only one fragment gets lost.

But yeah, is a good reminder that this probably needs to be tested with care, as probably basing this only in theory wouldn't be enough; it's too dependant on the environment (like how ofter ethernet fragments are
lost) to reach any conclusions from a purely theoretic POV.

Consider ways to implement connection priority

In a live system, some applications which access a swarm cluster are super important and need a response as soon as possible, while others are of low importance and do not require an immediate response. Currently, the nodes handle all request equally, which can lead to bursts of requests from low-importance apps overwhelming the normal requests from high-importance apps.

It should be possible to implement a priority system, whereby requests from certain clients are handled with greater priority than requests from others.

Connection I/O and send queue stats iterators in client should also pass connection

Currently, the iteration delegate is just passed the stats for each connection:

It would seem convenient to also pass some identifier for the connection, too, e.g. the remote address/port.

Auth-based, per-request permissions system

Now we have a proper system of authenticating clients, it would be possible to add a permissions system, such that certain requests may only be initiated by certain clients.

An example of this would be very destructive operation like a request which deletes a whole storage channel, which the average client has no place performing.

Advanced request handling framework

With the current (neo) request handling framework, we've been able to implement some more complex requests that in the past. The most complex class that we have at the moment is requests which stream a series of records from the nodes and can be suspended, resumed, or stopped on demand from the client.

The problem with the current approach, though, is that each time the request-on-conn fiber is suspended, there are potentially multiple reasons why it can be resumed. This adds a lot of complex corner cases -- at every step of the request, you have to be thinking about "what happens if X at this point?".

And there are ideas for yet more complex requests, which would be extremely difficult to implement in the current framework.

Going forward, we need to find a simpler/cleaner way of expressing more complex request behaviour.

Investigate using a single fiber per connection

The current structure has separate sender and receiver fibers per connection. This leads to a more complicated structure, especially where error handling is concerned (the state of the two fibers must be synchronised, when an error occurs in one of them).

Combining the sending and receiving logic into a single fiber may prove simpler, overall.

Add neo client config struct for use with ocean's config reader

Something like:

public class NeoConfig
{
    public Required!(istring) nodes_file;

    public Required!(istring) credentials_file;
}

The constructor should automatically call neo.addNodes on the nodes file specified in the config. Usage would look like:

this.client = new Client(epoll, legacyconfig, neoconfig);

Numerical channel identifiers, instead of strings?

We're hashing everything in the system, except the channel names. The channel names are constants, but are being sent as human-readable text over the network for virtually every single request. If we're ever looking to optimise network bandwidth, sending the channel names as some kind of numerical identifier would be worth investigating.

Rename `started` notification to `ready_to_control`

Discussion with users has revealed that the name started is misleading. They expect this to occur strictly before data is received (which is not the case).

The notification is usually described as:

    /// All known nodes have either started handling the request or are not
    /// currently connected. The request may now be suspended / resumed /
    /// stopped, via the controller.

The second sentence is the part that is actually important, from the user's point of view.

Suggest renaming to ready_to_control.

This field is referenced here, and renaming it will cause a breaking change:

https://github.com/sociomantic-tsunami/swarm/blob/v4.x.x/src/swarm/neo/client/mixins/AllNodesRequestCore.d#L596

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.