frc971 / 971-robot-code Goto Github PK

License: Apache License 2.0

Starlark 4.54% C++ 48.79% Shell 0.52% C 34.36% Gnuplot 0.02% TypeScript 2.00% HTML 0.38% CSS 0.07% Python 5.33% PureBasic 0.02% TeX 0.30% G-code 0.38% Ruby 0.84% GLSL 0.47% Makefile 0.16% JavaScript 0.17% Go 1.01% CMake 0.02% Rust 0.61% Dockerfile 0.01%

971-robot-code's People

Stargazers

Watchers

Forkers

austinschuh cttdev philsc bennetyee calcmogul brt-eric-schmiedeberg ngupta1024 frc900

971-robot-code's Issues

Merged nodes on applications aren't deduplicated

When merging configs, if the application names match, we concatenate the node lists. If there are duplicates, we preserve them. We should instead deduplicate them.

OOM killer leaks messages

If the OOM killer kills a process, all it's handles get leaked. This is because the OOM killer doesn't run the robust futex cleanup to free the messages.

It actually looks like /proc/pid/stat has a start time for the thread. When opening the queue, we could look for any PIDs which are nonzero, see if they exist, and then check that the start time matches. If it does, it's pretty much guaranteed to be the same process, and is actually running.

Better timing report consumption

Create one or both of:

Command line tool to filter timing reports based on application name and display them (with various possibilities for how it is displayed--at a minimum just dumping the straight JSON).
Plot that breaks out timing report information in a useful manner.

Unintuitive error message when failing to use LogReader::OnStart methods

When we hit https://github.com/frc971/971-Robot-Code/blob/master/aos/events/simulated_event_loop.cc#L1463 it typically seems to be people creating EventLoops outside of the LogReader::OnStart. Confirm the nature of the issue here and improve the error message (and/or consider changing when EventLoop creation is even allowed) so that users know what to do when they see that error.

Timestamp shared memory messages on observation/receipt

Currently, messages sent on shared memory channels are timestamped prior to actually being "sent" (https://github.com/frc971/971-Robot-Code/blob/master/aos/ipc_lib/lockless_queue.cc#L1028). While we should have the guarantees in place to ensure that messages do not get sent with out-of-order timestamps on a given channel, the current ordering does mean that a process listening on multiple channels could plausibly observe messages across channels out-of-order (which isn't supposed to happen).

If we instead figured out an appropriate way to have the first observer of a message timestamp it (this "observer" would be the first of any fetchers or watchers, as well as something that would run immediately after the send actually went through), then because of how the event processing happens on the listening side, it should no longer be possible to observe out of order events.

Add improved flatbuffer building API

The current API is cumbersome, confusing, and prone to human error.

Add config validator rule

And a rule that creates a test to confirm that, for a given config:

Remote timestamp channels are all specified and specified correctly.
Logs for any given node are fully self-consistent (i.e., won't require --skip_missing_forwarding_entries)--note that this should be configurable, because you may only care about this for a few nodes.

What is the contribution policy?

Support configuring watchers

Sometimes it is helpful to only process the latest message with a watcher, or to disable "die when you get behind" behavior. Add configuration support to watchers to enable all this.

JSON to flatbuffer parsing error messages are hard to see

People consistently seem to struggle with identifying failures associated with the JSON->flatbuffers code. Some of this may be that the messages tend to not stand out much in the program output. Part of it is also that failures of the code tend to show up as segfaults rather than some sort of more coherent error message.

Provide reverse mapping for channels

Provide a method that returns, for a given node/application combination, all the valid ways to refer to a given channel.

Channel names should be required to start with /

I've seen evidence that we don't enforce that channel names start with /. We should fix that.

Ran out of signals

We have a watcher which blocks forever (different bug). This makes it so the event loop isn't able to process signals.

RT signals queue up. RLIMIT_SIGPENDING is the limit per process, which defaults to somewhere around 7k of them.

Once this happens, we are unable to wake up any process, and everything gets triggered off timers. We need a way to not accumulate signals forever.

aos_starter handles ambiguous application names poorly

aos_starter is designed to allow it to take an executable name instead of just the application name. But if there are multiple applications with the same executable, the behavior is ambiguous.

JSON config should not accept timestamp_logger_nodes that different from source_node

Guided by Austin-- I was looking at the following code out of y2022_roborio.json:

{ "name": "/drivetrain", "type": "frc971.control_loops.drivetrain.Output", "source_node": "roborio", "frequency": 400, "max_size": 80, "num_senders": 2, "logger": "LOCAL_AND_REMOTE_LOGGER", "logger_nodes": [ "imu" ], "destination_nodes": [ { "name": "imu", "priority": 5, "timestamp_logger": "LOCAL_AND_REMOTE_LOGGER", "timestamp_logger_nodes": [ "imu" ], "time_to_live": 5000000 } ] },
The timestamp_logger_nodes should be the source_node, "roborio", rather than currently as "imu". Our json creation/merging shouldn't accept other choices than the source_node.

Build breaks when libbz2-dev is installed

cargo_raze__pcre fails to build. Digging in, the key lines in bazel-out/k8-opt/bin/external/cargo_raze__pcre/pcre_foreign_cc/CMake.log are:

-- Found BZip2: /usr/lib/x86_64-linux-gnu/libbz2.so (found version "1.0.8") 
-- Looking for BZ2_bzCompressInit
-- Looking for BZ2_bzCompressInit - not found
-- Found ZLIB: /dev/shm/bazel-sandbox.f856700afb979694adb03c9da2b7c0b9dacd7824b0655f8fe6ce9ad53a8e0492/linux-sandbox/1656/execroot/org_frc971/bazel-out/k8-opt/bin/external/cargo_raze__pcre/pcre.ext_build_deps/lib/libzlib.a (found version "1.2.11") 
-- Found Readline: /usr/include
-- Could not find OPTIONAL package Editline

We then explode with:

[ 83%] Building C object CMakeFiles/pcregrep.dir/pcregrep.c.o
^[[1m/dev/shm/bazel-sandbox.f856700afb979694adb03c9da2b7c0b9dacd7824b0655f8fe6ce9ad53a8e0492/linux-sandbox/1656/execroot/org_frc971/external/cargo_raze__pcre/pcregrep.c:69:10: ^[[0m^[[0;1;31mfatal error: ^[[0m^[[1m
      'bzlib.h' file not found^[[0m
#include <bzlib.h>
^[[0;1;32m         ^~~~~~~~~
^[[0m1 error generated. 
make[2]: *** [CMakeFiles/pcregrep.dir/build.make:76: CMakeFiles/pcregrep.dir/pcregrep.c.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:116: CMakeFiles/pcregrep.dir/all] Error 2
make: *** [Makefile:136: all] Error 2

Somehow, it looks like cmake is leaking outside the sandbox and finding things on the host when being built. Hmmm.

AOS <-> ROS bridge

Create some sort of bridge to allow bridging ROS nodes to AOS, to accommodate people who want to be able to experiment with both.

Exactly how this will look is not entirely clear, but the goal should be to be relatively easy to get working, rather than focusing on getting every single possible feature working.

Faster DARE solver

I thought you may be interested that I contributed a new DARE solver to WPILib that's 2.8x faster than SLICOT, the solver you use now, based on roboRIO benchmarking on a 5 state, 2 input differential drive LQR problem.

The gist of the algorithm from DOI 10.1080/00207170410001714988 is:

#include <Eigen/Cholesky>
#include <Eigen/Core>
#include <Eigen/LU>

template <int States, int Inputs>
Eigen::Matrix<double, States, States> DARE(
    const Eigen::Matrix<double, States, States>& A,
    const Eigen::Matrix<double, States, Inputs>& B,
    const Eigen::Matrix<double, States, States>& Q,
    const Eigen::Matrix<double, Inputs, Inputs>& R) {
  // [1] E. K.-W. Chu, H.-Y. Fan, W.-W. Lin & C.-S. Wang
  //     "Structure-Preserving Algorithms for Periodic Discrete-Time
  //     Algebraic Riccati Equations",
  //     International Journal of Control, 77:8, 767-788, 2004.
  //     DOI: 10.1080/00207170410001714988
  //
  // Implements SDA algorithm on p. 5 of [1] (initial A, G, H are from (4)).
  using StateMatrix = Eigen::Matrix<double, States, States>;

  StateMatrix A_k = A;
  StateMatrix G_k = B * R.llt().solve(B.transpose());
  StateMatrix H_k;
  StateMatrix H_k1 = Q;

  do {
    H_k = H_k1;

    StateMatrix W = StateMatrix::Identity() + G_k * H_k;
    auto W_solver = W.lu();

    StateMatrix V_1 = W_solver.solve(A_k);

    // Solve V₂Wᵀ = Gₖ for V₂
    //
    // V₂Wᵀ = Gₖ
    // (V₂Wᵀ)ᵀ = Gₖᵀ
    // WV₂ᵀ = Gₖᵀ
    // V₂ᵀ = W.solve(Gₖᵀ)
    // V₂ = W.solve(Gₖᵀ)ᵀ
    StateMatrix V_2 = W_solver.solve(G_k.transpose()).transpose();

    G_k += A_k * V_2 * A_k.transpose();
    H_k1 = H_k + V_1.transpose() * H_k * A_k;
    A_k *= V_1;
  } while ((H_k1 - H_k).norm() > 1e-10 * H_k1.norm());

  return H_k1;
}

The preconditions necessary for convergence are:

Q is symmetric positive semidefinite
R is symmetric positive definite
The (A, B) pair is stabilizable
The (A, C) pair where Q = CᵀC is detectable

The paper proves convergence under weaker conditions, but it seems to involve solving a generalized eigenvalue problem with the QZ algorithm. SLICOT and Drake use that to solve the whole problem, so it seemed too expensive to bother attempting.

The precondition checks turned out to be 50-60% of the total algorithm runtime, so WPILib exposed a function that skips them if the user knows they'll be satisfied. This would be a good candidate for your Kalman filter error covariance init code, since a comment in there mentioned it didn't use the DARE solver because it had unnecessary checks.

Here's WPILib's impl, which supports static sizing (for performance) and dynamic sizing (for JNI) and throws exceptions on precondition violations.
https://github.com/wpilibsuite/allwpilib/blob/main/wpimath/src/main/native/include/frc/DARE.h

I'd recommend std::expected instead of exceptions for your use case.

aos_dump doesn't respect maps

If you have /camera -> /pi1/camera on pi1, aos_dump /camera complains and doesn't respect the map.

This is a bit subtle, since we really would like aos_dump to be able to subscribe to any channel for debugging. But, it would also be nice to have aos_dump properly respect remaps.

Year 2018 arm feedforward

Hey,

I'm Gabor from team 114 and we're looking at your arm feed forward code to try to implement it in our robot. It's in the y2018/control_loops/python/arm_trajectory.py file. Can you please explain us a few things?

What are the G1 and G2 constants? I think this is the gear ratio of the motors, but I'm not 100% sure.

The following questions are not essential to be answered, but if you have time, we'd appreciate if you do

What is the difference between the K2 and K4 matrices? I understand what the other matrices do, but K2 and K4 are both multiplied by omega - which would imply that both are used to relate torque to omega. But I don't understand why you would need two matrices for the same thing. Can you please clarify what the exact purpose of these matrices are?
Can you please clarify how exactly the constant matrices are calculated and what the values in it are?

@AustinSchuh @platipus25 I'm pinging you because it seems that you are the contributors to this file, so I assume you know the implementation details.

Thanks in advance,
Gabor and team 114

Support logging remote timestamps on third-party node

If we have a system with nodes a, b, and logger, and are only running a logger on logger, then currently it is not possible to be log timestamps for messages sent between a and b. So if there is a channel that is forwarded from a to b and the logger, then if you attempt to replay a log from the perspective of b you will not get any messages replayed on that channel since the logger didn't log the timestamps for the message arrivals on b, so doesn't know when the messages actually arrived :(.

This means being able to specify arbitrary nodes in the timestamp_logger_nodes field for a connection, see:

971-Robot-Code/aos/configuration.fbs

Lines 28 to 32 in e7c7e58

 // If the corresponding delivery timestamps for this channel are logged 

 // remotely, which node should be responsible for logging the data. Note: 

 // for now, this can only be the source node. Empty implies the node this 

 // connection is connecting to (i.e. name). 

 timestamp_logger_nodes:[string] (id: 2);

Consider options for reducing fragility of `aos::Node*` pointers

@yimmy13 observed that by using const aos::Node*s to identify nodes we tend to create issues in log replay (namely, the node pointer was unstable across calls to `RemapLoggedChannel*). Can we make this less confusing to developers?

Dynamic flag-setting in starterd

Allow specifying flags as part of the starter RPC definition. Unsure exactly how this should manage override of flags specified in the AOS config. This is very helpful when wanting to start applications in the same environment as they would see under starterd but with slightly changed flags.

Double check `CopyFlatBuffer`'s handling of structs

Jim was having toruble copying TargetEstimate's. Not sure why.

Edit:
https://github.com/frc971/971-Robot-Code/blob/master/aos/flatbuffer_merge.cc#L682

irq_affinity should report top results

This would be incredibly handy to be able to debug what is happening when something goes wrong to see what else is happening in the system. 1hz is plenty (as we dredge through /proc to see how things are going), or lower frequency when we are dredging through.

We should also put the scheduler + affinity + priority in that report too, along with memory usage.

LogReader::OnEnd() does not get called with empty log files

LogReader::Register will skip any setting up of callbacks when a node has no log files with data in them. This leads to LogReader getting stuck indefinitely when calling Run() in this scenario.

Dropping sent-too-fast messages in LogReader makes unreadable logs

We currently drop replayed messages if they get sent too fast when replaying a log. This is itself somewhat dubious behavior, but also creates an issue where if you create a log of the replay, then you can get errors like

F0402 12:20:39.001713 8892 logfile_utils.cc:1506] Check failed: result.timestamp == monotonic_remote_time ({.boot=0, .time=160.591261903sec} vs. {.boot=0, .time=160.623109070sec}) : Queue index matches, but timestamp doesn't. Please investigate!

starterd sends output using AosLog message

It would be super helpful to see what went wrong in the log. We don't always have easy access to stdout/err of starterd.

Avoid double-sending on replayed channels

Add a check to logger to ensure that applications in log replay aren't sending on channels that are also getting replayed (i.e., all the relevant channels from the log are remapped).

Should pretty much just be a matter of doing this TODO:

971-Robot-Code/aos/events/logging/log_reader.h

Lines 409 to 413 in 890c249

 // TODO(james): Enable exclusive senders on LogReader to allow us to 

 // ensure we are remapping channels correctly. 

 event_loop_unique_ptr_ = node_event_loop_factory_->MakeEventLoop( 

 "log_reader", {NodeEventLoopFactory::CheckSentTooFast::kNo, 

 NodeEventLoopFactory::ExclusiveSenders::kNo});

and then seeing what blows up.

Publishing a forwarded message twice in simulation fails

If multiple messages all configured to be forwarded are sent at the same time in simulation (in the same callback), they don't get forwarded.

`ctre::phoenixpro::BaseStatusSignalValue::WaitForAll` now has override for const vector ref

I noticed you were using the initialzer_list version. We added an override to pass a vector to make things cleaner.

Proxy EventLoop

An EventLoop implementation which creates multiple EventLoops that are all scheduled via one underlying EventLoop would allow combining multiple applications in a single process. They must all be on the same node, and can communicate with each other and the outside world as normal. This would provide a similar API to SimulatedEventLoopFactory which can create new EventLoops on demand.

Some tricky things to keep in mind:

Make sure watchers and senders from all the proxied EventLoops work with each other
Multiple senders on the same channel in multiple proxied EventLoops. TimingReports end up doing this.
Timers and fetchers can mostly be used directly. Should tack on something in the name to help decipher TimingReports though (each one will be reported twice, once with a longer name in the proxy EventLoop and once with just the given name for the proxied EventLoop)

Sandbox escape: libpcre3

To reproduce:

$ bazel build //documentation/tutorials:create-a-new-autonomous
INFO: Analyzed target //documentation/tutorials:create-a-new-autonomous (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
ERROR: /home/steple/source/971-Robot-Code/documentation/tutorials/BUILD:14:13: Executing genrule //documentation/tutorials:create-a-new-autonomous failed: (Exit 127): bash failed: error executing command (from target //documentation/tutorials:create-a-new-autonomous) /bin/bash -c ... (remaining 1 argument skipped)

Use --sandbox_debug to see verbose messages from the sandbox and retain the sandbox build root for debugging
external/pandoc/usr/bin/pandoc: error while loading shared libraries: libpcre.so.3: cannot open shared object file: No such file or directory
Target //documentation/tutorials:create-a-new-autonomous failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 0.881s, Critical Path: 0.01s
INFO: 2 processes: 2 internal.
FAILED: Build did NOT complete successfully

sudo apt install libpcre3 will fix this on Ubuntu 23.10.

Better C++ flatbuffers API

The current flatbuffers C++ API requires building up messages inside-out. It should be possible to use the C++ stack for this instead, to allow the C++ code to build up messages from the outside in (which is often more natural).

More concretely, this means generating C++ classes for each flatbuffer struct which hold values for each field (and track which ones are set). Nested structs should be contained in their parent objects, because a major use case is calling functions which return a nested object, and there's no other easy place to store these objects. Then, these objects can do a depth-first traversal of the C++ object graph to actually write the flatbuffer (aka each object writes out its children, tracking the resulting offsets in local variables, then writes out itself and returns the offset to its parent).

I think writing the buffer should be done in a offset_t Write method (or similar), which is passed the FlatBufferBuilder. The top-level will normally be called via a templated wrapper type that holds onto the fbb with a destructor which calls Write on the top-level object and then Finish, to keep everything in outside-in order.

An alternative would be holding the fbb in each object, and then having their destructors write it out, but that means more stack space and makes it impossible to decide to skip writing it out later (for example, build up a sub object and then realize it's not actually needed, without taking up any space in the final buffer). Doing it in the destructor also means the parent object has to keep track of where the offsets to all its children are coming from.

Need to think through handling arrays of primitives. These can be big and variable-sized, neither of which interacts well with putting them on the stack. At the same time, allocating them immediately unlike other objects makes for a confusing API. Maybe provide APIs for both?

Handling arrays of objects is tricky. ArrayWriter<T> StartArray(int max_size) would work well for many cases, with convenience void CreateArray(span<const T>) when the temporary storage is managed externally. However, there's no place to stash C++ pointers to the intermediate objects. The flatbuffers array needs to be placed after those objects in the buffer, and C++ pointers are larger than offsets on 64-bit platforms. Forcing the user to allocate that array externally goes against making this API nice and easy to use, but that's the best I can think of right now. void CreateArray(span<const T*>), with an extra level of indirection, could be handy but also looks like a big foot-gun with dangling references.

Do we need to manage shared subobjects? Writing them out redundantly is easy, but not helpful for space efficiency. Maybe use a bit in the bitmask to track whether it's been written out, and make a union for all the variable storage which gets overwritten to the offset it was written to?

This will increase stack usage, which may be undesirable. It's probably worth using a bitmask in the generated code to track which fields are set, rather than using std::optional or a separate bool for each one.

Copying sub-objects could become expensive. It should be possible to structure this so that RVO (return value optimization) constructs the sub-objects in place for the common case of a function returning an entire sub-object.

These classes end up looking similar to the existing TableT, but without storing data in the objects. Do we want to expose reading the fields (and checking if they're set) and/or building them from a const Table*?

	// If the corresponding delivery timestamps for this channel are logged
	// remotely, which node should be responsible for logging the data. Note:
	// for now, this can only be the source node. Empty implies the node this
	// connection is connecting to (i.e. name).
	timestamp_logger_nodes:[string] (id: 2);

	// TODO(james): Enable exclusive senders on LogReader to allow us to
	// ensure we are remapping channels correctly.
	event_loop_unique_ptr_ = node_event_loop_factory_->MakeEventLoop(
	"log_reader", {NodeEventLoopFactory::CheckSentTooFast::kNo,
	NodeEventLoopFactory::ExclusiveSenders::kNo});

frc971 / 971-robot-code Goto Github PK

971-robot-code's People

Stargazers

Watchers

Forkers

971-robot-code's Issues

Recommend Projects

Recommend Topics

Recommend Org