This issue represents an opportunity for discussion of <a href="https://github.com/joy

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

RFD 119 Routing Between Fabric Networks: Discussion about rfd HOT 17 OPEN

tritondatacenter commented on August 10, 2024

RFD 119 Routing Between Fabric Networks: Discussion

from rfd.

Comments (17)

rjloura commented on August 10, 2024

When attaching to the second class of networks, NAPI will select an IP address from the source network, save it in the "overlay_router" field, and use it as the next-hop address in the "routes" object. This IP address will be mapped in Portolan to a special MAC address recognized by the overlay devices. (See the Overlay Changes section for more here.)

Does this represent a special entry in the existing portolan_vnet_mac_ip table/bucket, with the existing schema? Or are we looking to create a new table/bucket, or modify the existing one?

from rfd.

rjloura commented on August 10, 2024

In the future "remote" may be used to indicate other kinds of remote networks, possibly reachable through some kind of authenticated tunnel.

Instead of making remote a boolean do we want to instead make it a type (e.g. xdc, tunnel). Or perhaps a sub object like so:

# sdc-napi /networks/410fc93e-957a-4344-9112-ec17d5a946b5 | json -H
{
    "uuid": "410fc93e-957a-4344-9112-ec17d5a946b5",
    "remote": {
        "type": "xdc",
        "uuid": "025133ae-d107-47ab-aa08-27bb5e16e699",
        "dc": "us-east-3"
    },
    "subnet": "10.0.34.0/24",
    "fabric": true,
    "vnet_id": 56634,
    "vlan_id": 23
}

from rfd.

danmcd commented on August 10, 2024

Note: RFD 130 was published to spell out different types of remote networks.

from rfd.

danmcd commented on August 10, 2024

Note that when a network appears in an "attached_networks" array, then it will also contain its own mirroring "attached_networks" entry, to guarantee that two networks are always mutually routable and help prevent users from accidentally configuring a network to pass traffic in one direction but forgetting to do so in the other.

Do remote networks have attached_networks in them? If so, what if the remote network is not Triton?

To get around these, we will use a special MAC address to determine whether we need to inspect the destination IP address (which we can then use to find the UL3 information), whether we need to rewrite the VL2 information, and what VNET identifier to use. We will also need to change the source MAC address to match the special MAC address being used on the destination fabric network.

Earlier you mention the overlay_router. The overlay router's MAC address is this special MAC address... that's how we know off-link vs. on-link in overlay&varpd. That connection should be spelled out?

from rfd.

jasonbking commented on August 10, 2024

Specifically, on each subnet an IP is allocated for the router, and ARP requests for this IP will return the MAC of the overlay router. Within the overlay module, an additional flag (OVERLAY_ENTRY_F_ROUTER) will be created. During outbound processing, when the target is looked up (by VL2 dest MAC), if the resulting overlay_target_entry_ts has the OVERLAY_ENTRY_F_ROUTER flag set, that indicates the packet should be routed. VL2->UL3 lookup requests for this MAC from overlay causes varpd to return IN6ADDR_ANY for the UL3 address. This should keep things from looking too exotic from the instance's perspective.

from rfd.

melloc commented on August 10, 2024

@rjloura I like the "remote" as the type of remote instead of a boolean. The object is nice, too, we would just need to figure out how searching on its subfields would work. Maybe something like:

# sdc-napi /networks?remote.type=fabric&remote.dc=us-east-3

Does this represent a special entry in the existing portolan_vnet_mac_ip table/bucket, with the existing schema? Or are we looking to create a new table/bucket, or modify the existing one?

This will probably have to be something special given that we'll need to probably mark it in some way for Portolan. We'll want to figure out what our migration scheme for the new bucket layouts is. I think we can probably do something similar to how NAPI upgrades its other buckets today, but maybe not.

Earlier you mention the overlay_router. The overlay router's MAC address is this special MAC address... that's how we know off-link vs. on-link in overlay&varpd. That connection should be spelled out?

Yes, I'll try to make this clearer.

In the current proposed scheme, there is one router MAC address for a VNET ID.
Under this scheme, we need to inspect and use the sender's IP address in order
to determine what route to use, since there might be multiple routes matching
the destination. An alternative approach would be to assign a MAC address per
network and use that for disambiguation. We had discussed doing so initially
but were uncertain of how to propagate them out.

@rmustacc suggested having things work using the following steps:

When NAPI sets up the IP-to-MAC mapping for Portolan, it marks the MAC address
address as a router MAC
The guest instance on a fabric will need to ARP for the MAC address of the
next-hop IP, as usual.
When varpd does the VL3 request, the response from Portolan will flag the MAC
as a router MAC address. (We were planning on filling in all zeroes for the
underlay address information in this case, but we should probably add a
bitfield for IP/MAC properties that we can extend in the future.)
varpd can then add the MAC address to the overlay device as a new router MAC
if it's not already there.
varpd could possibly kick off a bulk request for all destination routes for
that MAC address.
When overlay(5) sees a packet arrive for a router MAC address heading to a
destination IP address that it doesn't recognize, then it can ask varpd to
find it the route for that MAC address and destination IP (instead of the
source IP and destination IP).

from rfd.

danmcd commented on August 10, 2024

varpd could possibly kick off a bulk request for all destination routes for
that MAC address.

My first question would've been where do the answers get stored, until I read this:

When overlay(5) sees a packet arrive for a router MAC address heading to a
destination IP address that it doesn't recognize, then it can ask varpd to
find it the route for that MAC address and destination IP (instead of the
source IP and destination IP).

which suggests varpd is the correct place to cache this information.

This way the state in overlay reduces drastically, the only real changes manifest in overlay_targ_lookup_t and in overlay_targ_resp_t. For lookups, the destination IP needs to show up in an ```otl_l3req`` case. In a response, the overlay's target point, or an alternative specifically for off-link targets, needs to know a bit more to make a transmitted packet palatable for a target (remote VLAN, remote vnetid, src MAC, etc.).

from rfd.

jasonbking commented on August 10, 2024

A problem with this I mentioned in MM though is that it seems like one could subvert the attachment policy for an instance using static arp entries.

from rfd.

danmcd commented on August 10, 2024

Let me dive deep and see if I can imagine an actual attack:

Consider a next-hop-IP x.y.z.N has blessed-MAC 0a:0b:0c:0d:0e:0f, which has reachability to victim-net. A rogue-VM on net a.b.c.Q can do route add x.y.z.0/24 -interface a.b.c.Q; route add victim-net x.y.z.N; arp -s x.y.z.N a:b:c:d:e:f. From then on out, the rogue-VM can attempt to reach victim-net.

1.) If a.b.c.Q is on a different vnetid, won't its overlay AND varpd state be distinct from an actual x.y.z.0/24 attachment, and therefore there would be no state (and portolan would reject based on vnetid)?

2.) If a.b.c.Q is the same-customer-but-rogue, could the source MAC be included in the lookup? (Or is that theoretically forgable too using vnics?)

What's your threat model here?

from rfd.

danmcd commented on August 10, 2024

Regarding question 2: Zones cannot change their MAC address. I'm not sure about bhyve/KVM instances yet, however.

from rfd.

jasonbking commented on August 10, 2024

I don't think you'd be able to get a reply, but the packet would still be received if overlay already has the destination information -- we don't do a varpd lookup on every packet (nor do I think we would want to).

That's the other thing I had mentioned was wondering if the source MAC could be used to determine the source fabric, and how reliable that is. AFAIK, we don't currently allow MAC spoofing anywhere. @papertigers mentioned that there was discussion about possibly allowing it for running vrrp or similar. However, I'm not sure how permissible that'd be -- would it still be restricted to just the MAC(s) that could move around, or would it turn off all checking and allow an arbitrary MAC address to be set?

What I was thinking about was if you have something like:

typedef struct overlay_fabric_t {
    struct in6_addr ofb_ip;
    uint64_t ofb_vid;
    uint32_t ofb_dcid;
    uint16_t ofb_vlan;
    uint8_t ofb_prefixlen;
} overlay_fabric_t;

typedef struct overlay_fabric_entry_t {
    /* bookkeeping stuff for attachment, pointers to overlay_dev_t, etc */
    overlay_fabric_t ofe_fabric;
} overlay_fabric_entry_t;

There'll need to be a way for varpd to send these to overlay (likely new ioctls that add/remove as well as the attachment information for them).

Then each overlay_target_entry_t can add a field for its VL3 IP as well as a pointer to the correct overlay_fabric_entry_t for that target. Then each overlay_target_entry_t is hashed based on VL2 MAC and <fabric, VL3 IP>. If we had an overlay_target_entry_t for the source VL2 MAC (if a customer has multiple instances on the same CN, this is likely to happen anyway), we could do a lookup on the src VL2 MAC of the packet to determine the source fabric.

from rfd.

danmcd commented on August 10, 2024

I don't think you'd be able to get a reply, but the packet would still be received if overlay already has the destination information -- we don't do a varpd lookup on every packet (nor do I think we would want to).

Exploiting an existing destination only works for rogue-VM-same-user, as I understand it. Different user means different vnetid, and different state in overlay.

As for the new structures you propose, are those for varpd or for overlay?

from rfd.

jasonbking commented on August 10, 2024

Today it can't happen cross customer, but if we ever did support cross-customer routing, then it becomes a similar concern. Even for one customer though, I think it could still be a problem. If $BIG_CUSTOMER has multiple groups under the same account and saw traffic arrive on instances from things that shouldn't ever be able to reach it -- that's bound to raise some eyebrows.

The structures would be for overlay, though overlay_fabric_t might get shared with varpd for exchanging information about fabrics.

from rfd.

danmcd commented on August 10, 2024

Would source-MAC checking/enforcement solve the $BIG_CUSTOMER problem? (And can a KVM or bhyve instance change their NIC's MAC w/o droppage?)

from rfd.

papertigers commented on August 10, 2024

When a user informs napi that they want to attach networks to their fabric the rfd outlines the following currently:

  "attached_networks": [
    "f4104070-df1e-4c4a-891c-58951abd72e8",
    "103b4f01-b8bc-42a5-886a-0a680da22d20",
    "b1963383-6b1a-4025-b73d-a7fb43ff7624"
  ],

It seems that we will need more than just a uuid here. Perhaps a dcid + uuid would be better suited. If a user happens to get unlucky and napi uses the same uuid for different fabrics in different DCs, then we need a way to clue napi in on which network it is they are actually trying to add. It also probably makes sense to tell napi the DC and network uuid anyways so that it can create a local network that maps to the remote network more easily.

from rfd.

jasonbking commented on August 10, 2024

Yes, and I don't know :)

Fundamentally, the problem is we get an mblk_t and have to try to determine which fabric it originates from so we can determine the correct destination. Because this could have security implications (in that we don't want packets to be sent to things that shouldn't be able to receive them -- even if the receiver cannot reply), it seems like we'd like something that could not be subverted by root in the instance.

from rfd.

danmcd commented on August 10, 2024

@papertigers you missed the level-of-indirection provided in the nascent remote-network-object described in 119's text. It maps <dcid, that-dc's-uuid> to a my-dc-UUID.

from rfd.

RFD 119 Routing Between Fabric Networks: Discussion about rfd HOT 17 OPEN

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent