Skip to content

Add VXLAN/EVPN support with flood list management#504

Open
rjarry wants to merge 12 commits intoDPDK:mainfrom
rjarry:vxlan
Open

Add VXLAN/EVPN support with flood list management#504
rjarry wants to merge 12 commits intoDPDK:mainfrom
rjarry:vxlan

Conversation

@rjarry
Copy link
Collaborator

@rjarry rjarry commented Feb 14, 2026

Add VXLAN interface type with encapsulation/decapsulation datapath nodes. Each VXLAN interface maintains a per-VNI flood list of remote VTEPs used for BUM traffic ingress replication.

The flood list API is transport-agnostic, designed to accommodate future SRv6 EVPN support. VXLAN VTEP is the first registered flood type. A dispatch layer routes add/del/list operations to type-specific callbacks.

FRR integration is wired up for bridge interfaces, VXLAN interfaces, FDB entries and flood lists. This enables BGP EVPN type-2 (MAC/IP) and type-3 (IMET) route exchange with remote PEs.

Also fix interface running state not being set on creation. This prevented FRR from seeing logical interfaces as operationally up.

Summary by CodeRabbit

  • New Features

    • Added L2 bridge interface support with member management and configuration options.
    • Introduced MAC forwarding database with dynamic learning and configurable aging.
    • Added VXLAN flood VTEP management for overlay networks.
    • Implemented L2 datapath with bridge and VXLAN packet forwarding.
    • Added CLI commands for bridge, FDB, and VXLAN management.
    • Integrated MAC learning events with FRR zebra dplane.
  • Tests

    • Added integration tests for bridge, VXLAN, and EVPN/VXLAN workflows.

@rjarry rjarry marked this pull request as draft February 14, 2026 00:06
@coderabbitai
Copy link

coderabbitai bot commented Feb 14, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This pull request adds comprehensive L2 switching capabilities including bridge and VXLAN interface types. New infrastructure includes: a dedicated L2 module with bridge management, FDB learning/aging, and flood VTEP tracking; datapath nodes for bridge ingress classification and output flooding; VXLAN encapsulation/decapsulation; integration with FRR zebra for MAC learning notifications; CLI commands for bridge, FDB, and flood management; a control queue draining mechanism for deferred event notification; and extended event types for FDB and flood operations. UDP port aliasing for VXLAN was added to the L4 layer. Supporting DPDK patches address RCU hash safety during overwrites and defer queue failures. Smoke tests validate bridge-based L2 forwarding and EVPN/VXLAN overlays.

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Fix all issues with AI agents
In `@modules/infra/control/group_nexthop.c`:
- Line 152: The call to rte_rcu_qsbr_synchronize(gr_datapath_rcu(),
rte_lcore_id()) is using rte_lcore_id() from a control thread that is not
registered as a QSBR reader; replace the second argument with
RTE_QSBR_THRID_INVALID so the call becomes
rte_rcu_qsbr_synchronize(gr_datapath_rcu(), RTE_QSBR_THRID_INVALID) whenever
invoked from control-plane threads (same change for any other control-plane
calls that pass rte_lcore_id()); ensure only datapath reader threads keep using
their registered thread IDs (registration happens via
rte_rcu_qsbr_thread_register in the datapath main loop).

In `@modules/l2/cli/vxlan.c`:
- Around line 73-77: arg_vrf currently returns 0 when the user omits the
ENCAV_VRF argument, but the code treats 0 as success and unconditionally sets
GR_VXLAN_SET_ENCAP_VRF, causing encap_vrf to be overwritten; fix by storing the
arg_vrf return value (e.g. int ret = arg_vrf(c, p, "ENCAP_VRF",
&vxlan->encap_vrf_id)), return on ret < 0, and only set set_attrs |=
GR_VXLAN_SET_ENCAP_VRF when ret > 0 (meaning the user actually supplied
ENCAV_VRF), leaving vxlan->encap_vrf_id untouched when the argument is absent.

In `@modules/l2/control/bridge.c`:
- Around line 60-77: bridge_detach_member currently resets member->mode to
GR_IFACE_MODE_VRF but leaves member->vrf_id as GR_VRF_ID_UNDEF; update
bridge_detach_member to restore the member's VRF by calling
vrf_default_get_or_create() and assigning the returned vrf id to member->vrf_id
and incrementing its refcount via vrf_incref (mirroring bridge_fini behavior),
then set member->mode = GR_IFACE_MODE_VRF so the detached iface has a valid VRF.

In `@modules/l2/control/vxlan.c`:
- Around line 281-287: The vtep_flood_del function mutates the shared
flood_vteps array in-place (swap-and-decrement) without RCU protection, causing
a data-race with datapath readers; change vtep_flood_del to follow the
copy-on-write + RCU pattern used by vtep_flood_add: allocate a new flood_vteps
buffer, copy entries from the old array excluding entry->vtep.addr (preserving
order if add does), set the new pointer and updated n_flood_vteps atomically
(using the same RCU/atomic swap helper used by vtep_flood_add), schedule the old
buffer to be freed after the RCU grace period, and keep the
gr_event_push(GR_EVENT_FLOOD_DEL, entry) call; reference vtep_flood_del,
vtep_flood_add, flood_vteps, n_flood_vteps, and gr_event_push when making the
change.
- Around line 50-83: The delete uses cur->encap_vrf_id after it was overwritten,
so rte_hash_del_key is built with the new encap_vrf_id instead of the old one;
fix by capturing the old encap_vrf_id (and old vni if needed) before mutating
cur (e.g., read old_vrf = cur->encap_vrf_id and build cur_key from old_vrf and
cur->vni) or postpone assigning cur->encap_vrf_id until after the hash
delete/add sequence; update the code around cur->encap_vrf_id, cur_key,
rte_hash_del_key, next_key and rte_hash_add_key_data accordingly so the deletion
targets the original {old_vni, old_vrf}.

In `@modules/l2/datapath/vxlan_output.c`:
- Around line 75-79: vxlan_output currently assigns ip_output_mbuf_data(m)->nh =
fib4_lookup(...) without checking for NULL and always sends packets to
IP_OUTPUT; change vxlan_output to check the result of fib4_lookup (the value
stored in ip_output_mbuf_data(m)->nh) and if it is NULL enqueue the packet to
the BAD_NEXTHOP edge (the declared but unused BAD_NEXTHOP path) instead of
forwarding to IP_OUTPUT, otherwise continue to set edge = IP_OUTPUT and enqueue
as before; update the enqueue logic around rte_node_enqueue_x1(graph, node,
edge, m) so the chosen edge reflects this NULL-check.
🧹 Nitpick comments (3)
frr/if_grout.c (1)

369-378: Variable add shadows outer bool add on line 356.

struct gr_fdb_add_req *add (line 370) shadows the bool add declared at line 356. This works correctly due to block scoping, but it's a latent maintenance trap — a future refactor could easily reference the wrong add.

Proposed fix — rename inner variable
 	if (add) {
-		struct gr_fdb_add_req *add = req;
-		add->exist_ok = true;
-		add->fdb.iface_id = ifindex_frr_to_grout(dplane_ctx_get_ifindex(ctx));
-		add->fdb.bridge_id = ifindex_frr_to_grout(dplane_ctx_mac_get_br_ifindex(ctx));
-		add->fdb.vlan_id = dplane_ctx_mac_get_vlan(ctx);
-		add->fdb.flags = dplane_ctx_mac_get_dp_static(ctx) ? GR_FDB_F_STATIC : 0;
-		memcpy(&add->fdb.mac, dplane_ctx_mac_get_addr(ctx), sizeof(add->fdb.mac));
-		add->fdb.vtep = dplane_ctx_mac_get_vtep_ip(ctx)->s_addr;
+		struct gr_fdb_add_req *add_req = req;
+		add_req->exist_ok = true;
+		add_req->fdb.iface_id = ifindex_frr_to_grout(dplane_ctx_get_ifindex(ctx));
+		add_req->fdb.bridge_id = ifindex_frr_to_grout(dplane_ctx_mac_get_br_ifindex(ctx));
+		add_req->fdb.vlan_id = dplane_ctx_mac_get_vlan(ctx);
+		add_req->fdb.flags = dplane_ctx_mac_get_dp_static(ctx) ? GR_FDB_F_STATIC : 0;
+		memcpy(&add_req->fdb.mac, dplane_ctx_mac_get_addr(ctx), sizeof(add_req->fdb.mac));
+		add_req->fdb.vtep = dplane_ctx_mac_get_vtep_ip(ctx)->s_addr;
 		req_type = GR_FDB_ADD;
modules/l2/api/gr_l2.h (1)

44-49: Bit 36 skipped in VXLAN reconfiguration flags.

GR_VXLAN_SET_LOCAL is bit 35, GR_VXLAN_SET_MAC jumps to bit 37. Bit 36 is unused. If intentional (reserved for a future attribute), no problem. If a typo, it won't cause a bug now but could cause confusion later.

modules/l2/control/fdb.c (1)

329-346: Redundant fdb_max_entries assignment.

Line 342 sets fdb_max_entries = req->max_entries, but fdb_reconfig (line 79) already does the same assignment. Harmless, but the duplicate write could be removed.

@rjarry rjarry force-pushed the vxlan branch 4 times, most recently from ca74f20 to 34d418a Compare February 14, 2026 21:14
@rjarry rjarry marked this pull request as ready for review February 14, 2026 21:14
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
main/control_queue.c (1)

24-24: ⚠️ Potential issue | 🟡 Minor

Missing parentheses around macro definition.

CONTROL_QUEUE_SIZE expands unsafely in expressions due to operator precedence. While current usages happen to be fine, this is a latent bug waiting to bite.

Proposed fix
-#define CONTROL_QUEUE_SIZE RTE_GRAPH_BURST_SIZE * 4
+#define CONTROL_QUEUE_SIZE (RTE_GRAPH_BURST_SIZE * 4)
🤖 Fix all issues with AI agents
In `@frr/if_grout.c`:
- Around line 369-378: The local pointer declaration inside the if (add) block
shadows the outer bool add; change the pointer name (e.g., rename "struct
gr_fdb_add_req *add = req;" to "struct gr_fdb_add_req *add_req = req;" or
similar) and update all references in that block (fields like add->exist_ok,
add->fdb.* and any further uses) to the new identifier to avoid shadowing and
future fragility in the function that contains the if (add) check.

In `@main/event.c`:
- Around line 42-44: control_queue_push failures currently drop events silently;
update the error path in the block where control_queue_push(notify_subscribers,
(void *)obj, ev_type) < 0 to record and surface the failure: increment a
persistent error metric/counter (e.g., control_events_dropped) and emit a single
WARNING log on first occurrence (or throttled warnings thereafter) that includes
ev_type and a pointer/identifier for obj so operators can detect lost events;
ensure the new metric and warning are used wherever notify_subscribers events
are pushed so dropped events are observable.

In `@modules/l2/control/fdb.c`:
- Around line 24-37: The fdb hash is created without the lock-free concurrency
extra flags, which allows concurrent writers from datapath lcores and control
plane to corrupt the table; in fdb_reconfig set the rte_hash_parameters
extra_flags to include RTE_HASH_EXTRA_FLAGS_RW_CONCURRENCY_LF |
RTE_HASH_EXTRA_FLAGS_TRANS_MEM_SUPPORT before calling rte_hash_create (same
pattern used by vxlan_hash) so rte_hash_create will enable lock-free RW
concurrency for fdb_hash.

In `@modules/l4/l4_input_local.c`:
- Around line 40-50: l4_input_alias_port currently overwrites udp_edges[alias]
and increments udp_refcounts even if alias already points to a different edge;
change the function to first check udp_edges[alias] and only allow aliasing when
the slot is unused or already points to the same edge as udp_edges[port].
Concretely, in l4_input_alias_port check if udp_edges[alias] != UNUSED &&
udp_edges[alias] != udp_edges[port] and return an error (e.g.,
errno_set(EADDRINUSE)) to avoid clobbering another edge; if the alias slot is
unused, set udp_edges[alias] = udp_edges[port] and increment
udp_refcounts[alias]; if it already equals the same edge, treat as idempotent
(optionally increment refcount or leave as-is per existing refcount semantics)
so l4_input_unalias_port can correctly restore state. Ensure the MANAGEMENT
check on the source port remains.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Fix all issues with AI agents
In `@modules/l2/cli/fdb.c`:
- Around line 118-126: The build fails because scols_* functions (used in
scols_new_table, scols_table_new_column, scols_table_set_column_separator, etc.)
are undeclared when NEED_SCOLS_LINE_SPRINTF is defined; to fix, ensure
<libsmartcols.h> is always included so those declarations are available—either
modify gr_table.h to include <libsmartcols.h> unconditionally (remove the
conditional that skips it when NEED_SCOLS_LINE_SPRINTF is defined) or add a
direct `#include` <libsmartcols.h> at the top of the affected CLI files (e.g.,
modules/l2/cli/fdb.c) so scols_new_table, scols_table_new_column,
scols_table_set_column_separator, and related symbols are declared.

In `@modules/l2/cli/vxlan.c`:
- Around line 73-78: The code sets GR_VXLAN_SET_ENCAP_VRF even when ENCAP_VRF is
omitted because the else is paired with the combined condition; fix by only
setting set_attrs when arg_str(p, "ENCAP_VRF") is present and arg_vrf succeeds:
change the logic so you first check if arg_str(p, "ENCAP_VRF") != NULL, then
call arg_vrf(c, p, "ENCAP_VRF", &vxlan->encap_vrf_id) and return 0 on failure,
and only after a successful arg_vrf call set GR_VXLAN_SET_ENCAP_VRF on set_attrs
(referencing arg_str, arg_vrf, vxlan->encap_vrf_id, and GR_VXLAN_SET_ENCAP_VRF).

In `@modules/l2/control/flood.c`:
- Around line 70-85: In flood_list, don't rely on errno after calling ops->list;
instead capture the return value (e.g. int ret = ops->list(...)), check if ret <
0 and propagate the error consistently (return api_out(-ret, 0, NULL)) like
flood_add/flood_del do; update the loop in flood_list to use the ret variable
and return -ret when ops->list fails or document that ops->list must set errno
if you prefer that contract.
- Around line 30-42: Validate the untrusted index before indexing the
flood_types array: in flood_add (and similarly in flood_del) check that
req->entry.type is >= 0 and < ARRAY_DIM(flood_types) and return
api_out(EAFNOSUPPORT, 0, NULL) if it is out of range; do this before reading
flood_types[req->entry.type] and before accessing ops->add/ops->del so you avoid
any out-of-bounds access.

In `@modules/l2/control/vxlan.c`:
- Around line 250-261: The pointer/count update is racy on weakly-ordered CPUs:
after allocating and filling vteps you assign vxlan->flood_vteps = vteps and
then increment vxlan->n_flood_vteps, which can be seen by readers out-of-order
on ARM; insert a release memory barrier between those two stores by calling
rte_atomic_thread_fence(rte_memory_order_release) immediately after setting
vxlan->flood_vteps = vteps and before doing vxlan->n_flood_vteps++ so readers
cannot observe the new count without seeing the new pointer (keep the existing
rte_rcu_qsbr_synchronize and rte_free usage unchanged).

In `@modules/l2/datapath/vxlan_input.c`:
- Around line 79-85: rte_pktmbuf_adj can return NULL on insufficient headroom
but the result is ignored; modify the logic around rte_pktmbuf_adj in
vxlan_input.c to check its return value and, when NULL, set edge = NO_HEADROOM
(or jump to the existing NO_HEADROOM handling) instead of proceeding to use
iface_mbuf_data and IFACE_INPUT; ensure you do not call iface_mbuf_data or touch
the mbuf after a failed rte_pktmbuf_adj and keep existing fields (vlan_id, vtep,
iface) assignment only on success.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
frr/zebra_dplane_grout.c (1)

219-225: ⚠️ Potential issue | 🟠 Major

Add GR_IFACE_TYPE_BRIDGE and GR_IFACE_TYPE_VXLAN to startup interface sync.

The types[] array in grout_sync_ifaces() omits bridge and VXLAN interfaces. When FRR starts or reconnects after these interfaces already exist in GROUT, FRR won't discover them during the initial startup sync—only runtime events will inform FRR of their existence. Since bridges and VXLAN tunnels are typically created before FRR starts, this gap causes FRR to miss them entirely until the next modification event.

Both types are fully handled by grout_link_change() and simply need to be added to the sync list. Suggest adding BRIDGE before VXLAN (since VXLAN may depend on bridge) and both before or alongside VLAN:

 	static const gr_iface_type_t types[] = {
 		GR_IFACE_TYPE_VRF,
+		GR_IFACE_TYPE_BRIDGE,
 		GR_IFACE_TYPE_BOND,
 		GR_IFACE_TYPE_IPIP,
 		GR_IFACE_TYPE_PORT,
+		GR_IFACE_TYPE_VXLAN,
 		GR_IFACE_TYPE_VLAN,
 	};
🤖 Fix all issues with AI agents
In `@frr/if_grout.c`:
- Around line 327-353: grout_fdb_change is missing setting the bridge ifindex on
the dplane context, so the handler grout_add_del_mac later reads an
uninitialized bridge via dplane_ctx_mac_get_br_ifindex(ctx); fix this by calling
dplane_ctx_mac_set_br_ifindex(ctx, ifindex_grout_to_frr(fdb->bridge_id)) (using
the existing ifindex_grout_to_frr helper) before enqueueing the context in
dplane_provider_enqueue_to_zebra(ctx) so the bridge_id is correctly propagated.

In `@modules/l2/control/vxlan.c`:
- Around line 67-83: The block mutates VRF refcounts and cur->encap_vrf_id then
deletes the old vxlan_hash key, but returns on later failures (EADDRINUSE,
ERANGE, or rte_hash_add_key_data) without rolling back, leaking refcounts and
removing the only hash entry; to fix, perform all validations first (check
next->vni range and rte_hash_lookup for next_key) before touching VRF refcounts
or calling rte_hash_del_key, or if you must change state early, add rollback
paths that on any subsequent error: restore the original cur->encap_vrf_id,
decrement the new VRF and increment the old VRF accordingly, and re-add the old
vxlan_hash entry (using rte_hash_add_key_data) before returning; key symbols:
GR_VXLAN_SET_ENCAP_VRF, cur->encap_vrf_id, vxlan_hash, rte_hash_del_key,
rte_hash_add_key_data, rte_hash_lookup.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@modules/l2/control/gr_l2_control.h`:
- Around line 73-87: The VNI lookup mismatch on little-endian systems is caused
by using different encodings: insertion uses vxlan_encode_vni() but control-path
lookups use rte_cpu_to_be_32(), producing different keys; fix by changing the
control-path lookup to use vxlan_encode_vni() (or consistently use
vxlan_encode_vni()/vxlan_decode_vni() everywhere) so keys match insertion/lookup
semantics—update the lookup sites that currently call rte_cpu_to_be_32() to call
vxlan_encode_vni() (or refactor insertion to use rte_cpu_to_be_32() if you
prefer the other convention), ensuring the same encoding function is used in
vxlan_encode_vni, vxlan_decode_vni, hash insertion, and control-path lookup
functions.

In `@modules/l2/control/vxlan.c`:
- Around line 86-100: The code currently ignores return values from
l4_input_unalias_port and l4_input_alias_port inside the GR_VXLAN_SET_DST_PORT
branch; update the logic to check their returns and handle failures: call
l4_input_unalias_port and if it fails log the error and abort updating
cur->dst_port (or return the error), then call l4_input_alias_port and if it
fails revert any prior unalias/alias changes (restore previous alias state), log
the error and return a failure code instead of setting cur->dst_port;
specifically modify the block handling set_attrs & GR_VXLAN_SET_DST_PORT around
variables next->dst_port and cur->dst_port to only assign cur->dst_port after
successful l4_input_alias_port / l4_input_unalias_port calls and propagate the
error (handle EADDRNOTAVAIL/EADDRINUSE) to the caller.
🧹 Nitpick comments (1)
modules/l2/control/fdb.c (1)

339-347: Redundant assignment of fdb_max_entries.

fdb_reconfig() already sets fdb_max_entries = max_entries at line 81. Line 346 sets it again. Not a bug, but worth noting.

@rjarry rjarry force-pushed the vxlan branch 5 times, most recently from a62fe2b to cd21aba Compare February 17, 2026 16:03
@rjarry rjarry force-pushed the vxlan branch 5 times, most recently from fde2085 to 86aa07b Compare February 20, 2026 09:20
Introduce the VXLAN interface type for the L2 module. A VXLAN
interface carries a VNI (VXLAN Network Identifier), a local VTEP
address used as the outer IP source, an encapsulation VRF for
underlay routing, and a configurable UDP destination port (default
4789).

VXLAN interfaces are keyed by (VNI, encap_vrf_id) in a lockfree
RCU-protected hash table so that the datapath can resolve incoming
tunneled packets to the correct interface without locks.

VXLAN interfaces are intended to be attached to a bridge domain.
All L2 traffic entering the bridge is forwarded transparently over
the VXLAN tunnel. The local VTEP address must already be configured
in the encapsulation VRF.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
VXLAN uses UDP port 4789 by default but allows configuring a custom
destination port per interface. Allow the control plane to register
additional UDP ports at runtime as aliases for an already registered
port, reusing the same datapath edge.

Use reference counting so that multiple interfaces sharing the same
non-default port do not interfere with each other during teardown.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Wire up the VXLAN interface's configurable destination port to the
L4 input node. When a non-default port is configured, register it
as an alias for the standard VXLAN port (4789) so that the datapath
delivers matching UDP packets to the vxlan_input node.

Unregister the alias when the port changes or the interface is
destroyed.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Introduce a transport-agnostic flood list framework for BUM traffic
(Broadcast, Unknown unicast, Multicast). In EVPN, each PE maintains
a flooding list built from IMET routes (RFC 8365, RFC 9572). The
entries in this list differ depending on the overlay encapsulation:
VXLAN uses a remote VTEP IPv4 address and a VNI, while SRv6 would
use a 128-bit SID.

The API defines a gr_flood_entry structure with a type discriminant
and a union, allowing future encapsulation types (e.g. SRv6 SIDs)
to be added without changing the API request types. A dispatch
layer in control/flood.c routes add/del/list operations to
type-specific callbacks registered at init time.

Implement the VXLAN VTEP flood type (GR_FLOOD_T_VTEP). Each VXLAN
interface maintains a per-VNI array of remote VTEP addresses used
by the vxlan_flood datapath node for ingress replication. The array
is replaced atomically with an RCU synchronization barrier so that
the datapath never sees a partially updated list.

CLI commands are exposed under "flood vtep add/del/show". Add new
generated grcli-flood(1) man page.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
In a VXLAN overlay, the bridge needs to know which remote VTEP to use
when sending unicast frames to a learned MAC address. Add a VTEP IPv4
address field to FDB entries so that known unicast traffic can be sent
directly to the correct tunnel endpoint instead of being flooded to all
VTEPs.

When bridge_input learns a MAC address from a VXLAN member interface, it
records the source VTEP from the decapsulated packet's outer IP header.
When forwarding to a known destination, the stored VTEP address is
passed to the output path via the mbuf private data so that vxlan_output
can build the correct outer header.

Only set the VTEP field when the source interface is actually a VXLAN
type to avoid storing uninitialized data from other packet paths
(control plane, local bridge traffic).

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Add three datapath nodes for VXLAN packet processing.

vxlan_input decapsulates incoming UDP/4789 packets. It strips the
outer UDP and VXLAN headers, resolves the inner VNI to a VXLAN
interface via the RCU-protected hash table, records the source VTEP
from the outer IP header into the mbuf private data, and forwards
the inner Ethernet frame to iface_input for bridge processing.

vxlan_output encapsulates outgoing frames for a known destination
VTEP. It prepends a pre-built IP/UDP/VXLAN header template
initialized by the control plane, fills in the per-packet fields
(destination VTEP, UDP length, IP length, checksum), and hashes the
inner flow to select an ephemeral source port for underlay ECMP
(RFC 7348 Section 5). The FIB lookup for the outer IP uses the
encapsulation VRF, not the bridge domain.

vxlan_flood handles BUM traffic by replicating the frame to every
VTEP in the flood list via ingress replication. The original mbuf
is sent to the first VTEP and clones are created for the rest.

The bridge_flood node is updated to steer VXLAN member traffic
through vxlan_flood instead of direct iface_output.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Set up a VXLAN overlay between grout and a Linux netns peer. Grout
runs a bridge with a VXLAN member (VNI 100) and the Linux side
mirrors the topology with a kernel VXLAN device enslaved to a Linux
bridge. Both sides have flood lists configured with each other's
VTEP address for BUM traffic replication.

The test verifies L3 connectivity over the tunnel by having the
Linux side ping the bridge address. This exercises the full path:
ARP resolution over VXLAN, FDB learning from decapsulated traffic,
and ICMP echo reply via the VXLAN output encapsulation.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Report bridge interfaces to FRR as ZEBRA_IF_BRIDGE with their MAC
address. Tag members with ZEBRA_IF_SLAVE_BRIDGE and propagate the
bridge ifindex so that FRR can associate them with the correct master.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Report VXLAN interfaces to FRR's zebra as ZEBRA_IF_VXLAN with the
associated L2 VNI information. This allows FRR's EVPN control
plane to discover which VNIs are locally configured and advertise
them via BGP IMET routes to remote PEs.

The VXLAN L2 info includes the VNI, the local VTEP address, and
the underlay interface index so that zebra can correlate the tunnel
with the correct underlay routing context.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Synchronize bridge FDB entries bidirectionally between grout and FRR.
This is required for EVPN to advertise locally learned MAC addresses
via BGP type-2 routes and to install remotely learned MACs into the
bridge forwarding table.

Zebra's dplane API is asymmetric for MAC/FDB entries. In the downward
direction (zebra to dplane provider), zebra uses DPLANE_OP_MAC_INSTALL
and DPLANE_OP_MAC_DELETE to push MACs into the dataplane. In the
upward direction (dplane provider notifying zebra of learned MACs),
DPLANE_OP_NEIGH_INSTALL and DPLANE_OP_NEIGH_DELETE must be used
instead. These go through zebra_neigh_macfdb_update() which calls
zebra_vxlan_local_mac_add_update() and ultimately triggers BGP EVPN
type-2 route advertisement. By contrast, the DPLANE_OP_MAC_* result
handler (zebra_vxlan_handle_result) is a no-op. Despite the NEIGH op
name, the context payload uses the macinfo union member and is
populated with dplane_ctx_mac_set_*() accessors, exactly like zebra's
own netlink provider does in netlink_macfdb_change().

Unlike routes and nexthops which use higher-level zebra APIs that
resolve the namespace from the VRF ID, the FDB notification path
looks up interfaces via if_lookup_by_index_per_ns(ns_id, ifindex).
GROUT_NS must therefore be set on the dplane context for the
interface lookup to succeed.

Function names follow zebra's rt_netlink.c naming conventions:
grout_macfdb_change() for the upward notification path (like
netlink_macfdb_change) and grout_macfdb_update_ctx() for the
downward install path (like netlink_macfdb_update_ctx).

Self-event suppression is enabled on the FDB event subscriptions
to prevent feedback loops when FRR installs a MAC that was originally
learned by grout.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Handle DPLANE_OP_VTEP_ADD and DPLANE_OP_VTEP_DELETE operations from
FRR's EVPN control plane. When BGP learns a remote VTEP via an IMET
route (EVPN type-3), zebra pushes the VTEP to the dataplane provider.

The grout_vxlan_flood_update_ctx() function (named after zebra's
netlink_vxlan_flood_update_ctx() in rt_netlink.c) translates these
operations into GR_FLOOD_ADD/DEL requests with GR_FLOOD_T_VTEP type.
This is a downward-only path: zebra pushes flood list entries to the
dplane provider. There is no upward notification for VTEP flood list
changes since grout does not learn VTEPs on its own, they are always
provided by FRR's BGP EVPN control plane.

This allows BGP EVPN to dynamically manage the per-VNI flood lists
used for BUM traffic ingress replication, replacing the need for
static flood list configuration via the CLI.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Set up a full EVPN/VXLAN topology between FRR+grout and a
standalone FRR+Linux peer. Each side runs a bridge with a VXLAN
member (VNI 100) and a host namespace. Both peers run iBGP with
the l2vpn evpn address-family and advertise-all-vni.

The test verifies that EVPN type-3 (IMET) routes are exchanged so
that both sides install each other's VTEP in their flood lists.
It then verifies end-to-end L2 connectivity by pinging between the
two host namespaces through the VXLAN overlay, which exercises
type-2 (MAC/IP) route advertisement and FDB synchronization.

Signed-off-by: Robin Jarry <rjarry@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant