Linux network stack inner workings
move · O overview · F fit
01 / 00

Where your packets actually go

Inside the
Linux network stack.

From a frame landing in a NIC's ring buffer to recv() waking your process — and all the way back out. We follow a packet through every layer, take apart the two structures the whole stack is built on (sk_buff and struct sock), and mark exactly where eBPF — including NetWatch's kprobe — taps in.

RX & TX pathssk_buffstruct sock NAPI & softirqsnetfilterTCP state machine XDP · tc · kprobes

Five acts · ~21 stops · drive the sk_buff pointers, follow a packet up the stack, step the TCP handshake, traverse the netfilter hooks.

One packet, many layers

The stack is a pipeline you can name end to end

Every received packet climbs the same ladder; every sent one descends it. Each rung is a real subsystem with real entry points — and a place where you can observe or intercept.

  • Driver / NAPI — DMA, interrupts, polling. XDP hooks here, pre-skb.
  • Link → IPnetif_receive_skbip_rcv. tc-BPF & netfilter hook here.
  • Transporttcp_v4_rcv / udp_rcv, socket lookup. kprobes (NetWatch) hook here.
  • Socket → app — receive queue, recv() wakes.
application — recv() / send()
↕ socket layer
TCP / UDP — tcp_v4_rcv · tcp_sendmsg
IP — ip_rcv · ip_queue_xmit
↕ netfilter · tc
link / driver — netif_receive_skb · ndo_start_xmit
↕ NAPI · XDP
NIC — DMA ring buffers, IRQ

Same ladder both ways. Learn the RX climb (Act II) and the TX descent (Act III) is its mirror.

Learn these two and everything else is verbs on them

sk_buff carries the packet; struct sock is the connection

Almost every function in the stack takes an skb, a sock, or both. The skb is one packet in flight; the sock is the kernel side of a socket.

  • sk_buff — a packet plus metadata. Passed by pointer up and down; headers are added/stripped by sliding pointers, not copying (Act II).
  • struct sock — receive/send queues, connection state, the 4-tuple. Its sock_common head holds skc_daddr / skc_dport — the very fields NetWatch's kprobe reads.

Everything below is how these two move through the layers.

include/linux/skbuff.h (simplified)

From wire to softirq

Interrupts get you in the door; NAPI keeps the door from jamming

At line rate, one interrupt per packet would melt the CPU. The kernel takes one interrupt, then switches to polling under load — that's NAPI.

  • DMA + IRQ. The NIC copies the frame into an RX ring in memory, then raises an interrupt.
  • Top half (tiny). The IRQ handler masks further RX interrupts and calls napi_schedule() — raising NET_RX_SOFTIRQ.
  • NAPI poll. net_rx_action runs the driver's poll(), draining up to a budget (64) of packets, then re-enables IRQs.
  • ksoftirqd. If softirqs flood, they're handed to a per-CPU kernel thread so userspace isn't starved.
NIC: DMA frame → RX ring · raise IRQ
↓ hardirq (top half)
mask IRQ · napi_schedule() · raise NET_RX_SOFTIRQ
↓ softirq
net_rx_action → driver poll() · budget = 64
↓ per packet
napi_gro_receive → up the stack

This split — fast hardirq, deferred softirq, polled under load — is why Linux can saturate a 10/40/100G link without livelock.

Headers without copying — just move pointers

Drive the four pointers: head · data · tail · end

An skb wraps one buffer with four pointers. Adding a header (skb_push) slides data left into the headroom; stripping one (skb_pull) slides it right. No payload is ever copied. Build a packet TX-style, then strip it RX-style.

drive it

one buffer, sliding pointers

headroom = data − head; tailroom = end − tail. Drivers reserve headroom up front so every layer can prepend its header by just moving a pointer.

One packet, all the way up

From DMA to recv() — step it

Follow a single TCP segment from the wire to the application. Each step is a real kernel entry point; the data structure it touches is named as it goes.

drive it

inbound: 142.250.72.4:443 → :52344

The TX path (Act III) is this exact ladder in reverse: tcp_sendmsgip_queue_xmit → qdisc → ndo_start_xmit → DMA out.

How the stack decides where a packet goes

Two demuxes and a hash table

Climbing the stack is a sequence of "which handler owns this?" decisions, each a fast lookup.

  • L2 → L3 by EtherType. __netif_receive_skb_core dispatches on skb->protocol (0x0800 → ip_rcv) via the registered packet_type list.
  • L3 → L4 by protocol number. The IP header's protocol field (6 → TCP) selects tcp_v4_rcv.
  • L4 → socket by 4-tuple. __inet_lookup hashes (saddr, sport, daddr, dport) into the established table; misses fall to the listen table (a new connection).

That established hash is the hot path — every packet of every connection hits it.

net/ipv4/tcp_ipv4.c (sketch)

The same ladder, descending

From send() down to the wire

Sending mirrors receiving. Your bytes are copied into skbs, headers are pushed on at each layer, and the packet is handed to the driver — gated by congestion control and queueing on the way.

  • tcp_sendmsg copies user data into skbs on the sk_write_queue.
  • tcp_write_xmit sends only what the congestion window (cwnd) allows — TCP's flow/congestion control lives here.
  • ip_queue_xmit adds the IP header, routes, runs LOCAL_OUT + POSTROUTING, resolves the next hop (ARP).
  • dev_queue_xmit hands off to the qdisc, which schedules the actual ndo_start_xmit into the NIC.
send() → tcp_sendmsg → sk_write_queue
↓ cwnd gate
tcp_write_xmit → tcp_transmit_skb (TCP hdr)
ip_queue_xmit → LOCAL_OUT · POSTROUTING · ARP
dev_queue_xmit → qdisc → ndo_start_xmit
↓ DMA
NIC → wire

Two gates the RX path doesn't have: congestion control (how much TCP may send) and the qdisc (when the link actually accepts it). Act IV opens the qdisc.

The connection, and how the kernel finds it

Every packet is matched to a struct sock by its 4-tuple

A struct sock is the kernel half of a socket: its queues, its state, its identity. The stack finds the right one with two hash tables.

  • Established hash (ehash) — keyed on the full 4-tuple (saddr, sport, daddr, dport). The hot path: every data packet looks up here.
  • Listen hash — keyed on local port; a SYN that misses ehash matches a listener and starts a new connection.
  • sock_common is the head: skc_daddr, skc_dport, skc_state. The exact fields NetWatch's kprobe reads.
include/net/sock.h (simplified)

A connection is a walk through states

Step the handshake — and the teardown

skc_state moves through a fixed state machine, driven by segments arriving and the app calling connect/close. Pick a scenario and step it; the current state is what ss and NetWatch report.

drive it

TCP state transitions

state: CLOSED

TIME_WAIT (2×MSL) is the famous one — the active closer lingers so a stray retransmitted FIN can't corrupt a new connection on the same tuple.

Where firewalling and NAT actually happen

Five hooks, one routing decision

netfilter places five hook points along the IP path; iptables/nftables rules and conntrack/NAT run at them. Whether a packet sees LOCAL_IN or FORWARD depends on the routing decision in the middle. Click each.

explore

inbound → … → outbound

Click a hook to see what runs there.

Each hook returns a verdict: ACCEPT · DROP · QUEUE (to userspace) · STOLEN · REPEAT. This is exactly the machinery pfSense/OPNsense drive — at the perimeter, blind to the per-process view a host tool has.

The egress gatekeeper

qdiscs decide when a packet leaves

Between dev_queue_xmit and the driver sits the queueing discipline — the kernel's traffic scheduler. Every device has a root qdisc.

  • Classless (fq_codel, the modern default) — fair queueing plus active queue management to fight bufferbloat.
  • Classful (HTB) — hierarchical bandwidth shaping: rate limits, borrowing, priorities.
  • enqueue / dequeue. Packets are enqueued; the qdisc chooses dequeue order, then sch_direct_xmit pushes to the NIC.
  • tc-BPF egress and BQL (byte queue limits) hook here too.
dev_queue_xmit → enqueue on root qdisc
↓ scheduler picks
fq_codel / HTB — shape, prioritize, drop early
↓ dequeue
sch_direct_xmit → ndo_start_xmit
NIC TX ring

This is where shaping, prioritization, and AQM live — and where a misconfigured queue adds the latency users feel as "lag."

Stateful, because the kernel remembers flows

Connection tracking turns a packet filter into a firewall

Raw netfilter sees packets; conntrack sees flows. It records each connection's tuple and state, which is what makes stateful firewalling and NAT possible.

  • State. Every flow is NEW, ESTABLISHED, RELATED, or INVALID — so "allow replies to connections I started" becomes one rule.
  • NAT. Rewriting addresses/ports: DNAT at PREROUTING (port forwarding), SNAT/masquerade at POSTROUTING (sharing one public IP).
  • The cost. Per-flow state in the conntrack table — finite, and a real tuning knob on busy gateways.
the stateful-firewall one-liner

Your code, at the layer that matters

Six places to attach a program — pick the one that sees your event

Every layer this deck walked is also an eBPF attach point. The earlier you hook, the faster and rawer; the later, the more context. Click each to see what it sees — and where NetWatch sits.

explore

driver → … → socket

Click a tap point to see what it can observe or change.

NetWatch uses a kprobe on tcp_v4_connect — late enough to know the process and destination, early enough to catch the connection's birth. Phase 2 adds more (Act V).

Why the stack barely copies anything

Shared buffers, paged data, and one big skb for many packets

Performance comes from not copying. An skb's payload can live in unmapped pages, be shared by reference, and represent many packets at once.

  • Linear + paged. A small linear part plus frags[] pointing at page-cache pages — that's how sendfile ships a file without copying it.
  • skb_clone vs skb_copy. Clone shares the data buffer (bump dataref), new skb head only; copy is a true deep copy.
  • GRO (RX) / GSO·TSO (TX). Coalesce many segments into one big skb going up; keep one big skb going down and segment at the very last step (the NIC, ideally). Per-packet cost amortized away.
skb_shared_info — at skb->end

How one kernel runs many isolated stacks

struct net — a whole network stack, instanced

A network namespace is an independent copy of the entire stack: its own interfaces, routing tables, conntrack, port space, and /proc/net. The whole networking code is namespace-aware — functions carry or derive a struct net.

  • Containers are netns. Each Docker/Kubernetes pod gets its own — that's why two containers can both bind port 80.
  • veth pairs are virtual cables stitching namespaces together, usually via a bridge or to the host.
  • Same kernel, same code paths — just keyed by which net the packet belongs to.

This is the substrate under all of container networking — and a thing NetWatch must reason about when attributing flows on a containerized host.

struct net (host)
veth ↕
struct net (container A) — own ports, routes, iface
struct net (container B) — own ports, routes, iface

One kernel, many parallel stacks. The 4-tuple that identifies a socket is only unique within a namespace.

Reading the stack for performance

The handful of places latency and CPU actually go

The user⇄kernel copy

sendmsg/recvmsg copy data across the boundary. Zero-copy paths — sendfile, MSG_ZEROCOPY, io_uring — exist precisely to dodge it.

The established-hash lookup

Every single packet of every connection hashes its 4-tuple into ehash. It's fast, but it's per-packet and on the hottest path.

Softirq budget & RSS/RPS

RX work runs in softirq on the CPU that took the IRQ. RSS (hardware) and RPS (software) spread flows across cores so one CPU isn't the bottleneck.

GRO / GSO amortization

Coalescing turns N trips up/down the stack into one. Disable it and per-packet overhead dominates at high rates.

Bufferbloat / AQM

Oversized queues add latency under load. fq_codel and friends drop early to keep queues — and lag — short.

Cache locality

Per-CPU structures and careful skb field layout exist so the hot path touches as few cache lines as possible.

When a network "feels slow," it's usually one of these — and each is observable with the very tools (ftrace, perf, eBPF) hooked at the layers this deck walked.

Where to look — and your own way in

The map, in real files — and NetWatch's Phase-2 hooks on it

The whole deck lives in a handful of directories. Open these alongside it:

  • include/linux/skbuff.h · include/net/sock.h — the two structures.
  • net/core/dev.c — NAPI, netif_receive_skb, dev_queue_xmit.
  • net/ipv4/{ip_input,ip_output,tcp_input,tcp_output,tcp_ipv4}.c — IP & TCP.
  • net/netfilter/ · net/sched/ — hooks & qdiscs.

NetWatch's eBPF Phase 2 is three new taps — each lands at a layer you now know:

tcp_v6_connect — net/ipv6/tcp_ipv6.c (IPv6 connect)
udp_sendmsg — net/ipv4/udp.c (QUIC/UDP attribution)
inet_sock_set_state (catch short-lived flows by state change)

Each is the same question this deck answered: which function sees the event, what's populated when it fires, what's the stable key. Adding one is a real first patch.

The whole map, once more

From a NIC's ring buffer to recv() — and back

The map Act I

A layered ladder, climbed by RX and descended by TX, built on two structures: sk_buff (the packet) and struct sock (the connection).

Receive Act II

NIC → IRQ → NAPI poll → skb → GRO → ip_rcvtcp_v4_rcv → socket lookup → wake recv(). Headers move by sliding pointers, never copying.

Transmit & TCP Act III

tcp_sendmsg → cwnd → IP → qdisc → driver; and the TCP state machine from SYN_SENT to TIME_WAIT.

Control points Act IV

netfilter's five hooks, conntrack/NAT, qdiscs, and the six eBPF tap points — XDP to cgroup — that observe or change each layer.

Depth Act V

Shared/paged skbs and GRO/GSO; namespaces as parallel stacks; where the CPU and latency really go.

Your way in

Open net/core/dev.c and follow one packet. Then add a NetWatch Phase-2 hook — and you're reading the stack like a contributor.

Press O for the full map · to revisit any simulation. The stack is big, but it's just this ladder, walked carefully.

Jump to a stop — click any, or press O to close