Where your packets actually go
Inside the
Linux network stack.
From a frame landing in a NIC's ring buffer to recv() waking your
process — and all the way back out. We follow a packet through every layer, take apart the two structures the
whole stack is built on (sk_buff and struct sock), and mark
exactly where eBPF — including NetWatch's kprobe — taps in.
Five acts · ~21 stops · drive the sk_buff pointers, follow a packet up the stack, step the TCP handshake, traverse the netfilter hooks.
One packet, many layers
The stack is a pipeline you can name end to end
Every received packet climbs the same ladder; every sent one descends it. Each rung is a real subsystem with real entry points — and a place where you can observe or intercept.
- Driver / NAPI — DMA, interrupts, polling. XDP hooks here, pre-skb.
- Link → IP —
netif_receive_skb→ip_rcv. tc-BPF & netfilter hook here. - Transport —
tcp_v4_rcv/udp_rcv, socket lookup. kprobes (NetWatch) hook here. - Socket → app — receive queue,
recv()wakes.
Same ladder both ways. Learn the RX climb (Act II) and the TX descent (Act III) is its mirror.
Learn these two and everything else is verbs on them
sk_buff carries the packet; struct sock is the connection
Almost every function in the stack takes an skb, a
sock, or both. The skb is one packet in flight; the
sock is the kernel side of a socket.
sk_buff— a packet plus metadata. Passed by pointer up and down; headers are added/stripped by sliding pointers, not copying (Act II).struct sock— receive/send queues, connection state, the 4-tuple. Itssock_commonhead holdsskc_daddr/skc_dport— the very fields NetWatch's kprobe reads.
Everything below is how these two move through the layers.
From wire to softirq
Interrupts get you in the door; NAPI keeps the door from jamming
At line rate, one interrupt per packet would melt the CPU. The kernel takes one interrupt, then switches to polling under load — that's NAPI.
- DMA + IRQ. The NIC copies the frame into an RX ring in memory, then raises an interrupt.
- Top half (tiny). The IRQ handler masks further RX interrupts and calls
napi_schedule()— raisingNET_RX_SOFTIRQ. - NAPI poll.
net_rx_actionruns the driver'spoll(), draining up to a budget (64) of packets, then re-enables IRQs. - ksoftirqd. If softirqs flood, they're handed to a per-CPU kernel thread so userspace isn't starved.
This split — fast hardirq, deferred softirq, polled under load — is why Linux can saturate a 10/40/100G link without livelock.
Headers without copying — just move pointers
Drive the four pointers: head · data · tail · end
An skb wraps one buffer with four pointers.
Adding a header (skb_push) slides data left into the
headroom; stripping one (skb_pull) slides it right. No payload is ever copied.
Build a packet TX-style, then strip it RX-style.
one buffer, sliding pointers
headroom = data − head; tailroom = end − tail.
Drivers reserve headroom up front so every layer can prepend its header by just moving a pointer.
One packet, all the way up
From DMA to recv() — step it
Follow a single TCP segment from the wire to the application. Each step is a real kernel entry point; the data structure it touches is named as it goes.
inbound: 142.250.72.4:443 → :52344
The TX path (Act III) is this exact ladder in reverse: tcp_sendmsg →
ip_queue_xmit → qdisc → ndo_start_xmit → DMA out.
How the stack decides where a packet goes
Two demuxes and a hash table
Climbing the stack is a sequence of "which handler owns this?" decisions, each a fast lookup.
- L2 → L3 by EtherType.
__netif_receive_skb_coredispatches onskb->protocol(0x0800 →ip_rcv) via the registeredpacket_typelist. - L3 → L4 by protocol number. The IP header's protocol field (6 → TCP) selects
tcp_v4_rcv. - L4 → socket by 4-tuple.
__inet_lookuphashes(saddr, sport, daddr, dport)into the established table; misses fall to the listen table (a new connection).
That established hash is the hot path — every packet of every connection hits it.
The same ladder, descending
From send() down to the wire
Sending mirrors receiving. Your bytes are copied into skbs, headers
are pushed on at each layer, and the packet is handed to the driver — gated by congestion control and
queueing on the way.
tcp_sendmsgcopies user data into skbs on thesk_write_queue.tcp_write_xmitsends only what the congestion window (cwnd) allows — TCP's flow/congestion control lives here.ip_queue_xmitadds the IP header, routes, runsLOCAL_OUT+POSTROUTING, resolves the next hop (ARP).dev_queue_xmithands off to the qdisc, which schedules the actualndo_start_xmitinto the NIC.
Two gates the RX path doesn't have: congestion control (how much TCP may send) and the qdisc (when the link actually accepts it). Act IV opens the qdisc.
The connection, and how the kernel finds it
Every packet is matched to a struct sock by its 4-tuple
A struct sock is the kernel half of a socket: its queues, its state,
its identity. The stack finds the right one with two hash tables.
- Established hash (
ehash) — keyed on the full 4-tuple(saddr, sport, daddr, dport). The hot path: every data packet looks up here. - Listen hash — keyed on local port; a SYN that misses
ehashmatches a listener and starts a new connection. sock_commonis the head:skc_daddr,skc_dport,skc_state. The exact fields NetWatch's kprobe reads.
A connection is a walk through states
Step the handshake — and the teardown
skc_state moves through a fixed state machine,
driven by segments arriving and the app calling connect/close.
Pick a scenario and step it; the current state is what ss and NetWatch report.
TCP state transitions
TIME_WAIT (2×MSL) is the famous one — the active closer lingers so a
stray retransmitted FIN can't corrupt a new connection on the same tuple.
Where firewalling and NAT actually happen
Five hooks, one routing decision
netfilter places five hook points along the IP path; iptables/nftables rules
and conntrack/NAT run at them. Whether a packet sees LOCAL_IN or
FORWARD depends on the routing decision in the middle. Click each.
inbound → … → outbound
Each hook returns a verdict: ACCEPT · DROP ·
QUEUE (to userspace) · STOLEN · REPEAT.
This is exactly the machinery pfSense/OPNsense drive — at the perimeter, blind to the per-process view a
host tool has.
The egress gatekeeper
qdiscs decide when a packet leaves
Between dev_queue_xmit and the driver sits the queueing
discipline — the kernel's traffic scheduler. Every device has a root qdisc.
- Classless (
fq_codel, the modern default) — fair queueing plus active queue management to fight bufferbloat. - Classful (
HTB) — hierarchical bandwidth shaping: rate limits, borrowing, priorities. - enqueue / dequeue. Packets are enqueued; the qdisc chooses dequeue order, then
sch_direct_xmitpushes to the NIC. - tc-BPF egress and BQL (byte queue limits) hook here too.
This is where shaping, prioritization, and AQM live — and where a misconfigured queue adds the latency users feel as "lag."
Stateful, because the kernel remembers flows
Connection tracking turns a packet filter into a firewall
Raw netfilter sees packets; conntrack sees flows. It records each connection's tuple and state, which is what makes stateful firewalling and NAT possible.
- State. Every flow is
NEW,ESTABLISHED,RELATED, orINVALID— so "allow replies to connections I started" becomes one rule. - NAT. Rewriting addresses/ports: DNAT at
PREROUTING(port forwarding), SNAT/masquerade atPOSTROUTING(sharing one public IP). - The cost. Per-flow state in the conntrack table — finite, and a real tuning knob on busy gateways.
Your code, at the layer that matters
Six places to attach a program — pick the one that sees your event
Every layer this deck walked is also an eBPF attach point. The earlier you hook, the faster and rawer; the later, the more context. Click each to see what it sees — and where NetWatch sits.
driver → … → socket
NetWatch uses a kprobe on tcp_v4_connect — late enough to know the
process and destination, early enough to catch the connection's birth. Phase 2 adds more (Act V).
Why the stack barely copies anything
Shared buffers, paged data, and one big skb for many packets
Performance comes from not copying. An skb's payload can live in
unmapped pages, be shared by reference, and represent many packets at once.
- Linear + paged. A small linear part plus
frags[]pointing at page-cache pages — that's howsendfileships a file without copying it. skb_clonevsskb_copy. Clone shares the data buffer (bumpdataref), new skb head only; copy is a true deep copy.- GRO (RX) / GSO·TSO (TX). Coalesce many segments into one big skb going up; keep one big skb going down and segment at the very last step (the NIC, ideally). Per-packet cost amortized away.
How one kernel runs many isolated stacks
struct net — a whole network stack, instanced
A network namespace is an independent copy of the entire stack: its own
interfaces, routing tables, conntrack, port space, and /proc/net. The whole
networking code is namespace-aware — functions carry or derive a struct net.
- Containers are netns. Each Docker/Kubernetes pod gets its own — that's why two containers can both bind port 80.
vethpairs are virtual cables stitching namespaces together, usually via a bridge or to the host.- Same kernel, same code paths — just keyed by which
netthe packet belongs to.
This is the substrate under all of container networking — and a thing NetWatch must reason about when attributing flows on a containerized host.
One kernel, many parallel stacks. The 4-tuple that identifies a socket is only unique within a namespace.
Reading the stack for performance
The handful of places latency and CPU actually go
The user⇄kernel copy
sendmsg/recvmsg
copy data across the boundary. Zero-copy paths — sendfile,
MSG_ZEROCOPY, io_uring — exist precisely to dodge it.
The established-hash lookup
Every single packet of every connection hashes its
4-tuple into ehash. It's fast, but it's per-packet and on the hottest path.
Softirq budget & RSS/RPS
RX work runs in softirq on the CPU that took the IRQ. RSS (hardware) and RPS (software) spread flows across cores so one CPU isn't the bottleneck.
GRO / GSO amortization
Coalescing turns N trips up/down the stack into one. Disable it and per-packet overhead dominates at high rates.
Bufferbloat / AQM
Oversized queues add latency under load. fq_codel
and friends drop early to keep queues — and lag — short.
Cache locality
Per-CPU structures and careful skb field
layout exist so the hot path touches as few cache lines as possible.
When a network "feels slow," it's usually one of these — and each is observable with the very tools (ftrace, perf, eBPF) hooked at the layers this deck walked.
Where to look — and your own way in
The map, in real files — and NetWatch's Phase-2 hooks on it
The whole deck lives in a handful of directories. Open these alongside it:
include/linux/skbuff.h·include/net/sock.h— the two structures.net/core/dev.c— NAPI,netif_receive_skb,dev_queue_xmit.net/ipv4/{ip_input,ip_output,tcp_input,tcp_output,tcp_ipv4}.c— IP & TCP.net/netfilter/·net/sched/— hooks & qdiscs.
NetWatch's eBPF Phase 2 is three new taps — each lands at a layer you now know:
Each is the same question this deck answered: which function sees the event, what's populated when it fires, what's the stable key. Adding one is a real first patch.
The whole map, once more
From a NIC's ring buffer to recv() — and back
The map Act I
A layered ladder, climbed by RX and
descended by TX, built on two structures: sk_buff (the packet) and
struct sock (the connection).
Receive Act II
NIC → IRQ → NAPI poll → skb → GRO →
ip_rcv → tcp_v4_rcv → socket lookup → wake
recv(). Headers move by sliding pointers, never copying.
Transmit & TCP Act III
tcp_sendmsg
→ cwnd → IP → qdisc → driver; and the TCP state machine from SYN_SENT to
TIME_WAIT.
Control points Act IV
netfilter's five hooks, conntrack/NAT, qdiscs, and the six eBPF tap points — XDP to cgroup — that observe or change each layer.
Depth Act V
Shared/paged skbs and GRO/GSO; namespaces as parallel stacks; where the CPU and latency really go.
Your way in
Open net/core/dev.c and follow one packet.
Then add a NetWatch Phase-2 hook — and you're reading the stack like a contributor.
Press O for the full map · ← to revisit any simulation. The stack is big, but it's just this ladder, walked carefully.