一、 sk_buff 是什么定义sk_buff(Socket Buffer) 是 Linux 内核网络子系统中数据包在协议栈中流转的唯一载体。本质它不仅包含原始的报文数据还包含了描述这个报文的所有元数据谁发的、发给谁、什么时候发的、当前解析到哪一层了。/** * DOC: Basic sk_buff geometry * * struct sk_buff itself is a metadata structure and does not hold any packet * data. All the data is held in associated buffers. * * sk_buff.head points to the main head buffer. The head buffer is divided * into two parts: * * - data buffer, containing headers and sometimes payload; * this is the part of the skb operated on by the common helpers * such as skb_put() or skb_pull(); * - shared info (struct skb_shared_info) which holds an array of pointers * to read-only data in the (page, offset, length) format. * * Optionally skb_shared_info.frag_list may point to another skb. * * Basic diagram may look like this:: * * --------------- * | sk_buff | * --------------- * ,--------------------------- head * / ,----------------- data * / / ,----------- tail * | | | , end * | | | | * v v v v * ----------------------------------------------- * | headroom | data | tailroom | skb_shared_info | * ----------------------------------------------- * [page frag] * [page frag] * [page frag] * [page frag] --------- * frag_list -- | sk_buff | * --------- * */ /** * struct sk_buff - socket buffer * next: Next buffer in list * prev: Previous buffer in list * tstamp: Time we arrived/left * skb_mstamp_ns: (aka tstamp) earliest departure time; start point * for retransmit timer * rbnode: RB tree node, alternative to next/prev for netem/tcp * list: queue head * ll_node: anchor in an llist (eg socket defer_list) * sk: Socket we are owned by * dev: Device we arrived on/are leaving by * dev_scratch: (aka dev) alternate use of dev when dev would be %NULL * cb: Control buffer. Free for use by every layer. Put private vars here * _skb_refdst: destination entry (with norefcount bit) * len: Length of actual data * data_len: Data length * mac_len: Length of link layer header * hdr_len: writable header length of cloned skb * csum: Checksum (must include start/offset pair) * csum_start: Offset from skb-head where checksumming should start * csum_offset: Offset from csum_start where checksum should be stored * priority: Packet queueing priority * ignore_df: allow local fragmentation * cloned: Head may be cloned (check refcnt to be sure) * ip_summed: Driver fed us an IP checksum * nohdr: Payload reference only, must not modify header * pkt_type: Packet class * fclone: skbuff clone status * ipvs_property: skbuff is owned by ipvs * inner_protocol_type: whether the inner protocol is * ENCAP_TYPE_ETHER or ENCAP_TYPE_IPPROTO * remcsum_offload: remote checksum offload is enabled * offload_fwd_mark: Packet was L2-forwarded in hardware * offload_l3_fwd_mark: Packet was L3-forwarded in hardware * tc_skip_classify: do not classify packet. set by IFB device * tc_at_ingress: used within tc_classify to distinguish in/egress * redirected: packet was redirected by packet classifier * from_ingress: packet was redirected from the ingress path * nf_skip_egress: packet shall skip nf egress - see netfilter_netdev.h * peeked: this packet has been seen already, so stats have been * done for it, dont do them again * nf_trace: netfilter packet trace flag * protocol: Packet protocol from driver * destructor: Destruct function * tcp_tsorted_anchor: list structure for TCP (tp-tsorted_sent_queue) * _sk_redir: socket redirection information for skmsg * _nfct: Associated connection, if any (with nfctinfo bits) * skb_iif: ifindex of device we arrived on * tc_index: Traffic control index * hash: the packet hash * queue_mapping: Queue mapping for multiqueue devices * head_frag: skb was allocated from page fragments, * not allocated by kmalloc() or vmalloc(). * pfmemalloc: skbuff was allocated from PFMEMALLOC reserves * pp_recycle: mark the packet for recycling instead of freeing (implies * page_pool support on driver) * active_extensions: active extensions (skb_ext_id types) * ndisc_nodetype: router type (from link layer) * ooo_okay: allow the mapping of a socket to a queue to be changed * l4_hash: indicate hash is a canonical 4-tuple hash over transport * ports. * sw_hash: indicates hash was computed in software stack * wifi_acked_valid: wifi_acked was set * wifi_acked: whether frame was acked on wifi or not * no_fcs: Request NIC to treat last 4 bytes as Ethernet FCS * encapsulation: indicates the inner headers in the skbuff are valid * encap_hdr_csum: software checksum is needed * csum_valid: checksum is already valid * csum_not_inet: use CRC32c to resolve CHECKSUM_PARTIAL * csum_complete_sw: checksum was completed by software * csum_level: indicates the number of consecutive checksums found in * the packet minus one that have been verified as * CHECKSUM_UNNECESSARY (max 3) * unreadable: indicates that at least 1 of the fragments in this skb is * unreadable. * dst_pending_confirm: need to confirm neighbour * decrypted: Decrypted SKB * slow_gro: state present at GRO time, slower prepare step required * tstamp_type: When set, skb-tstamp has the * delivery_time clock base of skb-tstamp. * napi_id: id of the NAPI struct this skb came from * sender_cpu: (aka napi_id) source CPU in XPS * alloc_cpu: CPU which did the skb allocation. * secmark: security marking * mark: Generic packet mark * reserved_tailroom: (aka mark) number of bytes of free space available * at the tail of an sk_buff * vlan_all: vlan fields (proto tci) * vlan_proto: vlan encapsulation protocol * vlan_tci: vlan tag control information * inner_protocol: Protocol (encapsulation) * inner_ipproto: (aka inner_protocol) stores ipproto when * skb-inner_protocol_type ENCAP_TYPE_IPPROTO; * inner_transport_header: Inner transport layer header (encapsulation) * inner_network_header: Network layer header (encapsulation) * inner_mac_header: Link layer header (encapsulation) * transport_header: Transport layer header * network_header: Network layer header * mac_header: Link layer header * kcov_handle: KCOV remote handle for remote coverage collection * tail: Tail pointer * end: End pointer * head: Head of buffer * data: Data head pointer * truesize: Buffer size * users: User count - see {datagram,tcp}.c * extensions: allocated extensions, valid if active_extensions is nonzero */二、 它的核心作用是什么1. 消除内存拷贝零拷贝的基础从网卡驱动层到应用层数据在内存里只存一份。协议栈每往上传一层MAC - IP - TCP只是把skb里的指针动一下把对应的协议头“暴露”给处理函数。2. 管理复杂的内存碎片一个巨大的 4K 报文在物理内存里可能是不连续的碎片。skb的shinfo共享信息区能把这些散落在各处的内存块像链表一样串起来但在处理时给人的感觉是一个完整的包。3. 多核同步与复用报文在内核里可能会被多个核处理比如一个核收包一个核转发。skb通过引用计数机制确保只有当所有人都不再需要这个包时内存才会被释放防止“野指针”导致系统崩溃。三、 通过读源码你必须彻底搞懂的“五大疑惑”这是你未来在面试或者在实际工作中处理死机、丢包、时延高时必须从源码里找出的答案。请把这五个问题刻在脑子里1. 空间预留疑惑为什么head和data不在一起源码细节看alloc_skb如何分配空间以及skb_reserve做了什么。你需要懂内核为什么要预留headroom答案为了让协议栈在往包头上加 MAC 头、VLAN 标签时不需要重新搬移整个报文直接在预留空间写字节极其高效。2. 指针变换疑惑四大指针如何实现“剥皮”和“加壳”源码逻辑对比skb_push,skb_pull,skb_put的实现。你需要懂当一个包从路由器进入防火墙它是如何通过指针偏移实现“逻辑上删除了 MAC 头”但“物理上没动内存”的3. 分层索引疑惑如何一秒定位 TCP 头源码细节看transport_header和network_header这两个成员。你需要懂在内核 5.x 版本中这些字段存的不再是指针而是相对于head的偏移量Offset。为什么要改用偏移量答案为了在 64 位系统上节省 4 字节空间并方便跨内存块计算。4. 引用计数疑惑skb_clone与skb_copy的生死区别源码细节搜索__skb_clone和pskb_copy的实现。你需要懂Clone只复制管理头struct sk_buff共用数据区。速度极快但如果你改了数据所有人都会变。Copy连同数据区完整复制一份。慢但安全。面试必考在做抓包监控Tcpdump时内核用的是哪种为什么5. 校验和卸载疑惑硬件是如何帮 CPU “减负”的源码细节就是你刚才找到的那个union和ip_summed。你需要懂如果ip_summed被标记为CHECKSUM_PARTIAL内核在把包丢给网卡驱动前需不需要手动计算 TCP 校验和答案不需要只计算伪首部剩下的扔给网卡硬件。四、skb的“生老病死” (生命周期)为了让你在工作中能用好它你必须记住这个流程诞生网卡收到信号 - 触发 DMA - 驱动调用alloc_skb申请空间 - 填入数据。成长通过netif_receive_skb进入协议栈 - 经过 IP 层skb_pull掉 MAC 头 - 经过 TCP 层skb_pull掉 IP 头。处理可能被skb_clone转发到多个接口也可能在 Netfilter 里被skb_copy修改。死亡数据被拷贝到用户态recv缓存或者被网卡发送出去 - 调用kfree_skb释放内存。五、实践/工作中要怎么抓包SKBebpf(现代大厂必会极其重要)这是目前最火的技术。你可以写一段简单的 C 代码动态地注入到正在运行的内核中把skb的内部结构抓出来。工具pwru(Packet Where Are You)。作用它能追踪一个skb经过了内核里的每一个函数并显示它的状态# 安装工具 sudo apt update sudo apt install bpftrace sudo bpftrace -e kprobe:ip_rcv { printf(Packet CPU:%d, skb_ptr:%p, len:%d\n, cpu, arg0, ((struct sk_buff *)arg0)-len); }Packet CPU:1, skb_ptr:0xffff9ef41cff5700, len:52 Packet CPU:4, skb_ptr:0xffff9ef360166ae8, len:90 Packet CPU:2, skb_ptr:0xffff9ef43a3764e8, len:309 Packet CPU:3, skb_ptr:0xffff9ef43b2208e8, len:274 Packet CPU:4, skb_ptr:0xffff9ef3601678e8, len:90 Packet CPU:2, skb_ptr:0xffff9ef43a377ee8, len:257 Packet CPU:1, skb_ptr:0xffff9ef41cff5700, len:52 Packet CPU:3, skb_ptr:0xffff9ef43b2236e8, len:326 Packet CPU:4, skb_ptr:0xffff9ef360166ae8, len:90 Packet CPU:2, skb_ptr:0xffff9ef43a374ae8, len:309 Packet CPU:0, skb_ptr:0xffff9ef453f34ae8, len:111 Packet CPU:1, skb_ptr:0xffff9ef4d873cae8, len:109按照脚本内容过滤需要在虚拟机或者ubuntu系统上操作WSL2暂时没有skbuff.h这个库文件所以暂时没有操作。sudo bpftrace -e #include linux/skbuff.h #include linux/ip.h kprobe:ip_rcv { $skb (struct sk_buff *)arg0; $iph (struct iphdr *)($skb-head $skb-network_header); // ntop 将 32 位整数转换为字符串格式 $daddr ntop($iph-daddr); $saddr ntop($iph-saddr); // 过滤只打印目的 IP 为 192.168.1.100 的包 if ($daddr 192.168.1.100) { printf(Time:%ld | SRC:%s - DST:%s | skb:%p | len:%d\n, elapsed, $saddr, $daddr, $skb, $skb-len); } }代码解读void *skb_put(struct sk_buff *skb, unsigned int len) { void *tmp skb_tail_pointer(skb); SKB_LINEAR_ASSERT(skb); skb-tail len; skb-len len; if (unlikely(skb-tail skb-end)) skb_over_panic(skb, len, __builtin_return_address(0)); return tmp; }1. 消除困惑unlikely是什么你在源码里看到的unlikely(len skb-len)这是一个编译优化指令。原理CPU 在执行指令时会进行“分支预测”。unlikely告诉编译器“这个条件大概率不会成立即len通常不会超过skb的长度”。结果编译器会把if成立后的跳转代码放在内存的远端而把正常的处理逻辑放在 CPU 流水线最容易抓取的地方。为什么在万兆网络下每一纳秒都很珍贵。既然我们默认程序是健康的就让“异常处理”给“正常转发”让路。skb_shared_info1. 什么是分片信息 (skb_shared_info)在网络中有一个概念叫MTU最大传输单元通常是 1500 字节。问题如果你的应用层发了一个 4000 字节的大包而物理内存分配器kmalloc给你的data区装不下怎么办方案内核会把剩下的数据挂在“附件”里。skb_shared_info就是这个附件夹。它紧跟在skb-end指向的内存之后。它记录了frags[]指向其他内存块分片的指针。gso_size/gso_segs即你提到的GSO分片卸载。它告诉网卡“我给你一个大包你自己按照这个 size 帮我切成小包发出去。”2. 什么是 Cache Line (缓存行)这是计算机组成原理在内核中的应用。物理本质CPU 从内存读取数据时不是一个字节一个字节读的而是一次性读64 字节这 64 字节就叫一个Cache Line。伪共享问题如果两个核频繁读写同一个 64 字节块里的不同变量就会导致 Cache 频繁失效性能暴跌。CPU 维护缓存一致性遵循一个原则通常是MESI 协议一旦某个核心修改了缓存行Cache Line里的哪怕一个字节其他所有核心里包含这个字节的整个 64 字节缓存行都会失效Invalid。场景模拟假设sk_buff的某 64 字节空间里紧挨着放了两个变量变量 A核心 1 经常读取它。变量 B核心 2 经常修改它。虽然核心 1 从不关心变量 B但因为 A 和 B 住在同一个“房子”64 字节缓存行里。每当核心 2 修改变量 B硬件就会通知核心 1“你的这块缓存过期了扔掉”核心 1 下次读 A 的时候发现缓存没了被迫去慢速的内存里重读。这就是“伪共享”两个核明明在干不相干的事却因为数据在物理上靠得太近互相拖累。3. 为什么要藏在“屁股后面”核心疑惑解答为什么不把skb_shared_info直接写在struct sk_buff定义里而是用skb_end_pointer(SKB)偏移过去原因 ACache 的优先级最重要的原因struct sk_buff是 CPU最频繁读写的区域各种指针、标志位。 而skb_shared_info分片信息只有在处理超大包或者分片时才会被用到。如果把分片信息放在sk_buff结构体里会把结构体撑得很大导致一个 Cache Line 只能装下很少的有用信息。把分片信息藏在 Data Buffer 的末尾可以确保sk_buff结构体本身足够瘦身尽可能多地留在 CPU L1 Cache 中。1. 为什么一定要留在 L1 Cache 里你可以把 CPU 想象成一个超级加工厂而数据是原材料。L1 Cache (一级缓存)相当于工人的口袋。伸手就能拿到速度极快但空间极小通常每核心只有 32KB-64KB。L2 Cache (二级缓存)相当于工人的办公桌。稍微慢一点但能放更多东西。L3 Cache (三级缓存)相当于车间的共享货架。多个工人核心共用速度更慢。内存 (RAM)相当于校外的仓库。要去那里取一趟CPU 要等好几百个周期效率暴跌。原因 B内存管理的统一性所有的skb无论大小其sk_buff结构体的大小是固定的。但 Data Buffer 的大小是动态的128B 到几 KB 不等。 通过把shinfo放在 Data Buffer 的末尾内核可以实现“谁申请的数据区谁负责带附件袋”。这样管理结构sk_buff就能保持极度纯粹。4.高性能转发的核心GSO (Generic Segmentation Offload)。/* Set skb_shinfo(skb)-gso_size to this in case you want skb_segment to* segment using its current segmentation instead.*/软件视角协议栈把一个 64KB 的巨包一路传下去不用在 IP 层切片这样省去了大量的协议头处理开销。硬件视角直到包到达网卡驱动那一刻才参考gso_size由网卡硬件在极短时间内把包切开。这就是高性能的秘密延迟切片硬件代劳。