nftables子系统浅分析
196082 慢慢好起来

前言

转眼间上一篇文章又是几个月前的事情了,主要原因是近期遇到了一个让我十分痴迷的游戏——PUBG!建议直接跟我学习我的无敌闪身喷!依稀记得在蓝楼三楼拐角连续喷死一队被对面骂我是开挂的!哈哈哈哈哈哈哈!!

最近的十杀1192伤害吃鸡也是久久难以平复。

前面一直提到我要开始写fuzzing相关内容,本来也就打算从AFL源码开始看然后记录下来,但是在学习完了之后就又懒得写了。当然最主要的还是在同事那里请教了一下挖洞心得,也是让我选择不在纠结于fuzzing之类的“歪门邪道”开始从内心审视代码。

这篇文章主要目的是简单介绍一下netlink通信机制以及nftables子系统,在这篇文章之后会聚焦复现关于netlink以及nftables子系统相关漏洞,在后续的文章中不会再出现对这它们的解释,所以这篇文章算是为后续的漏洞复现打基础吧。

Netlink通信机制

Netlink是Linux提供的用于内核和用户态进程之间的通信方式。但是注意虽然Netlink主要用于用户空间和内核空间的通信,但是也能用于用户空间的两个进程通信。只是进程间通信有其他很多方式,一般不用Netlink。除非需要用到Netlink的广播特性时。

一般来说用户空间和内核空间的通信方式有三种:/proc、ioctl、Netlink。而前两种都是单向的,但是Netlink可以实现双工通信。Netlink协议基于BSD socket和AF_NETLINK地址簇(address family),使用32位的端口号寻址(以前称作PID),每个Netlink协议(或称作总线,man手册中则称之为netlink family),通常与一个或一组内核服务/组件相关联,如NETLINK_ROUTE用于获取和设置路由与链路信息、NETLINK_KOBJECT_UEVENT用于内核向用户空间的udev进程发送通知等。

用户态数据结构

以下面这个程序为例子对其中涉及到的结构体进行分析。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
#include <sys/stat.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <string.h>
#include <asm/types.h>
#include <linux/netlink.h>
#include <linux/socket.h>
#include <errno.h>
#define MAX_PAYLOAD 1024 // maximum payload size
#define NETLINK_TEST 25 //自定义的协议
int main(int argc, char* argv[])
{
int state;
struct sockaddr_nl src_addr, dest_addr;
struct nlmsghdr *nlh = NULL; //Netlink数据包头
struct iovec iov;
struct msghdr msg;
int sock_fd, retval;
int state_smg = 0;
// Create a socket
sock_fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_TEST);
if(sock_fd == -1){
printf("error getting socket: %s", strerror(errno));
return -1;
}
// To prepare binding
memset(&src_addr, 0, sizeof(src_addr));
src_addr.nl_family = AF_NETLINK;
src_addr.nl_pid = 100; //A:设置源端端口号
src_addr.nl_groups = 0;
//Bind
retval = bind(sock_fd, (struct sockaddr*)&src_addr, sizeof(src_addr));
if(retval < 0){
printf("bind failed: %s", strerror(errno));
close(sock_fd);
return -1;
}
// To orepare create mssage
nlh = (struct nlmsghdr *)malloc(NLMSG_SPACE(MAX_PAYLOAD));
if(!nlh){
printf("malloc nlmsghdr error!\n");
close(sock_fd);
return -1;
}
memset(&dest_addr,0,sizeof(dest_addr));
dest_addr.nl_family = AF_NETLINK;
dest_addr.nl_pid = 0; //B:设置目的端口号
dest_addr.nl_groups = 0;
nlh->nlmsg_len = NLMSG_SPACE(MAX_PAYLOAD);
nlh->nlmsg_pid = 100; //C:设置源端口
nlh->nlmsg_flags = 0;
strcpy(NLMSG_DATA(nlh),"Hello you!"); //设置消息体
iov.iov_base = (void *)nlh;
iov.iov_len = NLMSG_SPACE(MAX_PAYLOAD);
//Create mssage
memset(&msg, 0, sizeof(msg));
msg.msg_name = (void *)&dest_addr;
msg.msg_namelen = sizeof(dest_addr);
msg.msg_iov = &iov;
msg.msg_iovlen = 1;定义如下
//send message
printf("state_smg\n");
state_smg = sendmsg(sock_fd,&msg,0);
if(state_smg == -1)
{
printf("get error sendmsg = %s\n",strerror(errno));
}
memset(nlh,0,NLMSG_SPACE(MAX_PAYLOAD));
//receive message
printf("waiting received!\n");
while(1){
printf("In while recvmsg\n");
state = recvmsg(sock_fd, &msg, 0);
if(state<0)
{
printf("state<1");
}
printf("Received message: %s\n",(char *) NLMSG_DATA(nlh));
}
close(sock_fd);
return 0;
}

程序首先创建一个socket,这里选择的地址族即为AF_NETLINK,套接字选择类型为SOCK_RAW或SOCK_DGRAM,因为netlink是一个面向数据报的服务,最后协议选择套接字使用哪种netlink特征。

随后通过bind函数进行地址绑定,可以看到这里第二个参数其实就是结构体struct sockaddr_nl

1
2
3
4
5
6
struct sockaddr_nl {
__kernel_sa_family_t nl_family; /* AF_NETLINK */
unsigned short nl_pad; /* zero */
__u32 nl_pid; /* port ID */
__u32 nl_groups; /* multicast groups mask */
};

简单来看一下这个结构体,首先第一个成员nl_family为固定的AF_NETLINK,其次nl_pad成员在起初是用不到的所以这里为0,重点关注后两个成员。

成员nl_pid在Netlink规范里,PID全称是Port-ID(32bits),其主要作用是用于唯一的标识一个基于netlink的socket通道。通常情况下nl_pid都设置为当前进程的进程号。前面我们也说过,Netlink不仅可以实现用户-内核空间的通信还可使现实用户空间两个进程之间,或内核空间两个进程之间的通信。该属性为0时一般指内核。

成员nl_groups如果用户空间的进程希望加入某个多播组,则必须执行bind()系统调用。该字段指明了调用者希望加入的多播组号的掩码(注意不是组号,后面我们会详细讲解这个字段)。如果该字段为0则表示调用者不希望加入任何多播组。对于每个隶属于Netlink协议域的协议,最多可支持32个多播组(因为nl_groups的长度为32比特),每个多播组用一个比特来表示。

回到上面程序流程,紧接着创建一个struct nlmsghdr,该结构体作为Netlink的报文消息头,具体定义如下

1
2
3
4
5
6
7
struct nlmsghdr {
__u32 nlmsg_len; /* Length of message including header */
__u16 nlmsg_type; /* Message content */
__u16 nlmsg_flags; /* Additional flags */
__u32 nlmsg_seq; /* Sequence number */
__u32 nlmsg_pid; /* Sending process port ID */
};

首先第一个成员很明显其含义就是整个消息的长度,按照字节计算,包括了Netlink消息头本身。

第二个成员则是消息的类型,第三个成员则是附加在消息的额外说明信息。消息序列号,用以将消息排队,有些类似TCP协议中的序号(不完全一样),但是netlink的这个字段是可选的,不强制使用。最后一个成员表示发送端口的ID号,对于内核来说该值就是0,对于用户进程来说就是其socket所绑定的ID号。

这里再看申请nlmsghdr时所申请的大小是多少。

1
2
3
4
5
6
#define NLMSG_ALIGNTO	4U
#define NLMSG_ALIGN(len) ( ((len)+NLMSG_ALIGNTO-1) & ~(NLMSG_ALIGNTO-1) )
#define NLMSG_HDRLEN ((int) NLMSG_ALIGN(sizeof(struct nlmsghdr)))
#define NLMSG_LENGTH(len) ((len)+NLMSG_ALIGN(NLMSG_HDRLEN))
#define NLMSG_SPACE(len) NLMSG_ALIGN(NLMSG_LENGTH(len))
#define NLMSG_DATA(nlh) ((void*)(((char*)nlh) + NLMSG_LENGTH(0)))

通过简单的运算可以算出来这里申请的堆块大小为:0x410 其含义就是用户自定义的大小 0x400 以及消息头的大小0x10。

程序再往后就是往DATA段写上消息体,随后设置iov,接着对struct msghdr进行设置。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
/* Structure describing messages sent by
`sendmsg' and received by `recvmsg'. */
struct msghdr
{
void *msg_name; /* Address to send to/receive from. */
socklen_t msg_namelen; /* Length of address data. */

struct iovec *msg_iov; /* Vector of data to send/receive into. */
int msg_iovlen; /* Number of elements in the vector. */

void *msg_control; /* Ancillary data (eg BSD filedesc passing). */
socklen_t msg_controllen; /* Ancillary data buffer length. */

int msg_flags; /* Flags in received message. */
};

这个结构体不仅限于在这种情况下使用,不仅是Netlink专属的一个结构体。这里简单介绍一下这个结构体,首先这里msg_name成员指向的是数据包的目的地址(这里也就是上面的dest_addr,可以注意到其nl_pid为0表示其目的为内核)。然后就是msg_iov也就是指向前面用于指向实际载荷的iov结构,后面的msg_iovlen成员表示的是msg_iov的个数而不是长度。

程序后续就是发送消息以及接受消息,这里就不过多赘述了(毕竟这一段主要还是说数据结构相关的内容)。

这里首先大概介绍一下在面对通信时内核态所需要使用到的相关API,具体的函数分析以及结构体分析放在后面。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
#include <linux/init.h>
#include <linux/module.h>
#include <linux/timer.h>
#include <linux/time.h>
#include <linux/types.h>
#include <net/sock.h>
#include <net/netlink.h>
#define NETLINK_TEST 25
#define MAX_MSGSIZE 1024
int stringlength(char *s);
int err;
struct sock *nl_sk = NULL;
int flag = 0;
//向用户态进程回发消息
void sendnlmsg(char *message, int pid)
{
struct sk_buff *skb_1;
struct nlmsghdr *nlh;
int len = NLMSG_SPACE(MAX_MSGSIZE);
int slen = 0;
if(!message || !nl_sk)
{
return ;
}
printk(KERN_ERR "pid:%d\n",pid);
skb_1 = alloc_skb(len,GFP_KERNEL);
if(!skb_1)
{
printk(KERN_ERR "my_net_link:alloc_skb error\n");
}
slen = stringlength(message);
nlh = nlmsg_put(skb_1,0,0,0,MAX_MSGSIZE,0);
NETLINK_CB(skb_1).pid = 0;
NETLINK_CB(skb_1).dst_group = 0;
message[slen]= '\0';
memcpy(NLMSG_DATA(nlh),message,slen+1);
printk("my_net_link:send message '%s'.\n",(char *)NLMSG_DATA(nlh));
netlink_unicast(nl_sk,skb_1,pid,MSG_DONTWAIT);
}
int stringlength(char *s)
{
int slen = 0;
for(; *s; s++)
{
slen++;
}
return slen;
}
//接收用户态发来的消息
void nl_data_ready(struct sk_buff *__skb)
{
struct sk_buff *skb;
struct nlmsghdr *nlh;
char str[100];
struct completion cmpl;
printk("begin data_ready\n");
int i=10;
int pid;
skb = skb_get (__skb);
if(skb->len >= NLMSG_SPACE(0))
{
nlh = nlmsg_hdr(skb);
memcpy(str, NLMSG_DATA(nlh), sizeof(str));
printk("Message received:%s\n",str) ;
pid = nlh->nlmsg_pid;
while(i--)
{//我们使用completion做延时,每3秒钟向用户态回发一个消息
init_completion(&cmpl);
wait_for_completion_timeout(&cmpl,3 * HZ);
sendnlmsg("I am from kernel!",pid);
}
flag = 1;
kfree_skb(skb);
}
}
// Initialize netlink
int netlink_init(void)
{
nl_sk = netlink_kernel_create(&init_net, NETLINK_TEST, 1,
nl_data_ready, NULL, THIS_MODULE);
if(!nl_sk){
printk(KERN_ERR "my_net_link: create netlink socket error.\n");
return 1;
}
printk("my_net_link_4: create netlink socket ok.\n");
return 0;
}
static void netlink_exit(void)
{
if(nl_sk != NULL){
sock_release(nl_sk->sk_socket);
}
printk("my_net_link: self module exited\n");
}
module_init(netlink_init);
module_exit(netlink_exit);
MODULE_AUTHOR("zhao_h");
MODULE_LICENSE("GPL");
1
2
3
4
struct sock *netlink_kernel_create(struct net *net,
int unit,unsigned int groups,
void (*input)(struct sk_buff *skb),
struct mutex *cb_mutex,struct module *module);
  • net:是一个网络名字空间namespace,在不同的名字空间里面可以有自己的转发信息库,有自己的一套net_device等等。默认情况下都是使用 init_net这个全局变量。
  • unit:表示netlink协议类型,如NETLINK_TEST、NETLINK_SELINUX。
  • groups:多播地址。
  • input:为内核模块定义的netlink消息处理函数,当有消息到达这个netlink socket时,该input函数指针就会被引用,且只有此函数返回时,调用者的sendmsg才能返回。
  • cb_mutex:为访问数据时的互斥信号量。
  • module: 一般为THIS_MODULE。
1
int netlink_unicast(struct sock *ssk, struct sk_buff *skb, u32 pid, int nonblock);
  • ssk:为函数 netlink_kernel_create()返回的socket。
  • skb:存放消息,它的data字段指向要发送的netlink消息结构,而 skb的控制块保存了消息的地址信息,宏NETLINK_CB(skb)就用于方便设置该控制块。
  • pid:为接收此消息进程的pid,即目标地址,如果目标为组或内核,它设置为 0。
  • nonblock:表示该函数是否为非阻塞,如果为1,该函数将在没有接收缓存可利用时立即返回;而如果为0,该函数在没有接收缓存可利用定时睡眠。
1
int netlink_broadcast(struct sock *ssk, struct sk_buff *skb, u32 pid, u32 group, gfp_t allocation);

前面的三个参数与 netlink_unicast相同,参数group为接收消息的多播组,该参数的每一个位代表一个多播组,因此如果发送给多个多播组,就把该参数设置为多个多播组组ID的位或。参数allocation为内核内存分配类型,一般地为GFP_ATOMIC或GFP_KERNEL,GFP_ATOMIC用于原子的上下文(即不可以睡眠),而GFP_KERNEL用于非原子上下文。

1
int netlink_broadcast(struct sock *ssk, struct sk_buff *skb, u32 pid, u32 group, gfp_t allocation);

释放 netlink socket。

Netlink内核接收消息

Netlink初始化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
static int __net_init nfnetlink_net_init(struct net *net)
{
struct sock *nfnl;
struct netlink_kernel_cfg cfg = {
.groups = NFNLGRP_MAX,
.input = nfnetlink_rcv,
#ifdef CONFIG_MODULES
.bind = nfnetlink_bind,
#endif
};

nfnl = netlink_kernel_create(net, NETLINK_NETFILTER, &cfg);
if (!nfnl)
return -ENOMEM;
net->nfnl_stash = nfnl;
rcu_assign_pointer(net->nfnl, nfnl);
return 0;
}

可以看到这里主要是通过netlink_kernel_create函数创建了一个sock(上面的内核态Netlink socket API经发现是比较老的内核版本,下面的内容以内核5.10为例,目的是为了更贴合后续的漏洞复现),并且把这个sock给了被初始化的net,同时还注册了一组回调函数cfg,可以看到其input成员就是nfnetlink_rcv那么在后续如果收到netlink的消息后会调用该成员即nfnetlink_rcv函数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
struct sock *
__netlink_kernel_create(struct net *net, int unit, struct module *module,
struct netlink_kernel_cfg *cfg)
{
struct socket *sock;
struct sock *sk;
struct netlink_sock *nlk;
struct listeners *listeners = NULL;
struct mutex *cb_mutex = cfg ? cfg->cb_mutex : NULL;
unsigned int groups;

BUG_ON(!nl_table);

if (unit < 0 || unit >= MAX_LINKS)
return NULL;

if (sock_create_lite(PF_NETLINK, SOCK_DGRAM, unit, &sock))
return NULL;

if (__netlink_create(net, sock, cb_mutex, unit, 1) < 0)
goto out_sock_release_nosk;

sk = sock->sk;

if (!cfg || cfg->groups < 32)
groups = 32;
else
groups = cfg->groups;

listeners = kzalloc(sizeof(*listeners) + NLGRPSZ(groups), GFP_KERNEL);
if (!listeners)
goto out_sock_release;

sk->sk_data_ready = netlink_data_ready;
if (cfg && cfg->input)
nlk_sk(sk)->netlink_rcv = cfg->input;

if (netlink_insert(sk, 0))
goto out_sock_release;

nlk = nlk_sk(sk);
nlk->flags |= NETLINK_F_KERNEL_SOCKET;

netlink_table_grab();
if (!nl_table[unit].registered) {
nl_table[unit].groups = groups;
rcu_assign_pointer(nl_table[unit].listeners, listeners);
nl_table[unit].cb_mutex = cb_mutex;
nl_table[unit].module = module;
if (cfg) {
nl_table[unit].bind = cfg->bind;
nl_table[unit].unbind = cfg->unbind;
nl_table[unit].flags = cfg->flags;
if (cfg->compare)
nl_table[unit].compare = cfg->compare;
}
nl_table[unit].registered = 1;
} else {
kfree(listeners);
nl_table[unit].registered++;
}
netlink_table_ungrab();
return sk;

out_sock_release:
kfree(listeners);
netlink_kernel_release(sk);
return NULL;

out_sock_release_nosk:
sock_release(sock);
return NULL;
}
EXPORT_SYMBOL(__netlink_kernel_create);

netlink_kernel_create(struct net *net, int unit, struct netlink_kernel_cfg *cfg)
{
return __netlink_kernel_create(net, unit, THIS_MODULE, cfg);
}

可以看到在前面将内核sock通过nlk_sk,并为其添加了netlink_rcv成员为cfg->input也就是最开始的nfnetlink_rcv函数。

这里最终会调用__netlink_create函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
static int __netlink_create(struct net *net, struct socket *sock,
struct mutex *cb_mutex, int protocol,
int kern)
{
struct sock *sk;
struct netlink_sock *nlk;

sock->ops = &netlink_ops;

sk = sk_alloc(net, PF_NETLINK, GFP_KERNEL, &netlink_proto, kern);
if (!sk)
return -ENOMEM;

sock_init_data(sock, sk);

nlk = nlk_sk(sk);
if (cb_mutex) {
nlk->cb_mutex = cb_mutex;
} else {
nlk->cb_mutex = &nlk->cb_def_mutex;
mutex_init(nlk->cb_mutex);
lockdep_set_class_and_name(nlk->cb_mutex,
nlk_cb_mutex_keys + protocol,
nlk_cb_mutex_key_strings[protocol]);
}
init_waitqueue_head(&nlk->wait);

sk->sk_destruct = netlink_sock_destruct;
sk->sk_protocol = protocol;
return 0;
}

这里需要注意的是,申请的sock的ops为netlink_ops

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
static const struct proto_ops netlink_ops = {
.family = PF_NETLINK,
.owner = THIS_MODULE,
.release = netlink_release,
.bind = netlink_bind,
.connect = netlink_connect,
.socketpair = sock_no_socketpair,
.accept = sock_no_accept,
.getname = netlink_getname,
.poll = datagram_poll,
.ioctl = netlink_ioctl,
.listen = sock_no_listen,
.shutdown = sock_no_shutdown,
.setsockopt = netlink_setsockopt,
.getsockopt = netlink_getsockopt,
.sendmsg = netlink_sendmsg,
.recvmsg = netlink_recvmsg,
.mmap = sock_no_mmap,
.sendpage = sock_no_sendpage,
};

sock层接收请求流程分析

当用户需要进行配置规则集等操作时,就需要通过netlink向内核发起请求。由于所有子系统都共用一个nfnetlink,所以在传入时需要指定子系统的id以及请求操作的id,在sock这一层的主要操作是根据这两个id选出对应的函数进行调用以及提取出数据传入该函数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
long __sys_sendmsg(int fd, struct user_msghdr __user *msg, unsigned int flags,
bool forbid_cmsg_compat)
{
int fput_needed, err;
struct msghdr msg_sys;
struct socket *sock;

if (forbid_cmsg_compat && (flags & MSG_CMSG_COMPAT))
return -EINVAL;

sock = sockfd_lookup_light(fd, &err, &fput_needed);
if (!sock)
goto out;

err = ___sys_sendmsg(sock, msg, &msg_sys, flags, NULL, 0);

fput_light(sock->file, fput_needed);
out:
return err;
}

SYSCALL_DEFINE3(sendmsg, int, fd, struct user_msghdr __user *, msg, unsigned int, flags)
{
return __sys_sendmsg(fd, msg, flags, true);
}

在用户态最后是调用sendmsg向内核传递消息,這裏系統調用直接調用了__sys_sendmsg,首先通過fd描述符經過sockfd_lookup_light函數調用,找到對應的socket套接字結構實例。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
static struct socket *sockfd_lookup_light(int fd, int *err, int *fput_needed)
{
struct fd f = fdget(fd);
struct socket *sock;

*err = -EBADF;
if (f.file) {
sock = sock_from_file(f.file, err);
if (likely(sock)) {
*fput_needed = f.flags & FDPUT_FPUT;
return sock;
}
fdput(f);
}
return NULL;
}

緊接着直接調用了___sys_sendmsg函數,這裏的第三個參數msg_sys結構體同用戶態,都叫msghdr但是定義不同。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
struct msghdr {
void *msg_name; /* ptr to socket address structure */
int msg_namelen; /* size of socket address structure */
struct iov_iter msg_iter; /* data */

/*
* Ancillary data. msg_control_user is the user buffer used for the
* recv* side when msg_control_is_user is set, msg_control is the kernel
* buffer used for all other cases.
*/
union {
void *msg_control;
void __user *msg_control_user;
};
bool msg_control_is_user : 1;
__kernel_size_t msg_controllen; /* ancillary data buffer length */
unsigned int msg_flags; /* flags on received message */
struct kiocb *msg_iocb; /* ptr to iocb for async requests */
};

這裏比較大的區別是msg_iter成員,其爲msg_iovmsg_iovlen的合體。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
static int ___sys_sendmsg(struct socket *sock, struct user_msghdr __user *msg,
struct msghdr *msg_sys, unsigned int flags,
struct used_address *used_address,
unsigned int allowed_msghdr_flags)
{
struct sockaddr_storage address;
struct iovec iovstack[UIO_FASTIOV], *iov = iovstack;
ssize_t err;

msg_sys->msg_name = &address;

err = sendmsg_copy_msghdr(msg_sys, msg, flags, &iov);
if (err < 0)
return err;

err = ____sys_sendmsg(sock, msg_sys, flags, used_address,
allowed_msghdr_flags);
kfree(iov);
return err;
}

首先函數內部先是定義了一個iovstack變量,其作用是加速用戶數據的拷貝,這裏會假設用戶數據的iovec個數不會超過UIO_FASTIOV個,如果超過了則會去通過kmalloc_array去申請內存。

這裏繼續關注___sys_sendmsg第五個參數,其定義爲struct used_address結構體

1
2
3
4
struct used_address {
struct sockaddr_storage name;
unsigned int name_len;
};

其兩個字段分別用於存放消息的地址以及消息地址的長度。该结构体主要用与sendmmsg系统调用(用于同时向一个socket地址发送多个数据包,可以避免重复的网络security检查,从而提高发送效率)保存多个数据包的目的地址。现在这里设置为NULL,表示不使用。

函數內首先調用sendmsg_copy_msghdr函數去将用户态的msghdr内容copy至内核中。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
int sendmsg_copy_msghdr(struct msghdr *msg,
struct user_msghdr __user *umsg, unsigned flags,
struct iovec **iov)
{
int err;

if (flags & MSG_CMSG_COMPAT) {
struct compat_msghdr __user *msg_compat;

msg_compat = (struct compat_msghdr __user *) umsg;
err = get_compat_msghdr(msg, msg_compat, NULL, iov);
} else {
err = copy_msghdr_from_user(msg, umsg, NULL, iov);
}
if (err < 0)
return err;

return 0;
}

这里的flags可以从前面的内容看到是不存在MSG_CMSG_COMPAT标识位的,所以这里会进入到copy_msghdr_from_user函数中。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
static int copy_msghdr_from_user(struct msghdr *kmsg,
struct user_msghdr __user *umsg,
struct sockaddr __user **save_addr,
struct iovec **iov)
{
struct user_msghdr msg;
ssize_t err;

err = __copy_msghdr_from_user(kmsg, umsg, save_addr, &msg.msg_iov,
&msg.msg_iovlen);
if (err)
return err;

err = import_iovec(save_addr ? READ : WRITE,
msg.msg_iov, msg.msg_iovlen,
UIO_FASTIOV, iov, &kmsg->msg_iter);
return err < 0 ? err : 0;
}

这里主要做的事情有两件,首先通过__copy_msghdr_from_user函数,将用户态的msghdr内容拷贝至内核中,其次就是将用户态的iovec拷贝至内核中。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
int __copy_msghdr_from_user(struct msghdr *kmsg,
struct user_msghdr __user *umsg,
struct sockaddr __user **save_addr,
struct iovec __user **uiov, size_t *nsegs)
{
struct user_msghdr msg;
ssize_t err;

if (copy_from_user(&msg, umsg, sizeof(*umsg)))
return -EFAULT;

kmsg->msg_control_is_user = true;
kmsg->msg_control_user = msg.msg_control;
kmsg->msg_controllen = msg.msg_controllen;
kmsg->msg_flags = msg.msg_flags;

kmsg->msg_namelen = msg.msg_namelen;
if (!msg.msg_name)
kmsg->msg_namelen = 0;

if (kmsg->msg_namelen < 0)
return -EINVAL;

if (kmsg->msg_namelen > sizeof(struct sockaddr_storage))
kmsg->msg_namelen = sizeof(struct sockaddr_storage);

if (save_addr)
*save_addr = msg.msg_name;

if (msg.msg_name && kmsg->msg_namelen) {
if (!save_addr) {
err = move_addr_to_kernel(msg.msg_name,
kmsg->msg_namelen,
kmsg->msg_name);
if (err < 0)
return err;
}
} else {
kmsg->msg_name = NULL;
kmsg->msg_namelen = 0;
}

if (msg.msg_iovlen > UIO_MAXIOV)
return -EMSGSIZE;

kmsg->msg_iocb = NULL;
*uiov = msg.msg_iov;
*nsegs = msg.msg_iovlen;
return 0;
}

__copy_msghdr_from_user函数内部前面部分就是对kmsg的成员做赋值,随后进入到move_addr_to_kernel函数

1
2
3
4
5
6
7
8
9
10
int move_addr_to_kernel(void __user *uaddr, int ulen, struct sockaddr_storage *kaddr)
{
if (ulen < 0 || ulen > sizeof(struct sockaddr_storage))
return -EINVAL;
if (ulen == 0)
return 0;
if (copy_from_user(kaddr, uaddr, ulen))
return -EFAULT;
return audit_sockaddr(ulen, kaddr);
}

这里也就是将内容复制到内核地址中,这里的内核可以追溯到开头的___sys_sendmsg函数中struct sockaddr_storage address;也就是这个位置,而这个结构体在前面也是介绍过。

__copy_msghdr_from_user函数最后则是将iov地址和个数写到kmsg中,接着返回函数调用import_iovec函数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
struct iovec *iovec_from_user(const struct iovec __user *uvec,
unsigned long nr_segs, unsigned long fast_segs,
struct iovec *fast_iov, bool compat)
{
struct iovec *iov = fast_iov;
int ret;

/*
* SuS says "The readv() function *may* fail if the iovcnt argument was
* less than or equal to 0, or greater than {IOV_MAX}. Linux has
* traditionally returned zero for zero segments, so...
*/
if (nr_segs == 0)
return iov;
if (nr_segs > UIO_MAXIOV)
return ERR_PTR(-EINVAL);
if (nr_segs > fast_segs) {
iov = kmalloc_array(nr_segs, sizeof(struct iovec), GFP_KERNEL);
if (!iov)
return ERR_PTR(-ENOMEM);
}

if (compat)
ret = copy_compat_iovec_from_user(iov, uvec, nr_segs);
else
ret = copy_iovec_from_user(iov, uvec, nr_segs);
if (ret) {
if (iov != fast_iov)
kfree(iov);
return ERR_PTR(ret);
}

return iov;
}

ssize_t __import_iovec(int type, const struct iovec __user *uvec,
unsigned nr_segs, unsigned fast_segs, struct iovec **iovp,
struct iov_iter *i, bool compat)
{
ssize_t total_len = 0;
unsigned long seg;
struct iovec *iov;

iov = iovec_from_user(uvec, nr_segs, fast_segs, *iovp, compat);
if (IS_ERR(iov)) {
*iovp = NULL;
return PTR_ERR(iov);
}

/*
* According to the Single Unix Specification we should return EINVAL if
* an element length is < 0 when cast to ssize_t or if the total length
* would overflow the ssize_t return value of the system call.
*
* Linux caps all read/write calls to MAX_RW_COUNT, and avoids the
* overflow case.
*/
for (seg = 0; seg < nr_segs; seg++) {
ssize_t len = (ssize_t)iov[seg].iov_len;

if (!access_ok(iov[seg].iov_base, len)) {
if (iov != *iovp)
kfree(iov);
*iovp = NULL;
return -EFAULT;
}

if (len > MAX_RW_COUNT - total_len) {
len = MAX_RW_COUNT - total_len;
iov[seg].iov_len = len;
}
total_len += len;
}

iov_iter_init(i, type, iov, nr_segs, total_len);
if (iov == *iovp)
*iovp = NULL;
else
*iovp = iov;
return total_len;
}

ssize_t import_iovec(int type, const struct iovec __user *uvec,
unsigned nr_segs, unsigned fast_segs,
struct iovec **iovp, struct iov_iter *i)
{
return __import_iovec(type, uvec, nr_segs, fast_segs, iovp, i,
in_compat_syscall());
}

上面则是对iov的初始化。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
static int ____sys_sendmsg(struct socket *sock, struct msghdr *msg_sys,
unsigned int flags, struct used_address *used_address,
unsigned int allowed_msghdr_flags)
{
unsigned char ctl[sizeof(struct cmsghdr) + 20]
__aligned(sizeof(__kernel_size_t));
/* 20 is size of ipv6_pktinfo */
unsigned char *ctl_buf = ctl;
int ctl_len;
ssize_t err;

err = -ENOBUFS;

if (msg_sys->msg_controllen > INT_MAX)
goto out;
flags |= (msg_sys->msg_flags & allowed_msghdr_flags);
ctl_len = msg_sys->msg_controllen;
if ((MSG_CMSG_COMPAT & flags) && ctl_len) {
err =
cmsghdr_from_user_compat_to_kern(msg_sys, sock->sk, ctl,
sizeof(ctl));
if (err)
goto out;
ctl_buf = msg_sys->msg_control;
ctl_len = msg_sys->msg_controllen;
} else if (ctl_len) {
BUILD_BUG_ON(sizeof(struct cmsghdr) !=
CMSG_ALIGN(sizeof(struct cmsghdr)));
if (ctl_len > sizeof(ctl)) {
ctl_buf = sock_kmalloc(sock->sk, ctl_len, GFP_KERNEL);
if (ctl_buf == NULL)
goto out;
}
err = -EFAULT;
if (copy_from_user(ctl_buf, msg_sys->msg_control_user, ctl_len))
goto out_freectl;
msg_sys->msg_control = ctl_buf;
msg_sys->msg_control_is_user = false;
}
msg_sys->msg_flags = flags;

if (sock->file->f_flags & O_NONBLOCK)
msg_sys->msg_flags |= MSG_DONTWAIT;
/*
* If this is sendmmsg() and current destination address is same as
* previously succeeded address, omit asking LSM's decision.
* used_address->name_len is initialized to UINT_MAX so that the first
* destination address never matches.
*/
if (used_address && msg_sys->msg_name &&
used_address->name_len == msg_sys->msg_namelen &&
!memcmp(&used_address->name, msg_sys->msg_name,
used_address->name_len)) {
err = sock_sendmsg_nosec(sock, msg_sys);
goto out_freectl;
}
err = sock_sendmsg(sock, msg_sys);
/*
* If this is sendmmsg() and sending to current destination address was
* successful, remember it.
*/
if (used_address && err >= 0) {
used_address->name_len = msg_sys->msg_namelen;
if (msg_sys->msg_name)
memcpy(&used_address->name, msg_sys->msg_name,
used_address->name_len);
}

out_freectl:
if (ctl_buf != ctl)
sock_kfree_s(sock->sk, ctl_buf, ctl_len);
out:
return err;
}

回到___sys_sendmsg函数,在执行完sendmsg_copy_msghdr之后紧接着会调用____sys_sendmsg函数。

在函数前半段会对flags和msghdr结构体做完检测之后会根据传入的used_address指针判断当前发送消息的目的地址和它记录的地址是否一致,如果一致则调用sock_sendmsg_nosec函数,如果不一致则调用sock_sendmsg函数。

1
2
3
4
5
6
7
8
int sock_sendmsg(struct socket *sock, struct msghdr *msg)
{
int err = security_socket_sendmsg(sock, msg,
msg_data_left(msg));

return err ?: sock_sendmsg_nosec(sock, msg);
}
EXPORT_SYMBOL(sock_sendmsg);

这里看sock_sendmsg的定义可以看到最终也会调用sock_sendmsg_nosec函数,区别就是其做了安全检测。

1
2
3
4
5
6
7
8
9
10
#define INDIRECT_CALL_INET(f, f2, f1, ...) f(__VA_ARGS__)

static inline int sock_sendmsg_nosec(struct socket *sock, struct msghdr *msg)
{
int ret = INDIRECT_CALL_INET(sock->ops->sendmsg, inet6_sendmsg,
inet_sendmsg, sock, msg,
msg_data_left(msg));
BUG_ON(ret == -EIOCBQUEUED);
return ret;
}

根据前面所提到的Netlink初始化中,可以得知这里的sendmsg为netlink_sendmsg

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
static int netlink_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
{
struct sock *sk = sock->sk;
struct netlink_sock *nlk = nlk_sk(sk);
DECLARE_SOCKADDR(struct sockaddr_nl *, addr, msg->msg_name);
u32 dst_portid;
u32 dst_group;
struct sk_buff *skb;
int err;
struct scm_cookie scm;
u32 netlink_skb_flags = 0;

if (msg->msg_flags & MSG_OOB)
return -EOPNOTSUPP;

err = scm_send(sock, msg, &scm, true);
if (err < 0)
return err;

if (msg->msg_namelen) {
err = -EINVAL;
if (msg->msg_namelen < sizeof(struct sockaddr_nl))
goto out;
if (addr->nl_family != AF_NETLINK)
goto out;
dst_portid = addr->nl_pid;
dst_group = ffs(addr->nl_groups);
err = -EPERM;
if ((dst_group || dst_portid) &&
!netlink_allowed(sock, NL_CFG_F_NONROOT_SEND))
goto out;
netlink_skb_flags |= NETLINK_SKB_DST;
} else {
dst_portid = nlk->dst_portid;
dst_group = nlk->dst_group;
}

if (!nlk->bound) {
err = netlink_autobind(sock);
if (err)
goto out;
} else {
/* Ensure nlk is hashed and visible. */
smp_rmb();
}

err = -EMSGSIZE;
if (len > sk->sk_sndbuf - 32)
goto out;
err = -ENOBUFS;
skb = netlink_alloc_large_skb(len, dst_group);
if (skb == NULL)
goto out;

NETLINK_CB(skb).portid = nlk->portid;
NETLINK_CB(skb).dst_group = dst_group;
NETLINK_CB(skb).creds = scm.creds;
NETLINK_CB(skb).flags = netlink_skb_flags;

err = -EFAULT;
if (memcpy_from_msg(skb_put(skb, len), msg, len)) {
kfree_skb(skb);
goto out;
}

err = security_netlink_send(sk, skb);
if (err) {
kfree_skb(skb);
goto out;
}

if (dst_group) {
refcount_inc(&skb->users);
netlink_broadcast(sk, skb, dst_portid, dst_group, GFP_KERNEL);
}
err = netlink_unicast(sk, skb, dst_portid, msg->msg_flags & MSG_DONTWAIT);

out:
scm_destroy(&scm);
return err;
}

该函数首先初始化一个sockaddr_nl,前面的主要逻辑就是判断是否可以进行组播以及在做一些验证。

后面会判断发送的数据长度是否过长,并且通过netlink_alloc_large_skb申请一个skb结构。在创建完成skb结构之后回对其进行初始化。

1
#define NETLINK_CB(skb)		(*(struct netlink_skb_parms*)&((skb)->cb))

可以看到这里使用NETLINK_CB宏来操作skb中的扩展cb字段,一共48个字节用于存放netlink的地址和标识相关的内容,并将netlink字段强制定义为了netlink_skb_parms结构。

1
2
3
4
5
6
7
struct netlink_skb_parms {
struct scm_creds creds; /* Skb credentials */
__u32 portid;
__u32 dst_group;
__u32 flags;
struct sock *sk;
};

其中portid表示原端套接字所绑定的id,dst_group表示消息目的组播地址,flag为标识,sk指向原端套接字的sock结构。

这里首先将套接字绑定的portid赋值到skb得cb字段中、同时设置组播地址的数量以及netlink_skb标识(这里是已经置位NETLINK_SKB_DST)。

1
2
3
4
5
6
7
8
9
static inline unsigned char *skb_tail_pointer(const struct sk_buff *skb)
{
return skb->tail;
}

static inline int memcpy_from_msg(void *data, struct msghdr *msg, int len)
{
return copy_from_iter_full(data, len, &msg->msg_iter) ? 0 : -EFAULT;
}

接下来调用最关键的调用memcpy_from_msg拷贝数据,它首先调用skb_put调整skb->tail指针,然后执行copy_from_iter(data, len, &msg->msg_iter)将数据从msg->msg_iter中传输到skb->data中。

随后调用security_netlink_send函数进行security检查,最后根据是否组播调用netlink_broadcast或者netlink_unicast

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
int netlink_unicast(struct sock *ssk, struct sk_buff *skb,
u32 portid, int nonblock)
{
struct sock *sk;
int err;
long timeo;

skb = netlink_trim(skb, gfp_any());

timeo = sock_sndtimeo(ssk, nonblock);
retry:
sk = netlink_getsockbyportid(ssk, portid);
if (IS_ERR(sk)) {
kfree_skb(skb);
return PTR_ERR(sk);
}
if (netlink_is_kernel(sk))
return netlink_unicast_kernel(sk, skb, ssk);

if (sk_filter(sk, skb)) {
err = skb->len;
kfree_skb(skb);
sock_put(sk);
return err;
}

err = netlink_attachskb(sk, skb, &timeo, ssk);
if (err == 1)
goto retry;
if (err)
return err;

return netlink_sendskb(sk, skb);
}
EXPORT_SYMBOL(netlink_unicast);

首先调用netlink_trim重新裁剪skb的数据区的大小,这可能会clone出一个新的skb结构同时重新分配skb->data的内存空间,当然如果原本skb中多余的内存数据区非常小或者该内存空间是在vmalloc空间中的就不会执行上述操作,我们现在跟随的情景上下文中就是后一种情况,并不会重新分配空间。

随后通过sock_sndtimeo函数记下发送超时等待时间,如果已经设置了MSG_DONTWAIT标识,则等待时间为0,否则返回sk->sk_sndtimeo。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
static struct sock *netlink_getsockbyportid(struct sock *ssk, u32 portid)
{
struct sock *sock;
struct netlink_sock *nlk;

sock = netlink_lookup(sock_net(ssk), ssk->sk_protocol, portid);
if (!sock)
return ERR_PTR(-ECONNREFUSED);

/* Don't bother queuing skb if kernel socket has no input function */
nlk = nlk_sk(sock);
if (sock->sk_state == NETLINK_CONNECTED &&
nlk->dst_portid != nlk_sk(ssk)->portid) {
sock_put(sock);
return ERR_PTR(-ECONNREFUSED);
}
return sock;
}

接下来调用netlink_getsockbyportid根据目的portid号和原端sock结构查找目的端的sock结构。接下来调用netlink_getsockbyportid函数根据目的portid号和原端sock结构查找目的端的sock结构。

在找到sock结构之后,通过netlink_is_kernel函数判断该sock是否为内核的netlink socket,如果目的地址是内核空间,则调用netlink_unicast_kernel向内核进行单播。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
static int netlink_unicast_kernel(struct sock *sk, struct sk_buff *skb,
struct sock *ssk)
{
int ret;
struct netlink_sock *nlk = nlk_sk(sk);

ret = -ECONNREFUSED;
if (nlk->netlink_rcv != NULL) {
ret = skb->len;
netlink_skb_set_owner_r(skb, sk);
NETLINK_CB(skb).sk = ssk;
netlink_deliver_tap_kernel(sk, ssk, skb);
nlk->netlink_rcv(skb);
consume_skb(skb);
} else {
kfree_skb(skb);
}
sock_put(sk);
return ret;
}

查目标netlink套接字是否注册了netlink_rcv()接收函数,如果没有则直接丢弃该数据包,否则继续发送流程。

进入if分支会先对skb设置一些标识,最终调用nlk->netlink_rcv函数,将消息送到内核中的目的netlink套接字中,经过前面的分析可以知道的是这里最终会调用到nfnetlink_rcv函数中。

netlink层接收请求流程分析

下面的流程分析会结合着用户态的动态链接库结合起来分析

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
void set_stable_table_and_set(struct mnl_socket* nl, const char *name)
{
char * table_name = name;
char * set_name = NULL;
uint8_t family = NFPROTO_IPV4;
uint32_t set_id = 1;

// a table for the sets to be associated with
struct nftnl_table * table = nftnl_table_alloc();
nftnl_table_set_str(table, NFTNL_TABLE_NAME, table_name);
nftnl_table_set_u32(table, NFTNL_TABLE_FLAGS, 0);

struct nftnl_set * set_stable = nftnl_set_alloc();
set_name = "set_stable";
nftnl_set_set_str(set_stable, NFTNL_SET_TABLE, table_name);
nftnl_set_set_str(set_stable, NFTNL_SET_NAME, set_name);
nftnl_set_set_u32(set_stable, NFTNL_SET_KEY_LEN, 1);
nftnl_set_set_u32(set_stable, NFTNL_SET_FAMILY, family);
nftnl_set_set_u32(set_stable, NFTNL_SET_ID, set_id++);

// expressions
struct nftnl_expr * exprs[128];
int exprid = 0;

// serialize
char buf[MNL_SOCKET_BUFFER_SIZE*2];

struct mnl_nlmsg_batch * batch = mnl_nlmsg_batch_start(buf, sizeof(buf));
int seq = 0;

nftnl_batch_begin(mnl_nlmsg_batch_current(batch), seq++);
mnl_nlmsg_batch_next(batch);

struct nlmsghdr * nlh;
int table_seq = seq;

nlh = nftnl_table_nlmsg_build_hdr(mnl_nlmsg_batch_current(batch),
NFT_MSG_NEWTABLE, family, NLM_F_CREATE|NLM_F_ACK, seq++);
nftnl_table_nlmsg_build_payload(nlh, table);
mnl_nlmsg_batch_next(batch);

// add set_stable
nlh = nftnl_set_nlmsg_build_hdr(mnl_nlmsg_batch_current(batch),
NFT_MSG_NEWSET, family,
NLM_F_CREATE|NLM_F_ACK, seq++);
nftnl_set_nlmsg_build_payload(nlh, set_stable);
nftnl_set_free(set_stable);
mnl_nlmsg_batch_next(batch);

nftnl_batch_end(mnl_nlmsg_batch_current(batch), seq++);
mnl_nlmsg_batch_next(batch);

if (nl == NULL) {
err(1, "mnl_socket_open");
}

printf("[+] setting stable %s and set\n", table_name);
if (mnl_socket_sendto(nl, mnl_nlmsg_batch_head(batch),
mnl_nlmsg_batch_size(batch)) < 0) {
err(1, "mnl_socket_send");
}
}

上述代码是 CVE-2022-32250复现 文章中exp代码中对netlink发送请求时所使用到的函数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
static void nfnetlink_rcv(struct sk_buff *skb)
{
struct nlmsghdr *nlh = nlmsg_hdr(skb);

if (skb->len < NLMSG_HDRLEN ||
nlh->nlmsg_len < NLMSG_HDRLEN ||
skb->len < nlh->nlmsg_len)
return;

if (!netlink_net_capable(skb, CAP_NET_ADMIN)) {
netlink_ack(skb, nlh, -EPERM, NULL);
return;
}

if (nlh->nlmsg_type == NFNL_MSG_BATCH_BEGIN)
nfnetlink_rcv_skb_batch(skb, nlh);
else
netlink_rcv_skb(skb, nfnetlink_rcv_msg);
}

这个函数主要是做上一些检测,在开头检测长度是否合法,随后检测是否具有CAP_NET_ADMIN权限,最后会根据nlh->nlmsg_type使用不同的函数进行处理。这里简单追踪一下这个nlmsg_type的由来。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
EXPORT_SYMBOL struct mnl_nlmsg_batch *mnl_nlmsg_batch_start(void *buf,
size_t limit)
{
struct mnl_nlmsg_batch *b;

b = malloc(sizeof(struct mnl_nlmsg_batch));
if (b == NULL)
return NULL;

b->buf = buf;
b->limit = limit;
b->buflen = 0;
b->cur = buf;
b->overflow = false;

return b;
}

首先前面会先通过mnl_nlmsg_batch_start函数申请一个batch。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
struct mnl_nlmsg_batch {
/* the buffer that is used to store the batch. */
void *buf;
size_t limit;
size_t buflen;
/* the current netlink message in the batch. */
void *cur;
bool overflow;
};

EXPORT_SYMBOL void *mnl_nlmsg_batch_current(struct mnl_nlmsg_batch *b)
{
return b->cur;
}

随后通过mnl_nlmsg_batch_current函数返回cur给nftnl_batch_begin函数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
static void nftnl_batch_build_hdr(char *buf, uint16_t type, uint32_t seq)
{
struct nlmsghdr *nlh;
struct nfgenmsg *nfg;

nlh = mnl_nlmsg_put_header(buf);
nlh->nlmsg_type = type;
nlh->nlmsg_flags = NLM_F_REQUEST;
nlh->nlmsg_seq = seq;

nfg = mnl_nlmsg_put_extra_header(nlh, sizeof(*nfg));
nfg->nfgen_family = AF_UNSPEC;
nfg->version = NFNETLINK_V0;
nfg->res_id = NFNL_SUBSYS_NFTABLES;
}

void nftnl_batch_begin(char *buf, uint32_t seq)
{
nftnl_batch_build_hdr(buf, NFNL_MSG_BATCH_BEGIN, seq);
}

也就是在这里会对nlh赋值为NFNL_MSG_BATCH_BEGIN

1
2
3
4
5
6
7
struct nlmsghdr {
__u32 nlmsg_len; /* Length of message including header */
__u16 nlmsg_type; /* Message content */
__u16 nlmsg_flags; /* Additional flags */
__u32 nlmsg_seq; /* Sequence number */
__u32 nlmsg_pid; /* Sending process port ID */
};

这里nlmsghdr的定义如上。

所以根据用户态传入给内核的nlh来看会进入到nfnetlink_rcv_skb_batch函数中进行下一步流程。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
static void nfnetlink_rcv_skb_batch(struct sk_buff *skb, struct nlmsghdr *nlh)
{
int min_len = nlmsg_total_size(sizeof(struct nfgenmsg));
struct nlattr *attr = (void *)nlh + min_len;
struct nlattr *cda[NFNL_BATCH_MAX + 1];
int attrlen = nlh->nlmsg_len - min_len;
struct nfgenmsg *nfgenmsg;
int msglen, err;
u32 gen_id = 0;
u16 res_id;

msglen = NLMSG_ALIGN(nlh->nlmsg_len);
if (msglen > skb->len)
msglen = skb->len;

if (skb->len < NLMSG_HDRLEN + sizeof(struct nfgenmsg))
return;

err = nla_parse_deprecated(cda, NFNL_BATCH_MAX, attr, attrlen,
nfnl_batch_policy, NULL);
if (err < 0) {
netlink_ack(skb, nlh, err, NULL);
return;
}
if (cda[NFNL_BATCH_GENID])
gen_id = ntohl(nla_get_be32(cda[NFNL_BATCH_GENID]));

nfgenmsg = nlmsg_data(nlh);
skb_pull(skb, msglen);
/* Work around old nft using host byte order */
if (nfgenmsg->res_id == NFNL_SUBSYS_NFTABLES)
res_id = NFNL_SUBSYS_NFTABLES;
else
res_id = ntohs(nfgenmsg->res_id);

nfnetlink_rcv_batch(skb, nlh, res_id, gen_id);
}

这里前面进行一些预处理,对参数做检验赋值等操作,最后判断nfgenmsg->res_id的值。这里切换到用户态继续看nfgenmsg结构的由来,不难看到在前面的nftnl_batch_build_hdr函数中就有对这个结构体的使用,并且最终对其的res_id赋值为NFNL_SUBSYS_NFTABLES

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
EXPORT_SYMBOL struct nlmsghdr *mnl_nlmsg_put_header(void *buf)
{
int len = MNL_ALIGN(sizeof(struct nlmsghdr));
struct nlmsghdr *nlh = buf;

memset(buf, 0, len);
nlh->nlmsg_len = len;
return nlh;
}

EXPORT_SYMBOL void *mnl_nlmsg_put_extra_header(struct nlmsghdr *nlh,
size_t size)
{
char *ptr = (char *)nlh + nlh->nlmsg_len;
size_t len = MNL_ALIGN(size);
nlh->nlmsg_len += len;
memset(ptr, 0, len);
return ptr;
}

所以根据这里来看,nfgenmsg结构体就是紧邻着nlmsghdr的(其实根据前面的内存表现也可以看出来,因为skb->data是直接由kmsg->msg_iter拷贝过去的)。

1
2
3
4
static inline void *nlmsg_data(const struct nlmsghdr *nlh)
{
return (unsigned char *) nlh + NLMSG_HDRLEN;
}

并且内核中在获取nfgenmsg结构体也是直接拿到nlh地址加上其大小的。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
/* No enum here, otherwise __stringify() trick of MODULE_ALIAS_NFNL_SUBSYS()
* won't work anymore */
#define NFNL_SUBSYS_NONE 0
#define NFNL_SUBSYS_CTNETLINK 1
#define NFNL_SUBSYS_CTNETLINK_EXP 2
#define NFNL_SUBSYS_QUEUE 3
#define NFNL_SUBSYS_ULOG 4
#define NFNL_SUBSYS_OSF 5
#define NFNL_SUBSYS_IPSET 6
#define NFNL_SUBSYS_ACCT 7
#define NFNL_SUBSYS_CTNETLINK_TIMEOUT 8
#define NFNL_SUBSYS_CTHELPER 9
#define NFNL_SUBSYS_NFTABLES 10
#define NFNL_SUBSYS_NFT_COMPAT 11
#define NFNL_SUBSYS_COUNT 12

这里是各类子系统宏定义的值。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
static void nfnetlink_rcv_batch(struct sk_buff *skb, struct nlmsghdr *nlh,
u16 subsys_id, u32 genid)
{
struct sk_buff *oskb = skb;
struct net *net = sock_net(skb->sk);
const struct nfnetlink_subsystem *ss;
const struct nfnl_callback *nc;
struct netlink_ext_ack extack;
LIST_HEAD(err_list);
u32 status;
int err;

if (subsys_id >= NFNL_SUBSYS_COUNT)
return netlink_ack(skb, nlh, -EINVAL, NULL);
replay:
status = 0;
replay_abort:
skb = netlink_skb_clone(oskb, GFP_KERNEL);
if (!skb)
return netlink_ack(oskb, nlh, -ENOMEM, NULL);

nfnl_lock(subsys_id);
ss = nfnl_dereference_protected(subsys_id);
// ...... check subsystem

nfnl_unlock(subsys_id);

while (skb->len >= nlmsg_total_size(0)) {
int msglen, type;

// ......

memset(&extack, 0, sizeof(extack));
nlh = nlmsg_hdr(skb);
err = 0;

if (nlh->nlmsg_len < NLMSG_HDRLEN ||
skb->len < nlh->nlmsg_len ||
nlmsg_len(nlh) < sizeof(struct nfgenmsg)) {
nfnl_err_reset(&err_list);
status |= NFNL_BATCH_FAILURE;
goto done;
}

/* Only requests are handled by the kernel */
if (!(nlh->nlmsg_flags & NLM_F_REQUEST)) {
err = -EINVAL;
goto ack;
}

type = nlh->nlmsg_type;
if (type == NFNL_MSG_BATCH_BEGIN) {
/* Malformed: Batch begin twice */
nfnl_err_reset(&err_list);
status |= NFNL_BATCH_FAILURE;
goto done;
} else if (type == NFNL_MSG_BATCH_END) {
status |= NFNL_BATCH_DONE;
goto done;
} else if (type < NLMSG_MIN_TYPE) {
err = -EINVAL;
goto ack;
}

/* We only accept a batch with messages for the same
* subsystem.
*/
if (NFNL_SUBSYS_ID(type) != subsys_id) {
err = -EINVAL;
goto ack;
}

nc = nfnetlink_find_client(type, ss);
if (!nc) {
err = -EINVAL;
goto ack;
}

{
int min_len = nlmsg_total_size(sizeof(struct nfgenmsg));
u8 cb_id = NFNL_MSG_TYPE(nlh->nlmsg_type);
struct nlattr *cda[NFNL_MAX_ATTR_COUNT + 1];
struct nlattr *attr = (void *)nlh + min_len;
int attrlen = nlh->nlmsg_len - min_len;

/* Sanity-check NFTA_MAX_ATTR */
if (ss->cb[cb_id].attr_count > NFNL_MAX_ATTR_COUNT) {
err = -ENOMEM;
goto ack;
}

err = nla_parse_deprecated(cda,
ss->cb[cb_id].attr_count,
attr, attrlen,
ss->cb[cb_id].policy, NULL);
if (err < 0)
goto ack;

if (nc->call_batch) {
err = nc->call_batch(net, net->nfnl, skb, nlh,
(const struct nlattr **)cda,
&extack);
}

/* The lock was released to autoload some module, we
* have to abort and start from scratch using the
* original skb.
*/
if (err == -EAGAIN) {
status |= NFNL_BATCH_REPLAY;
goto done;
}
}
ack:
// ...... out
}
done:
// ...... out

nfnl_err_deliver(&err_list, oskb);
kfree_skb(skb);
module_put(ss->owner);
}

这里首先判断subsys_id的合法性,随后通过nfnl_dereference_protected函数找到对应的子系统。

1
2
3
4
5
6
7
8
9
10
11
static const struct nfnetlink_subsystem nf_tables_subsys = {
.name = "nf_tables",
.subsys_id = NFNL_SUBSYS_NFTABLES,
.cb_count = NFT_MSG_MAX,
.cb = nf_tables_cb,
.commit = nf_tables_commit,
.abort = nf_tables_abort,
.cleanup = nf_tables_cleanup,
.valid_genid = nf_tables_valid_genid,
.owner = THIS_MODULE,
};

随后判断skb中的nlmsg的数量,随后对nlh的长度做判断,再然后对nlh的type做判断,因为在前面执行了skb_pull的缘故,所以这里不能再是NFNL_MSG_BATCH_BEGIN了。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
static inline void *__skb_pull(struct sk_buff *skb, unsigned int len)
{
skb->len -= len;
BUG_ON(skb->len < skb->data_len);
return skb->data += len;
}

static inline void *skb_pull_inline(struct sk_buff *skb, unsigned int len)
{
return unlikely(len > skb->len) ? NULL : __skb_pull(skb, len);
}

void *skb_pull(struct sk_buff *skb, unsigned int len)
{
return skb_pull_inline(skb, len);
}
EXPORT_SYMBOL(skb_pull);

这里会让data往后面的数据移动,所以才会有这样一个判断。随后调用nfnetlink_find_client函数到消息的目标对象。

1
2
3
4
5
6
7
8
9
10
static inline const struct nfnl_callback *
nfnetlink_find_client(u16 type, const struct nfnetlink_subsystem *ss)
{
u8 cb_id = NFNL_MSG_TYPE(type);

if (cb_id >= ss->cb_count)
return NULL;

return &ss->cb[cb_id];
}

这里是根据type和子系统查找的,子系统已经找到,所以这里主要关注type怎么来的。接下来回到用户态。

1
2
3
4
5
6
7
8
9
10
11
12
struct nftnl_table * table = nftnl_table_alloc();
nftnl_table_set_str(table, NFTNL_TABLE_NAME, table_name);
nftnl_table_set_u32(table, NFTNL_TABLE_FLAGS, 0);
// ......
mnl_nlmsg_batch_next(batch);

struct nlmsghdr * nlh;
int table_seq = seq;
nlh = nftnl_table_nlmsg_build_hdr(mnl_nlmsg_batch_current(batch),
NFT_MSG_NEWTABLE, family, NLM_F_CREATE|NLM_F_ACK, seq++);
nftnl_table_nlmsg_build_payload(nlh, table);
// ......

这里看其中一个nlh的生成。

1
2
3
4
5
6
7
8
9
10
11
12
EXPORT_SYMBOL bool mnl_nlmsg_batch_next(struct mnl_nlmsg_batch *b)
{
struct nlmsghdr *nlh = b->cur;

if (b->buflen + nlh->nlmsg_len > b->limit) {
b->overflow = true;
return false;
}
b->cur = b->buf + b->buflen + nlh->nlmsg_len;
b->buflen += nlh->nlmsg_len;
return true;
}

意思就会把cur指针下移。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
struct nlmsghdr *nftnl_nlmsg_build_hdr(char *buf, uint16_t cmd, uint16_t family,
uint16_t type, uint32_t seq)
{
struct nlmsghdr *nlh;
struct nfgenmsg *nfh;

nlh = mnl_nlmsg_put_header(buf);
nlh->nlmsg_type = (NFNL_SUBSYS_NFTABLES << 8) | cmd;
nlh->nlmsg_flags = NLM_F_REQUEST | type;
nlh->nlmsg_seq = seq;

nfh = mnl_nlmsg_put_extra_header(nlh, sizeof(struct nfgenmsg));
nfh->nfgen_family = family;
nfh->version = NFNETLINK_V0;
nfh->res_id = 0;

return nlh;
}
EXPORT_SYMBOL(nftnl_nlmsg_build_hdr);

#define nftnl_table_nlmsg_build_hdr nftnl_nlmsg_build_hdr

这里也就会生成一个nlh和一个nfgenmsg。所以也可以清楚的看到其type是有这里cmd指定的,对于这里来说也就是NFT_MSG_NEWTABLE

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
static const struct nfnl_callback nf_tables_cb[NFT_MSG_MAX] = {
[NFT_MSG_NEWTABLE] = {
.call_batch = nf_tables_newtable,
.attr_count = NFTA_TABLE_MAX,
.policy = nft_table_policy,
},
[NFT_MSG_GETTABLE] = {
.call_rcu = nf_tables_gettable,
.attr_count = NFTA_TABLE_MAX,
.policy = nft_table_policy,
},
[NFT_MSG_DELTABLE] = {
.call_batch = nf_tables_deltable,
.attr_count = NFTA_TABLE_MAX,
.policy = nft_table_policy,
},
// ......
}

这里的目标客户端就如上述形式。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
enum nf_tables_msg_types {
NFT_MSG_NEWTABLE,
NFT_MSG_GETTABLE,
NFT_MSG_DELTABLE,
NFT_MSG_NEWCHAIN,
NFT_MSG_GETCHAIN,
NFT_MSG_DELCHAIN,
NFT_MSG_NEWRULE,
NFT_MSG_GETRULE,
NFT_MSG_DELRULE,
NFT_MSG_NEWSET,
NFT_MSG_GETSET,
NFT_MSG_DELSET,
NFT_MSG_NEWSETELEM,
NFT_MSG_GETSETELEM,
NFT_MSG_DELSETELEM,
NFT_MSG_NEWGEN,
NFT_MSG_GETGEN,
NFT_MSG_TRACE,
NFT_MSG_NEWOBJ,
NFT_MSG_GETOBJ,
NFT_MSG_DELOBJ,
NFT_MSG_GETOBJ_RESET,
NFT_MSG_NEWFLOWTABLE,
NFT_MSG_GETFLOWTABLE,
NFT_MSG_DELFLOWTABLE,
NFT_MSG_MAX,
};

这是其枚举类型。回到内核。

1
2
3
nc->call_batch(net, net->nfnl, skb, nlh,
(const struct nlattr **)cda,
&extack);

最终会在这里调用其回调函数。

nftables相关操作及内核实现

配置表

在nftables中想要配置表的操作很简单

1
nft add table ip filter

以上命令即可添加一个表。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
static int nf_tables_newtable(struct net *net, struct sock *nlsk,
struct sk_buff *skb, const struct nlmsghdr *nlh,
const struct nlattr * const nla[],
struct netlink_ext_ack *extack)
{
const struct nfgenmsg *nfmsg = nlmsg_data(nlh);
u8 genmask = nft_genmask_next(net);
int family = nfmsg->nfgen_family;
const struct nlattr *attr;
struct nft_table *table;
struct nft_ctx ctx;
u32 flags = 0;
int err;

lockdep_assert_held(&net->nft.commit_mutex);
attr = nla[NFTA_TABLE_NAME];
table = nft_table_lookup(net, attr, family, genmask);
if (IS_ERR(table)) {
if (PTR_ERR(table) != -ENOENT)
return PTR_ERR(table);
} else {
if (nlh->nlmsg_flags & NLM_F_EXCL) {
NL_SET_BAD_ATTR(extack, attr);
return -EEXIST;
}
if (nlh->nlmsg_flags & NLM_F_REPLACE)
return -EOPNOTSUPP;

nft_ctx_init(&ctx, net, skb, nlh, family, table, NULL, nla);
return nf_tables_updtable(&ctx);
}

if (nla[NFTA_TABLE_FLAGS]) {
flags = ntohl(nla_get_be32(nla[NFTA_TABLE_FLAGS]));
if (flags & ~NFT_TABLE_F_DORMANT)
return -EINVAL;
}

err = -ENOMEM;
table = kzalloc(sizeof(*table), GFP_KERNEL);
if (table == NULL)
goto err_kzalloc;

table->name = nla_strdup(attr, GFP_KERNEL);
if (table->name == NULL)
goto err_strdup;

if (nla[NFTA_TABLE_USERDATA]) {
table->udata = nla_memdup(nla[NFTA_TABLE_USERDATA], GFP_KERNEL);
if (table->udata == NULL)
goto err_table_udata;

table->udlen = nla_len(nla[NFTA_TABLE_USERDATA]);
}

err = rhltable_init(&table->chains_ht, &nft_chain_ht_params);
if (err)
goto err_chain_ht;

INIT_LIST_HEAD(&table->chains);
INIT_LIST_HEAD(&table->sets);
INIT_LIST_HEAD(&table->objects);
INIT_LIST_HEAD(&table->flowtables);
table->family = family;
table->flags = flags;
table->handle = ++table_handle;

nft_ctx_init(&ctx, net, skb, nlh, family, table, NULL, nla);
err = nft_trans_table_add(&ctx, NFT_MSG_NEWTABLE);
if (err < 0)
goto err_trans;

list_add_tail_rcu(&table->list, &net->nft.tables);
return 0;
// error exit
}

函数首先会调用nft_table_lookup函数来找已存在的表,其参数分别有netattrfamilygenmask。这里的net就是init_net可以理解为一个全局变量。而这里的attrnla[NFTA_TABLE_NAME]取出。

1
2
3
4
5
6
7
8
9
10
11
12
13
/*
* <------- NLA_HDRLEN ------> <-- NLA_ALIGN(payload)-->
* +---------------------+- - -+- - - - - - - - - -+- - -+
* | Header | Pad | Payload | Pad |
* | (struct nlattr) | ing | | ing |
* +---------------------+- - -+- - - - - - - - - -+- - -+
* <-------------- nlattr->nla_len -------------->
*/

struct nlattr {
__u16 nla_len;
__u16 nla_type;
};

从上面的示意图可以看出的是nalttrheaderpayloadpading三部分组成,这里的nlaattr->nla_len表示的长度为总长度,并且在结构体中是看不到有什么成员表示payload部分的。

1
2
3
4
err = nla_parse_deprecated(cda,
ss->cb[cb_id].attr_count,
attr, attrlen,
ss->cb[cb_id].policy, NULL);

查看调用关系可以很清楚发现nla由外层函数nfnetlink_rcv_batch中的上述函数得来。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
int __nla_parse(struct nlattr **tb, int maxtype,
const struct nlattr *head, int len,
const struct nla_policy *policy, unsigned int validate,
struct netlink_ext_ack *extack)
{
return __nla_validate_parse(head, len, maxtype, policy, validate,
extack, tb, 0);
}
EXPORT_SYMBOL(__nla_parse);

static inline int nla_parse_deprecated(struct nlattr **tb, int maxtype,
const struct nlattr *head, int len,
const struct nla_policy *policy,
struct netlink_ext_ack *extack)
{
return __nla_parse(tb, maxtype, head, len, policy,
NL_VALIDATE_LIBERAL, extack);
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
static int __nla_validate_parse(const struct nlattr *head, int len, int maxtype,
const struct nla_policy *policy,
unsigned int validate,
struct netlink_ext_ack *extack,
struct nlattr **tb, unsigned int depth)
{
const struct nlattr *nla;
int rem;

if (depth >= MAX_POLICY_RECURSION_DEPTH) {
NL_SET_ERR_MSG(extack,
"allowed policy recursion depth exceeded");
return -EINVAL;
}

if (tb)
memset(tb, 0, sizeof(struct nlattr *) * (maxtype + 1));

nla_for_each_attr(nla, head, len, rem) {
u16 type = nla_type(nla);

if (type == 0 || type > maxtype) {
if (validate & NL_VALIDATE_MAXTYPE) {
NL_SET_ERR_MSG_ATTR(extack, nla,
"Unknown attribute type");
return -EINVAL;
}
continue;
}
if (policy) {
int err = validate_nla(nla, maxtype, policy,
validate, extack, depth);

if (err < 0)
return err;
}

if (tb)
tb[type] = (struct nlattr *)nla;
}

if (unlikely(rem > 0)) {
pr_warn_ratelimited("netlink: %d bytes leftover after parsing attributes in process `%s'.\n",
rem, current->comm);
NL_SET_ERR_MSG(extack, "bytes leftover after parsing attributes");
if (validate & NL_VALIDATE_TRAILING)
return -EINVAL;
}

return 0;
}

分析发现最终得到的nla也就是由attr经过检验了合法性之后得到的

1
2
int min_len = nlmsg_total_size(sizeof(struct nfgenmsg));
struct nlattr *attr = (void *)nlh + min_len;

所以最终可以得知,其实拿到的就是payload段,那么现在回到用户态来查看这究竟是怎么来的。

1
nftnl_table_nlmsg_build_payload(nlh, table);

不难看到在用户态代码中存在这样一条语句,来对nlh的payload段进行设置。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
void nftnl_set_nlmsg_build_payload(struct nlmsghdr *nlh, struct nftnl_set *s)
{
if (s->flags & (1 << NFTNL_SET_TABLE))
mnl_attr_put_strz(nlh, NFTA_SET_TABLE, s->table);
if (s->flags & (1 << NFTNL_SET_NAME))
mnl_attr_put_strz(nlh, NFTA_SET_NAME, s->name);
if (s->flags & (1 << NFTNL_SET_FLAGS))
mnl_attr_put_u32(nlh, NFTA_SET_FLAGS, htonl(s->set_flags));
if (s->flags & (1 << NFTNL_SET_KEY_TYPE))
mnl_attr_put_u32(nlh, NFTA_SET_KEY_TYPE, htonl(s->key_type));
if (s->flags & (1 << NFTNL_SET_KEY_LEN))
mnl_attr_put_u32(nlh, NFTA_SET_KEY_LEN, htonl(s->key_len));
/* These are only used to map matching -> action (1:1) */
if (s->flags & (1 << NFTNL_SET_DATA_TYPE))
mnl_attr_put_u32(nlh, NFTA_SET_DATA_TYPE, htonl(s->data_type));
if (s->flags & (1 << NFTNL_SET_DATA_LEN))
mnl_attr_put_u32(nlh, NFTA_SET_DATA_LEN, htonl(s->data_len));
if (s->flags & (1 << NFTNL_SET_OBJ_TYPE))
mnl_attr_put_u32(nlh, NFTA_SET_OBJ_TYPE, htonl(s->obj_type));
if (s->flags & (1 << NFTNL_SET_ID))
mnl_attr_put_u32(nlh, NFTA_SET_ID, htonl(s->id));
if (s->flags & (1 << NFTNL_SET_POLICY))
mnl_attr_put_u32(nlh, NFTA_SET_POLICY, htonl(s->policy));
if (s->flags & (1 << NFTNL_SET_DESC_SIZE))
nftnl_set_nlmsg_build_desc_payload(nlh, s);
if (s->flags & (1 << NFTNL_SET_TIMEOUT))
mnl_attr_put_u64(nlh, NFTA_SET_TIMEOUT, htobe64(s->timeout));
if (s->flags & (1 << NFTNL_SET_GC_INTERVAL))
mnl_attr_put_u32(nlh, NFTA_SET_GC_INTERVAL, htonl(s->gc_interval));
if (s->flags & (1 << NFTNL_SET_USERDATA))
mnl_attr_put(nlh, NFTA_SET_USERDATA, s->user.len, s->user.data);
}
EXPORT_SYMBOL(nftnl_set_nlmsg_build_payload);

这个函数会根据不同的s->flags来进行不同的操作,这里先看一下这个s是怎么来的吧。

1
2
3
struct nftnl_table * table = nftnl_table_alloc();
nftnl_table_set_str(table, NFTNL_TABLE_NAME, table_name);
nftnl_table_set_u32(table, NFTNL_TABLE_FLAGS, 0);

这里的s就是在这里创建的

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
int nftnl_table_set_data(struct nftnl_table *t, uint16_t attr,
const void *data, uint32_t data_len)
{
nftnl_assert_attr_exists(attr, NFTNL_TABLE_MAX);
nftnl_assert_validate(data, nftnl_table_validate, attr, data_len);

switch (attr) {
case NFTNL_TABLE_NAME:
if (t->flags & (1 << NFTNL_TABLE_NAME))
xfree(t->name);

t->name = strdup(data);
if (!t->name)
return -1;
break;
case NFTNL_TABLE_FLAGS:
t->table_flags = *((uint32_t *)data);
break;
case NFTNL_TABLE_FAMILY:
t->family = *((uint32_t *)data);
break;
case NFTNL_TABLE_USE:
t->use = *((uint32_t *)data);
break;
}
t->flags |= (1 << attr);
return 0;
}
EXPORT_SYMBOL(nftnl_table_set_data);

int nftnl_table_set_str(struct nftnl_table *t, uint16_t attr, const char *str)
{
return nftnl_table_set_data(t, attr, str, strlen(str) + 1);
}
EXPORT_SYMBOL(nftnl_table_set_str);

以设置str为例(其实设置其他的也是一样的),最终会调用的nftnl_table_set_data函数,函数内部实现的就是根据不同的属性进行不同的操作最终给struct nftnl_table结构体设置上。

1
2
3
4
5
6
7
enum nftnl_table_attr {
NFTNL_TABLE_NAME = 0,
NFTNL_TABLE_FAMILY,
NFTNL_TABLE_FLAGS,
NFTNL_TABLE_USE,
__NFTNL_TABLE_MAX
};
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
enum nftnl_set_attr {
NFTNL_SET_TABLE,
NFTNL_SET_NAME,
NFTNL_SET_FLAGS,
NFTNL_SET_KEY_TYPE,
NFTNL_SET_KEY_LEN,
NFTNL_SET_DATA_TYPE,
NFTNL_SET_DATA_LEN,
NFTNL_SET_FAMILY,
NFTNL_SET_ID,
NFTNL_SET_POLICY,
NFTNL_SET_DESC_SIZE,
NFTNL_SET_TIMEOUT,
NFTNL_SET_GC_INTERVAL,
NFTNL_SET_USERDATA,
NFTNL_SET_OBJ_TYPE,
__NFTNL_SET_MAX
};

那么回到nftnl_set_nlmsg_build_payload函数中,根据这里的定义最终会调用到mnl_attr_put_strz函数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
EXPORT_SYMBOL void mnl_attr_put(struct nlmsghdr *nlh, uint16_t type,
size_t len, const void *data)
{
struct nlattr *attr = mnl_nlmsg_get_payload_tail(nlh);
uint16_t payload_len = MNL_ALIGN(sizeof(struct nlattr)) + len;
int pad;

attr->nla_type = type;
attr->nla_len = payload_len;
memcpy(mnl_attr_get_payload(attr), data, len);
pad = MNL_ALIGN(len) - len;
if (pad > 0)
memset(mnl_attr_get_payload(attr) + len, 0, pad);

nlh->nlmsg_len += MNL_ALIGN(payload_len);
}

EXPORT_SYMBOL void mnl_attr_put_strz(struct nlmsghdr *nlh, uint16_t type,
const char *data)
{
mnl_attr_put(nlh, type, strlen(data) + 1, data);
}

这里即可看到attr由mnl_nlmsg_get_payload_tail函数得到。

1
2
3
4
EXPORT_SYMBOL void *mnl_nlmsg_get_payload_tail(const struct nlmsghdr *nlh)
{
return (void *)nlh + MNL_ALIGN(nlh->nlmsg_len);
}

这里就是取出nlh头部之后的payload部分。

而在进行memcpy时调用的

1
2
3
4
EXPORT_SYMBOL void *mnl_attr_get_payload(const struct nlattr *attr)
{
return (void *)attr + MNL_ATTR_HDRLEN;
}

这里就是取出nlattr的头部之后payload,所以这里直观来看有两层payload部分。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
static struct nft_table *nft_table_lookup(const struct net *net,
const struct nlattr *nla,
u8 family, u8 genmask)
{
struct nft_table *table;

if (nla == NULL)
return ERR_PTR(-EINVAL);

list_for_each_entry_rcu(table, &net->nft.tables, list,
lockdep_is_held(&net->nft.commit_mutex)) {
if (!nla_strcmp(nla, table->name) &&
table->family == family &&
nft_active_genmask(table, genmask))
return table;
}

return ERR_PTR(-ENOENT);
}

那么回到内核态,这里的nla就是类似如下结构

1
2
3
4
5
struct my_nlattr {
__u16 nla_len;
__u16 nla_type;
auto data;
};

随后这里会根据familyname以及genmask来判断是否能够在net中找到table,如果能够找到table,那证明已经存在这个table了,那么要做的就是更新表,这里浅浅说一下更新的流程以及目的是什么吧。

nfnetlink_rev_batch函数中可以看到的是在最后的done分支中

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
if (status & NFNL_BATCH_REPLAY) {
// ......
} else if (status == NFNL_BATCH_DONE) {
err = ss->commit(net, oskb);
if (err == -EAGAIN) {
status |= NFNL_BATCH_REPLAY;
goto done;
} else if (err) {
ss->abort(net, oskb, NFNL_ABORT_NONE);
netlink_ack(oskb, nlmsg_hdr(oskb), err, NULL);
}
} else {
// ......
}
if (ss->cleanup)
ss->cleanup(net);

nfnl_err_deliver(&err_list, oskb);
kfree_skb(skb);
module_put(ss->owner);

会根据不同的status来进入到不同的分支

1
2
3
4
5
void nftnl_batch_end(char *buf, uint32_t seq)
{
nftnl_batch_build_hdr(buf, NFNL_MSG_BATCH_END, seq);
}
EXPORT_SYMBOL(nftnl_batch_end);

对于我们来说正常结束的时候会对加上NFNL_MSG_BATCH_END标识。

1
2
3
4
5
6
7
8
9
10
11
12
13
type = nlh->nlmsg_type;
if (type == NFNL_MSG_BATCH_BEGIN) {
/* Malformed: Batch begin twice */
nfnl_err_reset(&err_list);
status |= NFNL_BATCH_FAILURE;
goto done;
} else if (type == NFNL_MSG_BATCH_END) {
status |= NFNL_BATCH_DONE;
goto done;
} else if (type < NLMSG_MIN_TYPE) {
err = -EINVAL;
goto ack;
}

nfnetlink_rev_batch函数中会根据不同的type修改调status,所以正常结束时会进入到status == NFNL_BATCH_DONE分支中,并执行ss->commit

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
static int nf_tables_commit(struct net *net, struct sk_buff *skb)
{
struct nft_trans *trans, *next;
struct nft_trans_elem *te;
struct nft_chain *chain;
struct nft_table *table;
int err;

if (list_empty(&net->nft.commit_list)) {
mutex_unlock(&net->nft.commit_mutex);
return 0;
}

/* 0. Validate ruleset, otherwise roll back for error reporting. */
if (nf_tables_validate(net) < 0)
return -EAGAIN;

err = nft_flow_rule_offload_commit(net);
if (err < 0)
return err;

/* 1. Allocate space for next generation rules_gen_X[] */
list_for_each_entry_safe (trans, next, &net->nft.commit_list, list) {
int ret;

if (trans->msg_type == NFT_MSG_NEWRULE ||
trans->msg_type == NFT_MSG_DELRULE) {
chain = trans->ctx.chain;

ret = nf_tables_commit_chain_prepare(net, chain);
if (ret < 0) {
nf_tables_commit_chain_prepare_cancel(net);
return ret;
}
}
}

/* step 2. Make rules_gen_X visible to packet path */
list_for_each_entry (table, &net->nft.tables, list) {
list_for_each_entry (chain, &table->chains, list)
nf_tables_commit_chain(net, chain);
}

/*
* Bump generation counter, invalidate any dump in progress.
* Cannot fail after this point.
*/
while (++net->nft.base_seq == 0)
;

/* step 3. Start new generation, rules_gen_X now in use. */
net->nft.gencursor = nft_gencursor_next(net);

list_for_each_entry_safe (trans, next, &net->nft.commit_list, list) {
switch (trans->msg_type) {
case NFT_MSG_NEWTABLE:
if (nft_trans_table_update(trans)) {
if (!nft_trans_table_enable(trans)) {
nf_tables_table_disable(
net, trans->ctx.table);
trans->ctx.table->flags |=
NFT_TABLE_F_DORMANT;
}
} else {
nft_clear(net, trans->ctx.table);
}
nf_tables_table_notify(&trans->ctx, NFT_MSG_NEWTABLE);
nft_trans_destroy(trans);
break;
case NFT_MSG_DELTABLE:
// ......
}
}

nft_commit_notify(net, NETLINK_CB(skb).portid);
nf_tables_gen_notify(net, skb, NFT_MSG_NEWGEN);
nf_tables_commit_release(net);

return 0;
}

首先这里会对net->nft.commit_list存在几次遍历,主要看后面会对当前trans所属于的操作进行处理,当前的操作也就是NFT_MSG_NEW_TABLE操作,会判断当前trans是否更新,以及当前表是否enable,如果是未启用状态则会进入到nf_tables_table_disable函数中。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
static void nft_table_disable(struct net *net, struct nft_table *table, u32 cnt)
{
struct nft_chain *chain;
u32 i = 0;

list_for_each_entry (chain, &table->chains, list) {
if (!nft_is_active_next(net, chain))
continue;
if (!nft_is_base_chain(chain))
continue;

if (cnt && i++ == cnt)
break;

nf_tables_unregister_hook(net, table, chain);
}
}

static void nf_tables_table_disable(struct net *net, struct nft_table *table)
{
nft_table_disable(net, table, 0);
}

这里会将所有未激活的并且为非常规的该表下所有链脱离hook(即无法在触发)。

1
2
trans->ctx.table->flags |=
NFT_TABLE_F_DORMANT;

并且最后给该table标记为休眠状态。下面回到nf_tables_newtable函数中,这里更新处理的第一步就是

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
static void nft_ctx_init(struct nft_ctx *ctx, struct net *net,
const struct sk_buff *skb, const struct nlmsghdr *nlh,
u8 family, struct nft_table *table,
struct nft_chain *chain,
const struct nlattr *const *nla)
{
ctx->net = net;
ctx->family = family;
ctx->level = 0;
ctx->table = table;
ctx->chain = chain;
ctx->nla = nla;
ctx->portid = NETLINK_CB(skb).portid;
ctx->report = nlmsg_report(nlh);
ctx->flags = nlh->nlmsg_flags;
ctx->seq = nlh->nlmsg_seq;
}

调用nft_ctx_init函数将所有内容赋值到ctx中去。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
static int nf_tables_updtable(struct nft_ctx *ctx)
{
struct nft_trans *trans;
u32 flags;
int ret = 0;

if (!ctx->nla[NFTA_TABLE_FLAGS])
return 0;

flags = ntohl(nla_get_be32(ctx->nla[NFTA_TABLE_FLAGS]));
if (flags & ~NFT_TABLE_F_DORMANT)
return -EINVAL;

if (flags == ctx->table->flags)
return 0;

trans = nft_trans_alloc(ctx, NFT_MSG_NEWTABLE,
sizeof(struct nft_trans_table));
if (trans == NULL)
return -ENOMEM;

if ((flags & NFT_TABLE_F_DORMANT) &&
!(ctx->table->flags & NFT_TABLE_F_DORMANT)) {
nft_trans_table_enable(trans) = false;
} else if (!(flags & NFT_TABLE_F_DORMANT) &&
ctx->table->flags & NFT_TABLE_F_DORMANT) {
ctx->table->flags &= ~NFT_TABLE_F_DORMANT;
ret = nf_tables_table_enable(ctx->net, ctx->table);
if (ret >= 0)
nft_trans_table_enable(trans) = true;
else
ctx->table->flags |= NFT_TABLE_F_DORMANT;
}
if (ret < 0)
goto err;

nft_trans_table_update(trans) = true;
list_add_tail(&trans->list, &ctx->net->nft.commit_list);
return 0;
err:
nft_trans_destroy(trans);
return ret;
}

随后调用nf_tables_updtable函数,通过判断用户态传入的flags和内核table所有用的flags来判断是否让该table休眠或是启用。

接着回到nf_tables_newtable函数流程,如果没有找到对应的table,就是通过kzalloc申请table随后对其进行成员初始化,并且初始化它的链的哈希表chains_ht,最后添加到net->nft.tables中即可。

配置链

有了前面配置表的基础,再来看配置链会相对简单许多

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
static int nf_tables_newchain(struct net *net, struct sock *nlsk,
struct sk_buff *skb, const struct nlmsghdr *nlh,
const struct nlattr *const nla[],
struct netlink_ext_ack *extack)
{
const struct nfgenmsg *nfmsg = nlmsg_data(nlh);
u8 genmask = nft_genmask_next(net);
int family = nfmsg->nfgen_family;
struct nft_chain *chain = NULL;
const struct nlattr *attr;
struct nft_table *table;
u8 policy = NF_ACCEPT;
struct nft_ctx ctx;
u64 handle = 0;
u32 flags = 0;

lockdep_assert_held(&net->nft.commit_mutex);

table = nft_table_lookup(net, nla[NFTA_CHAIN_TABLE], family, genmask);
if (IS_ERR(table)) {
NL_SET_BAD_ATTR(extack, nla[NFTA_CHAIN_TABLE]);
return PTR_ERR(table);
}

chain = NULL;
attr = nla[NFTA_CHAIN_NAME];

if (nla[NFTA_CHAIN_HANDLE]) {
handle = be64_to_cpu(nla_get_be64(nla[NFTA_CHAIN_HANDLE]));
chain = nft_chain_lookup_byhandle(table, handle, genmask);
if (IS_ERR(chain)) {
NL_SET_BAD_ATTR(extack, nla[NFTA_CHAIN_HANDLE]);
return PTR_ERR(chain);
}
attr = nla[NFTA_CHAIN_HANDLE];
} else if (nla[NFTA_CHAIN_NAME]) {
chain = nft_chain_lookup(net, table, attr, genmask);
if (IS_ERR(chain)) {
if (PTR_ERR(chain) != -ENOENT) {
NL_SET_BAD_ATTR(extack, attr);
return PTR_ERR(chain);
}
chain = NULL;
}
} else if (!nla[NFTA_CHAIN_ID]) {
return -EINVAL;
}

if (nla[NFTA_CHAIN_POLICY]) {
if (chain != NULL && !nft_is_base_chain(chain)) {
NL_SET_BAD_ATTR(extack, nla[NFTA_CHAIN_POLICY]);
return -EOPNOTSUPP;
}

if (chain == NULL && nla[NFTA_CHAIN_HOOK] == NULL) {
NL_SET_BAD_ATTR(extack, nla[NFTA_CHAIN_POLICY]);
return -EOPNOTSUPP;
}

policy = ntohl(nla_get_be32(nla[NFTA_CHAIN_POLICY]));
switch (policy) {
case NF_DROP:
case NF_ACCEPT:
break;
default:
return -EINVAL;
}
}

if (nla[NFTA_CHAIN_FLAGS])
flags = ntohl(nla_get_be32(nla[NFTA_CHAIN_FLAGS]));
else if (chain)
flags = chain->flags;

if (flags & ~NFT_CHAIN_FLAGS)
return -EOPNOTSUPP;

nft_ctx_init(&ctx, net, skb, nlh, family, table, chain, nla);

if (chain != NULL) {
if (nlh->nlmsg_flags & NLM_F_EXCL) {
NL_SET_BAD_ATTR(extack, attr);
return -EEXIST;
}
if (nlh->nlmsg_flags & NLM_F_REPLACE)
return -EOPNOTSUPP;

flags |= chain->flags & NFT_CHAIN_BASE;
return nf_tables_updchain(&ctx, genmask, policy, flags, attr,
extack);
}

return nf_tables_addchain(&ctx, family, genmask, policy, flags);
}

这里首先通过nft_table_lookup函数来找到table,如果不存在则直接退出。

随后存在两种方式来找到chain,第一种是直接通过handle来进行对比找到,第二种则是通过名字在table的哈希表中找到。

在找到之后基本就是链和参数做判断,以及对常规链的一系列处理。

最终在nf_tables_addchain函数添加chain

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
static int nf_tables_addchain(struct nft_ctx *ctx, u8 family, u8 genmask,
u8 policy, u32 flags)
{
const struct nlattr *const *nla = ctx->nla;
struct nft_table *table = ctx->table;
struct nft_base_chain *basechain;
struct nft_stats __percpu *stats;
struct net *net = ctx->net;
char name[NFT_NAME_MAXLEN];
struct nft_trans *trans;
struct nft_chain *chain;
struct nft_rule **rules;
int err;

if (table->use == UINT_MAX)
return -EOVERFLOW;

if (nla[NFTA_CHAIN_HOOK]) {
struct nft_chain_hook hook;

if (flags & NFT_CHAIN_BINDING)
return -EOPNOTSUPP;

err = nft_chain_parse_hook(net, nla, &hook, family, true);
if (err < 0)
return err;

basechain = kzalloc(sizeof(*basechain), GFP_KERNEL);
if (basechain == NULL) {
nft_chain_release_hook(&hook);
return -ENOMEM;
}
chain = &basechain->chain;

if (nla[NFTA_CHAIN_COUNTERS]) {
stats = nft_stats_alloc(nla[NFTA_CHAIN_COUNTERS]);
if (IS_ERR(stats)) {
nft_chain_release_hook(&hook);
kfree(basechain);
return PTR_ERR(stats);
}
rcu_assign_pointer(basechain->stats, stats);
static_branch_inc(&nft_counters_enabled);
}

err = nft_basechain_init(basechain, family, &hook, flags);
if (err < 0) {
nft_chain_release_hook(&hook);
kfree(basechain);
return err;
}
} else {
if (flags & NFT_CHAIN_BASE)
return -EINVAL;
if (flags & NFT_CHAIN_HW_OFFLOAD)
return -EOPNOTSUPP;

chain = kzalloc(sizeof(*chain), GFP_KERNEL);
if (chain == NULL)
return -ENOMEM;

chain->flags = flags;
}
ctx->chain = chain;

INIT_LIST_HEAD(&chain->rules);
chain->handle = nf_tables_alloc_handle(table);
chain->table = table;

if (nla[NFTA_CHAIN_NAME]) {
chain->name = nla_strdup(nla[NFTA_CHAIN_NAME], GFP_KERNEL);
} else {
if (!(flags & NFT_CHAIN_BINDING)) {
err = -EINVAL;
goto err_destroy_chain;
}

snprintf(name, sizeof(name), "__chain%llu", ++chain_id);
chain->name = kstrdup(name, GFP_KERNEL);
}

if (!chain->name) {
err = -ENOMEM;
goto err_destroy_chain;
}

if (nla[NFTA_CHAIN_USERDATA]) {
chain->udata = nla_memdup(nla[NFTA_CHAIN_USERDATA], GFP_KERNEL);
if (chain->udata == NULL) {
err = -ENOMEM;
goto err_destroy_chain;
}
chain->udlen = nla_len(nla[NFTA_CHAIN_USERDATA]);
}

rules = nf_tables_chain_alloc_rules(chain, 0);
if (!rules) {
err = -ENOMEM;
goto err_destroy_chain;
}

*rules = NULL;
rcu_assign_pointer(chain->rules_gen_0, rules);
rcu_assign_pointer(chain->rules_gen_1, rules);

err = nf_tables_register_hook(net, table, chain);
if (err < 0)
goto err_destroy_chain;

trans = nft_trans_chain_add(ctx, NFT_MSG_NEWCHAIN);
if (IS_ERR(trans)) {
err = PTR_ERR(trans);
goto err_unregister_hook;
}

nft_trans_chain_policy(trans) = NFT_CHAIN_POLICY_UNSET;
if (nft_is_base_chain(chain))
nft_trans_chain_policy(trans) = policy;

err = nft_chain_add(table, chain);
if (err < 0) {
nft_trans_destroy(trans);
goto err_unregister_hook;
}

table->use++;

return 0;
err_unregister_hook:
nf_tables_unregister_hook(net, table, chain);
err_destroy_chain:
nf_tables_chain_destroy(ctx);

return err;
}

这里主要通过用户态是否传递hook来分两种情况来分别创建基本链和常规链。

后续则是对其进行一系列初始化包括生成rules堆块。

配置规则

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
static int nf_tables_newrule(struct net *net, struct sock *nlsk,
struct sk_buff *skb, const struct nlmsghdr *nlh,
const struct nlattr * const nla[],
struct netlink_ext_ack *extack)
{
const struct nfgenmsg *nfmsg = nlmsg_data(nlh);
u8 genmask = nft_genmask_next(net);
struct nft_expr_info *info = NULL;
int family = nfmsg->nfgen_family;
struct nft_flow_rule *flow;
struct nft_table *table;
struct nft_chain *chain;
struct nft_rule *rule, *old_rule = NULL;
struct nft_userdata *udata;
struct nft_trans *trans = NULL;
struct nft_expr *expr;
struct nft_ctx ctx;
struct nlattr *tmp;
unsigned int size, i, n, ulen = 0, usize = 0;
int err, rem;
u64 handle, pos_handle;

lockdep_assert_held(&net->nft.commit_mutex);

table = nft_table_lookup(net, nla[NFTA_RULE_TABLE], family, genmask);
if (IS_ERR(table)) {
NL_SET_BAD_ATTR(extack, nla[NFTA_RULE_TABLE]);
return PTR_ERR(table);
}

if (nla[NFTA_RULE_CHAIN]) {
chain = nft_chain_lookup(net, table, nla[NFTA_RULE_CHAIN],
genmask);
if (IS_ERR(chain)) {
NL_SET_BAD_ATTR(extack, nla[NFTA_RULE_CHAIN]);
return PTR_ERR(chain);
}
if (nft_chain_is_bound(chain))
return -EOPNOTSUPP;

} else if (nla[NFTA_RULE_CHAIN_ID]) {
chain = nft_chain_lookup_byid(net, nla[NFTA_RULE_CHAIN_ID]);
if (IS_ERR(chain)) {
NL_SET_BAD_ATTR(extack, nla[NFTA_RULE_CHAIN_ID]);
return PTR_ERR(chain);
}
} else {
return -EINVAL;
}

if (nla[NFTA_RULE_HANDLE]) {
handle = be64_to_cpu(nla_get_be64(nla[NFTA_RULE_HANDLE]));
rule = __nft_rule_lookup(chain, handle);
if (IS_ERR(rule)) {
NL_SET_BAD_ATTR(extack, nla[NFTA_RULE_HANDLE]);
return PTR_ERR(rule);
}

if (nlh->nlmsg_flags & NLM_F_EXCL) {
NL_SET_BAD_ATTR(extack, nla[NFTA_RULE_HANDLE]);
return -EEXIST;
}
if (nlh->nlmsg_flags & NLM_F_REPLACE)
old_rule = rule;
else
return -EOPNOTSUPP;
} else {
if (!(nlh->nlmsg_flags & NLM_F_CREATE) ||
nlh->nlmsg_flags & NLM_F_REPLACE)
return -EINVAL;
handle = nf_tables_alloc_handle(table);

if (chain->use == UINT_MAX)
return -EOVERFLOW;

if (nla[NFTA_RULE_POSITION]) {
pos_handle = be64_to_cpu(nla_get_be64(nla[NFTA_RULE_POSITION]));
old_rule = __nft_rule_lookup(chain, pos_handle);
if (IS_ERR(old_rule)) {
NL_SET_BAD_ATTR(extack, nla[NFTA_RULE_POSITION]);
return PTR_ERR(old_rule);
}
} else if (nla[NFTA_RULE_POSITION_ID]) {
old_rule = nft_rule_lookup_byid(net, nla[NFTA_RULE_POSITION_ID]);
if (IS_ERR(old_rule)) {
NL_SET_BAD_ATTR(extack, nla[NFTA_RULE_POSITION_ID]);
return PTR_ERR(old_rule);
}
}
}

nft_ctx_init(&ctx, net, skb, nlh, family, table, chain, nla);

n = 0;
size = 0;
if (nla[NFTA_RULE_EXPRESSIONS]) {
info = kvmalloc_array(NFT_RULE_MAXEXPRS,
sizeof(struct nft_expr_info),
GFP_KERNEL);
if (!info)
return -ENOMEM;

nla_for_each_nested(tmp, nla[NFTA_RULE_EXPRESSIONS], rem) {
err = -EINVAL;
if (nla_type(tmp) != NFTA_LIST_ELEM)
goto err1;
if (n == NFT_RULE_MAXEXPRS)
goto err1;
err = nf_tables_expr_parse(&ctx, tmp, &info[n]);
if (err < 0)
goto err1;
size += info[n].ops->size;
n++;
}
}
/* Check for overflow of dlen field */
err = -EFBIG;
if (size >= 1 << 12)
goto err1;

if (nla[NFTA_RULE_USERDATA]) {
ulen = nla_len(nla[NFTA_RULE_USERDATA]);
if (ulen > 0)
usize = sizeof(struct nft_userdata) + ulen;
}

err = -ENOMEM;
rule = kzalloc(sizeof(*rule) + size + usize, GFP_KERNEL);
if (rule == NULL)
goto err1;

nft_activate_next(net, rule);

rule->handle = handle;
rule->dlen = size;
rule->udata = ulen ? 1 : 0;

if (ulen) {
udata = nft_userdata(rule);
udata->len = ulen - 1;
nla_memcpy(udata->data, nla[NFTA_RULE_USERDATA], ulen);
}

expr = nft_expr_first(rule);
for (i = 0; i < n; i++) {
err = nf_tables_newexpr(&ctx, &info[i], expr);
if (err < 0) {
NL_SET_BAD_ATTR(extack, info[i].attr);
goto err2;
}

if (info[i].ops->validate)
nft_validate_state_update(net, NFT_VALIDATE_NEED);

info[i].ops = NULL;
expr = nft_expr_next(expr);
}

if (nlh->nlmsg_flags & NLM_F_REPLACE) {
trans = nft_trans_rule_add(&ctx, NFT_MSG_NEWRULE, rule);
if (trans == NULL) {
err = -ENOMEM;
goto err2;
}
err = nft_delrule(&ctx, old_rule);
if (err < 0) {
nft_trans_destroy(trans);
goto err2;
}

list_add_tail_rcu(&rule->list, &old_rule->list);
} else {
trans = nft_trans_rule_add(&ctx, NFT_MSG_NEWRULE, rule);
if (!trans) {
err = -ENOMEM;
goto err2;
}

if (nlh->nlmsg_flags & NLM_F_APPEND) {
if (old_rule)
list_add_rcu(&rule->list, &old_rule->list);
else
list_add_tail_rcu(&rule->list, &chain->rules);
} else {
if (old_rule)
list_add_tail_rcu(&rule->list, &old_rule->list);
else
list_add_rcu(&rule->list, &chain->rules);
}
}
kvfree(info);
chain->use++;

if (net->nft.validate_state == NFT_VALIDATE_DO)
return nft_table_validate(net, table);

if (chain->flags & NFT_CHAIN_HW_OFFLOAD) {
flow = nft_flow_rule_create(net, rule);
if (IS_ERR(flow))
return PTR_ERR(flow);

nft_trans_flow_rule(trans) = flow;
}

return 0;
err2:
nf_tables_rule_release(&ctx, rule);
err1:
for (i = 0; i < n; i++) {
if (info[i].ops) {
module_put(info[i].ops->type->owner);
if (info[i].ops->type->release_ops)
info[i].ops->type->release_ops(info[i].ops);
}
}
kvfree(info);
return err;
}

nft_ctx_init函数调用往前主要做的是找到已存在的table和chain以及rule

nft_ctx_init函数调后之后会根据用户态是否设置了nla[NFTA_RULE_EXPRESSIONS]来选择是否给size加上所有expression的size,这里可以看到的是对size做运算的是info[n].ops->size,所以这里分析一下info的来源。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
static int nf_tables_expr_parse(const struct nft_ctx *ctx,
const struct nlattr *nla,
struct nft_expr_info *info)
{
const struct nft_expr_type *type;
const struct nft_expr_ops *ops;
struct nlattr *tb[NFTA_EXPR_MAX + 1];
int err;

err = nla_parse_nested_deprecated(tb, NFTA_EXPR_MAX, nla,
nft_expr_policy, NULL);
if (err < 0)
return err;

type = nft_expr_type_get(ctx->net, ctx->family, tb[NFTA_EXPR_NAME]);
if (IS_ERR(type))
return PTR_ERR(type);

if (tb[NFTA_EXPR_DATA]) {
err = nla_parse_nested_deprecated(info->tb, type->maxattr,
tb[NFTA_EXPR_DATA],
type->policy, NULL);
if (err < 0)
goto err1;
} else
memset(info->tb, 0, sizeof(info->tb[0]) * (type->maxattr + 1));

if (type->select_ops != NULL) {
ops = type->select_ops(ctx,
(const struct nlattr * const *)info->tb);
if (IS_ERR(ops)) {
err = PTR_ERR(ops);
#ifdef CONFIG_MODULES
if (err == -EAGAIN)
if (nft_expr_type_request_module(ctx->net,
ctx->family,
tb[NFTA_EXPR_NAME]) != -EAGAIN)
err = -ENOENT;
#endif
goto err1;
}
} else
ops = type->ops;

info->attr = nla;
info->ops = ops;

return 0;

err1:
module_put(type->owner);
return err;
}

可以看到的是info->ops所赋值的ops是由type所决定的。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
static const struct nft_expr_type *nft_expr_type_get(struct net *net,
u8 family,
struct nlattr *nla)
{
const struct nft_expr_type *type;

if (nla == NULL)
return ERR_PTR(-EINVAL);

type = __nft_expr_type_get(family, nla);
if (type != NULL && try_module_get(type->owner))
return type;

lockdep_nfnl_nft_mutex_not_held();
#ifdef CONFIG_MODULES
if (type == NULL) {
if (nft_expr_type_request_module(net, family, nla) == -EAGAIN)
return ERR_PTR(-EAGAIN);

if (nft_request_module(net, "nft-expr-%.*s",
nla_len(nla),
(char *)nla_data(nla)) == -EAGAIN)
return ERR_PTR(-EAGAIN);
}
#endif
return ERR_PTR(-ENOENT);
}

然而type是由上述函数产生的。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
static const struct nft_expr_type *__nft_expr_type_get(u8 family,
struct nlattr *nla)
{
const struct nft_expr_type *type, *candidate = NULL;

list_for_each_entry(type, &nf_tables_expressions, list) {
if (!nla_strcmp(nla, type->name)) {
if (!type->family && !candidate)
candidate = type;
else if (type->family == family)
candidate = type;
}
}
return candidate;
}

可以看到这里主要是通过名字进行字符串判断来找到对应的type的,所以回到用户态来看其是如何产生的。

1
exprs[exprid] = nftnl_expr_alloc("lookup");
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
struct nftnl_expr *nftnl_expr_alloc(const char *name)
{
struct nftnl_expr *expr;
struct expr_ops *ops;

ops = nftnl_expr_ops_lookup(name);
if (ops == NULL)
return NULL;

expr = calloc(1, sizeof(struct nftnl_expr) + ops->alloc_len);
if (expr == NULL)
return NULL;

/* Manually set expression name attribute */
expr->flags |= (1 << NFTNL_EXPR_NAME);
expr->ops = ops;

return expr;
}
EXPORT_SYMBOL(nftnl_expr_alloc);

这里依旧是通过名字寻找到ops。

1
2
3
4
5
6
7
8
9
10
11
12
struct expr_ops *nftnl_expr_ops_lookup(const char *name)
{
int i = 0;

while (expr_ops[i] != NULL) {
if (strcmp(expr_ops[i]->name, name) == 0)
return expr_ops[i];

i++;
}
return NULL;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
static struct expr_ops *expr_ops[] = {
&expr_ops_bitwise,
&expr_ops_byteorder,
&expr_ops_cmp,
&expr_ops_counter,
&expr_ops_ct,
&expr_ops_dup,
&expr_ops_exthdr,
&expr_ops_fwd,
&expr_ops_immediate,
&expr_ops_limit,
&expr_ops_log,
&expr_ops_lookup,
&expr_ops_masq,
&expr_ops_match,
&expr_ops_meta,
&expr_ops_ng,
&expr_ops_nat,
&expr_ops_notrack,
&expr_ops_payload,
&expr_ops_range,
&expr_ops_redir,
&expr_ops_reject,
&expr_ops_rt,
&expr_ops_queue,
&expr_ops_quota,
&expr_ops_target,
&expr_ops_dynset,
&expr_ops_hash,
&expr_ops_fib,
&expr_ops_objref,
NULL,
};

这里的ops有以上这么多类,这里以开头的lookup为例,回到内核态。

1
2
3
4
5
6
7
8
9
10
11
12
13
static struct nft_expr_type *nft_basic_types[] = {
&nft_imm_type,
&nft_cmp_type,
&nft_lookup_type,
&nft_bitwise_type,
&nft_byteorder_type,
&nft_payload_type,
&nft_dynset_type,
&nft_range_type,
&nft_meta_type,
&nft_rt_type,
&nft_exthdr_type,
};

内核也具有很多expression类型,从前面来看这里匹配的type则为nft_lookup_type

1
2
3
4
5
6
7
struct nft_expr_type nft_lookup_type __read_mostly = {
.name = "lookup",
.ops = &nft_lookup_ops,
.policy = nft_lookup_policy,
.maxattr = NFTA_LOOKUP_MAX,
.owner = THIS_MODULE,
};

因为其select_ops为空,所以最终得到的ops即为nft_lookup_ops

1
2
3
4
5
6
7
8
9
10
11
static const struct nft_expr_ops nft_lookup_ops = {
.type = &nft_lookup_type,
.size = NFT_EXPR_SIZE(sizeof(struct nft_lookup)),
.eval = nft_lookup_eval,
.init = nft_lookup_init,
.activate = nft_lookup_activate,
.deactivate = nft_lookup_deactivate,
.destroy = nft_lookup_destroy,
.dump = nft_lookup_dump,
.validate = nft_lookup_validate,
};

最终从这里得到size并加起来,回到nf_tables_newrule,接着会判断有无用户数据,如果有的话在申请rule时也会连带加上。紧接着就是对rule的一系列初始化操作。

最后判断新创建的rule是否为replace旧的rule,如果不是则判断其是插在最后还是开始。

配置集合

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
static int nf_tables_newset(struct net *net, struct sock *nlsk,
struct sk_buff *skb, const struct nlmsghdr *nlh,
const struct nlattr * const nla[],
struct netlink_ext_ack *extack)
{
const struct nfgenmsg *nfmsg = nlmsg_data(nlh);
u8 genmask = nft_genmask_next(net);
int family = nfmsg->nfgen_family;
const struct nft_set_ops *ops;
struct nft_expr *expr = NULL;
struct nft_table *table;
struct nft_set *set;
struct nft_ctx ctx;
char *name;
u64 size;
u64 timeout;
u32 ktype, dtype, flags, policy, gc_int, objtype;
struct nft_set_desc desc;
unsigned char *udata;
u16 udlen;
int err;
int i;

if (nla[NFTA_SET_TABLE] == NULL ||
nla[NFTA_SET_NAME] == NULL ||
nla[NFTA_SET_KEY_LEN] == NULL ||
nla[NFTA_SET_ID] == NULL)
return -EINVAL;

memset(&desc, 0, sizeof(desc));

ktype = NFT_DATA_VALUE;
if (nla[NFTA_SET_KEY_TYPE] != NULL) {
ktype = ntohl(nla_get_be32(nla[NFTA_SET_KEY_TYPE]));
if ((ktype & NFT_DATA_RESERVED_MASK) == NFT_DATA_RESERVED_MASK)
return -EINVAL;
}

desc.klen = ntohl(nla_get_be32(nla[NFTA_SET_KEY_LEN]));
if (desc.klen == 0 || desc.klen > NFT_DATA_VALUE_MAXLEN)
return -EINVAL;

flags = 0;
if (nla[NFTA_SET_FLAGS] != NULL) {
flags = ntohl(nla_get_be32(nla[NFTA_SET_FLAGS]));
if (flags & ~(NFT_SET_ANONYMOUS | NFT_SET_CONSTANT |
NFT_SET_INTERVAL | NFT_SET_TIMEOUT |
NFT_SET_MAP | NFT_SET_EVAL |
NFT_SET_OBJECT | NFT_SET_CONCAT))
return -EOPNOTSUPP;
/* Only one of these operations is supported */
if ((flags & (NFT_SET_MAP | NFT_SET_OBJECT)) ==
(NFT_SET_MAP | NFT_SET_OBJECT))
return -EOPNOTSUPP;
if ((flags & (NFT_SET_EVAL | NFT_SET_OBJECT)) ==
(NFT_SET_EVAL | NFT_SET_OBJECT))
return -EOPNOTSUPP;
}

dtype = 0;
if (nla[NFTA_SET_DATA_TYPE] != NULL) {
if (!(flags & NFT_SET_MAP))
return -EINVAL;

dtype = ntohl(nla_get_be32(nla[NFTA_SET_DATA_TYPE]));
if ((dtype & NFT_DATA_RESERVED_MASK) == NFT_DATA_RESERVED_MASK &&
dtype != NFT_DATA_VERDICT)
return -EINVAL;

if (dtype != NFT_DATA_VERDICT) {
if (nla[NFTA_SET_DATA_LEN] == NULL)
return -EINVAL;
desc.dlen = ntohl(nla_get_be32(nla[NFTA_SET_DATA_LEN]));
if (desc.dlen == 0 || desc.dlen > NFT_DATA_VALUE_MAXLEN)
return -EINVAL;
} else
desc.dlen = sizeof(struct nft_verdict);
} else if (flags & NFT_SET_MAP)
return -EINVAL;

if (nla[NFTA_SET_OBJ_TYPE] != NULL) {
if (!(flags & NFT_SET_OBJECT))
return -EINVAL;

objtype = ntohl(nla_get_be32(nla[NFTA_SET_OBJ_TYPE]));
if (objtype == NFT_OBJECT_UNSPEC ||
objtype > NFT_OBJECT_MAX)
return -EOPNOTSUPP;
} else if (flags & NFT_SET_OBJECT)
return -EINVAL;
else
objtype = NFT_OBJECT_UNSPEC;

timeout = 0;
if (nla[NFTA_SET_TIMEOUT] != NULL) {
if (!(flags & NFT_SET_TIMEOUT))
return -EINVAL;

err = nf_msecs_to_jiffies64(nla[NFTA_SET_TIMEOUT], &timeout);
if (err)
return err;
}
gc_int = 0;
if (nla[NFTA_SET_GC_INTERVAL] != NULL) {
if (!(flags & NFT_SET_TIMEOUT))
return -EINVAL;
gc_int = ntohl(nla_get_be32(nla[NFTA_SET_GC_INTERVAL]));
}

policy = NFT_SET_POL_PERFORMANCE;
if (nla[NFTA_SET_POLICY] != NULL)
policy = ntohl(nla_get_be32(nla[NFTA_SET_POLICY]));

if (nla[NFTA_SET_DESC] != NULL) {
err = nf_tables_set_desc_parse(&desc, nla[NFTA_SET_DESC]);
if (err < 0)
return err;
}

if (nla[NFTA_SET_EXPR])
desc.expr = true;

table = nft_table_lookup(net, nla[NFTA_SET_TABLE], family, genmask);
if (IS_ERR(table)) {
NL_SET_BAD_ATTR(extack, nla[NFTA_SET_TABLE]);
return PTR_ERR(table);
}

nft_ctx_init(&ctx, net, skb, nlh, family, table, NULL, nla);

set = nft_set_lookup(table, nla[NFTA_SET_NAME], genmask);
if (IS_ERR(set)) {
if (PTR_ERR(set) != -ENOENT) {
NL_SET_BAD_ATTR(extack, nla[NFTA_SET_NAME]);
return PTR_ERR(set);
}
} else {
if (nlh->nlmsg_flags & NLM_F_EXCL) {
NL_SET_BAD_ATTR(extack, nla[NFTA_SET_NAME]);
return -EEXIST;
}
if (nlh->nlmsg_flags & NLM_F_REPLACE)
return -EOPNOTSUPP;

return 0;
}

if (!(nlh->nlmsg_flags & NLM_F_CREATE))
return -ENOENT;

ops = nft_select_set_ops(&ctx, nla, &desc, policy);
if (IS_ERR(ops))
return PTR_ERR(ops);

udlen = 0;
if (nla[NFTA_SET_USERDATA])
udlen = nla_len(nla[NFTA_SET_USERDATA]);

size = 0;
if (ops->privsize != NULL)
size = ops->privsize(nla, &desc);

set = kvzalloc(sizeof(*set) + size + udlen, GFP_KERNEL);
if (!set)
return -ENOMEM;

name = nla_strdup(nla[NFTA_SET_NAME], GFP_KERNEL);
if (!name) {
err = -ENOMEM;
goto err_set_name;
}

err = nf_tables_set_alloc_name(&ctx, set, name);
kfree(name);
if (err < 0)
goto err_set_alloc_name;

if (nla[NFTA_SET_EXPR]) {
expr = nft_set_elem_expr_alloc(&ctx, set, nla[NFTA_SET_EXPR]);
if (IS_ERR(expr)) {
err = PTR_ERR(expr);
goto err_set_alloc_name;
}
}

udata = NULL;
if (udlen) {
udata = set->data + size;
nla_memcpy(udata, nla[NFTA_SET_USERDATA], udlen);
}

INIT_LIST_HEAD(&set->bindings);
set->table = table;
write_pnet(&set->net, net);
set->ops = ops;
set->ktype = ktype;
set->klen = desc.klen;
set->dtype = dtype;
set->objtype = objtype;
set->dlen = desc.dlen;
set->expr = expr;
set->flags = flags;
set->size = desc.size;
set->policy = policy;
set->udlen = udlen;
set->udata = udata;
set->timeout = timeout;
set->gc_int = gc_int;
set->handle = nf_tables_alloc_handle(table);

set->field_count = desc.field_count;
for (i = 0; i < desc.field_count; i++)
set->field_len[i] = desc.field_len[i];

err = ops->init(set, &desc, nla);
if (err < 0)
goto err_set_init;

err = nft_trans_set_add(&ctx, NFT_MSG_NEWSET, set);
if (err < 0)
goto err_set_trans;

list_add_tail_rcu(&set->list, &table->sets);
table->use++;
return 0;

err_set_trans:
ops->destroy(set);
err_set_init:
if (expr)
nft_expr_destroy(&ctx, expr);
err_set_alloc_name:
kfree(set->name);
err_set_name:
kvfree(set);
return err;
}

同样的nft_ctx_init在这个函数之前主要做的事情是初始化一些变量并且找到对应的table。

随后直接通过nft_set_lookup函数找到set,如果已存在set则直接返回,如果未存在则继续。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
static const struct nft_set_ops *
nft_select_set_ops(const struct nft_ctx *ctx,
const struct nlattr * const nla[],
const struct nft_set_desc *desc,
enum nft_set_policies policy)
{
const struct nft_set_ops *ops, *bops;
struct nft_set_estimate est, best;
const struct nft_set_type *type;
u32 flags = 0;
int i;

lockdep_assert_held(&ctx->net->nft.commit_mutex);
lockdep_nfnl_nft_mutex_not_held();

if (nla[NFTA_SET_FLAGS] != NULL)
flags = ntohl(nla_get_be32(nla[NFTA_SET_FLAGS]));

bops = NULL;
best.size = ~0;
best.lookup = ~0;
best.space = ~0;

for (i = 0; i < ARRAY_SIZE(nft_set_types); i++) {
type = nft_set_types[i];
ops = &type->ops;

if (!nft_set_ops_candidate(type, flags))
continue;
if (!ops->estimate(desc, flags, &est))
continue;

switch (policy) {
case NFT_SET_POL_PERFORMANCE:
if (est.lookup < best.lookup)
break;
if (est.lookup == best.lookup &&
est.space < best.space)
break;
continue;
case NFT_SET_POL_MEMORY:
if (!desc->size) {
if (est.space < best.space)
break;
if (est.space == best.space &&
est.lookup < best.lookup)
break;
} else if (est.size < best.size || !bops) {
break;
}
continue;
default:
break;
}

bops = ops;
best = est;
}

if (bops != NULL)
return bops;

return ERR_PTR(-EOPNOTSUPP);
}

这里首先通过nft_select_set_ops函数来找到ops

1
2
3
4
5
6
7
8
9
10
11
static const struct nft_set_type *nft_set_types[] = {
&nft_set_hash_fast_type,
&nft_set_hash_type,
&nft_set_rhash_type,
&nft_set_bitmap_type,
&nft_set_rbtree_type,
#if defined(CONFIG_X86_64) && !defined(CONFIG_UML)
&nft_set_pipapo_avx2_type,
#endif
&nft_set_pipapo_type,
};

存在以上种类的set type,在这里找到对应的ops之后进行返回最后再初始化set。

不过这里与前面不同的是,在写入名字的时候如果之前存在的话会直接释放内存并返回错误。

配置表达式

表达式在两处内都存在一是在申请规则时,其次就是申请集合时。

申请规则时好理解在申请到rule之后直接调用nf_tables_newexpr函数进行初始化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
static int nf_tables_newexpr(const struct nft_ctx *ctx,
const struct nft_expr_info *info,
struct nft_expr *expr)
{
const struct nft_expr_ops *ops = info->ops;
int err;

expr->ops = ops;
if (ops->init) {
err = ops->init(ctx, expr, (const struct nlattr **)info->tb);
if (err < 0)
goto err1;
}

return 0;
err1:
expr->ops = NULL;
return err;
}

这里面对不同类型表达式会进入到不同的init函数中,这里不过多分析了。

再就是在申请集合时会申请表达式。

1
2
3
4
5
6
7
if (nla[NFTA_SET_EXPR]) {
expr = nft_set_elem_expr_alloc(&ctx, set, nla[NFTA_SET_EXPR]);
if (IS_ERR(expr)) {
err = PTR_ERR(expr);
goto err_set_alloc_name;
}
}

在用户态传入时存在expression则会调用nft_set_elem_expr_alloc进行创建。


参考链接:

https://wiki.nftables.org/wiki-nftables/index.php/Main_Page

https://www.cnblogs.com/xinghuo123/p/13797589.html

 评论
评论插件加载失败
正在加载评论插件
由 Hexo 驱动 & 主题 Keep
本站由 提供部署服务
总字数 335.6k 访客数 访问量