您的位置:首页 > 运维架构 > Linux

深入理解Linux网络技术内幕 第8章 设备注册和初始化

2020-06-07 06:12 344 查看

设备注册和初始化

  • 设备的注册和注销
  • 设备注销
  • 引用计数
  • 开启设备
  • 关闭设备
  • 更新设备队列规则状态
  • 链接状态变更检测
  • 虚拟设备
  • 设备注册

    网络设备注册发生在下列情况:

    • 加载NIC设备驱动程序
      NIC设备驱动初始化时,该驱动程序控制的所有的NIC都会被注册。
    • 插入可热插拔网络设备

    前边章节知道加载PCI设备驱动程序导致pci_driver->probe函数执行,probe函数由驱动程序提供,并由该函数负责设备的注册。

    设备注销

    以下情况触发设备的注销:

    • 卸载NIC设备驱动程序
      仅仅针对那些以模块加载的驱动程序。不适用内建到内核的驱动程序。
    • 删除可热插拔设备

    分配net_device结构

    内核使用alloc_etherdev_mqs函数分配struct net_device结构,该函数会调用alloc_netdev_mqs函数进行实际的分配。
    传入的第一个参数是驱动程序扩充私有数据块区域大小,驱动程序可以用此区域存储驱动程序参数信息。
    第二个时设备名称,在alloc_etherdev_mqs函数中生成网卡命名规则,为eth%d,。
    setup函数参数用于初始化net_device的部分字段。

    /**
    * alloc_netdev_mqs - allocate network device
    * @sizeof_priv: size of private data to allocate space for
    * @name: device name format string
    * @name_assign_type: origin of device name
    * @setup: callback to initialize device
    * @txqs: the number of TX subqueues to allocate
    * @rxqs: the number of RX subqueues to allocate
    *
    * Allocates a struct net_device with private data area for driver use
    * and performs basic initialization.  Also allocates subqueue structs
    * for each queue on the device.
    */
    struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
    unsigned char name_assign_type,
    void (*setup)(struct net_device *),
    unsigned int txqs, unsigned int rxqs)

    一般会使用包裹函数对alloc_netdev_mqs进行包裹,比如Ethernet设备使用alloc_etherdev_mqs函数申请net_device。

    struct net_device *alloc_etherdev_mqs(int sizeof_priv, unsigned int txqs,
    unsigned int rxqs)
    {
    return alloc_netdev_mqs(sizeof_priv, "eth%d", NET_NAME_UNKNOWN,
    ether_setup, txqs, rxqs);
    }

    NIC注册和注销架构

    设备注册两个关键步骤

    1. 使用alloc_etherdev分配net_device结构,alloc_etherdev会为Ethernet设备通用参数做初始化。
    2. 调用register_netdev为函数注册。

    设备注销两个关键步骤

    1. unregister_netdev函数将设备注销掉
    2. free_netdev将申请的netdev释放掉。

    设备初始化

    Ethernet设备在申请netdev时使用ether_setup函数初始化netdev中的某些字段。
    header_ops包含操作L2链路层报文的函数。

    /**
    * ether_setup - setup Ethernet network device
    * @dev: network device
    *
    * Fill in the fields of the device structure with Ethernet-generic values.
    */
    void ether_setup(struct net_device *dev)
    {
    dev->header_ops		= &eth_header_ops;
    dev->type		= ARPHRD_ETHER;
    dev->hard_header_len 	= ETH_HLEN;
    dev->min_header_len	= ETH_HLEN;
    dev->mtu		= ETH_DATA_LEN;
    dev->min_mtu		= ETH_MIN_MTU;
    dev->max_mtu		= ETH_DATA_LEN;
    dev->addr_len		= ETH_ALEN;
    dev->tx_queue_len	= DEFAULT_TX_QUEUE_LEN;
    dev->flags		= IFF_BROADCAST|IFF_MULTICAST;
    dev->priv_flags		|= IFF_TX_SKB_SHARING;
    
    eth_broadcast_addr(dev->broadcast);
    
    }

    驱动程序初始化netdev_ops和ethtool_ops 两个字段。
    netdev_ops包括管理网卡的可能的函数。
    ethtool_ops 包括可选的网卡设备操作。

    netdev->netdev_ops = &e100_netdev_ops;
    netdev->ethtool_ops = &e100_ethtool_ops;

    上面提到的函数很多不需要初始化,相关的函数指针时NULL,使用前需要判断。

    net_device组织

    net_device数据结构插入全局链表和两个hash表中。
    dev_list将内核中的net_device通过链表的形式组织起来。
    name_hlist将内核中的net_device通过以网卡Name为key的HASH表组织起来。
    index_hlist将内核中的net_device通过以网卡ifindex为key的HASH表组织起来。
    这些不同的结构让内核按需求查找net_device结构。

    struct hlist_node	name_hlist;
    struct hlist_node	index_hlist;
    struct list_head	dev_list;

    设备状态

    net_device结构中和设备状态有关的字段:

    unsigned long		state;//@state:Generic network queuing layer state, see netdev_state_t
    unsigned int		flags;//@flags:Interface flags (a la BSD)
    enum { NETREG_UNINITIALIZED=0,
    NETREG_REGISTERED,	/* completed register_netdevice */
    NETREG_UNREGISTERING,	/* called unregister_netdevice */
    NETREG_UNREGISTERED,	/* completed unregister todo */
    NETREG_RELEASED,		/* called free_netdev */
    NETREG_DUMMY,		/* dummy device for NAPI poll */
    } reg_state:8;//Register/unregister state machine

    队列规则状态

    每个网络设备都会被分配一种队列规则,流量控制使用这种队列规则实现QoS机制。net_device结构的state字段是流量控制使用的字段之一。
    state可以设置以下标识:

    • __LINK_STATE_START
      设备开启,可以由函数netif_running检测。

    • __LINK_STATE_PRESENT
      设备存在,可热插拔设备可以暂时删除。当系统进入挂起模式然后重新继续运行时,此标志也会被清除然后再取回值。

    • __LINK_STATE_NOCARRIER
      NIC接口没有载波,网口处于down的状态。

    • __LINK_STATE_LINKWATCH_PENDING

    • __LINK_STATE_DORMANT

    /* These flag bits are private to the generic network queueing
    * layer; they may not be explicitly referenced by any other
    * code.
    */
    
    enum netdev_state_t {
    __LINK_STATE_START,
    __LINK_STATE_PRESENT,
    __LINK_STATE_NOCARRIER,
    __LINK_STATE_LINKWATCH_PENDING,
    __LINK_STATE_DORMANT,
    };

    注册状态

    网络设备的注册状态存储在reg_state字段中。

    enum { NETREG_UNINITIALIZED=0,
    NETREG_REGISTERED,	/* completed register_netdevice */
    NETREG_UNREGISTERING,	/* called unregister_netdevice */
    NETREG_UNREGISTERED,	/* completed unregister todo */
    NETREG_RELEASED,		/* called free_netdev */
    NETREG_DUMMY,		/* dummy device for NAPI poll */
    } reg_state:8;

    设备的注册和注销

    网络设备的驱动程序通过register_netdev和unregister_netdev函数向内核注册和注销设备。

    设备注册

    register_netdev会调用register_netdevice进一步的处理。
    register_netdevice会使用dev_get_valid_name为网卡完成命名。alloc_etherdev_mqs在申请net_device时,网卡的名字初始化为"eth%d",在dev_get_valid_name将%d修改为网口编号。
    如果dev->netdev_ops->ndo_init设置了回调函数则需要调用该函数。
    向通知链发送网卡注册消息。
    向sysfs注册网卡信息。
    标记这个net_device的注册状态为NETREG_REGISTERED。

    ret = dev_get_valid_name(net, dev, dev->name);
    if (ret < 0)
    goto out;
    
    /* Init, if this function is available */
    if (dev->netdev_ops->ndo_init) {
    ret = dev->netdev_ops->ndo_init(dev);
    if (ret) {
    if (ret > 0)
    ret = -EIO;
    goto out;
    }
    }
    ...
    ret = call_netdevice_notifiers(NETDEV_POST_INIT, dev);
    ret = notifier_to_errno(ret);
    if (ret)
    goto err_uninit;
    
    ret = netdev_register_kobject(dev);
    if (ret) {
    dev->reg_state = NETREG_UNREGISTERED;
    goto err_uninit;
    }
    dev->reg_state = NETREG_REGISTERED;

    函数list_netdevice负责将该net_device放入全局链表和两个hash表中。

    list_netdevice(dev);
    static void list_netdevice(struct net_device *dev)
    {
    struct net *net = dev_net(dev);
    
    ASSERT_RTNL();
    
    write_lock_bh(&dev_base_lock);
    list_add_tail_rcu(&dev->dev_list, &net->dev_base_head);
    hlist_add_head_rcu(&dev->name_hlist, dev_name_hash(net, dev->name));
    hlist_add_head_rcu(&dev->index_hlist,
    dev_index_hash(net, dev->ifindex));
    write_unlock_bh(&dev_base_lock);
    
    dev_base_seq_inc(net);
    }

    通过函数dev_init_scheduler初始化设备的队列规则,实现Qos功能。队列规则定义出口报文如何进入、退出出口队列的规则。定义开始丢掉报文前有多少报文可以在队列中等。

    netdev_run_todo

    register_netdevice函数负责一部分注册工作,然后在让netdev_run_todo完成其余的工作。
    对net_device结构的修改需要rtnl_mutex(Rounting Netlink)信号量的保护。所以在调用register_netdevice函数之前需要先调用rtnl_lock_killable锁定该信号量,并在完成后释放该信号量。

    int register_netdev(struct net_device *dev)
    {
    int err;
    
    if (rtnl_lock_killable())
    return -EINTR;
    err = register_netdevice(dev);
    rtnl_unlock();
    return err;
    }

    rtnl_unlock函数中调用netdev_run_todo函数。为什么需要在释放锁的时候调用这个netdev_run_todo函数呢?

    void rtnl_unlock(void)
    {
    /* This fellow will unlock it for us. */
    netdev_run_todo();
    }

    查看netdev_run_todo函数代码和注释可知,这样设计原因可以解决以下问题:

    • 这样避免因删除sysfs objects时引起的热插拔事件通过keventd导致的和linkwatch的死锁。
    • 因为我们运行时没有获得RTNL信号量,我们可以为了等待netdev的refcnt到0而安全的进入睡眠。我们必须在所有的注销事件完成后才能返回。
    /* The sequence is:
    *
    * rtnl_lock();
    * ...
    * register_netdevice(x1);
    * register_netdevice(x2);
    * ...
    * unregister_netdevice(y1);
    * unregister_netdevice(y2);
    *      ...
    * rtnl_unlock();
    * free_netdev(y1);
    * free_netdev(y2);
    *  * We are invoked by rtnl_unlock().
    * This allows us to deal with problems:
    * 1) We can delete sysfs objects which invoke hotplug
    *    without deadlocking with linkwatch via keventd.
    * 2) Since we run with the RTNL semaphore not held, we can sleep
    *    safely in order to wait for the netdev refcnt to drop to zero.
    *  * We must not return until all unregister events added during
    * the interval the lock was held have been completed.
    */
    void netdev_run_todo(void)
    {
    struct list_head list;
    
    /* Snapshot list, allow later requests */
    list_replace_init(&net_todo_list, &list);
    
    __rtnl_unlock();
    
    /* Wait for rcu callbacks to finish before next phase */
    if (!list_empty(&list))
    rcu_barrier();
    
    while (!list_empty(&list)) {
    struct net_device *dev
    = list_first_entry(&list, struct net_device, todo_list);
    list_del(&dev->todo_list);
    
    if (unlikely(dev->reg_state != NETREG_UNREGISTERING)) {
    pr_err("network todo '%s' but state %d\n",
    dev->name, dev->reg_state);
    dump_stack();
    continue;
    }
    
    dev->reg_state = NETREG_UNREGISTERED;
    
    netdev_wait_allrefs(dev);
    
    /* paranoia */
    BUG_ON(netdev_refcnt_read(dev));
    BUG_ON(!list_empty(&dev->ptype_all));
    BUG_ON(!list_empty(&dev->ptype_specific));
    WARN_ON(rcu_access_pointer(dev->ip_ptr));
    WARN_ON(rcu_access_pointer(dev->ip6_ptr));
    #if IS_ENABLED(CONFIG_DECNET)
    WARN_ON(dev->dn_ptr);
    #endif
    if (dev->priv_destructor)
    dev->priv_destructor(dev);
    if (dev->needs_free_netdev)
    free_netdev(dev);
    
    /* Report a network device has been unregistered */
    rtnl_lock();
    dev_net(dev)->dev_unreg_count--;
    __rtnl_unlock();
    wake_up(&netdev_unregistering_wq);
    
    /* Free network device */
    kobject_put(&dev->dev.kobj);
    }
    }

    设备注册状态通知

    网络设备注册、注销、关闭、开启事件通过两个通知链传递

    • netdev_chain
    • Netlink的REMGRP_LINK多播群组

    netdev_chain

    设备注册和注销各个阶段都是通过这个通知链报告的。
    内核通过register_netdevice_notifier和unregister_netdevice_notifier两个函数处理通知链。
    通过call_netdevice_notifiers函数发送通知链信息,支持的信息如下:

    /* netdevice notifier chain. Please remember to update netdev_cmd_to_name()
    * and the rtnetlink notification exclusion list in rtnetlink_event() when
    * adding new types.
    */
    enum netdev_cmd {
    NETDEV_UP	= 1,	/* For now you can't veto a device up/down */
    NETDEV_DOWN,
    NETDEV_REBOOT,		/* Tell a protocol stack a network interface
    detected a hardware crash and restarted
    - we can use this eg to kick tcp sessions
    once done */
    NETDEV_CHANGE,		/* Notify device state change */
    NETDEV_REGISTER,
    NETDEV_UNREGISTER,
    NETDEV_CHANGEMTU,	/* notify after mtu change happened */
    NETDEV_CHANGEADDR,
    NETDEV_GOING_DOWN,
    NETDEV_CHANGENAME,
    NETDEV_FEAT_CHANGE,
    NETDEV_BONDING_FAILOVER,
    NETDEV_PRE_UP,
    NETDEV_PRE_TYPE_CHANGE,
    NETDEV_POST_TYPE_CHANGE,
    NETDEV_POST_INIT,
    NETDEV_RELEASE,
    NETDEV_NOTIFY_PEERS,
    NETDEV_JOIN,
    NETDEV_CHANGEUPPER,
    NETDEV_RESEND_IGMP,
    NETDEV_PRECHANGEMTU,	/* notify before mtu change happened */
    NETDEV_CHANGEINFODATA,
    NETDEV_BONDING_INFO,
    NETDEV_PRECHANGEUPPER,
    NETDEV_CHANGELOWERSTATE,
    NETDEV_UDP_TUNNEL_PUSH_INFO,
    NETDEV_UDP_TUNNEL_DROP_INFO,
    NETDEV_CHANGE_TX_QUEUE_LEN,
    NETDEV_CVLAN_FILTER_PUSH_INFO,
    NETDEV_CVLAN_FILTER_DROP_INFO,
    NETDEV_SVLAN_FILTER_PUSH_INFO,
    NETDEV_SVLAN_FILTER_DROP_INFO,
    };

    当其他子系统通过register_netdevice_notifier注册通知链时,该函数会将内核中已经存在的网卡信息重新回放给注册者。
    这样新注册的系统也可以得知系统网卡的状态。
    注册netdev_chain的内核组件有:

    • 路由
    • 防火墙
    • 协议代码
    • 虚拟设备
    • RTnetlink

    设备注销

    要把设备注销,内核需要操作如下:

    • 以dev_close关闭设备
    • 释放所有的资源(IO IRQ 端口)
    • 将全局链表和两个hash表中的netdevice指针删除。
    • 一旦结构中的所有引用计数都释放后,将释放netdevice结构。
    • 删除/proc/和sysfs下添加的文件。

    unregister_netdev函数

    unregister_netdev函数和register_netdev函数类似先调用rtnl_lock加锁。

    void unregister_netdev(struct net_device *dev)
    {
    rtnl_lock();
    unregister_netdevice(dev);
    rtnl_unlock();
    }
    EXPORT_SYMBOL(unregister_netdev);

    unregister_netdev调用unregister_netdevice_queue函数在内核中将设备移除。之后将剩余工作交给通过调用net_set_todo在rtnl_unlock调用时完成。

    /**
    *	unregister_netdevice_queue - remove device from the kernel
    *	@dev: device
    *	@head: list
    *
    *	This function shuts down a device interface and removes it
    *	from the kernel tables.
    *	If head not NULL, device is queued to be unregistered later.
    *
    *	Callers must hold the rtnl semaphore.  You may want
    *	unregister_netdev() instead of this.
    */
    void unregister_netdevice_queue(struct net_device *dev, struct list_head *head)
    {
    ASSERT_RTNL();
    
    if (head) {
    list_move_tail(&dev->unreg_list, head);
    } else {
    rollback_registered(dev);
    /* Finish processing unregister after unlock */
    net_set_todo(dev);
    }
    }

    rollback_registered函数负责实际的注销工作。

    static void rollback_registered_many(struct list_head *head)
    {
    struct net_device *dev, *tmp;
    LIST_HEAD(close_head);
    
    BUG_ON(dev_boot_phase);
    ASSERT_RTNL();
    
    list_for_each_entry_safe(dev, tmp, head, unreg_list) {
    /* Some devices call without registering
    * for initialization unwind. Remove those
    * devices and proceed with the remaining.
    */
    if (dev->reg_state == NETREG_UNINITIALIZED) {
    pr_debug("unregister_netdevice: device %s/%p never was registered\n",
    dev->name, dev);
    
    WARN_ON(1);
    list_del(&dev->unreg_list);
    continue;
    }
    dev->dismantle = true;
    BUG_ON(dev->reg_state != NETREG_REGISTERED);
    }
    
    /* If device is running, close it first. */
    list_for_each_entry(dev, head, unreg_list)
    list_add_tail(&dev->close_list, &close_head);
    dev_close_many(&close_head, true);
    
    list_for_each_entry(dev, head, unreg_list) {
    /* And unlink it from device chain. */
    unlist_netdevice(dev);
    
    dev->reg_state = NETREG_UNREGISTERING;
    }
    flush_all_backlogs();
    
    synchronize_net();
    
    list_for_each_entry(dev, head, unreg_list) {
    struct sk_buff *skb = NULL;
    
    /* Shutdown queueing discipline. */
    dev_shutdown(dev);
    
    dev_xdp_uninstall(dev);
    
    /* Notify protocols, that we are about to destroy
    * this device. They should clean all the things.
    */
    call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
    
    if (!dev->rtnl_link_ops ||
    dev->rtnl_link_state == RTNL_LINK_INITIALIZED)
    skb = rtmsg_ifinfo_build_skb(RTM_DELLINK, dev, ~0U, 0,
    GFP_KERNEL, NULL, 0);
    
    /*
    *	Flush the unicast and multicast chains
    */
    dev_uc_flush(dev);
    dev_mc_flush(dev);
    
    if (dev->netdev_ops->ndo_uninit)
    dev->netdev_ops->ndo_uninit(dev);
    
    if (skb)
    rtmsg_ifinfo_send(skb, dev, GFP_KERNEL);
    
    /* Notifier chain MUST detach us all upper devices. */
    WARN_ON(netdev_has_any_upper_dev(dev));
    WARN_ON(netdev_has_any_lower_dev(dev));
    
    /* Remove entries from kobject tree */
    netdev_unregister_kobject(dev);
    #ifdef CONFIG_XPS
    /* Remove XPS queueing entries */
    netif_reset_xps_queues_gt(dev, 0);
    #endif
    }
    
    synchronize_net();
    
    list_for_each_entry(dev, head, unreg_list)
    dev_put(dev);
    }

    引用计数

    net_device只有在所有的引用计数都释放时才会被释放。
    所以unregister_netdev调用后,引用计数不为0,不能讲net_device结构删除,内核必须等待内核其他部分将引用都释放为止。但是该设备注销后就不能再使用了,内核必须通知所有的引用持有者使其释放引用,通知过程也是通过向netdev_chain发送注销通知信息实现的。

    上一小节说到rtnl_unlock函数调用netdev_run_todo,而netdev_run_todo会调用netdev_wait_allrefs。一直等待下去,知道net_device的引用计数为0。

    netdev_wait_allrefs

    netdev_wait_allrefs由一个循环组成,netdev_refcnt降为0时结束。
    循环中没一秒发送一次NETDEV_UNREGISTER到netdev_chain通知链。
    每隔10秒钟打印一次警告信息。

    /**
    * netdev_wait_allrefs - wait until all references are gone.
    * @dev: target net_device
    *
    * This is called when unregistering network devices.
    *
    * Any protocol or device that holds a reference should register
    * for netdevice notification, and cleanup and put back the
    * reference if they receive an UNREGISTER event.
    * We can get stuck here if buggy protocols don't correctly
    * call dev_put.
    */
    static void netdev_wait_allrefs(struct net_device *dev)
    {
    unsigned long rebroadcast_time, warning_time;
    int refcnt;
    
    linkwatch_forget_dev(dev);
    
    rebroadcast_time = warning_time = jiffies;
    refcnt = netdev_refcnt_read(dev);
    
    while (refcnt != 0) {
    if (time_after(jiffies, rebroadcast_time + 1 * HZ)) {
    rtnl_lock();
    
    /* Rebroadcast unregister notification */
    call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
    
    __rtnl_unlock();
    rcu_barrier();
    rtnl_lock();
    
    if (test_bit(__LINK_STATE_LINKWATCH_PENDING,
    &dev->state)) {
    /* We must not have linkwatch events
    * pending on unregister. If this
    * happens, we simply run the queue
    * unscheduled, resulting in a noop
    * for this device.
    */
    linkwatch_run_queue();
    }
    
    __rtnl_unlock();
    
    rebroadcast_time = jiffies;
    }
    
    msleep(250);
    
    refcnt = netdev_refcnt_read(dev);
    
    if (refcnt && time_after(jiffies, warning_time + 10 * HZ)) {
    pr_emerg("unregister_netdevice: waiting for %s to become free. Usage count = %d\n",
    dev->name, refcnt);
    warning_time = jiffies;
    }
    }
    }

    开启设备

    设备一旦注册就可用了,大师除非由应用程序明确开启,否则还是无法传输和接收报文。开启设备由dev_open函数负责。
    开启设备由下列人物要做:

    • 调用驱动程序注册的dev->netdev_ops中的相关回调函数。
    • 设置dev->state的__LINK_STATE_START标记。
    • 设置dev->flags的IFF_UP标记。
    • 调用dev_activate函数初始化流量控制使用的出口队列规则,然后启动看门狗定时器。如果没有配置流量控制,就指定默认的FIFO队列。
    • 传送NETDEV_UP到Netdev_chain通知链

    关闭设备

    网络设备由dev_close负责,大概有以下任务要做

    • 传送NETDEV_DOWN到netdev_chain通知链。
    • 调用dev_deactivate_many函数关闭出口队列规则。设备将无法再用于传输数据,停止看门狗定时器。
    • 清除dev->state的__LINK_STATE_START标记
    • 清除dev->flags 的~IFF_UP标记
    • 如果dev->netdev_ops->ndo_stop有定义,就调用该函数。

    更新设备队列规则状态

    和电源管理之间的交互

    pci_driver结构的suspend和resume函数根据内核是否支持电源管理进行初始化。系统进入挂起状态时,执行设备驱动程序提供的suspend函数,让驱动程序采取动作,电源管理不影响netdevice->reg_state但是要更新netdevice->state结构。

    挂起设备

    挂起设备时调用suspend函数处理此事件,动作包括:

    • 清除dev_state的__LINK_STATE_PRESENT标记。
    • 如果设备已开启就调用netif_stop_queue关闭出口队列。防止再次传递数据包。

    netif_device_detach函数负责处理

    /**
    * netif_device_detach - mark device as removed
    * @dev: network device
    *
    * Mark device as removed from system and therefore no longer available.
    */
    void netif_device_detach(struct net_device *dev)
    {
    if (test_and_clear_bit(__LINK_STATE_PRESENT, &dev->state) &&
    netif_running(dev)) {
    netif_tx_stop_all_queues(dev);
    }
    }

    设备继续运行

    resume函数负责设备继续运行,由netif_device_attach负责处理:

    void netif_device_attach(struct net_device *dev)
    {
    if (!test_and_set_bit(__LINK_STATE_PRESENT, &dev->state) &&
    netif_running(dev)) {
    netif_tx_wake_all_queues(dev);
    __netdev_watchdog_up(dev);
    }
    }

    链接状态变更检测

    当NIC设备驱动程序侦测到载波信号是否存在时,由NIC通知或者读取NIC寄存器得出。可以利用netif_carrier_on和netif_carrier_off通知内核。
    链接状态变化情况:

    • 网线插入或者拔出NIC
    • 网线另一侧设备状态发生变化

    设备驱动程序发现载波消失调用netif_carrier_off函数。函数会设置__LINK_STATE_NOCARRIER标记,并调用linkwatch_fire_event处理。

    void netif_carrier_off(struct net_device *dev)
    {
    if (!test_and_set_bit(__LINK_STATE_NOCARRIER, &dev->state)) {
    if (dev->reg_state == NETREG_UNINITIALIZED)
    return;
    atomic_inc(&dev->carrier_down_count);
    linkwatch_fire_event(dev);
    }
    }

    驱动程序检测到链接由载波时调用netif_carrier_on函数。清除__LINK_STATE_NOCARRIER标记,并调用linkwatch_fire_event函数。

    void netif_carrier_on(struct net_device *dev)
    {
    if (test_and_clear_bit(__LINK_STATE_NOCARRIER, &dev->state)) {
    if (dev->reg_state == NETREG_UNINITIALIZED)
    return;
    atomic_inc(&dev->carrier_up_count);
    linkwatch_fire_event(dev);
    if (netif_running(dev))
    __netdev_watchdog_up(dev);
    }
    }

    linkwatch_fire_event函数检查net_device->state字段是否有设置__LINK_STATE_LINKWATCH_PENDING标记,如果没有设置的话就调用linkwatch_add_event函数,该函数只是将dev->link_watch_list放到lweventlist链表结尾。
    lweventlist链表中的设备载波发生了变化,即使发生了多次链表中也只有一个元素,因为持有的时net_device结构指针,
    一旦net_device加入到了lweventlist链表或者linkwatch_urgent_event函数返回true,就需要把这个事件交给keventd_wq内核线程调度执行。
    为了防止linkwatch_event执行过于频繁,其执行频率限制为每秒1次。

    static void linkwatch_add_event(struct net_device *dev)
    {
    unsigned long flags;
    
    spin_lock_irqsave(&lweventlist_lock, flags);
    if (list_empty(&dev->link_watch_list)) {
    list_add_tail(&dev->link_watch_list, &lweventlist);
    dev_hold(dev);
    }
    spin_unlock_irqrestore(&lweventlist_lock, flags);
    }
    
    void linkwatch_fire_event(struct net_device *dev)
    {
    bool urgent = linkwatch_urgent_event(dev);
    
    if (!test_and_set_bit(__LINK_STATE_LINKWATCH_PENDING, &dev->state)) {
    linkwatch_add_event(dev);
    } else if (!urgent)
    return;
    
    linkwatch_schedule_work(urgent);
    }

    linkwatch_schedule_work函数会调用工作队列workqueue_struct的调度函数,使workqueue注册时的回调函数linkwatch_event被调用。

    static DECLARE_DELAYED_WORK(linkwatch_work, linkwatch_event);
    
    static void linkwatch_schedule_work(int urgent)
    {
    unsigned long delay = linkwatch_nextevent - jiffies;
    
    if (test_bit(LW_URGENT, &linkwatch_flags))
    return;
    
    /* Minimise down-time: drop delay for up event. */
    if (urgent) {
    if (test_and_set_bit(LW_URGENT, &linkwatch_flags))
    return;
    delay = 0;
    }
    
    /* If we wrap around we'll delay it by at most HZ. */
    if (delay > HZ)
    delay = 0;
    
    /*
    * If urgent, schedule immediate execution; otherwise, don't
    * override the existing timer.
    */
    if (test_bit(LW_URGENT, &linkwatch_flags))
    mod_delayed_work(system_wq, &linkwatch_work, 0);
    else
    schedule_delayed_work(&linkwatch_work, delay);
    }

    linkwatch_event函数调用__linkwatch_run_queue函数。
    在__linkwatch_run_queue函数会为link_watch_list上的每个设备调用linkwatch_do_dev函数。
    linkwatch_do_dev函数中清除__LINK_STATE_LINKWATCH_PENDING标记,并向netdev_chain发送通知信息。

    static void linkwatch_do_dev(struct net_device *dev)
    {
    /*
    * Make sure the above read is complete since it can be
    * rewritten as soon as we clear the bit below.
    */
    smp_mb__before_atomic();
    
    /* We are about to handle this device,
    * so new events can be accepted
    */
    clear_bit(__LINK_STATE_LINKWATCH_PENDING, &dev->state);
    
    rfc2863_policy(dev);
    if (dev->flags & IFF_UP) {
    if (netif_carrier_ok(dev))
    dev_activate(dev);
    else
    dev_deactivate(dev);
    
    netdev_state_change(dev);
    }
    dev_put(dev);
    }
    static void __linkwatch_run_queue(int urgent_only)
    {
    struct net_device *dev;
    LIST_HEAD(wrk);
    
    /*
    * Limit the number of linkwatch events to one
    * per second so that a runaway driver does not
    * cause a storm of messages on the netlink
    * socket.  This limit does not apply to up events
    * while the device qdisc is down.
    */
    if (!urgent_only)
    linkwatch_nextevent = jiffies + HZ;
    /* Limit wrap-around effect on delay. */
    else if (time_after(linkwatch_nextevent, jiffies + HZ))
    linkwatch_nextevent = jiffies;
    
    clear_bit(LW_URGENT, &linkwatch_flags);
    
    spin_lock_irq(&lweventlist_lock);
    list_splice_init(&lweventlist, &wrk);
    
    while (!list_empty(&wrk)) {
    
    dev = list_first_entry(&wrk, struct net_device, link_watch_list);
    list_del_init(&dev->link_watch_list);
    
    if (urgent_only && !linkwatch_urgent_event(dev)) {
    list_add_tail(&dev->link_watch_list, &lweventlist);
    continue;
    }
    spin_unlock_irq(&lweventlist_lock);
    linkwatch_do_dev(dev);
    spin_lock_irq(&lweventlist_lock);
    }
    
    if (!list_empty(&lweventlist))
    linkwatch_schedule_work(0);
    spin_unlock_irq(&lweventlist_lock);
    }
    
    static void linkwatch_event(struct work_struct *dummy)
    {
    rtnl_lock();
    __linkwatch_run_queue(time_after(linkwatch_nextevent, jiffies));
    rtnl_unlock();
    }

    虚拟设备

    虚拟设备使用场景

    • Bonding接口
    • VLAN接口
    内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
    标签: