Make /sys/class/net per net namespace objects belong to container

Submitted by Tyler Hicks on July 13, 2018, 4:05 p.m.

Details

Reviewer None
Submitted July 13, 2018, 4:05 p.m.
Last Updated July 23, 2018, 5:50 p.m.
Revision 2

Cover Letter

This is a revival of an older patch set from Dmitry Torokhov:

 https://lore.kernel.org/lkml/1471386795-32918-1-git-send-email-dmitry.torokhov@gmail.com/

My submission of v2 is here:

 https://lore.kernel.org/lkml/1531497949-1766-1-git-send-email-tyhicks@canonical.com/

Here's Dmitry's description:

 There are objects in /sys hierarchy (/sys/class/net/) that logically
 belong to a namespace/container. Unfortunately all sysfs objects start
 their life belonging to global root, and while we could change
 ownership manually, keeping tracks of all objects that come and go is
 cumbersome. It would be better if kernel created them using correct
 uid/gid from the beginning. 

 This series changes kernfs to allow creating object's with arbitrary
 uid/gid, adds get_ownership() callback to ktype structure so subsystems
 could supply their own logic (likely tied to namespace support) for
 determining ownership of kobjects, and adjusts sysfs code to make use
 of this information. Lastly net-sysfs is adjusted to make sure that
 objects in net namespace are owned by the root user from the owning
 user namespace.

 Note that we do not adjust ownership of objects moved into a new
 namespace (as when moving a network device into a container) as
 userspace can easily do it.

I'm reviving this patch set because we would like this feature for
system containers. One specific use case that we have is that libvirt is
unable to configure its bridge device inside of a system container due
to the bridge files in /sys/class/net/ being owned by init root instead
of container root. The last two patches in this set are patches that
I've added to Dmitry's original set to allow such configuration of the
bridge device.

Eric had previously provided feedback that he didn't favor these changes
affecting all layers of the stack and that most of the changes could
remain local to drivers/base/core.c. That feedback is certainly sensible
but I wanted to send out v2 of the patch set without making that large
of a change since quite a bit of time has passed and the bridge changes
in the last patch of this set shows that not all of the changes will be
local to drivers/base/core.c. I'm happy to make the changes if the
original request still stands.

* Changes since v2:
  - Added my Co-Developed-by and Signed-off-by tags to all of Dmitry's
    patches that I've modified
  - Patch 1 received build failure fixes in
    arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
  - Patch 2 was updated to drop the declaration of sysfs_add_file() from
    sysfs.h since the patch removed all other uses of the function
  - Patch 5 is a new patch that prevents tx_maxrate from being written
    to from inside of a container
    + Maybe I'm being too cautious here but the restriction can always
      be loosened up later
  - Patches 6 and 7 were updated to make net_ns_get_ownership() always
    initialize uid and gid, even when the network namespace is NULL, so
    that it isn't a dangerous function to reuse
    + Requested by Christian Brauner
  - I've looked at all sysfs attributes affected by this patch set and
    feel comfortable about the changes. There are quite a few affected
    attributes that don't have any capable()/ns_capable() checks in
    their store operations (per_bond_attrs, at91_sysfs_attrs,
    sysfs_grcan_attrs, ican3_sysfs_attrs, cdc_ncm_sysfs_attrs,
    qmi_wwan_sysfs_attrs) but I think this is acceptable. It means that
    container root, rather than specifically CAP_NET_ADMIN inside of the
    network namespace that the device belongs to, can write to those
    device attributes. It's the same situation that those devices have
    today in that init root is able to write to the attributes without
    necessarily having CAP_NET_ADMIN. I think that this should probably
    be fixed in order to be consistent with what netdev_store() does by
    verifying CAP_NET_ADMIN in the network namespace but that it doesn't
    need to happen in this patch set.

Thanks!

Tyler
  

Revisions