Thursday, 22 September 2016

The confluence of Linux capabilities and namespaces

The confluence of Linux capabilities and namespaces
Linux capabilities were introduced in kernel 2.2, while Namespaces in kernel 3.8 (by introduction I just mean introduction, and not the final implementation of the feature, which sometimes takes a lot of releases). Two technologies separated so far apart in the release cycle came to a confluence in the patchset proposed by Serge E. Hallyn. Here is the Idea - Link. Let me break things down for you.

A. Linux Capabilities

These came into picture to fine grain the power of root (EUID 0, traditionally the programs running with SUDO). Before that a program running as root was all powerful. Now let's say I am running wireshark, it should only have network related powers, and if it tries to insert a kernel module (say someone exploited my wireshark instance) it should be denied. With just EUID being a check of powers, it was not possible. Now I can give wireshark the "CAP_NET_ADMIN" and "CAP_NET_RAW" capabilities (and perhaps some others which are related to networking), and not give "CAP_SYS_MODULE".

B. Namespaces

Namespaces are a Linux kernel feature that isolates and virtualizes resources (PID, hostname, userid, network, ipc, filesystem) of a collection of processes. (~ source : wikipedia). So for example two processes running on the same system can have same PID, as long as they belong to different PID namespaces.

C. User Namespace

User namespaces isolate security-related identifiers and attributes, in particular, user IDs and group IDs (see credentials(7)), the root directory, keys (see keyctl(2)), and capabilities (see capabilities(7)). A process's user and group IDs can be different inside and outside a user namespace. In particular, a process can have a normal unprivileged user ID outside a user namespace while at the same time having a user ID of 0 inside the namespace; in other words, the process has full privileges for operations inside the user namespace, but is unprivileged for operations outside the namespace. (~ source : man page)
 
*Permissions for namespace of the other kinds are checked in the user namespace, they got created in.(~ source : wikipedia).

So in a way user namespaces, like user and group IDs in a traditional system control the access to various components of it (for example, managing files, killing processes, etc.). The only thing different in case of user namespaces is that the "system" is defined by other parameters related to it, like the mount namespace, PID namespace, etc.

The highlighted line above then says something like this, if a process in user namespace UN1 has the capability (i'll explain how capability intersects with user namespaces in a while, for now assume that it is run with sudo) to kill other processes, and if it tries to kill a process in another user namespace, say UN2, where UN1 and UN2 are disjoint, and one is not the ancestor of other, then this must not be allowed.

D. The init_user namespace (init_user_ns)

This is simply the root user namespace, i.e. the user namespace that is created at boot time.

E. The confluence

The intersection of the two technologies allow an unprivileged process to create new user namespace , where it has full capabilities. Thus in the "pseudo system" created by it, it will be able to use Linux capabilities, but outside this "system", it will be powerless. By "pseudo system" I mean creation of a new user namespace, and assigning of resources to it using other namespaces like PID, mount, net, etc.

F. Digging deeper

If we have a look at all the places in the Linux kernel where the checks of capability is being made, we will see two types of checks, depending on the operation that is being performed.

  • Checks with respect to user namespace of the process.
    For operations in which one process tries to affect another process, it is important to check that the process trying to kill another process has "CAP_KILL" in the user namespace of the process it is trying to kill. For example let's look at the following code.

/*kernel/signal.c:692*/
static int kill_ok_by_cred(struct task_struct t) /* t is the task_struct of the task that is being killed */
{
  const struct cred *cred = current_cred();
  const struct cred *tcred = __task_cred(t);

  if (uid_eq(cred->euid, tcred->suid) ||
  uid_eq(cred->euid, tcred->uid) ||
  uid_eq(cred->uid, tcred->suid) ||
  uid_eq(cred->uid, tcred->uid))
  return 1;

  if (ns_capable(tcred->user_ns, CAP_KILL))
  return 1;

  return 0;
}

"kill_ok_by_cred()" is involved in the process of doing permission checks when a process is trying to kill another process. As can be seen in highlighted line, "ns_capable()" is considering not just the capability of the process, but is considering it with respect to the user namespaces of the two processes.

  • Checks that just consider the capability of the processes without worrying about the user namespace the process is in.
    Let's say a process P1 is trying to install a kernel module, now from the point of view of the process it is not trying to influence a process belonging to some other user namespace, thus only capability matter, and not the user namespace.(Incorrect)

  • Checks that consider the capability with respect to the init_user namespace (init_user_ns).
    There are some actions which have system wide consequences, for example if a process P1 tries to install a module. Such checks are made with respect to the init_user namespace. There is no meaning to the question, "is this user allowed to install module with respect to his own namespace?", since the installation will affect the whole system. Thus a better question to ask would be, "Is this user allowed to install module with respect to the root namespace (init_user) ?". Since, the init_user will anyway be ancestor of all other namespaces, thus it will anyway have system wide scope.

    Consider the following code. The check to ensure that the process has required capabilities is made by "may_init_module()", which internally calls "ns_capable()", which is the same function used in the previous code in this article, but notice the difference now. Now, instead of the user namespace, the capability is being checked with respect to the "init_user" namespace.

/* kernel/module.c:3321 */
SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
{
  int err;
  struct load_info info = { };

  err = may_init_module();
  if (err)
. . .

}

/* kernel/module.c:3136 */
static int may_init_module(void)
{
  if (!capable(CAP_SYS_MODULE) || modules_disabled)
  return -EPERM;

  return 0;
}

/* kernel/capability.c:405 */
 bool capable(int cap)
{
  return ns_capable(&init_user_ns, cap);
}

Resources




Tags: linux kernel, lxc, linux namespaces, linuxfarer.blogspot.in, published, linux capabilities, init_user_ns
September 22, 2016 at 07:20PM
Open in Evernote

Wednesday, 21 September 2016

Debugging Linux kernel using virtual machine - Qemu monitor and GDB

Debugging Linux kernel using virtual machine - Qemu monitor and GDB
This method to debug kernel uses the inbuilt gdb server exposed by Qemu for a running virtual machine. This is superior to just running gdb on the kernel, since Qemu monitor provides many more options.

A. Preparation
Tell Qemu to provide a way to connect to it's monitor. We will use unix domain socket here.

1) If you are using virt-manager (libvirt/virsh) to manage your Qemu virtual machines (else skip).
Add a qemu monitor interface to VM.

virsh edit <vm-name>

Make the following changes to config file.

(i) change

<domain type='kvm' >
to
<domain type='kvm' xmlns:qemu='http://ift.tt/LmSPz1'>

(ii) Add the following lines
Use a different id than monitor2, if it already has been used. Also ensure that the file "/var/lib/libvirt/qemu/<vm-name>/monitor2.sock"isn't already being used.

<qemu:commandline>
         <qemu:arg value='-chardev'/>
         <qemu:arg value='socket,id=charmonitor2,path=/var/lib/libvirt/qemu/<vm-name>/monitor2.sock,server,nowait'/>
         <qemu:arg value='-mon'/>
         <qemu:arg value='chardev=charmonitor2,id=monitor2,mode=readline'/>
</qemu:commandline>

2) If you are running Qemu directly from command line (else skip).

Add the following to your qemu command line (the highlighted part).

qemu-system-x86_64 ... -chardev socket,id=charmonitor2,path=/var/lib/libvirt/qemu/<vm-name>/monitor2.sock,server,nowait -mon chardev=charmonitor2,id=monitor2,mode=readline

B. Run
Run the virtual machine.

C. Connect to Qemu monitor

(The commands to be typed are in bold, rest is output from the console.)

sudo socat - UNIX-CONNECT:/var/lib/libvirt/qemu/<vm-name>/monitor2.sock

Now you will see the following command prompt.

(qemu) 
(qemu) help info
help info
info balloon -- show balloon information
info block [-n] [-v] [device] -- show info of one block device or all block devices (-n: show named nodes; -v: show details)
info block-jobs -- show progress of ongoing block device operations
info blockstats -- show block device statistics
info capture -- show capture information
info chardev -- show the character devices
info cpus -- show infos for each CPU
info cpustats -- show CPU statistics
 . . .
info ioapic -- show io apic state
info iothreads -- show iothreads
info irq -- show the interrupts statistics (if available)
info jit -- show dynamic compiler info
info kvm -- show KVM information
info lapic -- show local apic state
info mem -- show the active virtual memory mappings
info memdev -- show memory backends
info memory-devices -- show memory devices
...
info mtree -- show memory tree
. . .


D. Start gdb server

(qemu) gdbserver
gdbserver
Waiting for gdb connection on device 'tcp::1234'
(qemu) 


E. Connect to the started gdb server from host, or from a remote machine.

gdb vmlinux
(gdb) target remote localhost:1234
Remote debugging using localhost:1234
native_safe_halt () at ./arch/x86/include/asm/irqflags.h:50

The first line loads debugging symbols from vmlinux files produced from compiling the kernel. It can also be extracted from vmlinuz file present in the "/boot" folder.
As you can see in the third line that the virtual machine is halted. To continue running the vm enter continue command.


(gdb) continue


Resources


Tags: linuxfarer.blogspot.in
September 22, 2016 at 03:16AM
Open in Evernote

Finding capabilities from inside Linux kernel

Problem statement -
You have to find (print) the capabilities of a process from inside the kernel.
Since you are inside Linux Kernel you cannot use user space utilities(shell commands) to get capabilities, neither do you have user space library calls capget(), and capset(). Invoking system calls capset() and capget() [sys_capget, and sys_capget - notice that the system calls and the user space counterpart have the same name] from inside the kernel may seem convoluted.

Solution -
use the following code.

/*Remember - we are inside kernel*/
/*If pid=0 find capability of the current process*/

int print_capabilities(int pid=0){ 

kernel_cap_t pE, pI, pP; //effective inheritable, and permitted
int ret=0;
__u64 *cap=NULL;
__u64 one=1;
int pid=0;//send 0 as pid to lxc_uio_cap_get_target_pid() to get capabilities of current process.

char *capabilities[]={
"CAP_CHOWN ",
"CAP_DAC_OVERRIDE ",
"CAP_DAC_READ_SEARCH ",
"CAP_FOWNER ",
"CAP_FSETID ",
"CAP_KILL ",
"CAP_SETGID ",
"CAP_SETUID ",
"CAP_SETPCAP ",
"CAP_LINUX_IMMUTABLE ",
"CAP_NET_BIND_SERVICE",
"CAP_NET_BROADCAST ",
"CAP_NET_ADMIN ",
"CAP_NET_RAW ",
"CAP_IPC_LOCK ",
"CAP_IPC_OWNER ",
"CAP_SYS_MODULE ",
"CAP_SYS_RAWIO ",
"CAP_SYS_CHROOT ",
"CAP_SYS_PTRACE ",
"CAP_SYS_PACCT ",
"CAP_SYS_ADMIN ",
"CAP_SYS_BOOT ",
"CAP_SYS_NICE ",
"CAP_SYS_RESOURCE ",
"CAP_SYS_TIME ",
"CAP_SYS_TTY_CONFIG ",
"CAP_MKNOD ",
"CAP_LEASE ",
"CAP_AUDIT_WRITE ",
"CAP_AUDIT_CONTROL ",
"CAP_SETFCAP ",
"CAP_MAC_OVERRIDE ",
"CAP_MAC_ADMIN ",
"CAP_SYSLOG ",
"CAP_WAKE_ALARM ",
"CAP_BLOCK_SUSPEND ",
"CAP_AUDIT_READ "
};

ret=lxc_uio_cap_get_target_pid(pid, &pE, &pI, &pP);
if(ret!=0){
//error
printk(KERN_ALERT"[lxc-uio] Error getting capabilities\n");
return ret;
}

printk(KERN_ALERT"Effective Capabilities\n");
for(ret=0;ret<=37;ret++){
cap=(void *)&pE.cap[0];
printk(KERN_ALERT"[sahil] %s [%d]\n",capabilities[ret],(*cap & (one<<ret))>>ret);
}

printk(KERN_ALERT"Permitted Capabilities\n");
for(ret=0;ret<=37;ret++){
cap=(void *)&pP.cap[0];
printk(KERN_ALERT"[sahil] %s [%d]\n",capabilities[ret],(*cap & (one<<ret))>>ret);
}int


printk(KERN_ALERT"Inheritable Capabilities\n");
for(ret=0;ret<=37;ret++){
cap=(void *)&pI.cap[0];
printk(KERN_ALERT"[sahil] %s [%d]\n",capabilities[ret],(*cap & (one<<ret))>>ret);
}

return 0;
}


static inline int lxc_uio_cap_get_target_pid(pid_t pid, kernel_cap_t *pEp,
  kernel_cap_t *pIp, kernel_cap_t *pPp)
{
/* copied from capability.c since there the function was declared as static */
  int ret;

  if (pid && (pid != task_pid_vnr(current))) {
  struct task_struct *target;

  rcu_read_lock();

  target = find_task_by_vpid(pid);
  if (!target)
  ret = -ESRCH;
  else
  ret = security_capget(target, pEp, pIp, pPp);

  rcu_read_unlock();
  } else
  ret = security_capget(current, pEp, pIp, pPp);

  return ret;
}


LXC and Linux capabilities

By default LXC drops the following capabilities from container.