git.kernelconcepts.de Git - karo-tx-linux.git/commit

cgroup: Fix task counter common ancestor logic

The task counter subsystem has been written assuming that
can_attach_task/attach_task/cancel_attach_task calls are serialized per
task.  This is true when we attach only one task but not when we attach a
whole thread group, in which case the sequence is:

for each thread
if (can_attach_task() < 0)
goto rollback

for each_thread
attach_task()

rollback:
for each thread
cancel_attach_task()

The common ancestor, searched on task_counter_attach_task(), can thus
change between each of these calls for a given task.  This breaks if some
tasks in the thread group are not in the same cgroup origin.  The uncharge
made in attach_task() or the rollback in cancel_attach_task() there would
have an erroneous propagation.

This can even break seriously is some scenario. For example there
with $PID beeing the pid of a multithread process:

mkdir /dev/cgroup/cgroup0
echo $PID > /dev/cgroup/cgroup0/cgroup.procs
echo $PID > /dev/cgroup/tasks
echo $PID > /dev/cgroup/cgroup0/cgroup.procs

On the last move, attach_task() is called on the thread leader with
the wrong common_ancestor, leading to a crash because we uncharge
a res_counter that doesn't exist:

[  149.805063] BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
[  149.806013] IP: [<ffffffff810a0172>] __lock_acquire+0x62/0x15d0
[  149.806013] PGD 51d38067 PUD 5119e067 PMD 0
[  149.806013] Oops: 0000 [#1] PREEMPT SMP
[  149.806013] Dumping ftrace buffer:
[  149.806013]    (ftrace buffer empty)
[  149.806013] CPU 3
[  149.806013] Modules linked in:
[  149.806013]
[  149.806013] Pid: 1111, comm: spread_thread_g Not tainted 3.1.0-rc3+ #165 FUJITSU SIEMENS AMD690VM-FMH/AMD690VM-FMH
[  149.806013] RIP: 0010:[<ffffffff810a0172>]  [<ffffffff810a0172>] __lock_acquire+0x62/0x15d0
[  149.806013] RSP: 0018:ffff880051479b38  EFLAGS: 00010046
[  149.806013] RAX: 0000000000000046 RBX: 0000000000000040 RCX: 0000000000000000
[  149.868002] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000040
[  149.868002] RBP: ffff880051479c08 R08: 0000000000000002 R09: 0000000000000001
[  149.868002] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
[  149.868002] R13: 0000000000000000 R14: 0000000000000000 R15: ffff880051f120a0
[  149.868002] FS:  00007f1e01dd7700(0000) GS:ffff880057d80000(0000) knlGS:0000000000000000
[  149.868002] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  149.868002] CR2: 0000000000000040 CR3: 0000000051c59000 CR4: 00000000000006e0
[  149.868002] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  149.868002] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  149.868002] Process spread_thread_g (pid: 1111, threadinfo ffff880051478000, task ffff880051f120a0)
[  149.868002] Stack:
[  149.868002]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  149.868002]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  149.868002]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  149.868002] Call Trace:
[  149.868002]  [<ffffffff810a1d32>] lock_acquire+0xa2/0x1a0
[  149.868002]  [<ffffffff810c373c>] ? res_counter_uncharge_until+0x4c/0xb0
[  149.868002]  [<ffffffff8180802b>] _raw_spin_lock+0x3b/0x50
[  149.868002]  [<ffffffff810c373c>] ? res_counter_uncharge_until+0x4c/0xb0
[  149.868002]  [<ffffffff810c373c>] res_counter_uncharge_until+0x4c/0xb0
[  149.868002]  [<ffffffff810c26c4>] task_counter_attach_task+0x44/0x50
[  149.868002]  [<ffffffff810bffcd>] cgroup_attach_proc+0x5ad/0x720
[  149.868002]  [<ffffffff810bfa99>] ? cgroup_attach_proc+0x79/0x720
[  149.868002]  [<ffffffff810c01cf>] attach_task_by_pid+0x8f/0x220
[  149.868002]  [<ffffffff810c0230>] ? attach_task_by_pid+0xf0/0x220
[  149.868002]  [<ffffffff810c0230>] ? attach_task_by_pid+0xf0/0x220
[  149.868002]  [<ffffffff810c0388>] cgroup_procs_write+0x28/0x40
[  149.868002]  [<ffffffff810c0bd9>] cgroup_file_write+0x209/0x2f0
[  149.868002]  [<ffffffff812b8d08>] ? apparmor_file_permission+0x18/0x20
[  149.868002]  [<ffffffff8127ef43>] ? security_file_permission+0x23/0x90
[  149.868002]  [<ffffffff81157038>] vfs_write+0xc8/0x190
[  149.868002]  [<ffffffff811571f1>] sys_write+0x51/0x90
[  149.868002]  [<ffffffff818102c2>] system_call_fastpath+0x16/0x1b

To solve this, keep the original cgroup of each thread in the thread
group cached in the flex array and pass it to can_attach_task()/attach_task()
and cancel_attach_task() so that the correct common ancestor between the old
and new cgroup can be safely retrieved for each task.

This is inspired by a previous patch from Li Zefan:
"[PATCH] cgroups: don't cache common ancestor in task counter subsys".

Reported-by: Ben Blum <bblum@andrew.cmu.edu>
Reported-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <htejun@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

author	Frederic Weisbecker <fweisbec@gmail.com>
	Wed, 16 Nov 2011 23:41:40 +0000 (10:41 +1100)
committer	Stephen Rothwell <sfr@canb.auug.org.au>
	Thu, 17 Nov 2011 02:57:17 +0000 (13:57 +1100)
commit	d409a94f540dceba4b8d2b8c4935e9e895603b6e
tree	744bf6920a6c7afd0998e849b607c3460f8dcd4b	tree \| snapshot
parent	474bfdcab49fcaa0445ade8a12a21218a41da521	commit \| diff

Documentation/cgroups/cgroups.txt		diff \| blob \| history
include/linux/cgroup.h		diff \| blob \| history
kernel/cgroup.c		diff \| blob \| history
kernel/cgroup_task_counter.c		diff \| blob \| history