Randy Dunlap [Tue, 20 Mar 2012 23:58:11 +0000 (10:58 +1100)]
ramoops: fix printk format warnings
Fix printk format warnings for phys_addr_t type variables:
drivers/char/ramoops.c:246:3: warning: format '%llx' expects type 'long long unsigned int', but argument 3 has type 'phys_addr_t'
drivers/char/ramoops.c:273:2: warning: format '%llx' expects type 'long long unsigned int', but argument 3 has type 'phys_addr_t'
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Cc: Kees Cook <keescook@chromium.org> Cc: Greg Kroah-Hartman <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kees Cook [Tue, 20 Mar 2012 23:58:11 +0000 (10:58 +1100)]
ramoops: use pstore interface
Instead of using /dev/mem directly and forcing userspace to know (or
extract) where the platform has defined persistent memory, how many slots
it has, the sizes, etc, use the common pstore infrastructure to handle
Oops gathering and extraction. This presents a much easier to use
filesystem-based view to the memory region. This also means that any
other tools that are written to understand pstore will automatically be
able to process ramoops too.
Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Marco Stornelli <marco.stornelli@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Greg Kroah-Hartman <greg@kroah.com> Cc: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cyrill Gorcunov [Tue, 20 Mar 2012 23:58:10 +0000 (10:58 +1100)]
c/r: prctl: add ability to set new mm_struct::exe_file
When we do restore we would like to have a way to setup a former
mm_struct::exe_file so that /proc/pid/exe would point to the original
executable file a process had at checkpoint time.
For this the PR_SET_MM_EXE_FILE code is introduced. This option takes a
file descriptor which will be set as a source for new /proc/$pid/exe
symlink.
Note it allows to change /proc/$pid/exe if there are no VM_EXECUTABLE
vmas present for current process, simply because this feature is a special
to C/R and mm::num_exe_file_vmas become meaningless after that.
To minimize the amount of transition the /proc/pid/exe symlink might have,
this feature is implemented in one-shot manner. Thus once changed the
symlink can't be changed again. This should help sysadmins to monitor the
symlinks over all process running in a system.
In particular one could make a snapshot of processes and ring alarm if
there unexpected changes of /proc/pid/exe's in a system.
Note -- this feature is available iif CONFIG_CHECKPOINT_RESTORE is set and
the caller must have CAP_SYS_RESOURCE capability granted, otherwise the
request to change symlink will be rejected.
Cyrill Gorcunov [Tue, 20 Mar 2012 23:58:09 +0000 (10:58 +1100)]
c/r: prctl: extend PR_SET_MM to set up more mm_struct entries
During checkpoint we dump whole process memory to a file and the dump
includes process stack memory. But among stack data itself, the stack
carries additional parameters such as command line arguments, environment
data and auxiliary vector.
So when we do restore procedure and once we've restored stack data itself
we need to setup mm_struct::arg_start/end, env_start/end, so restored
process would be able to find command line arguments and environment data
it had at checkpoint time. The same applies to auxiliary vector.
For this reason additional PR_SET_MM_(ARG_START | ARG_END | ENV_START |
ENV_END | AUXV) codes are introduced.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Kees Cook <keescook@chromium.org> Cc: Tejun Heo <tj@kernel.org> Cc: Andrew Vagin <avagin@openvz.org> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Vasiliy Kulikov <segoon@openwall.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cyrill Gorcunov [Tue, 20 Mar 2012 23:58:09 +0000 (10:58 +1100)]
c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat
We would like to have an ability to restore command line arguments and
program environment pointers but first we need to obtain them somehow.
Thus we put these values into /proc/$pid/stat. The exit_code is needed to
restore zombie tasks.
Cyrill Gorcunov [Tue, 20 Mar 2012 23:58:08 +0000 (10:58 +1100)]
syscalls, x86: add __NR_kcmp syscall
While doing the checkpoint-restore in the user space one need to determine
whether various kernel objects (like mm_struct-s of file_struct-s) are
shared between tasks and restore this state.
The 2nd step can be solved by using appropriate CLONE_ flags and the
unshare syscall, while there's currently no ways for solving the 1st one.
One of the ways for checking whether two tasks share e.g. mm_struct is to
provide some mm_struct ID of a task to its proc file, but showing such
info considered to be not that good for security reasons.
Thus after some debates we end up in conclusion that using that named
'comparison' syscall might be the best candidate. So here is it --
__NR_kcmp.
It takes up to 5 arguments - the pids of the two tasks (which
characteristics should be compared), the comparison type and (in case of
comparison of files) two file descriptors.
Lookups for pids are done in the caller's PID namespace only.
At moment only x86 is supported and tested.
[akpm@linux-foundation.org: fix up selftests, warnings] Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Andrey Vagin <avagin@openvz.org> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Glauber Costa <glommer@parallels.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tejun Heo <tj@kernel.org> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Vasiliy Kulikov <segoon@openwall.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Valdis.Kletnieks@vt.edu Cc: Michal Marek <mmarek@suse.cz> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When we do checkpoint of a task we need to know the list of children the
task, has but there is no easy and fast way to generate reverse
parent->children chain from arbitrary <pid> (while a parent pid is
provided in "PPid" field of /proc/<pid>/status).
So instead of walking over all pids in the system (creating one big
process tree in memory, just to figure out which children a task has) --
we add explicit /proc/<pid>/task/<tid>/children entry, because the kernel
already has this kind of information but it is not yet exported.
This is a first level children, not the whole process tree.
Manfred Spraul [Tue, 20 Mar 2012 23:57:57 +0000 (10:57 +1100)]
ipc/sem.c: alternatives to preempt_disable()
ipc/sem.c uses a custom wakeup scheme that relies on preempt_disable().
On -RT, this causes increased latencies and debug warnings.
The patch adds two additional schemes:
- one built around a completion - could be better for -RT kernels
- one built around a spinlock - unfortunately it's broken
- and the current one
My preferred solution would be the spinlock implementation: RT would use
premptible spinlocks, mainline normal spinlocks. Thus both get the
optimal implementation without any special code in ipc/sem.c.
Unfortunately, I don't see how it could be fixed.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tetsuo Handa [Tue, 20 Mar 2012 23:57:43 +0000 (10:57 +1100)]
kmod: avoid deadlock from recursive kmod call
The system deadlocks (at least since 2.6.10) when
call_usermodehelper(UMH_WAIT_EXEC) request triggered
call_usermodehelper(UMH_WAIT_PROC) request.
This is because "khelper thread is waiting for the worker thread at
wait_for_completion() in do_fork() since the worker thread was created
with CLONE_VFORK flag" and "the worker thread cannot call complete()
because do_execve() is blocked at UMH_WAIT_PROC request" and "the khelper
thread cannot start processing UMH_WAIT_PROC request because the khelper
thread is waiting for the worker thread at wait_for_completion() in
do_fork()".
In order to avoid deadlock, do not try to call wait_for_completion() in
call_usermodehelper_exec() if the worker thread was created by khelper
thread with CLONE_VFORK flag.
The easiest example to observe this deadlock is to use a corrupted
/sbin/hotplug binary (like shown below).
call_usermodehelper("/tmp/dummy", UMH_WAIT_EXEC) is called from
kobject_uevent_env() in lib/kobject_uevent.c upon loading/unloading a
module. do_execve("/tmp/dummy") triggers a call to
request_module("binfmt-0000") from search_binary_handler() which in turn
calls call_usermodehelper(UMH_WAIT_PROC).
There are various hooks called during do_execve() operation (e.g.
security_bprm_check(), audit_bprm(), "struct
linux_binfmt"->load_binary()). If one of such hooks triggers
UMH_WAIT_EXEC, this deadlock will happen even if /sbin/hotplug is not
corrupted.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
YanHong [Tue, 20 Mar 2012 23:57:27 +0000 (10:57 +1100)]
init/do_mounts.c: create /root if it does not exist
If someone supplies an initramfs without /root in it, and we fail to
execute rdinit, we will try to mount root device and fail, for the mount
point does not exits.
But we get error message "VFS: Cannot open root device". It's confusing.
We can give a more detailed error message, or we can go further: if /root
does not exit, create it.
Signed-off-by: YanHong <tempname2@hotmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Woody Suwalski <terraluna977@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
That set_current_state() won't work very well: the subsequent mutex_lock()
might flip the task back into TASK_RUNNING.
Attempt to put it somewhere where it might have been meant to be, and
attempt to describe why it might have been added.
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Grant Likely <grant.likely@secretlab.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kautuk Consul [Tue, 20 Mar 2012 23:49:16 +0000 (10:49 +1100)]
um/kernel/trap.c: port OOM changes to handle_page_fault()
Commit d065bd810b6deb6 ("mm: retry page fault when blocking on disk
transfer") and commit 37b23e0525 ("x86,mm: make pagefault killable")
The above commits introduced changes into the x86 pagefault handler
for making the page fault handler retryable as well as killable.
These changes reduce the mmap_sem hold time, which is crucial
during OOM killer invocation.
Port these changes to um.
Signed-off-by: Kautuk Consul <consul.kautuk@gmail.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matt Fleming [Tue, 20 Mar 2012 23:49:15 +0000 (10:49 +1100)]
cris: use set_current_blocked() and block_sigmask()
As described in e6fa16ab ("signal: sigprocmask() should do
retarget_shared_pending()") the modification of current->blocked is
incorrect as we need to check whether the signal we're about to block is
pending in the shared queue.
Also, use the new helper function introduced in commit 5e6292c0f28f
("signal: add block_sigmask() for adding sigmask to current->blocked")
which centralises the code for updating current->blocked after
successfully delivering a signal and reduces the amount of duplicate code
across architectures. In the past some architectures got this code wrong,
so using this helper function should stop that from happening again.
Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Mikael Starvik <starvik@axis.com> Acked-by: Jesper Nilsson <jesper.nilsson@axis.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matt Fleming [Tue, 20 Mar 2012 23:49:13 +0000 (10:49 +1100)]
C6X: use set_current_blocked() and block_sigmask()
As described in e6fa16ab ("signal: sigprocmask() should do
retarget_shared_pending()") the modification of current->blocked is
incorrect as we need to check whether the signal we're about to block is
pending in the shared queue.
Also, use the new helper function introduced in commit 5e6292c0f28f
("signal: add block_sigmask() for adding sigmask to current->blocked")
which centralises the code for updating current->blocked after
successfully delivering a signal and reduces the amount of duplicate code
across architectures. In the past some architectures got this code wrong,
so using this helper function should stop that from happening again.
Cc: Mark Salter <msalter@redhat.com> Cc: Aurelien Jacquiot <a-jacquiot@ti.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matt Fleming [Tue, 20 Mar 2012 23:49:13 +0000 (10:49 +1100)]
mn10300: use set_current_blocked() and block_sigmask()
As described in e6fa16ab ("signal: sigprocmask() should do
retarget_shared_pending()") the modification of current->blocked is
incorrect as we need to check whether the signal we're about to block is
pending in the shared queue.
Also, use the new helper function introduced in commit 5e6292c0f28f
("signal: add block_sigmask() for adding sigmask to current->blocked")
which centralises the code for updating current->blocked after
successfully delivering a signal and reduces the amount of duplicate code
across architectures. In the past some architectures got this code wrong,
so using this helper function should stop that from happening again.
Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matt Fleming [Tue, 20 Mar 2012 23:49:13 +0000 (10:49 +1100)]
m68k: use set_current_blocked() and block_sigmask()
As described in e6fa16ab ("signal: sigprocmask() should do
retarget_shared_pending()") the modification of current->blocked is
incorrect as we need to check whether the signal we're about to block is
pending in the shared queue.
Also, use the new helper function introduced in commit 5e6292c0f28f
("signal: add block_sigmask() for adding sigmask to current->blocked")
which centralises the code for updating current->blocked after
successfully delivering a signal and reduces the amount of duplicate code
across architectures. In the past some architectures got this code wrong,
so using this helper function should stop that from happening again.
Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Greg Ungerer <gerg@uclinux.org> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matt Fleming [Tue, 20 Mar 2012 23:49:12 +0000 (10:49 +1100)]
m32r: use set_current_blocked() and block_sigmask()
As described in e6fa16ab ("signal: sigprocmask() should do
retarget_shared_pending()") the modification of current->blocked is
incorrect as we need to check whether the signal we're about to block is
pending in the shared queue.
Also, use the new helper function introduced in commit 5e6292c0f28f
("signal: add block_sigmask() for adding sigmask to current->blocked")
which centralises the code for updating current->blocked after
successfully delivering a signal and reduces the amount of duplicate code
across architectures. In the past some architectures got this code wrong,
so using this helper function should stop that from happening again.
Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Kyle McMartin <kyle@redhat.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matt Fleming [Tue, 20 Mar 2012 23:49:12 +0000 (10:49 +1100)]
alpha: use set_current_blocked() and block_sigmask()
As described in e6fa16ab ("signal: sigprocmask() should do
retarget_shared_pending()") the modification of current->blocked is
incorrect as we need to check for shared signals we're about to block.
Also, use the new helper function introduced in commit 5e6292c0f28f
("signal: add block_sigmask() for adding sigmask to current->blocked")
which centralises the code for updating current->blocked after
successfully delivering a signal and reduces the amount of duplicate code
across architectures. In the past some architectures got this code wrong,
so using this helper function should stop that from happening again.
Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matt Fleming [Tue, 20 Mar 2012 23:49:11 +0000 (10:49 +1100)]
h8300: use set_current_blocked() and block_sigmask()
As described in e6fa16ab ("signal: sigprocmask() should do
retarget_shared_pending()") the modification of current->blocked is
incorrect as we need to check whether the signal we're about to block is
pending in the shared queue.
Also, use the new helper function introduced in commit 5e6292c0f28f
("signal: add block_sigmask() for adding sigmask to current->blocked")
which centralises the code for updating current->blocked after
successfully delivering a signal and reduces the amount of duplicate code
across architectures. In the past some architectures got this code wrong,
so using this helper function should stop that from happening again.
Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matt Fleming [Tue, 20 Mar 2012 23:49:10 +0000 (10:49 +1100)]
frv: use set_current_blocked() and block_sigmask()
As described in e6fa16ab ("signal: sigprocmask() should do
retarget_shared_pending()") the modification of current->blocked is
incorrect as we need to check whether the signal we're about to block is
pending in the shared queue.
Also, use the new helper function introduced in commit 5e6292c0f28f
("signal: add block_sigmask() for adding sigmask to current->blocked")
which centralises the code for updating current->blocked after
successfully delivering a signal and reduces the amount of duplicate code
across architectures. In the past some architectures got this code wrong,
so using this helper function should stop that from happening again.
Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rik van Riel [Tue, 20 Mar 2012 23:48:57 +0000 (10:48 +1100)]
mm-add-extra-free-kbytes-tunable-update
All the fixes suggested by Andrew Morton. Not much of a changelog
since the patch should probably be folded into
mm-add-extra-free-kbytes-tunable.patch
Thank you for pointing these out, Andrew.
Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rik van Riel [Tue, 20 Mar 2012 23:48:57 +0000 (10:48 +1100)]
mm: add extra free kbytes tunable
Add a userspace visible knob to tell the VM to keep an extra amount of
memory free, by increasing the gap between each zone's min and low
watermarks.
This is useful for realtime applications that call system calls and have a
bound on the number of allocations that happen in any short time period.
In this application, extra_free_kbytes would be left at an amount equal to
or larger than than the maximum number of allocations that happen in any
burst.
It may also be useful to reduce the memory use of virtual machines
(temporarily?), in a way that does not cause memory fragmentation like
ballooning does.
Testing results from Satoru Moriya:
: I ran some sample workloads and measure memory allocation latency
: (latency of __alloc_page_nodemask()).
: The test is like following:
:
: - CPU: 1 socket, 4 core
: - Memory: 4GB
:
: - Background load:
: $ dd if=3D/dev/zero of=3D/tmp/tmp1
: $ dd if=3D/dev/zero of=3D/tmp/tmp2
: $ dd if=3D/dev/zero of=3D/tmp/tmp3
:
: - Main load:
: $ mapped-file-stream 1 $((1024 * 1024 * 640)) --(*)
:
: (*) This is made by Johannes Weiner
: https://lkml.org/lkml/2010/8/30/226
:
: It allocates/access 640MByte memory at a burst.
:
: The result is follwoing:
:
: | | extra |
: | default | kbytes |
: --------------------------------------------------------------
: min_free_kbytes | 8113 | 8113 |
: extra_free_kbytes | 0 | 640*1024 | (KB)
: --------------------------------------------------------------
: worst latency | 517.762 | 20.775 | (usec)
: --------------------------------------------------------------
: vmstat result | | |
: nr_vmscan_write | 0 | 0 |
: pgsteal_dma | 0 | 0 |
: pgsteal_dma32 | 143667 | 144882 |
: pgsteal_normal | 31486 | 27001 |
: pgsteal_movable | 0 | 0 |
: pgscan_kswapd_dma | 0 | 0 |
: pgscan_kswapd_dma32 | 138617 | 156351 |
: pgscan_kswapd_normal | 30593 | 27955 |
: pgscan_kswapd_movable | 0 | 0 |
: pgscan_direct_dma | 0 | 0 |
: pgscan_direct_dma32 | 5050 | 0 |
: pgscan_direct_normal | 896 | 0 |
: pgscan_direct_movable | 0 | 0 |
: kswapd_steal | 169207 | 171883 |
: kswapd_inodesteal | 0 | 0 |
: kswapd_low_wmark_hit_quickly | 43 | 45 |
: kswapd_high_wmark_hit_quickly | 1 | 0 |
: allocstall | 32 | 0 |
:
:
: As you can see, in the default case there were 32 direct reclaim
: (allocstal= l) and its worst latency was 517.762 usecs. This value may be
: larger if a process would sleep or issue I/O in the direct reclaim path.
: OTOH, ii the other case where I add extra free bytes, there were no direct
: reclaim and its worst latency was 20.775 usecs.
:
: In this test case, we can avoid direct reclaim and keep a latency low.
Signed-off-by: Rik van Riel<riel@redhat.com> Acked-by: Johannes Weiner <jweiner@redhat.com> Tested-by: Satoru Moriya <satoru.moriya@hds.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
After commit v2.6.36-5896-gd065bd8 "mm: retry page fault when blocking on
disk transfer" we usually wait in page-faults without mmap_sem held, so
all swap-token logic was broken, because it based on using
rwsem_is_locked(&mm->mmap_sem) as sign of in progress page-faults.
Add an atomic counter of in progress page-faults for mm to the mm_struct
with swap-token.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kees Cook [Tue, 20 Mar 2012 23:48:23 +0000 (10:48 +1100)]
fs: hardlink creation restrictions
On systems that have user-writable directories on the same partition as
system files, a long-standing class of security issues is the
hardlink-based time-of-check-time-of-use race, most commonly seen in
world-writable directories like /tmp. The common method of exploitation
of this flaw is to cross privilege boundaries when following a given
hardlink (i.e. a root process follows a hardlink created by another
user). Additionally, an issue exists where users can "pin" a potentially
vulnerable setuid/setgid file so that an administrator will not actually
upgrade a system fully.
The solution is to permit hardlinks to only be created when the user is
already the existing file's owner, or if they already have read/write
access to the existing file.
Many Linux users are surprised when they learn they can link to files they
have no access to, so this change appears to follow the doctrine of "least
surprise". Additionally, this change does not violate POSIX, which states
"the implementation may require that the calling process has permission to
access the existing file"[1].
This change is known to break some implementations of the "at" daemon,
though the version used by Fedora and Ubuntu has been fixed[2] for a
while. Otherwise, the change has been undisruptive while in use in Ubuntu
for the last 1.5 years.
This patch is based on the patch in Openwall and grsecurity. I have added
a sysctl to enable the protected behavior, documentation, and an audit
notification.
Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Rik van Riel <riel@redhat.com> Cc: Federica Teodori <federica.teodori@googlemail.com> Cc: Lucian Adrian Grijincu <lucian.grijincu@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Eric Paris <eparis@redhat.com> Cc: Randy Dunlap <rdunlap@xenotime.net> Cc: Dan Rosenberg <drosenberg@vsecurity.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kees Cook [Tue, 20 Mar 2012 23:48:22 +0000 (10:48 +1100)]
fs: symlink restrictions on sticky directories
A longstanding class of security issues is the symlink-based
time-of-check-time-of-use race, most commonly seen in world-writable
directories like /tmp. The common method of exploitation of this flaw is
to cross privilege boundaries when following a given symlink (i.e. a root
process follows a symlink belonging to another user). For a likely
incomplete list of hundreds of examples across the years, please see:
http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp
The solution is to permit symlinks to only be followed when outside a
sticky world-writable directory, or when the uid of the symlink and
follower match, or when the directory owner matches the symlink's owner.
Some pointers to the history of earlier discussion that I could find:
1996 Aug, Zygo Blaxell
http://marc.info/?l=bugtraq&m=87602167419830&w=2
1996 Oct, Andrew Tridgell
http://lkml.indiana.edu/hypermail/linux/kernel/9610.2/0086.html
1997 Dec, Albert D Cahalan
http://lkml.org/lkml/1997/12/16/4
2005 Feb, Lorenzo Hernández García-Hierro
http://lkml.indiana.edu/hypermail/linux/kernel/0502.0/1896.html
2010 May, Kees Cook
https://lkml.org/lkml/2010/5/30/144
Past objections and rebuttals could be summarized as:
- Violates POSIX.
- POSIX didn't consider this situation and it's not useful to follow
a broken specification at the cost of security.
- Might break unknown applications that use this feature.
- Applications that break because of the change are easy to spot and
fix. Applications that are vulnerable to symlink ToCToU by not having
the change aren't. Additionally, no applications have yet been found
that rely on this behavior.
- Applications should just use mkstemp() or O_CREATE|O_EXCL.
- True, but applications are not perfect, and new software is written
all the time that makes these mistakes; blocking this flaw at the
kernel is a single solution to the entire class of vulnerability.
- This should live in the core VFS.
- This should live in an LSM. (https://lkml.org/lkml/2010/5/31/135)
- This should live in an LSM.
- This should live in the core VFS. (https://lkml.org/lkml/2010/8/2/188)
This patch is based on the patch in Openwall and grsecurity, along with
suggestions from Al Viro. I have added a sysctl to enable the protected
behavior, documentation, and an audit notification.
[akpm@linux-foundation.org: move sysctl_protected_sticky_symlinks declaration into .h] Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Ingo Molnar <mingo@elte.hu> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Rik van Riel <riel@redhat.com> Cc: Federica Teodori <federica.teodori@googlemail.com> Cc: Lucian Adrian Grijincu <lucian.grijincu@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Eric Paris <eparis@redhat.com> Cc: Randy Dunlap <rdunlap@xenotime.net> Cc: Dan Rosenberg <drosenberg@vsecurity.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
WARNING: please, no space before tabs
#1565: FILE: drivers/net/wireless/iwlwifi/iwl-debugfs.c:91:
+^I.open = simple_open, ^I^I^I\$
WARNING: please, no space before tabs
#1574: FILE: drivers/net/wireless/iwlwifi/iwl-debugfs.c:99:
+^I.open = simple_open, ^I^I^I\$
WARNING: please, no space before tabs
#1583: FILE: drivers/net/wireless/iwlwifi/iwl-debugfs.c:110:
+^I.open = simple_open, ^I^I\$
total: 0 errors, 3 warnings, 2133 lines checked
./patches/simple_open-automatically-convert-to-simple_open.patch has style problems, please review.
If any of these errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.
Please run checkpatch prior to sending patches
Cc: Stephen Boyd <sboyd@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Stephen Boyd [Tue, 20 Mar 2012 23:48:21 +0000 (10:48 +1100)]
simple_open: automatically convert to simple_open()
Many users of debugfs copy the implementation of default_open() when they
want to support a custom read/write function op. This leads to a
proliferation of the default_open() implementation across the entire tree.
Now that the common implementation has been consolidated into libfs we
can replace all the users of this function with simple_open().
This replacement was done with the following semantic patch:
Julia Lawall [Tue, 20 Mar 2012 23:48:20 +0000 (10:48 +1100)]
scripts/coccinelle/api/simple_open.cocci: semantic patch for simple_open()
Find instances of an open-coded simple_open() and replace them with calls
to simple_open().
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Reported-by: Stephen Boyd <sboyd@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Stephen Boyd [Tue, 20 Mar 2012 23:48:20 +0000 (10:48 +1100)]
libfs: add simple_open()
debugfs and a few other drivers use an open-coded version of simple_open()
to pass a pointer from the file to the read/write file ops. Add support
for this simple case to libfs so that we can remove the many duplicate
copies of this simple function.
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andi Kleen [Tue, 20 Mar 2012 23:48:19 +0000 (10:48 +1100)]
brlocks/lglocks: cleanups
lglocks and brlocks are currently generated with some complicated macros
in lglock.h. But there's no reason to not just use common utility
functions and put all the data into a common data structure.
Since there are at least two users it makes sense to share this code in a
library. This is also easier maintainable than a macro forest.
This will also make it later possible to dynamically allocate lglocks and
also use them in modules (this would both still need some additional, but
now straightforward, code)
In general the users now look more like normal function calls with
pointers, not magic macros.
The patch is rather large because I move over all users in one go to keep
it bisectable. This impacts the VFS somewhat in terms of lines changed.
But no actual behaviour change.
[akpm@linux-foundation.org: checkpatch fixes] Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dmitry Kasatkin [Tue, 20 Mar 2012 23:48:18 +0000 (10:48 +1100)]
vfs: increment iversion when a file is truncated
When a file is truncated with truncate()/ftruncate() and then closed,
iversion is not updated. This patch uses ATTR_SIZE flag as an indication
to increment iversion.
Mimi said:
On fput(), i_version is used to detect and flag files that have changed
and need to be re-measured in the IMA measurement policy. When a file
is truncated with truncate()/ftruncate() and then closed, i_version is
not updated. As a result, although the file has changed, it will not be
re-measured and added to the IMA measurement list on subsequent access.
Signed-off-by: Dmitry Kasatkin <dmitry.kasatkin@intel.com> Acked-by: Mimi Zohar <zohar@us.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matt Fleming [Tue, 20 Mar 2012 23:48:18 +0000 (10:48 +1100)]
parisc: use set_current_blocked() and block_sigmask()
As described in e6fa16ab ("signal: sigprocmask() should do
retarget_shared_pending()") the modification of current->blocked is
incorrect as we need to check whether the signal we're about to block is
pending in the shared queue.
Also, use the new helper function introduced in commit 5e6292c0f28f
("signal: add block_sigmask() for adding sigmask to current->blocked")
which centralises the code for updating current->blocked after
successfully delivering a signal and reduces the amount of duplicate code
across architectures. In the past some architectures got this code wrong,
so using this helper function should stop that from happening again.
Akinobu Mita [Tue, 20 Mar 2012 23:48:17 +0000 (10:48 +1100)]
ocfs2: use bitmap_weight()
Use bitmap_weight() instead of reinventing the wheel.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Akinobu Mita [Tue, 20 Mar 2012 23:48:16 +0000 (10:48 +1100)]
ocfs2: use find_last_bit()
We already have find_last_bit(). So just use it as described in the
comment.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matt Fleming [Tue, 20 Mar 2012 23:48:16 +0000 (10:48 +1100)]
blackfin: use set_current_blocked() and block_sigmask()
As described in e6fa16ab ("signal: sigprocmask() should do
retarget_shared_pending()") the modification of current->blocked is
incorrect as we need to check whether the signal we're about to block is
pending in the shared queue.
Also, use the new helper function introduced in commit 5e6292c0f28f
("signal: add block_sigmask() for adding sigmask to current->blocked")
which centralises the code for updating current->blocked after
successfully delivering a signal and reduces the amount of duplicate code
across architectures. In the past some architectures got this code wrong,
so using this helper function should stop that from happening again.
Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matt Fleming [Tue, 20 Mar 2012 23:48:15 +0000 (10:48 +1100)]
unicore32: use block_sigmask()
Use the new helper function introduced in commit 5e6292c0f28f ("signal:
add block_sigmask() for adding sigmask to current->blocked") which
centralises the code for updating current->blocked after successfully
delivering a signal and reduces the amount of duplicate code across
architectures. In the past some architectures got this code wrong, so
using this helper function should stop that from happening again.
Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Guan Xuetao <gxt@mprc.pku.edu.cn> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matt Fleming [Tue, 20 Mar 2012 23:48:12 +0000 (10:48 +1100)]
score: use set_current_blocked() and block_sigmask()
As described in e6fa16ab ("signal: sigprocmask() should do
retarget_shared_pending()") the modification of current->blocked is
incorrect as we need to check whether the signal we're about to block is
pending in the shared queue.
Also, use the new helper function introduced in commit 5e6292c0f28f
("signal: add block_sigmask() for adding sigmask to current->blocked")
which centralises the code for updating current->blocked after
successfully delivering a signal and reduces the amount of duplicate code
across architectures. In the past some architectures got this code wrong,
so using this helper function should stop that from happening again.
Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Chen Liqin <liqin.chen@sunplusct.com> Cc: Lennox Wu <lennox.wu@gmail.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matt Fleming [Tue, 20 Mar 2012 23:48:11 +0000 (10:48 +1100)]
MIPS: use set_current_blocked() and block_sigmask()
As described in e6fa16ab ("signal: sigprocmask() should do
retarget_shared_pending()") the modification of current->blocked is
incorrect as we need to check whether the signal we're about to block is
pending in the shared queue.
Also, use the new helper function introduced in commit 5e6292c0f28f
("signal: add block_sigmask() for adding sigmask to current->blocked")
which centralises the code for updating current->blocked after
successfully delivering a signal and reduces the amount of duplicate code
across architectures. In the past some architectures got this code wrong,
so using this helper function should stop that from happening again.
Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Ralf Baechle <ralf@linux-mips.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: David Daney <ddaney@caviumnetworks.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matt Fleming [Tue, 20 Mar 2012 23:48:10 +0000 (10:48 +1100)]
microblaze: use set_current_blocked() and block_sigmask()
As described in e6fa16ab ("signal: sigprocmask() should do
retarget_shared_pending()") the modification of current->blocked is
incorrect as we need to check whether the signal we're about to block is
pending in the shared queue.
Also, use the new helper function introduced in commit 5e6292c0f28f
("signal: add block_sigmask() for adding sigmask to current->blocked")
which centralises the code for updating current->blocked after
successfully delivering a signal and reduces the amount of duplicate code
across architectures. In the past some architectures got this code wrong,
so using this helper function should stop that from happening again.
Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Michal Simek <monstr@monstr.eu> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matt Fleming [Tue, 20 Mar 2012 23:48:10 +0000 (10:48 +1100)]
microblaze: fix signal masking
There are a couple of problems with the current signal code,
1. If we failed to setup the signal stack frame then we should not be
masking any signals.
2. ka->sa.sa_mask is only added to the current blocked signals list if
SA_NODEFER is set in ka->sa.sa_flags. If we successfully setup the
signal frame and are going to run the handler then we must honour
sa_mask.
Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Michal Simek <monstr@monstr.eu> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matt Fleming [Tue, 20 Mar 2012 23:48:10 +0000 (10:48 +1100)]
microblaze: no need to reset handler if SA_ONESHOT
get_signal_to_deliver() already resets the signal handler if SA_ONESHOT is
set in ka->sa.sa_flags, there's no need to do it again in handle_signal().
Furthermore, because we were modifying ka->sa.sa_handler (which is a copy
of sighand->action[]) instead of sighand->action[] the original code
actually had no effect on signal delivery.
Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Michal Simek <monstr@monstr.eu> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matt Fleming [Tue, 20 Mar 2012 23:48:09 +0000 (10:48 +1100)]
microblaze: don't reimplement force_sigsegv()
Instead of open coding the sequence from force_sigsegv() just call it.
This also fixes a bug because we were modifying ka->sa.sa_handler (which
is a copy of sighand->action[]), whereas the intention of the code was to
modify sighand->action[] directly.
As the original code was working with a copy it had no effect on signal
delivery.
Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Michal Simek <monstr@monstr.eu> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matt Fleming [Tue, 20 Mar 2012 23:48:08 +0000 (10:48 +1100)]
ia64: use set_current_blocked() and block_sigmask()
As described in e6fa16ab ("signal: sigprocmask() should do
retarget_shared_pending()") the modification of current->blocked is
incorrect as we need to check whether the signal we're about to block is
pending in the shared queue.
Also, use the new helper function introduced in commit 5e6292c0f28f
("signal: add block_sigmask() for adding sigmask to current->blocked")
which centralises the code for updating current->blocked after
successfully delivering a signal and reduces the amount of duplicate code
across architectures. In the past some architectures got this code wrong,
so using this helper function should stop that from happening again.
Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alex Shi [Tue, 20 Mar 2012 23:48:07 +0000 (10:48 +1100)]
percpu: remove percpu_xxx() functions
There are no percpu_xxx callers remaining
Signed-off-by: Alex Shi <alex.shi@intel.com> Acked-by: Christoph Lameter <cl@gentwo.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alex Shi [Tue, 20 Mar 2012 23:48:07 +0000 (10:48 +1100)]
net: use this_cpu_xxx replace percpu_xxx funcs
percpu_xxx funcs are duplicated with this_cpu_xxx funcs, so replace them
for further code clean up.
And in preempt safe scenario, __this_cpu_xxx funcs has a bit better
performance since __this_cpu_xxx has no redundant preempt_disable()
Signed-off-by: Alex Shi <alex.shi@intel.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alex Shi [Tue, 20 Mar 2012 23:48:06 +0000 (10:48 +1100)]
x86: change percpu_read_stable() to this_cpu_read_stable()
It has no function change. It's a preparation for percpu_xxx serial
function change.
Signed-off-by: Alex Shi <alex.shi@intel.com> Acked-by: Christoph Lameter <cl@gentwo.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alex Shi [Tue, 20 Mar 2012 23:48:06 +0000 (10:48 +1100)]
x86: use this_cpu_xxx to replace percpu_xxx funcs
Since percpu_xxx() serial functions are duplicate with this_cpu_xxx().
Removing percpu_xxx() definition and replacing them by this_cpu_xxx() in
code.
And further more, as Christoph Lameter's requirement, I try to use
__this_cpu_xx to replace this_cpu_xxx if it is in preempt safe scenario.
The preempt safe scenarios include:
1, in irq/softirq/nmi handler
2, protected by preempt_disable
3, protected by spin_lock
4, if the code context imply that it is preempt safe, like the code is
follows or be followed a preempt safe code.
BTW, In fact, this_cpu_xxx are same as __this_cpu_xxx since all funcs
implement in a single instruction for x86 machine. But it maybe other
platforms' performance.
[akpm@linux-foundation.org: fix build]
[sfr@canb.auug.org.au: arch/x86/include/asm/desc.h: fix smp_processor_id's need for this_cpu_read] Signed-off-by: Alex Shi <alex.shi@intel.com> Acked-by: Christoph Lameter <cl@gentwo.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matt Fleming [Tue, 20 Mar 2012 23:48:06 +0000 (10:48 +1100)]
avr32: use block_sigmask()
Use the new helper function introduced in commit 5e6292c0f28f ("signal:
add block_sigmask() for adding sigmask to current->blocked") which
centralises the code for updating current->blocked after successfully
delivering a signal and reduces the amount of duplicate code across
architectures.
In the past some architectures got this code wrong, so using this helper
function should stop that from happening again.
Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no> Cc: Havard Skinnemoen <hskinnemoen@gmail.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matt Fleming [Tue, 20 Mar 2012 23:48:05 +0000 (10:48 +1100)]
avr32: don't mask signals in the error path
The current handle_signal() implementation is broken - it will mask
signals if we fail to setup the signal stack frame, which isn't the
desired behaviour, we should only be masking signals if we succeed in
setting up the stack frame. It looks like this code was copied from the
old (broken) arm implementation but wasn't updated when the arm code was
fixed in commit a6c61e9dfdd0 ("[ARM] 3168/1: Update ARM signal delivery
and masking").
Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no> Acked-by: Havard Skinnemoen <hskinnemoen@gmail.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shérab [Tue, 20 Mar 2012 23:48:03 +0000 (10:48 +1100)]
arch/x86/platform/iris/iris.c: register a platform device and a platform driver
This makes the iris driver use the platform API, so it is properly exposed
in /sys.
[akpm@linux-foundation.org: remove commented-out code, add missing space to printk, clean up code layout] Signed-off-by: Shérab <Sebastien.Hinderer@ens-lyon.org> Cc: Len Brown <lenb@kernel.org> Cc: Matthew Garrett <mjg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Feuerer [Tue, 20 Mar 2012 23:48:03 +0000 (10:48 +1100)]
acerhdf: lowered default temp fanon/fanoff values
Due to new supported hardware, of which the actual temperature limits of
processor, harddisk and other components are unknown, it feels safer with
lower fanon / fanoff settings.
It won't change much for most people, already using acerhdf, as they use
their own fanon/fanoff variable settings when loading the module.
Furthermore seems like kernel and userspace tools have been improved to
work more efficient and netbooks don't get so hot anymore.
Signed-off-by: Peter Feuerer <peter@piie.net> Acked-by: Borislav Petkov <petkovbb@gmail.com> Cc: Matthew Garrett <mjg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alex Bligh [Tue, 20 Mar 2012 23:48:01 +0000 (10:48 +1100)]
net/netfilter/nf_conntrack_netlink.c: fix Oops on container destroy
Problem:
A repeatable Oops can be caused if a container with networking
unshared is destroyed when it has nf_conntrack entries yet to expire.
A copy of the oops follows below. A perl program generating the oops
repeatably is attached inline below.
Analysis:
The oops is called from cleanup_net when the namespace is
destroyed. conntrack iterates through outstanding events and calls
death_by_timeout on each of them, which in turn produces a call to
ctnetlink_conntrack_event. This calls nf_netlink_has_listeners, which
oopses because net->nfnl is NULL.
The perl program generates the container through fork() then
clone(NS_NEWNET). I does not explicitly set up netlink
explicitly set up netlink, but I presume it was set up else net->nfnl
would have been NULL earlier (i.e. when an earlier connection
timed out). This would thus suggest that net->nfnl is made NULL
during the destruction of the container, which I think is done by
nfnetlink_net_exit_batch.
I can see that the various subsystems are deinitialised in the opposite
order to which the relevant register_pernet_subsys calls are called,
and both nf_conntrack and nfnetlink_net_ops register their relevant
subsystems. If nfnetlink_net_ops registered later than nfconntrack,
then its exit routine would have been called first, which would cause
the oops described. I am not sure there is anything to prevent this
happening in a container environment.
Whilst there's perhaps a more complex problem revolving around ordering
of subsystem deinit, it seems to me that missing a netlink event on a
container that is dying is not a disaster. An early check for net->nfnl
being non-NULL in ctnetlink_conntrack_event appears to fix this. There
may remain a potential race condition if it becomes NULL immediately
after being checked (I am not sure any lock is held at this point or
how synchronisation for subsystem deinitialization works).
Patch:
The patch attached should apply on everything from 2.6.26 (if not before)
onwards; it appears to be a problem on all kernels. This was taken against
Ubuntu-3.0.0-11.17 which is very close to 3.0.4. I have torture-tested it
with the above perl script for 15 minutes or so; the perl script hung the
machine within 20 seconds without this patch.
Applicability:
If this is the right solution, it should be applied to all stable kernels
as well as head. Apart from the minor overhead of checking one variable
against NULL, it can never 'do the wrong thing', because if net->nfnl
is NULL, an oops will inevitably result. Therefore, checking is a reasonable
thing to do unless it can be proven than net->nfnl will never be NULL.
Check net->nfnl for NULL in ctnetlink_conntrack_event to avoid Oops on
container destroy
Signed-off-by: Alex Bligh <alex@alex.org.uk> Cc: Patrick McHardy <kaber@trash.net> Cc: David Miller <davem@davemloft.net> Cc: <stable@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>