]> git.kernelconcepts.de Git - karo-tx-linux.git/log
karo-tx-linux.git
12 years agoAdd linux-next specific files for 20110727 next-20110727
Stephen Rothwell [Wed, 27 Jul 2011 03:56:57 +0000 (13:56 +1000)]
Add linux-next specific files for 20110727

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
12 years agosparc: rename atomic_add_unless
Stephen Rothwell [Wed, 27 Jul 2011 03:48:55 +0000 (13:48 +1000)]
sparc: rename atomic_add_unless

Should have been done in commit 1af08a1407f4 ("This is in preparation
for more generic atomic").

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
12 years agoMerge branch 'akpm'
Stephen Rothwell [Wed, 27 Jul 2011 03:28:06 +0000 (13:28 +1000)]
Merge branch 'akpm'

12 years agoOnly a few core funcs need to be implemented for SMP systems, so allow the
Mike Frysinger [Tue, 26 Jul 2011 10:15:19 +0000 (20:15 +1000)]
Only a few core funcs need to be implemented for SMP systems, so allow the
arches to override them while getting the rest for free.

At least, this is enough to allow the Blackfin SMP port to use things.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Cc: Arun Sharma <asharma@fb.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoSince arches are expected to implement this guy, add a common version for
Mike Frysinger [Tue, 26 Jul 2011 10:15:18 +0000 (20:15 +1000)]
Since arches are expected to implement this guy, add a common version for
people the same way as atomic_clear_mask is handled.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Cc: Arun Sharma <asharma@fb.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoThe atomic helpers are supposed to take an atomic_t pointer, not a random
Mike Frysinger [Tue, 26 Jul 2011 10:15:18 +0000 (20:15 +1000)]
The atomic helpers are supposed to take an atomic_t pointer, not a random
unsigned long pointer.  So convert atomic_clear_mask over.

While we're here, also add some nice documentation to the func.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Cc: Arun Sharma <asharma@fb.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoWe already declared inc/dec helpers, so we don't need to call the
Mike Frysinger [Tue, 26 Jul 2011 10:15:18 +0000 (20:15 +1000)]
We already declared inc/dec helpers, so we don't need to call the
atomic_{add,sub}_return funcs directly.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Cc: Arun Sharma <asharma@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoThis clarifies the differences between <linux/atomic.h> and
Arun Sharma [Tue, 26 Jul 2011 10:15:18 +0000 (20:15 +1000)]
This clarifies the differences between <linux/atomic.h> and
<asm-generic/atomic.h>

Signed-off-by: Arun Sharma <asharma@fb.com>
Suggested-by: Mike Frysinger <vapier.adi@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoAfter changing all consumers of atomics to include
Arun Sharma [Tue, 26 Jul 2011 10:15:17 +0000 (20:15 +1000)]
After changing all consumers of atomics to include
<linux/atomic.h>, we ran into some compile time errors
due to this dependency chain:

linux/atomic.h
  -> asm/atomic.h
    -> asm-generic/atomic-long.h

where atomic-long.h could use funcs defined later in
linux/atomic.h without a prototype. This patches moves
the code that includes asm-generic/atomic*.h to
linux/atomic.h.

Archs that need <asm-generic/atomic64.h> need to select
CONFIG_GENERIC_ATOMIC64 from now on (some of them used
to include it unconditionally).

Compile tested on i386 and x86_64 with allnoconfig.

Signed-off-by: Arun Sharma <asharma@fb.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoThis is in preparation for more generic atomic
Arun Sharma [Tue, 26 Jul 2011 10:15:17 +0000 (20:15 +1000)]
This is in preparation for more generic atomic
primitives based on __atomic_add_unless.

Signed-off-by: Arun Sharma <asharma@fb.com>
Signed-off-by: Hans-Christian Egtvedt <hans-christian.egtvedt@atmel.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoThis allows us to move duplicated code in <asm/atomic.h>
Arun Sharma [Tue, 26 Jul 2011 10:15:16 +0000 (20:15 +1000)]
This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>

Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoThe majority of architectures implement ext2 atomic bitops as
Akinobu Mita [Tue, 26 Jul 2011 10:15:15 +0000 (20:15 +1000)]
The majority of architectures implement ext2 atomic bitops as
test_and_{set,clear}_bit() without spinlock.

This adds this type of generic implementation in ext2-atomic-setbit.h and
use it wherever possible.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Suggested-by: Andreas Dilger <adilger@dilger.ca>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoThis changes should_fail_request() to more usable wrapper function of
Akinobu Mita [Tue, 26 Jul 2011 10:15:15 +0000 (20:15 +1000)]
This changes should_fail_request() to more usable wrapper function of
should_fail().  It can avoid putting #ifdef CONFIG_FAIL_MAKE_REQUEST in
the middle of a function.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoNow cleanup_fault_attr_dentries() recursively removes a directory, So we
Akinobu Mita [Tue, 26 Jul 2011 10:15:15 +0000 (20:15 +1000)]
Now cleanup_fault_attr_dentries() recursively removes a directory, So we
can simplify the error handling in the initialization code and no need to
hold dentry structs for each debugfs file.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoNow cleanup_fault_attr_dentries() recursively removes a directory, So we
Akinobu Mita [Tue, 26 Jul 2011 10:15:14 +0000 (20:15 +1000)]
Now cleanup_fault_attr_dentries() recursively removes a directory, So we
can simplify the error handling in the initialization code and no need to
hold dentry structs for each debugfs file.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoUse debugfs_remove_recursive() to simplify initialization and
Akinobu Mita [Tue, 26 Jul 2011 10:15:14 +0000 (20:15 +1000)]
Use debugfs_remove_recursive() to simplify initialization and
deinitialization of fault injection debugfs files.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoMinor cosmetic changes for simple attribute of stacktrace_depth:
Akinobu Mita [Tue, 26 Jul 2011 10:15:14 +0000 (20:15 +1000)]
Minor cosmetic changes for simple attribute of stacktrace_depth:

 - use min_t()
 - reduce #ifdef by moving a function
 - do not use partly capitalized function name

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoshould_fail_srandom() does not exist.
Akinobu Mita [Tue, 26 Jul 2011 10:15:13 +0000 (20:15 +1000)]
should_fail_srandom() does not exist.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoNo need to include linux/kallsyms.h.
Akinobu Mita [Tue, 26 Jul 2011 10:15:13 +0000 (20:15 +1000)]
No need to include linux/kallsyms.h.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoWhile ramoops writes to ram, accessing the dump requires using /dev/mem
Sergiu Iordache [Tue, 26 Jul 2011 10:15:13 +0000 (20:15 +1000)]
While ramoops writes to ram, accessing the dump requires using /dev/mem
and knowing the memory location (or a similar solution).  This patch
provides a debugfs interface through which the respective memory area can
be easily accessed.

The entry added is /sys/kernel/debug/ramoops/next

The entry returns a dump of size record_size each time, skipping invalid
dumps.  When it reaches the end of the memory area reserved for dumps it
returns an empty record and resets the current record count.

Signed-off-by: Sergiu Iordache <sergiu@chromium.org>
Acked-by: Marco Stornelli <marco.stornelli@gmail.com>
Cc: "Ahmed S. Darwish" <darwish.07@gmail.com>
Cc: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoThe size of the dump is currently set using the RECORD_SIZE macro which is
Sergiu Iordache [Tue, 26 Jul 2011 10:15:12 +0000 (20:15 +1000)]
The size of the dump is currently set using the RECORD_SIZE macro which is
set to a page size.  This patch makes the record size a module parameter
and allows it to be set through platform data as well to allow larger
dumps if needed.

Signed-off-by: Sergiu Iordache <sergiu@chromium.org>
Acked-by: Marco Stornelli <marco.stornelli@gmail.com>
Cc: "Ahmed S. Darwish" <darwish.07@gmail.com>
Cc: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoThe platform driver currently allows setting the mem_size and mem_address.
Sergiu Iordache [Tue, 26 Jul 2011 10:15:12 +0000 (20:15 +1000)]
The platform driver currently allows setting the mem_size and mem_address.
 Since dump_oops is also a module parameter it would be more consistent if
it could be set through platform data as well.

Signed-off-by: Sergiu Iordache <sergiu@chromium.org>
Acked-by: Marco Stornelli <marco.stornelli@gmail.com>
Cc: "Ahmed S. Darwish" <darwish.07@gmail.com>
Cc: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoAdd new line to each print.
Marco Stornelli [Tue, 26 Jul 2011 10:15:12 +0000 (20:15 +1000)]
Add new line to each print.

Signed-off-by: Marco Stornelli <marco.stornelli@gmail.com>
Reported-by: Stevie Trujillo <stevie.trujillo@gmail.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Américo Wang <xiyou.wangcong@gmail.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoERROR: Invalid UTF-8, patch and commit message should be encoded in UTF-8
Andrew Morton [Tue, 26 Jul 2011 10:15:12 +0000 (20:15 +1000)]
ERROR: Invalid UTF-8, patch and commit message should be encoded in UTF-8
#10:
Cc: Américo Wang <xiyou.wangcong@gmail.com>
      ^

ERROR: that open brace { should be on the previous line
#81: FILE: drivers/char/ramoops.c:182:
+ if (ret == -ENODEV)
+ {

ERROR: trailing whitespace
#83: FILE: drivers/char/ramoops.c:184:
+^I^I/* $

ERROR: trailing whitespace
#97: FILE: drivers/char/ramoops.c:198:
+^I^Iif (IS_ERR(dummy)) $

ERROR: trailing whitespace
#102: FILE: drivers/char/ramoops.c:203:
+^I$

total: 5 errors, 0 warnings, 87 lines checked

NOTE: whitespace errors detected, you may wish to use scripts/cleanpatch or
      scripts/cleanfile

./patches/ramoops-use-module-parameters-instead-of-platform-data-if-not-available.patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Américo Wang <xiyou.wangcong@gmail.com>
Cc: Joe Perches <joe@perches.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Marco Stornelli <marco.stornelli@gmail.com>
Cc: Stevie Trujillo <stevie.trujillo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoUse generic module parameters instead of platform data, if platform data
Marco Stornelli [Tue, 26 Jul 2011 10:15:11 +0000 (20:15 +1000)]
Use generic module parameters instead of platform data, if platform data
are not available.  This limitation has been introduced with commit
c3b92ce9e75 ("ramoops: use the platform data structure instead of module
params").

Signed-off-by: Marco Stornelli <marco.stornelli@gmail.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Américo Wang <xiyou.wangcong@gmail.com>
Reported-by: Stevie Trujillo <stevie.trujillo@gmail.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agofix comment layout & grammar
Andrew Morton [Tue, 26 Jul 2011 10:15:11 +0000 (20:15 +1000)]
fix comment layout & grammar

Cc: Dmitry Torokhov <dtor@vmware.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoWith the arrival of concurrency-managed workqueues there is no need for
Dmitry Torokhov [Tue, 26 Jul 2011 10:15:11 +0000 (20:15 +1000)]
With the arrival of concurrency-managed workqueues there is no need for
our driver to use dedicated workqueue; system-wide one should suffice just
fine.

Signed-off-by: Dmitry Torokhov <dtor@vmware.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoDon't force output if you intend to reboot immediately.
Mandeep Singh Baines [Tue, 26 Jul 2011 10:15:11 +0000 (20:15 +1000)]
Don't force output if you intend to reboot immediately.

In this patch, I'm disabling the functionality enabled by
vc->vc_panic_force_write if panic_timeout < 0 (i.e.  no timeout).
vc_panic_force_write is only enabled for fb video consoles if the
FBINFO_CAN_FORCE_OUTPUT flag is set.

For our application, we're using ram_oops to preserved the panic in
memory.  We want to reliably, and as fast as possible, machine_restart.
The vc_panic_force_write flag results in a bunch of graphics driver code
to be invoked which slows down restart and decreases reliability.  Since
we're already storing the panic in RAM and are going to reboot
immediately, there is no benefit in mode switching back to the vc in order
to display the panic output.  The log buffer will get flushed by the
console_unblank() call so remote management consoles should see all
output.

Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Olaf Hering <olaf@aepfle.de>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoWhen kernel BUG or oops occurs, ChromeOS intends to panic and immediately
Hugh Dickins [Tue, 26 Jul 2011 10:15:10 +0000 (20:15 +1000)]
When kernel BUG or oops occurs, ChromeOS intends to panic and immediately
reboot, with stacktrace and other messages preserved in RAM across reboot.
 But the longer we delay, the more likely the user is to poweroff and lose
the info.

panic_timeout (seconds before rebooting) is set by panic= boot option or
sysctl or /proc/sys/kernel/panic; but 0 means wait forever, so at present
we have to delay at least 1 second.

Let a negative number mean reboot immediately (with the small cosmetic
benefit of suppressing that newline-less "Rebooting in %d seconds.."
message).

Signed-off-by: Hugh Dickins <hughd@chromium.org>
Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Olaf Hering <olaf@aepfle.de>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoParameter offset_in_page in edac_mc_handle_ce() should mask the higher
Kai.Jiang [Tue, 26 Jul 2011 10:15:10 +0000 (20:15 +1000)]
Parameter offset_in_page in edac_mc_handle_ce() should mask the higher
bits above the page size, not the lower bits.  The original input
sometimes causes a crash.

Signed-off-by: Kai.Jiang <Kai.Jiang@freescale.com>
Signed-off-by: Shaohui Xie <Shaohui.Xie@freescale.com>
Cc: Anton Vorontsov <avorontsov@mvista.com>
Cc: Grant Likely <grant.likely@secretlab.ca>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoUpdate kernel-parameters.txt to point users to the authoritative comment
Will Drewry [Tue, 26 Jul 2011 10:15:10 +0000 (20:15 +1000)]
Update kernel-parameters.txt to point users to the authoritative comment
for name_to_dev_t.  In addition, updates other places where some
name_to_dev_t behavior was discussed.  All other references to root=
appear to be for explicit sample usage or just side comments when
discussing other kernel parameters.

Signed-off-by: Will Drewry <wad@chromium.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoExpand root=PARTUUID=UUID syntax to support selecting a root partition by
Will Drewry [Tue, 26 Jul 2011 10:15:09 +0000 (20:15 +1000)]
Expand root=PARTUUID=UUID syntax to support selecting a root partition by
integer offset from a known, unique partition.  This approach provides
similar properties to specifying a device and partition number, but using
the UUID as the unique path prior to evaluating the offset.

For example,
  root=PARTUUID=99DE9194-FC15-4223-9192-FC243948F88B/PARTNROFF=1
selects the partition with UUID 99DE.. then select the next
partition.

This change is motivated by a particular usecase in Chromium OS where the
bootloader can easily determine what partition it is on (by UUID) but
doesn't perform general partition table walking.

That said, support for this model provides a direct mechanism for the user
to modify the root partition to boot without specifically needing to
extract each UUID or update the bootloader explicitly when the root
partition UUID is changed (if it is recreated to be larger, for instance).
 Pinning to a /boot-style partition UUID allows the arbitrary root
partition reconfiguration/modifications with slightly less ambiguity than
just [dev][partition] and less stringency than the specific root partition
UUID.

Signed-off-by: Will Drewry <wad@chromium.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoCc: "Eric W. Biederman" <ebiederm@xmission.com>
Andrew Morton [Tue, 26 Jul 2011 10:15:09 +0000 (20:15 +1000)]
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Serge E. Hallyn" <serge.hallyn@canonical.com>
Cc: Daniel Lezcano <daniel.lezcano@free.fr>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years ago- fix shm_rmid_forced/shm_forced_rmid confusion
Andrew Morton [Tue, 26 Jul 2011 10:15:09 +0000 (20:15 +1000)]
- fix shm_rmid_forced/shm_forced_rmid confusion

- use standard comment layout

Cc: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoshm_may_destroy() and ipc_namespace.shm_forced_rmid lack comments.
Vasiliy Kulikov [Tue, 26 Jul 2011 10:15:09 +0000 (20:15 +1000)]
shm_may_destroy() and ipc_namespace.shm_forced_rmid lack comments.

Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoreadability/conventionality tweaks
Andrew Morton [Tue, 26 Jul 2011 10:15:08 +0000 (20:15 +1000)]
readability/conventionality tweaks

Cc: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoinclude/linux/shm.h: In function 'exit_shm':
Andrew Morton [Tue, 26 Jul 2011 10:15:08 +0000 (20:15 +1000)]
include/linux/shm.h: In function 'exit_shm':
include/linux/shm.h:122: warning: 'return' with a value, in function returning void

Testing?

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Serge E. Hallyn" <serge.hallyn@canonical.com>
Cc: Daniel Lezcano <daniel.lezcano@free.fr>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agofix documentation, per Randy
Andrew Morton [Tue, 26 Jul 2011 10:15:08 +0000 (20:15 +1000)]
fix documentation, per Randy

Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoAdd support for the shm_rmid_forced sysctl. If set to 1, all
Vasiliy Kulikov [Tue, 26 Jul 2011 10:15:07 +0000 (20:15 +1000)]
Add support for the shm_rmid_forced sysctl.  If set to 1, all
shared memory objects in current ipc namespace will be automatically
forced to use IPC_RMID.

The POSIX way of handling shmem allows one to create shm objects and call
shmdt(), leaving shm object associated with no process, thus consuming
memory not counted via rlimits.

With shm_rmid_forced=1 the shared memory object is counted at least for
one process, so OOM killer may effectively kill the fat process holding
the shared memory.

It obviously breaks POSIX - some programs relying on the feature would
stop working.  So set shm_rmid_forced=1 only if you're sure nobody uses
"orphaned" memory.  Use shm_rmid_forced=0 by default for compatability
reasons.

The feature was previously impemented in -ow as a configure option.

Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Serge E. Hallyn" <serge.hallyn@canonical.com>
Cc: Daniel Lezcano <daniel.lezcano@free.fr>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Solar Designer <solar@openwall.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoWe return ENOMEM from mqueue_get_inode even when we have enough memory.
Jiri Slaby [Tue, 26 Jul 2011 10:15:07 +0000 (20:15 +1000)]
We return ENOMEM from mqueue_get_inode even when we have enough memory.
Namely in case the system rlimit of mqueue was reached.  This error
propagates to mq_queue and user sees the error unexpectedly.  So fix this
up to properly return EMFILE as described in the manpage:

EMFILE The process already has the maximum number of files and
       message queues open.
instead of:
ENOMEM Insufficient memory.

With the previous patch we just switch to ERR_PTR/PTR_ERR/IS_ERR error
handling here.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoIf new_inode fails to allocate an inode we need only to return with NULL.
Jiri Slaby [Tue, 26 Jul 2011 10:15:06 +0000 (20:15 +1000)]
If new_inode fails to allocate an inode we need only to return with NULL.
But now we test the opposite and have all the work in a nested block.  So
do the opposite to save one indentation level (and remove unnecessary line
breaks).

This is only a preparation/cleanup for the next patch where we fix up
return values from mqueue_get_inode.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoacct_arg_size() takes ->page_table_lock around add_mm_counter() if
Oleg Nesterov [Tue, 26 Jul 2011 10:15:06 +0000 (20:15 +1000)]
acct_arg_size() takes ->page_table_lock around add_mm_counter() if
!SPLIT_RSS_COUNTING.  This is not needed after 172703b0 ("mm: delete
non-atomic mm counter implementation").

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Matt Fleming <matt.fleming@linux.intel.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoIf CONFIG_MODULES=n, it makes no sense to retry the list of binary formats
Tetsuo Handa [Tue, 26 Jul 2011 10:15:06 +0000 (20:15 +1000)]
If CONFIG_MODULES=n, it makes no sense to retry the list of binary formats
handler because the list will not be modified by request_module().

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Richard Weinberger <richard@nod.at>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoCurrently, search_binary_handler() tries to load binary loader module
Tetsuo Handa [Tue, 26 Jul 2011 10:15:06 +0000 (20:15 +1000)]
Currently, search_binary_handler() tries to load binary loader module
using request_module() if a loader for the requested program is not yet
loaded.  But second attempt of request_module() does not affect the result
of search_binary_handler().

If request_module() triggered recursion, calling request_module() twice
causes 2 to the power of MAX_KMOD_CONCURRENT (= 50) repetitions.  It is
not an infinite loop but is sufficient for users to consider as a hang up.

Therefore, this patch changes not to call request_module() twice, making 1
to the power of MAX_KMOD_CONCURRENT repetitions in case of recursion.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: Richard Weinberger <richard@nod.at>
Tested-by: Richard Weinberger <richard@nod.at>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoa8bef8ff ("mm: migration: avoid race between shift_arg_pages() and
Michal Hocko [Tue, 26 Jul 2011 10:15:05 +0000 (20:15 +1000)]
a8bef8ff ("mm: migration: avoid race between shift_arg_pages() and
rmap_walk() during migration by not migrating temporary stacks")
introduced a BUG_ON() to ensure that VM_STACK_FLAGS and
VM_STACK_INCOMPLETE_SETUP do not overlap.  The check is a compile time
one, so BUILD_BUG_ON is more appropriate.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Richard Weinberger <richard@nod.at>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoSigned-off-by: Daniel Rebelo de Oliveira <psykon@gmail.com>
Daniel Rebelo de Oliveira [Tue, 26 Jul 2011 10:15:05 +0000 (20:15 +1000)]
Signed-off-by: Daniel Rebelo de Oliveira <psykon@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agodo_coredump() assumes that if format_corename() fails it should return
Oleg Nesterov [Tue, 26 Jul 2011 10:15:05 +0000 (20:15 +1000)]
do_coredump() assumes that if format_corename() fails it should return
-ENOMEM.  This is not true, for example cn_print_exe_file() can propagate
the error from d_path.  Even if it was true, this is too fragile.  Change
the code to check "ispipe < 0".

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Reviewed-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoChange every occurence of / in comm and hostname to !. If the process
Jiri Slaby [Tue, 26 Jul 2011 10:15:05 +0000 (20:15 +1000)]
Change every occurence of / in comm and hostname to !.  If the process
changes its name to contain /, the core is not dumped (if the directory
tree doesn't exist like that).  The same with hostname being something
like myhost/3.  Fix this behaviour by using the escape loop used in %E.
(We extract it to a separate function.)

Now both with comm == myprocess/1 and hostname == myhost/1, the core is
dumped like (kernel.core_pattern='core.%p.%e.%h):
core.2349.myprocess!1.myhost!1

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoIf we don't know the file corresponding to the binary (i.e. exe_file is
Jiri Slaby [Tue, 26 Jul 2011 10:15:04 +0000 (20:15 +1000)]
If we don't know the file corresponding to the binary (i.e.  exe_file is
unknown), use "task->comm (path unknown)" instead of simple "(unknown)" as
suggested by ak.

The fallback is the same as %e except it will append "(path unknown)".

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agosys_ssetmask(), sys_rt_sigsuspend() and compat_sys_rt_sigsuspend()
Oleg Nesterov [Tue, 26 Jul 2011 10:15:04 +0000 (20:15 +1000)]
sys_ssetmask(), sys_rt_sigsuspend() and compat_sys_rt_sigsuspend()
change ->blocked directly. This is not correct, see the changelog in
e6fa16ab "signal: sigprocmask() should do retarget_shared_pending()"

Change them to use set_current_blocked().

Another change is that now we are doing ->saved_sigmask = ->blocked
lockless, it doesn't make any sense to do this under ->siglock.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Matt Fleming <matt.fleming@linux.intel.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoNo need to declare show_regs() in ptrace.h, sched.h does this.
Oleg Nesterov [Tue, 26 Jul 2011 10:15:04 +0000 (20:15 +1000)]
No need to declare show_regs() in ptrace.h, sched.h does this.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoSigned-off-by: Mike Frysinger <vapier@gentoo.org>
Mike Frysinger [Tue, 26 Jul 2011 10:15:03 +0000 (20:15 +1000)]
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agomake David happy
Andrew Morton [Tue, 26 Jul 2011 10:15:03 +0000 (20:15 +1000)]
make David happy

Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoKosaki Motohiro raised a concern that copy_process is hot path and we do
Michal Hocko [Tue, 26 Jul 2011 10:15:03 +0000 (20:15 +1000)]
Kosaki Motohiro raised a concern that copy_process is hot path and we do
not want to initialize cpuset_{mem,slab}_spread_rotor if they are not used
most of the time.

I think that we should rather intialize it lazily when rotors are used for
the first time.  This will also catch the case when we set up spread
mem/slab later.

Also do not use -1 for unitialized nodes and rather use NUMA_NO_NODE
instead.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Paul Menage <menage@google.com>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Robin Holt <holt@sgi.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agofix CONFIG_NUMA=y, MAX_NUMNODES>1 build
Andrew Morton [Tue, 26 Jul 2011 10:15:02 +0000 (20:15 +1000)]
fix CONFIG_NUMA=y, MAX_NUMNODES>1 build

Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Jack Steiner <steiner@sgi.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Paul Menage <menage@google.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Robin Holt <holt@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years ago[This patch has already been accepted as 0ac0c0d but later reverted
Michal Hocko [Tue, 26 Jul 2011 10:15:02 +0000 (20:15 +1000)]
[This patch has already been accepted as 0ac0c0d but later reverted
(35926ff) because it itroduced arch specific __node_random which was
defined only for x86 code so it broke other archs.  This is a followup
without any arch specific code.  Other than that there are no functional
changes.]

Some workloads that create a large number of small files tend to assign
too many pages to node 0 (multi-node systems).  Part of the reason is that
the rotor (in cpuset_mem_spread_node()) used to assign nodes starts at
node 0 for newly created tasks.

This patch changes the rotor to be initialized to a random node number of
the cpuset.

[akpm@linux-foundation.org: fix layout]
[Lee.Schermerhorn@hp.com: Define stub numa_random() for !NUMA configuration]
[mhocko@suse.cz: Make it arch independent]
Signed-off-by: Jack Steiner <steiner@sgi.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Paul Menage <menage@google.com>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Robin Holt <holt@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agopercpu_charge_mutex protects from multiple simultaneous per-cpu charge
Michal Hocko [Tue, 26 Jul 2011 10:15:02 +0000 (20:15 +1000)]
percpu_charge_mutex protects from multiple simultaneous per-cpu charge
caches draining because we might end up having too many work items.  At
least this was the case until 26fe6168 (memcg: fix percpu cached charge
draining frequency) when we introduced a more targeted draining for async
mode.

Now that also sync draining is targeted we can safely remove mutex because
we will not send more work than the current number of CPUs.
FLUSHING_CACHED_CHARGE protects from sending the same work multiple times
and stock->nr_pages == 0 protects from pointless sending a work if there
is obviously nothing to be done.  This is of course racy but we can live
with it as the race window is really small (we would have to see
FLUSHING_CACHED_CHARGE cleared while nr_pages would be still non-zero).

The only remaining place where we can race is synchronous mode when we
rely on FLUSHING_CACHED_CHARGE test which might have been set by other
drainer on the same group but we should wait in that case as well.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoWe are checking whether a given two groups are same or at least in the
Michal Hocko [Tue, 26 Jul 2011 10:15:02 +0000 (20:15 +1000)]
We are checking whether a given two groups are same or at least in the
same subtree of a hierarchy at several places.  Let's make a helper for it
to make code easier to read.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoCurrently we have two ways how to drain per-CPU caches for charges.
Michal Hocko [Tue, 26 Jul 2011 10:15:01 +0000 (20:15 +1000)]
Currently we have two ways how to drain per-CPU caches for charges.
drain_all_stock_sync will synchronously drain all caches while
drain_all_stock_async will asynchronously drain only those that refer to a
given memory cgroup or its subtree in hierarchy.  Targeted async draining
has been introduced by 26fe6168 (memcg: fix percpu cached charge draining
frequency) to reduce the cpu workers number.

sync draining is currently triggered only from mem_cgroup_force_empty
which is triggered only by userspace (mem_cgroup_force_empty_write) or
when a cgroup is removed (mem_cgroup_pre_destroy).  Although these are not
usually frequent operations it still makes some sense to do targeted
draining as well, especially if the box has many CPUs.

This patch unifies both methods to use the single code (drain_all_stock)
which relies on the original async implementation and just adds flush_work
to wait on all caches that are still under work for the sync mode.  We are
using FLUSHING_CACHED_CHARGE bit check to prevent from waiting on a work
that we haven't triggered.  Please note that both sync and async functions
are currently protected by percpu_charge_mutex so we cannot race with
other drainers.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agodrain_all_stock_async tries to optimize a work to be done on the work
Michal Hocko [Tue, 26 Jul 2011 10:15:01 +0000 (20:15 +1000)]
drain_all_stock_async tries to optimize a work to be done on the work
queue by excluding any work for the current CPU because it assumes that
the context we are called from already tried to charge from that cache and
it's failed so it must be empty already.

While the assumption is correct we can optimize it even more by checking
the current number of pages in the cache.  This will also reduce a work on
other CPUs with an empty stock.

For the current CPU we can simply call drain_local_stock rather than
deferring it to the work queue.

[kamezawa.hiroyu@jp.fujitsu.com: use drain_local_stock for current CPU optimization]
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoThe commit log of 0ae5e89 ("memcg: count the soft_limit reclaim in...")
KAMEZAWA Hiroyuki [Tue, 26 Jul 2011 10:15:01 +0000 (20:15 +1000)]
The commit log of 0ae5e89 ("memcg: count the soft_limit reclaim in...")
says it adds scanning stats to memory.stat file.  But it doesn't because
we considered we needed to make a concensus for such new APIs.

This patch is a trial to add memory.scan_stat. This shows
  - the number of scanned pages(total, anon, file)
  - the number of rotated pages(total, anon, file)
  - the number of freed pages(total, anon, file)
  - the number of elaplsed time (including sleep/pause time)

  for both of direct/soft reclaim.

The biggest difference with oringinal Ying's one is that this file
can be reset by some write, as

  # echo 0 ...../memory.scan_stat

Example of output is here. This is a result after make -j 6 kernel
under 300M limit.

[kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat
[kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.vmscan_stat
scanned_pages_by_limit 9471864
scanned_anon_pages_by_limit 6640629
scanned_file_pages_by_limit 2831235
rotated_pages_by_limit 4243974
rotated_anon_pages_by_limit 3971968
rotated_file_pages_by_limit 272006
freed_pages_by_limit 2318492
freed_anon_pages_by_limit 962052
freed_file_pages_by_limit 1356440
elapsed_ns_by_limit 351386416101
scanned_pages_by_system 0
scanned_anon_pages_by_system 0
scanned_file_pages_by_system 0
rotated_pages_by_system 0
rotated_anon_pages_by_system 0
rotated_file_pages_by_system 0
freed_pages_by_system 0
freed_anon_pages_by_system 0
freed_file_pages_by_system 0
elapsed_ns_by_system 0
scanned_pages_by_limit_under_hierarchy 9471864
scanned_anon_pages_by_limit_under_hierarchy 6640629
scanned_file_pages_by_limit_under_hierarchy 2831235
rotated_pages_by_limit_under_hierarchy 4243974
rotated_anon_pages_by_limit_under_hierarchy 3971968
rotated_file_pages_by_limit_under_hierarchy 272006
freed_pages_by_limit_under_hierarchy 2318492
freed_anon_pages_by_limit_under_hierarchy 962052
freed_file_pages_by_limit_under_hierarchy 1356440
elapsed_ns_by_limit_under_hierarchy 351386416101
scanned_pages_by_system_under_hierarchy 0
scanned_anon_pages_by_system_under_hierarchy 0
scanned_file_pages_by_system_under_hierarchy 0
rotated_pages_by_system_under_hierarchy 0
rotated_anon_pages_by_system_under_hierarchy 0
rotated_file_pages_by_system_under_hierarchy 0
freed_pages_by_system_under_hierarchy 0
freed_anon_pages_by_system_under_hierarchy 0
freed_file_pages_by_system_under_hierarchy 0
elapsed_ns_by_system_under_hierarchy 0

total_xxxx is for hierarchy management.

This will be useful for further memcg developments and need to be
developped before we do some complicated rework on LRU/softlimit
management.

This patch adds a new struct memcg_scanrecord into scan_control struct.
sc->nr_scanned at el is not designed for exporting information. For example,
nr_scanned is reset frequentrly and incremented +2 at scanning mapped pages.

To avoid complexity, I added a new param in scan_control which is for
exporting scanning score.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ying Han <yinghan@google.com>
Cc: Andrew Bresticker <abrestic@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years ago22a668d7 ("memcg: fix behavior under memory.limit equals to memsw.limit")
Daisuke Nishimura [Tue, 26 Jul 2011 10:15:01 +0000 (20:15 +1000)]
22a668d7 ("memcg: fix behavior under memory.limit equals to memsw.limit")
introduced "memsw_is_minimum" flag, which becomes true when mem_limit ==
memsw_limit.  The flag is checked at the beginning of reclaim, and
"noswap" is set if the flag is true, because using swap is meaningless in
this case.

This works well in most cases, but when we try to shrink mem_limit, which
is the same as memsw_limit now, we might fail to shrink mem_limit because
swap doesn't used.

This patch fixes this behavior by:
- check MEM_CGROUP_RECLAIM_SHRINK at the begining of reclaim
- If it is set, don't set "noswap" flag even if memsw_is_minimum is true.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <bsingharora@gmail.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ying Han <yinghan@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years ago246e87a ("memcg: fix get_scan_count() for small targets") fixes the
KAMEZAWA Hiroyuki [Tue, 26 Jul 2011 10:15:00 +0000 (20:15 +1000)]
246e87a ("memcg: fix get_scan_count() for small targets") fixes the
memcg/kswapd behavior against small targets and prevent vmscan priority
too high.

But implementation is too naive and adds another problem to small memcg.
It always force scan to 32 pages of file/anon and doesn't handle
swappiness and other rotate_info.  It makes vmscan to scan anon LRU
regardless of swappiness and make reclaim bad.  This patch fixes it by
adjusting scanning count with regard to swappiness at el.

At a test "cat 1G file under 300M limit." (swappiness=20)
 before patch
        scanned_pages_by_limit 360919
        scanned_anon_pages_by_limit 180469
        scanned_file_pages_by_limit 180450
        rotated_pages_by_limit 31
        rotated_anon_pages_by_limit 25
        rotated_file_pages_by_limit 6
        freed_pages_by_limit 180458
        freed_anon_pages_by_limit 19
        freed_file_pages_by_limit 180439
        elapsed_ns_by_limit 429758872
 after patch
        scanned_pages_by_limit 180674
        scanned_anon_pages_by_limit 24
        scanned_file_pages_by_limit 180650
        rotated_pages_by_limit 35
        rotated_anon_pages_by_limit 24
        rotated_file_pages_by_limit 11
        freed_pages_by_limit 180634
        freed_anon_pages_by_limit 0
        freed_file_pages_by_limit 180634
        elapsed_ns_by_limit 367119089
        scanned_pages_by_system 0

the numbers of scanning anon are decreased(as expected), and elapsed time
reduced. By this patch, small memcgs will work better.
(*) Because the amount of file-cache is much bigger than anon,
    recalaim_stat's rotate-scan counter make scanning files more.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agomemcg_oom_mutex is used to protect memcg OOM path and eventfd interface
Michal Hocko [Tue, 26 Jul 2011 10:15:00 +0000 (20:15 +1000)]
memcg_oom_mutex is used to protect memcg OOM path and eventfd interface
for oom_control.  None of the critical sections which it protects sleep
(eventfd_signal works from atomic context and the rest are simple linked
list resp.  oom_lock atomic operations).

Mutex is also too heavyweight for those code paths because it triggers a
lot of scheduling.  It also makes makes convoying effects more visible
when we have a big number of oom killing because we take the lock mutliple
times during mem_cgroup_handle_oom so we have multiple places where many
processes can sleep.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years ago867578cb ("memcg: fix oom kill behavior") introduced oom_lock counter
Michal Hocko [Tue, 26 Jul 2011 10:15:00 +0000 (20:15 +1000)]
867578cb ("memcg: fix oom kill behavior") introduced oom_lock counter
which is incremented by mem_cgroup_oom_lock when we are about to handle
memcg OOM situation.  mem_cgroup_handle_oom falls back to a sleep if
oom_lock > 1 to prevent from multiple oom kills at the same time.  The
counter is then decremented by mem_cgroup_oom_unlock called from the same
function.

This works correctly but it can lead to serious starvations when we have
many processes triggering OOM and many CPUs available for them (I have
tested with 16 CPUs).

Consider a process (call it A) which gets the oom_lock (the first one that
got to mem_cgroup_handle_oom and grabbed memcg_oom_mutex) and other
processes that are blocked on the mutex.  While A releases the mutex and
calls mem_cgroup_out_of_memory others will wake up (one after another) and
increase the counter and fall into sleep (memcg_oom_waitq).

Once A finishes mem_cgroup_out_of_memory it takes the mutex again and
decreases oom_lock and wakes other tasks (if releasing memory by somebody
else - e.g.  killed process - hasn't done it yet).

Testcase would look like:
 Assume malloc XXX is a program allocating XXX Megabytes of memory
 which touches all allocated pages in a tight loop
 # swapoff SWAP_DEVICE
 # cgcreate -g memory:A
 # cgset -r memory.oom_control=0   A
 # cgset -r memory.limit_in_bytes= 200M
 # for i in `seq 100`
 # do
 #     cgexec -g memory:A   malloc 10 &
 # done

The main problem here is that all processes still race for the mutex and
there is no guarantee that we will get counter back to 0 for those that
got back to mem_cgroup_handle_oom.  In the end the whole convoy
in/decreases the counter but we do not get to 1 that would enable killing
so nothing useful can be done.  The time is basically unbounded because it
highly depends on scheduling and ordering on mutex (I have seen this
taking hours...).

This patch replaces the counter by a simple {un}lock semantic.  As
mem_cgroup_oom_{un}lock works on the a subtree of a hierarchy we have to
make sure that nobody else races with us which is guaranteed by the
memcg_oom_mutex.

We have to be careful while locking subtrees because we can encounter a
subtree which is already locked: hierarchy:

          A
        /   \
       B     \
      /\      \
     C  D     E

B - C - D tree might be already locked.  While we want to enable locking E
subtree because OOM situations cannot influence each other we definitely
do not want to allow locking A.

Therefore we have to refuse lock if any subtree is already locked and
clear up the lock for all nodes that have been set up to the failure
point.

On the other hand we have to make sure that the rest of the world will
recognize that a group is under OOM even though it doesn't have a lock.
Therefore we have to introduce under_oom variable which is incremented and
decremented for the whole subtree when we enter resp.  leave
mem_cgroup_handle_oom.  under_oom, unlike oom_lock, doesn't need be
updated under memcg_oom_mutex because its users only check a single group
and they use atomic operations for that.

This can be checked easily by the following test case:

 # cgcreate -g memory:A
 # cgset -r memory.use_hierarchy=1 A
 # cgset -r memory.oom_control=1   A
 # cgset -r memory.limit_in_bytes= 100M
 # cgset -r memory.memsw.limit_in_bytes= 100M
 # cgcreate -g memory:A/B
 # cgset -r memory.oom_control=1 A/B
 # cgset -r memory.limit_in_bytes=20M
 # cgset -r memory.memsw.limit_in_bytes=20M
 # cgexec -g memory:A/B malloc 30  &    #->this will be blocked by OOM of group B
 # cgexec -g memory:A   malloc 80  &    #->this will be blocked by OOM of group A

While B gets oom_lock A will not get it.  Both of them go into sleep and
wait for an external action.  We can make the limit higher for A to
enforce waking it up

 # cgset -r memory.memsw.limit_in_bytes=300M A
 # cgset -r memory.limit_in_bytes=300M A

malloc in A has to wake up even though it doesn't have oom_lock.

Finally, the unlock path is very easy because we always unlock only the
subtree we have locked previously while we always decrement under_oom.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoSigned-off-by: Igor Mammedov <imammedo@redhat.com>
Igor Mammedov [Tue, 26 Jul 2011 10:14:59 +0000 (20:14 +1000)]
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agomm/vmscan.c: In function 'zone_nr_lru_pages':
Andrew Morton [Tue, 26 Jul 2011 10:14:59 +0000 (20:14 +1000)]
mm/vmscan.c: In function 'zone_nr_lru_pages':
mm/vmscan.c:175: warning: passing argument 2 of 'mem_cgroup_zone_nr_lru_pages' makes pointer from integer without a cast
include/linux/memcontrol.h:307: note: expected 'struct zone *' but argument is of type 'int'
mm/vmscan.c:175: error: too many arguments to function 'mem_cgroup_zone_nr_lru_pages'

Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoIn mm/memcontrol.c, there are many lru stat functions as..
KAMEZAWA Hiroyuki [Tue, 26 Jul 2011 10:14:59 +0000 (20:14 +1000)]
In mm/memcontrol.c, there are many lru stat functions as..

mem_cgroup_zone_nr_lru_pages
mem_cgroup_node_nr_file_lru_pages
mem_cgroup_nr_file_lru_pages
mem_cgroup_node_nr_anon_lru_pages
mem_cgroup_nr_anon_lru_pages
mem_cgroup_node_nr_unevictable_lru_pages
mem_cgroup_nr_unevictable_lru_pages
mem_cgroup_node_nr_lru_pages
mem_cgroup_nr_lru_pages
mem_cgroup_get_local_zonestat

Some of them are under #ifdef MAX_NUMNODES >1 and others are not.
This seems bad. This patch consolidates all functions into

mem_cgroup_zone_nr_lru_pages()
mem_cgroup_node_nr_lru_pages()
mem_cgroup_nr_lru_pages()

For these functions, "which LRU?" information is passed by a mask.

example)
mem_cgroup_nr_lru_pages(mem, BIT(LRU_ACTIVE_ANON))

And I added some macro as ALL_LRU, ALL_LRU_FILE, ALL_LRU_ANON.
example)
mem_cgroup_nr_lru_pages(mem, ALL_LRU)

BTW, considering layout of NUMA memory placement of counters, this patch seems
to be better.

Now, when we gather all LRU information, we scan in following orer
    for_each_lru -> for_each_node -> for_each_zone.

This means we'll touch cache lines in different node in turn.

After patch, we'll scan
    for_each_node -> for_each_zone -> for_each_lru(mask)

Then, we'll gather information in the same cacheline at once.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoEach memory cgroup has a 'swappiness' value which can be accessed by
KAMEZAWA Hiroyuki [Tue, 26 Jul 2011 10:14:59 +0000 (20:14 +1000)]
Each memory cgroup has a 'swappiness' value which can be accessed by
get_swappiness(memcg).  The major user is try_to_free_mem_cgroup_pages()
and swappiness is passed by argument.  It's propagated by scan_control.

get_swappiness() is a static function but some planned updates will need
to get swappiness from files other than memcontrol.c This patch exports
get_swappiness() as mem_cgroup_swappiness().  With this, we can remove the
argument of swapiness from try_to_free...  and drop swappiness from
scan_control.  only memcg uses it.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ying Han <yinghan@google.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoBen reported a lockup related to rtc. The lockup happens due to:
Thomas Gleixner [Tue, 26 Jul 2011 10:14:58 +0000 (20:14 +1000)]
Ben reported a lockup related to rtc. The lockup happens due to:

CPU0                                        CPU1

rtc_irq_set_state()     __run_hrtimer()
  spin_lock_irqsave(&rtc->irq_task_lock)    rtc_handle_legacy_irq();
      spin_lock(&rtc->irq_task_lock);
  hrtimer_cancel()
    while (callback_running);

So the running callback never finishes as it's blocked on
rtc->irq_task_lock.

Use hrtimer_try_to_cancel() instead and drop rtc->irq_task_lock while
waiting for the callback.  Fix this for both rtc_irq_set_state() and
rtc_irq_set_freq().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Ben Greear <greearb@candelatech.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoDue to the hrtimer self rearming mode a user can DoS the machine simply
Thomas Gleixner [Tue, 26 Jul 2011 10:14:58 +0000 (20:14 +1000)]
Due to the hrtimer self rearming mode a user can DoS the machine simply
because it's starved by hrtimer events.

The RTC hrtimer is self rearming.  We really need to limit the frequency
to something sensible.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ben Greear <greearb@candelatech.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoThe code checks the correctness of the parameters, but unconditionally
Thomas Gleixner [Tue, 26 Jul 2011 10:14:58 +0000 (20:14 +1000)]
The code checks the correctness of the parameters, but unconditionally
arms/disarms the hrtimer.

The result is that a random task might arm/disarm rtc timer and surprise
the real owner by either generating events or by stopping them.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ben Greear <greearb@candelatech.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoThe help text for this config is duplicated across the x86, parisc, and
Stephen Boyd [Tue, 26 Jul 2011 10:14:57 +0000 (20:14 +1000)]
The help text for this config is duplicated across the x86, parisc, and
s390 Kconfig.debug files.  Arnd Bergman noted that the help text was
slightly misleading and should be fixed to state that enabling this option
isn't a problem when using pre 4.4 gcc.

To simplify the rewording, consolidate the text into lib/Kconfig.debug and
modify it there to be more explicit about when you should say N to this
config.

Also, make the text a bit more generic by stating that this option enables
compile time checks so we can cover architectures which emit warnings vs.
ones which emit errors.  The details of how an architecture decided to
implement the checks isn't as important as the concept of compile time
checking of copy_from_user() calls.

While we're doing this, remove all the copy_from_user_overflow() code
that's duplicated many times and place it into lib/ so that any
architecture supporting this option can get the function for free.

Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoStrict user copy checks are only really supported on x86_32 even though
Stephen Boyd [Tue, 26 Jul 2011 10:14:57 +0000 (20:14 +1000)]
Strict user copy checks are only really supported on x86_32 even though
the config option is selectable on x86_64.  Add the necessary support to
the 64 bit code to trigger copy_from_user() warnings at compile time.

Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoEnabling DEBUG_STRICT_USER_COPY_CHECKS causes the following warning:
Stephen Boyd [Tue, 26 Jul 2011 10:14:57 +0000 (20:14 +1000)]
Enabling DEBUG_STRICT_USER_COPY_CHECKS causes the following warning:

In file included from arch/x86/include/asm/uaccess.h:573,
                 from kernel/kprobes.c:55:
In function 'copy_from_user',
    inlined from 'write_enabled_file_bool' at
    kernel/kprobes.c:2191:
arch/x86/include/asm/uaccess_64.h:65:
warning: call to 'copy_from_user_overflow' declared with
attribute warning: copy_from_user() buffer size is not provably
correct

presumably due to buf_size being signed causing GCC to fail to see that
buf_size can't become negative.

Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoauto_demotion_disable is called only for online CPUs. For hotplugged
Shaohua Li [Tue, 26 Jul 2011 10:14:57 +0000 (20:14 +1000)]
auto_demotion_disable is called only for online CPUs.  For hotplugged
CPUs, we should disable it too.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: Len Brown <lenb@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agosmp_call_function() only lets all other CPUs execute a specific function,
Shaohua Li [Tue, 26 Jul 2011 10:14:56 +0000 (20:14 +1000)]
smp_call_function() only lets all other CPUs execute a specific function,
while we expect all CPUs do in intel_idle.  Without the fix, we could have
one cpu which has auto_demotion enabled or has no boradcast timer setup.
Usually we don't see impact because auto demotion just harms power and the
intel_idle init is called in CPU 0, where boradcast timer delivers
interrupt, but this still could be a problem.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: Len Brown <lenb@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoThe address limit is already set in flush_old_exec() so this
Mathias Krause [Tue, 26 Jul 2011 10:14:56 +0000 (20:14 +1000)]
The address limit is already set in flush_old_exec() so this
set_fs(USER_DS) is redundant.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoDev_opp initial value shoule be ERR_PTR(), IS_ERR() is used to check
Jonghwan Choi [Tue, 26 Jul 2011 10:14:56 +0000 (20:14 +1000)]
Dev_opp initial value shoule be ERR_PTR(), IS_ERR() is used to check
error.

Signed-off-by: Jonghwan Choi <jhbird.choi@samsung.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoExpand the fs/Kconfig "help" info to clarify why you might need to select
Robert P. J. Day [Tue, 26 Jul 2011 10:14:56 +0000 (20:14 +1000)]
Expand the fs/Kconfig "help" info to clarify why you might need to select
the TMPFS_POSIX_ACL config variable.

Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoExpand the fs/Kconfig "help" info to clarify why it's a bad idea to
Robert P. J. Day [Tue, 26 Jul 2011 10:14:55 +0000 (20:14 +1000)]
Expand the fs/Kconfig "help" info to clarify why it's a bad idea to
deselect the TMPFS_POSIX_ACL config variable.

Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoFix NULL dereference I introduced in mincore_page().
Hugh Dickins [Tue, 26 Jul 2011 10:14:55 +0000 (20:14 +1000)]
Fix NULL dereference I introduced in mincore_page().

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoRemove PageSwapBacked (!page_is_file_cache) cases from
Hugh Dickins [Tue, 26 Jul 2011 10:14:55 +0000 (20:14 +1000)]
Remove PageSwapBacked (!page_is_file_cache) cases from
add_to_page_cache_locked() and add_to_page_cache_lru(): those pages now go
through shmem_add_to_page_cache().

Remove a comment on maximum tmpfs size from fsstack_copy_inode_size(), and
add a comment on swap entries to invalidate_mapping_pages().

And mincore_page() uses find_get_page() on what might be shmem or a tmpfs
file: allow for a radix_tree_exceptional_entry(), and proceed to
find_get_page() on swapper_space if so (oh, swapper_space needs #ifdef).

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoBut we've not yet removed the old swp_entry_t i_direct[16] from
Hugh Dickins [Tue, 26 Jul 2011 10:14:54 +0000 (20:14 +1000)]
But we've not yet removed the old swp_entry_t i_direct[16] from
shmem_inode_info.  That's because it was still being shared with the
inline symlink.  Remove it now (saving 64 or 128 bytes from shmem inode
size), and use kmemdup() for short symlinks, say, those up to 128 bytes.

I wonder why mpol_free_shared_policy() is done in shmem_destroy_inode()
rather than shmem_evict_inode(), where we usually do such freeing?  I
guess it doesn't matter, and I'm not into NUMA mpol testing right now.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoConvert shmem_writepage() to use shmem_delete_from_page_cache() to use
Hugh Dickins [Tue, 26 Jul 2011 10:14:54 +0000 (20:14 +1000)]
Convert shmem_writepage() to use shmem_delete_from_page_cache() to use
shmem_radix_tree_replace() to substitute swap entry for page pointer
atomically in the radix tree.

As with shmem_add_to_page_cache(), it's not entirely satisfactory to be
copying such code from delete_from_swap_cache, but again judged easier to
sell than making its other callers go through the extras.

Remove the toy implementation's shmem_put_swap() and shmem_get_swap(), now
unreferenced, and the hack to disable swap: it's now good to go.

The way things have worked out, info->lock no longer helps to guard the
shmem_swaplist: we increment swapped under shmem_swaplist_mutex only.
That global mutex exclusion between shmem_writepage() and shmem_unuse() is
not pretty, and we ought to find another way; but it's been forced on us
by recent race discoveries, not a consequence of this patchset.

And what has become of the WARN_ON_ONCE(1) free_swap_and_cache() if a swap
entry was found already present?  That's no longer possible, the (unknown)
one inserting this page into filecache would hit the swap entry occupying
that slot.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoRemove mem_cgroup_shmem_charge_fallback(): it was only required when we
Hugh Dickins [Tue, 26 Jul 2011 10:14:54 +0000 (20:14 +1000)]
Remove mem_cgroup_shmem_charge_fallback(): it was only required when we
had to move swappage to filecache with GFP_NOWAIT.

Remove the GFP_NOWAIT special case from mem_cgroup_cache_charge(), by
moving its call out from shmem_add_to_page_cache() to two of thats three
callers.  But leave it doing mem_cgroup_uncharge_cache_page() on error:
although asymmetrical, it's easier for all 3 callers to handle.

These two changes would also be appropriate if anyone were to start using
shmem_read_mapping_page_gfp() with GFP_NOWAIT.

Remove mem_cgroup_get_shmem_target(): mc_handle_file_pte() can test
radix_tree_exceptional_entry() to get what it needs for itself.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoConvert shmem_getpage_gfp(), the engine-room of shmem, to expect page or
Hugh Dickins [Tue, 26 Jul 2011 10:14:54 +0000 (20:14 +1000)]
Convert shmem_getpage_gfp(), the engine-room of shmem, to expect page or
swap entry returned from radix tree by find_lock_page().

Whereas the repetitive old method proceeded mainly under info->lock,
dropping and repeating whenever one of the conditions needed was not met,
now we can proceed without it, leaving shmem_add_to_page_cache() to check
for a race.

This way there is no need to preallocate a page, no need for an early
radix_tree_preload(), no need for mem_cgroup_shmem_charge_fallback().

Move the error unwinding down to the bottom instead of repeating it
throughout.  ENOSPC handling is a little different from before: there is
no longer any race between find_lock_page() and finding swap, but we can
arrive at ENOSPC before calling shmem_recalc_inode(), which might
occasionally discover freed space.

Be stricter to check i_size before returning.  info->lock is used for
little but alloced, swapped, i_blocks updates.  Move i_blocks updates out
from under the max_blocks check, so even an unlimited size=0 mount can
show accurate du.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoConvert shmem_unuse_inode() to use a lockless gang lookup of the radix
Hugh Dickins [Tue, 26 Jul 2011 10:14:53 +0000 (20:14 +1000)]
Convert shmem_unuse_inode() to use a lockless gang lookup of the radix
tree, searching for matching swap.

This is somewhat slower than the old method: because of repeated radix
tree descents, because of copying entries up, but probably most because
the old method noted and skipped once a vector page was cleared of swap.
Perhaps we can devise a use of radix tree tagging to achieve that later.

shmem_add_to_page_cache() uses shmem_radix_tree_replace() to compensate
for the lockless lookup by checking that the expected entry is in place,
under lock.  It is not very satisfactory to be copying this much from
add_to_page_cache_locked(), but I think easier to sell than insisting that
every caller of add_to_page_cache*() go through the extras.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoDisable the toy swapping implementation in shmem_writepage() - it's hard
Hugh Dickins [Tue, 26 Jul 2011 10:14:53 +0000 (20:14 +1000)]
Disable the toy swapping implementation in shmem_writepage() - it's hard
to support two schemes at once - and convert shmem_truncate_range() to a
lockless gang lookup of swap entries along with pages, freeing both.

Since the second loop tightens its noose until all entries of either kind
have been squeezed out (and we shall make sure that there's not an instant
when neither is visible), there is no longer a need for yet another pass
below.

shmem_radix_tree_replace() compensates for the lockless lookup by checking
that the expected entry is in place, under lock, before replacing it.
Here it just deletes, but will be used in later patches to substitute swap
entry for page or page for swap entry.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoBring truncate.c's code for truncate_inode_pages_range() inline into
Hugh Dickins [Tue, 26 Jul 2011 10:14:53 +0000 (20:14 +1000)]
Bring truncate.c's code for truncate_inode_pages_range() inline into
shmem_truncate_range(), replacing its first call (there's a followup call
below, but leave that one, it will disappear next).

Don't play with it yet, apart from leaving out the cleancache flush, and
(importantly) the nrpages == 0 skip, and moving shmem_setattr()'s partial
page preparation into its partial page handling.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoWhile it's at its least, make a number of boring nitpicky cleanups to
Hugh Dickins [Tue, 26 Jul 2011 10:14:52 +0000 (20:14 +1000)]
While it's at its least, make a number of boring nitpicky cleanups to
shmem.c, mostly for consistency of variable naming.  Things like "swap"
instead of "entry", "pgoff_t index" instead of "unsigned long idx".

And since everything else here is prefixed "shmem_",
better change init_tmpfs() to shmem_init().

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoThe maximum size of a shmem/tmpfs file has been limited by the maximum
Hugh Dickins [Tue, 26 Jul 2011 10:14:52 +0000 (20:14 +1000)]
The maximum size of a shmem/tmpfs file has been limited by the maximum
size of its triple-indirect swap vector.  With 4kB page size, maximum
filesize was just over 2TB on a 32-bit kernel, but sadly one eighth of
that on a 64-bit kernel.  (With 8kB page size, maximum filesize was just
over 4TB on a 64-bit kernel, but 16TB on a 32-bit kernel, MAX_LFS_FILESIZE
being then more restrictive than swap vector layout.)

It's a shame that tmpfs should be more restrictive than ramfs, and this
limitation has now been noticed.  Add another level to the swap vector?
No, it became obscure and hard to maintain, once I complicated it to make
use of highmem pages nine years ago: better choose another way.

Surely, if 2.4 had had the radix tree pagecache introduced in 2.5, then
tmpfs would never have invented its own peculiar radix tree: we would have
fitted swap entries into the common radix tree instead, in much the same
way as we fit swap entries into page tables.

And why should each file have a separate radix tree for its pages and for
its swap entries?  The swap entries are required precisely where and when
the pages are not.  We want to put them together in a single radix tree:
which can then avoid much of the locking which was needed to prevent them
from being exchanged underneath us.

This also avoids the waste of memory devoted to swap vectors, first in the
shmem_inode itself, then at least two more pages once a file grew beyond
16 data pages (pages accounted by df and du, but not by memcg).  Allocated
upfront, to avoid allocation when under swapping pressure, but pure waste
when CONFIG_SWAP is not set - I have never spattered around the ifdefs to
prevent that, preferring this move to sharing the common radix tree
instead.

There are three downsides to sharing the radix tree.  One, that it binds
tmpfs more tightly to the rest of mm, either requiring knowledge of swap
entries in radix tree there, or duplication of its code here in shmem.c.
I believe that the simplications and memory savings (and probable higher
performance, not yet measured) justify that.

Two, that on HIGHMEM systems with SWAP enabled, it's the lowmem radix
nodes that cannot be freed under memory pressure - whereas before it was
the less precious highmem swap vector pages that could not be freed.  I'm
hoping that 64-bit has now been accessible for long enough, that the
highmem argument has grown much less persuasive.

Three, that swapoff is slower than it used to be on tmpfs files, since
it's using a simple generic mechanism not tailored to it: I find this
noticeable, and shall want to improve, but maybe nobody else will notice.

So...  now remove most of the old swap vector code from shmem.c.  But, for
the moment, keep the simple i_direct vector of 16 pages, with simple
accessors shmem_put_swap() and shmem_get_swap(), as a toy implementation
to help mark where swap needs to be handled in subsequent patches.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoIf swap entries are to be stored along with struct page pointers in a
Hugh Dickins [Tue, 26 Jul 2011 10:14:52 +0000 (20:14 +1000)]
If swap entries are to be stored along with struct page pointers in a
radix tree, they need to be distinguished as exceptional entries.

Most of the handling of swap entries in radix tree will be contained in
shmem.c, but a few functions in filemap.c's common code need to check for
their appearance: find_get_page(), find_lock_page(), find_get_pages() and
find_get_pages_contig().

So as not to slow their fast paths, tuck those checks inside the existing
checks for unlikely radix_tree_deref_slot(); except for find_lock_page(),
where it is an added test.  And make it a BUG in find_get_pages_tag(),
which is not applied to tmpfs files.

A part of the reason for eliminating shmem_readpage() earlier, was to
minimize the places where common code would need to allow for swap
entries.

The swp_entry_t known to swapfile.c must be massaged into a slightly
different form when stored in the radix tree, just as it gets massaged
into a pte_t when stored in page tables.

In an i386 kernel this limits its information (type and page offset) to 30
bits: given 32 "types" of swapfile and 4kB pagesize, that's a maximum
swapfile size of 128GB.  Which is less than the 512GB we previously
allowed with X86_PAE (where the swap entry can occupy the entire upper 32
bits of a pte_t), but not a new limitation on 32-bit without PAE; and
there's not a new limitation on 64-bit (where swap filesize is already
limited to 16TB by a 32-bit page offset).  Thirty areas of 128GB is
probably still enough swap for a 64GB 32-bit machine.

Provide swp_to_radix_entry() and radix_to_swp_entry() conversions, and
enforce filesize limit in read_swap_header(), just as for ptes.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoA patchset to extend tmpfs to MAX_LFS_FILESIZE by abandoning its peculiar
Hugh Dickins [Tue, 26 Jul 2011 10:14:52 +0000 (20:14 +1000)]
A patchset to extend tmpfs to MAX_LFS_FILESIZE by abandoning its peculiar
swap vector, instead keeping a file's swap entries in the same radix tree
as its struct page pointers: thus saving memory, and simplifying its code
and locking.

This patch:

The radix_tree is used by several subsystems for different purposes.  A
major use is to store the struct page pointers of a file's pagecache for
memory management.  But what if mm wanted to store something other than
page pointers there too?

The low bit of a radix_tree entry is already used to denote an indirect
pointer, for internal use, and the unlikely radix_tree_deref_retry() case.
 Define the next bit as denoting an exceptional entry, and supply inline
functions radix_tree_exception() to return non-0 in either unlikely case,
and radix_tree_exceptional_entry() to return non-0 in the second case.

If a subsystem already uses radix_tree with that bit set, no problem: it
does not affect internal workings at all, but is defined for the
convenience of those storing well-aligned pointers in the radix_tree.

The radix_tree_gang_lookups have an implicit assumption that the caller
can deduce the offset of each entry returned e.g.  by the page->index of a
struct page.  But that may not be feasible for some kinds of item to be
stored there.

radix_tree_gang_lookup_slot() allow for an optional indices argument,
output array in which to return those offsets.  The same could be added to
other radix_tree_gang_lookups, but for now keep it to the only one for
which we need it.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoUse the nice enumerated constant.
Andrew Morton [Tue, 26 Jul 2011 10:14:51 +0000 (20:14 +1000)]
Use the nice enumerated constant.

Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoCc: Greg KH <greg@kroah.com>
Andrew Morton [Tue, 26 Jul 2011 10:14:51 +0000 (20:14 +1000)]
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoalpha allmodconfig:
Andrew Morton [Tue, 26 Jul 2011 10:14:51 +0000 (20:14 +1000)]
alpha allmodconfig:

drivers/staging/solo6x10/p2m.c:52: error: implicit declaration of function 'kzalloc'

Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoalpha allmodconfig:
Andrew Morton [Tue, 26 Jul 2011 10:14:50 +0000 (20:14 +1000)]
alpha allmodconfig:

drivers/staging/solo6x10/core.c:140: error: implicit declaration of function 'kzalloc'

Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoalpha allmodconfig:
Andrew Morton [Tue, 26 Jul 2011 10:14:50 +0000 (20:14 +1000)]
alpha allmodconfig:

drivers/staging/dt3155v4l/dt3155v4l.c:434: error: implicit declaration of function 'kzalloc'

Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agox86_64 allmodconfig:
Andrew Morton [Tue, 26 Jul 2011 10:14:50 +0000 (20:14 +1000)]
x86_64 allmodconfig:

In file included from arch/x86/include/asm/uaccess.h:572,
                 from include/linux/uaccess.h:5,
                 from drivers/staging/speakup/devsynth.c:4:
In function 'copy_from_user',
    inlined from 'speakup_file_write' at drivers/staging/speakup/devsynth.c:28:
arch/x86/include/asm/uaccess_64.h:64: error: call to 'copy_from_user_overflow' declared with attribute error: copy_from_user() buffer size is not provably correct

I'm not sure what was unprovable about it, but size_t is the correct type
anyway.

Also replace needless min_t() with min()

Cc: William Hubbs <w.d.hubbs@gmail.com>
Cc: Greg KH <greg@kroah.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>