]> git.kernelconcepts.de Git - karo-tx-linux.git/log
karo-tx-linux.git
12 years agoAdd linux-next specific files for 20111101 next-20111101
Stephen Rothwell [Tue, 1 Nov 2011 09:32:32 +0000 (20:32 +1100)]
Add linux-next specific files for 20111101

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
12 years agodevice-mapper: using EXPORT_SYBOL in dm-space-map-checker.c needs export.h
Stephen Rothwell [Tue, 1 Nov 2011 09:27:43 +0000 (20:27 +1100)]
device-mapper: using EXPORT_SYBOL in dm-space-map-checker.c needs export.h

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
12 years agoRevert "ALSA: intel8x0: Improve performance in virtual environment"
Stephen Rothwell [Tue, 1 Nov 2011 09:24:24 +0000 (20:24 +1100)]
Revert "ALSA: intel8x0: Improve performance in virtual environment"

This reverts commit 228cf79376f13b98f2e1ac10586311312757675c.

12 years agoRevert "drivers/scsi/sd.c: use ida_simple_get() and ida_simple_remove() in place...
Stephen Rothwell [Tue, 1 Nov 2011 08:55:03 +0000 (19:55 +1100)]
Revert "drivers/scsi/sd.c: use ida_simple_get() and ida_simple_remove() in place of boilerplate code"

This reverts commit ddabd33db5a288353e48b16c18a77e12209635c2.

12 years agoMerge branch 'akpm'
Stephen Rothwell [Tue, 1 Nov 2011 08:48:56 +0000 (19:48 +1100)]
Merge branch 'akpm'

12 years agoramoops: update parameters only after successful init
Kees Cook [Mon, 24 Oct 2011 15:00:24 +0000 (02:00 +1100)]
ramoops: update parameters only after successful init

If a platform device exists on the system, but ramoops fails to attach to
it, the module parameters are overridden before ramoops can fall back and
try to use passed module parameters.  Move update to end of init routine.

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Marco Stornelli <marco.stornelli@gmail.com>
Cc: Sergiu Iordache <sergiu@chromium.org>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agodio: using prefetch requires including prefetch.h
Stephen Rothwell [Mon, 24 Oct 2011 15:00:23 +0000 (02:00 +1100)]
dio: using prefetch requires including prefetch.h

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agodio-optimize-cache-misses-in-the-submission-path-v2-checkpatch-fixes
Andrew Morton [Mon, 24 Oct 2011 15:00:23 +0000 (02:00 +1100)]
dio-optimize-cache-misses-in-the-submission-path-v2-checkpatch-fixes

ERROR: trailing whitespace
#63: FILE: fs/direct-io.c:1109:
+^I/* $

ERROR: trailing whitespace
#80: FILE: fs/direct-io.c:1127:
+^I^Iif (unlikely((addr & blocksize_mask) || $

WARNING: suspect code indent for conditional statements (24, 33)
#82: FILE: fs/direct-io.c:1129:
  if (bdev)
+  blkbits = blksize_bits(

ERROR: trailing whitespace
#99: FILE: fs/direct-io.c:1319:
+^Istruct block_device *bdev, const struct iovec *iov, loff_t offset, $

ERROR: trailing whitespace
#103: FILE: fs/direct-io.c:1323:
+^I/* $

ERROR: trailing whitespace
#108: FILE: fs/direct-io.c:1328:
+^I * $

total: 5 errors, 1 warnings, 80 lines checked

NOTE: whitespace errors detected, you may wish to use scripts/cleanpatch or
      scripts/cleanfile

./patches/dio-optimize-cache-misses-in-the-submission-path-v2.patch has style problems, please review.

If any of these errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agodio: optimize cache misses in the submission path
Andi Kleen [Mon, 24 Oct 2011 15:00:23 +0000 (02:00 +1100)]
dio: optimize cache misses in the submission path

Some investigation of a transaction processing workload showed that a
major consumer of cycles in __blockdev_direct_IO is the cache miss while
accessing the block size.  This is because it has to walk the chain from
block_dev to gendisk to queue.

The block size is needed early on to check alignment and sizes.  It's only
done if the check for the inode block size fails.  But the costly block
device state is unconditionally fetched.

- Reorganize the code to only fetch block dev state when actually
  needed.

Then do a prefetch on the block dev early on in the direct IO path.  This
is worth it, because there is substantial code run before we actually
touch the block dev now.

- I also added some unlikelies to make it clear the compiler that block
  device fetch code is not normally executed.

This gave a small, but measurable improvement on a large database
benchmark (about 0.3%)

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agovfs: cache request_queue in struct block_device
Andi Kleen [Mon, 24 Oct 2011 15:00:22 +0000 (02:00 +1100)]
vfs: cache request_queue in struct block_device

This makes it possible to get from the inode to the request_queue with one
less cache miss.  Used in followon optimization.

The livetime of the pointer is the same as the gendisk.

This assumes that the queue will always stay the same in the gendisk while
it's visible to block_devices.  I think that's safe correct?

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agofs/direct-io.c: salcuate fs_count correctly in get_more_blocks()
Tao Ma [Mon, 24 Oct 2011 15:00:17 +0000 (02:00 +1100)]
fs/direct-io.c: salcuate fs_count correctly in get_more_blocks()

In get_more_blocks(), we use dio_count to calcuate fs_count and do some
tricky things to increase fs_count if dio_count isn't aligned.  But
actually it still has some corner cases that can't be coverd.  See the
following example:

dio_write foo -s 1024 -w 4096

(direct write 4096 bytes at offset 1024).  The same goes if the offset
isn't aligned to fs_blocksize.

In this case, the old calculation counts fs_count to be 1, but actually we
will write into 2 different blocks (if fs_blocksize=4096).  The old code
just works, since it will call get_block twice (and may have to allocate
and create extents twice for filesystems like ext4).  So we'd better call
get_block just once with the proper fs_count.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoaio: allocate kiocbs in batches
Jeff Moyer [Mon, 24 Oct 2011 15:00:17 +0000 (02:00 +1100)]
aio: allocate kiocbs in batches

In testing aio on a fast storage device, I found that the context lock
takes up a fair amount of cpu time in the I/O submission path.  The reason
is that we take it for every I/O submitted (see __aio_get_req).  Since we
know how many I/Os are passed to io_submit, we can preallocate the kiocbs
in batches, reducing the number of times we take and release the lock.

In my testing, I was able to reduce the amount of time spent in
_raw_spin_lock_irq by .56% (average of 3 runs).  The command I used to
test this was:

   aio-stress -O -o 2 -o 3 -r 8 -d 128 -b 32 -i 32 -s 16384 <dev>

I also tested the patch with various numbers of events passed to
io_submit, and I ran the xfstests aio group of tests to ensure I didn't
break anything.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Daniel Ehrenberg <dehrenberg@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agodrivers/misc/vmw_balloon.c: fix typo in code comment
Rakib Mullick [Mon, 24 Oct 2011 15:00:16 +0000 (02:00 +1100)]
drivers/misc/vmw_balloon.c: fix typo in code comment

Fix typo in code comment.

Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Acked-by: Dmitry Torokhov <dtor@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agodrivers/misc/vmw_balloon.c: determine page allocation flag can_sleep outside loop
Rakib Mullick [Mon, 24 Oct 2011 15:00:16 +0000 (02:00 +1100)]
drivers/misc/vmw_balloon.c: determine page allocation flag can_sleep outside loop

In vmballoon_reserve_page(), flags has been passed from the callee
function (vmballoon_inflate here).  So, we can determine can_sleep outside
the loop.

Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Acked-by: Dmitry Torokhov <dtor@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agow1: disable irqs in critical section
Jan Weitzel [Mon, 24 Oct 2011 15:00:16 +0000 (02:00 +1100)]
w1: disable irqs in critical section

Interrupting w1_delay() in w1_read_bit() results in missing the low level
on the w1 line and receiving "1" instead of "0".

Add local_irq_save()/local_irq_restore() around the critical section

Signed-off-by: Jan Weitzel <j.weitzel@phytec.de>
Acked-by: Evgeniy Polyakov <zbr@ioremap.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agodrivers/w1/w1_int.c: multiple masters used same init_name
Florian Faber [Mon, 24 Oct 2011 15:00:15 +0000 (02:00 +1100)]
drivers/w1/w1_int.c: multiple masters used same init_name

When using multiple masters, w1_int.c would use the .init_name from w1.c
for all entities, which will fail when creating a corresponding sysfs
entry.  This patch uses the unique name previously generated.

[    0.716000] WARNING: at fs/sysfs/dir.c:451 sysfs_add_one+0x48/0x64()
[    0.716000] sysfs: cannot create duplicate filename '/devices/w1 bus
master'
[    0.716000] Modules linked in:
[    0.716000] Call trace:
[    0.716000]  [<9001a604>] warn_slowpath_common+0x34/0x44
[    0.716000]  [<9001a64c>] warn_slowpath_fmt+0x14/0x18
[    0.720000]  [<90078020>] sysfs_add_one+0x48/0x64
[    0.720000]  [<900784ec>] create_dir+0x40/0x68
[    0.720000]  [<9007857a>] sysfs_create_dir+0x66/0x78
[    0.720000]  [<900c1a8a>] kobject_add_internal+0x6e/0x104
[    0.720000]  [<900c1bc0>] kobject_add_varg+0x20/0x2c
[    0.720000]  [<900c1c1c>] kobject_add+0x30/0x3c
[    0.720000]  [<900dbd66>] device_add+0x6a/0x378
[    0.720000]  [<900dbb4a>] device_initialize+0x12/0x48
[    0.720000]  [<900dc080>] device_register+0xc/0x10
[    0.720000]  [<900f99be>] w1_add_master_device+0x162/0x274
[    0.720000]  [<90008e7a>] w1_gpio_probe+0x66/0xb4
[    0.720000]  [<9000030c>] kernel_init+0x0/0xe8
[    0.720000]  [<900dde54>] platform_drv_probe+0xc/0xe
[    0.720000]  [<9000030c>] kernel_init+0x0/0xe8
[    0.720000]  [<900dd4f8>] driver_probe_device+0x6c/0xdc
[    0.720000]  [<900dd5fc>] __driver_attach+0x34/0x48
[    0.720000]  [<900dcce8>] bus_for_each_dev+0x2c/0x48
[    0.720000]  [<900dd5c8>] __driver_attach+0x0/0x48
[    0.720000]  [<900dd38c>] driver_attach+0x10/0x14
[    0.720000]  [<900dd16a>] bus_add_driver+0x6a/0x18c
[    0.720000]  [<900dd768>] driver_register+0x60/0xb8
[    0.720000]  [<90011594>] __initcall_w1_therm_init6+0x0/0x4
[    0.720000]  [<90008e00>] w1_gpio_init+0x0/0x14
[    0.720000]  [<9000030c>] kernel_init+0x0/0xe8
[    0.720000]  [<900ddf48>] platform_driver_register+0x30/0x38
[    0.720000]  [<90011594>] __initcall_w1_therm_init6+0x0/0x4
[    0.720000]  [<90008e00>] w1_gpio_init+0x0/0x14
[    0.720000]  [<9000030c>] kernel_init+0x0/0xe8
[    0.720000]  [<900ddf5e>] platform_driver_probe+0xe/0x3c
[    0.720000]  [<90008e0c>] w1_gpio_init+0xc/0x14
[    0.720000]  [<90011594>] __initcall_w1_therm_init6+0x0/0x4
[    0.720000]  [<90008e00>] w1_gpio_init+0x0/0x14
[    0.720000]  [<900126d4>] do_one_initcall+0x34/0x130
[    0.720000]  [<90000372>] kernel_init+0x66/0xe8
[    0.720000]  [<90011594>] __initcall_w1_therm_init6+0x0/0x4
[    0.720000]  [<9001ca3e>] do_exit+0x0/0x3a6
[    0.720000]  [<9000030c>] kernel_init+0x0/0xe8
[    0.720000]  [<9001ca3e>] do_exit+0x0/0x3a6
[    0.720000]
[    0.720000] ---[ end trace 5a9233884fead918 ]---
[    0.720000] kobject_add_internal failed for w1 bus master with
-EEXIST, don't try to register things with the same name in the same
directory.

Signed-off-by: Florian Faber <faber@faberman.de>
Cc: Evgeniy Polyakov <zbr@ioremap.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agodrivers/power/ds2780_battery.c: fix deadlock upon insertion and removal
Clifton Barnes [Mon, 24 Oct 2011 15:00:15 +0000 (02:00 +1100)]
drivers/power/ds2780_battery.c: fix deadlock upon insertion and removal

Fixes the deadlock when inserting and removing the ds2780.

Signed-off-by: Clifton Barnes <cabarnes@indesign-llc.com>
Cc: Evgeniy Polyakov <zbr@ioremap.net>
Cc: <stable@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agodrivers/power/ds2780_battery.c: add a nolock function to w1 interface
Clifton Barnes [Mon, 24 Oct 2011 15:00:15 +0000 (02:00 +1100)]
drivers/power/ds2780_battery.c: add a nolock function to w1 interface

Adds a nolock function to the w1 interface to avoid locking the
mutex if needed.

Signed-off-by: Clifton Barnes <cabarnes@indesign-llc.com>
Cc: Evgeniy Polyakov <zbr@ioremap.net>
Cc: <stable@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agodrivers/power/ds2780_battery.c: create central point for calling w1 interface
Clifton Barnes [Mon, 24 Oct 2011 15:00:14 +0000 (02:00 +1100)]
drivers/power/ds2780_battery.c: create central point for calling w1 interface

Simply creates one point to call the w1 interface.

Signed-off-by: Clifton Barnes <cabarnes@indesign-llc.com>
Cc: Evgeniy Polyakov <zbr@ioremap.net>
Cc: <stable@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agow1: ds2760 and ds2780, use ida for id and ida_simple_get() to get it
Jonathan Cameron [Mon, 24 Oct 2011 15:00:14 +0000 (02:00 +1100)]
w1: ds2760 and ds2780, use ida for id and ida_simple_get() to get it

Straightforward.  As an aside, the ida_init calls are not needed as far as
I can see needed.  (DEFINE_IDA does the same already).

Signed-off-by: Jonathan Cameron <jic23@cam.ac.uk>
Cc: Evgeniy Polyakov <zbr@ioremap.net>
Acked-by: Clifton Barnes <cabarnes@indesign-llc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agopps gpio client: add missing dependency
Heiko Carstens [Mon, 24 Oct 2011 15:00:13 +0000 (02:00 +1100)]
pps gpio client: add missing dependency

Add "depends on GENERIC_HARDIRQS" to avoid compile breakage on s390:

drivers/built-in.o: In function `pps_gpio_remove':
linux-next/drivers/pps/clients/pps-gpio.c:189: undefined reference to `free_irq'

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: James Nuss <jamesnuss@nanometrics.ca>
Cc: Rodolfo Giometti <giometti@enneenne.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agopps-new-client-driver-using-gpio-fix
Andrew Morton [Mon, 24 Oct 2011 15:00:13 +0000 (02:00 +1100)]
pps-new-client-driver-using-gpio-fix

remove unneeded cast of void*

Cc: Alexander Gordeev <lasaine@lvk.cs.msu.su>
Cc: Igor Plyatov <plyatov@gmail.com>
Cc: James Nuss <jamesnuss@nanometrics.ca>
Cc: Ricardo Martins <rasm@fe.up.pt>
Cc: Rodolfo Giometti <giometti@linux.it>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agopps: new client driver using GPIO
James Nuss [Mon, 24 Oct 2011 15:00:13 +0000 (02:00 +1100)]
pps: new client driver using GPIO

This client driver allows you to use a GPIO pin as a source for PPS
signals.  Platform data [1] are used to specify the GPIO pin number,
label, assert event edge type, and whether clear events are captured.

This driver is based on the work by Ricardo Martins who submitted an
initial implementation [2] of a PPS IRQ client driver to the linuxpps
mailing-list on Dec 3 2010.

[1] include/linux/pps-gpio.h
[2] http://ml.enneenne.com/pipermail/linuxpps/2010-December/004155.html

Signed-off-by: James Nuss <jamesnuss@nanometrics.ca>
Cc: Ricardo Martins <rasm@fe.up.pt>
Acked-by: Rodolfo Giometti <giometti@linux.it>
Signed-off-by: Ricardo Martins <rasm@fe.up.pt>
Cc: Alexander Gordeev <lasaine@lvk.cs.msu.su>
Cc: Igor Plyatov <plyatov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agopps: default echo function
James Nuss [Mon, 24 Oct 2011 15:00:12 +0000 (02:00 +1100)]
pps: default echo function

A default echo function has been provided so it is no longer an error when
you specify PPS_ECHOASSERT or PPS_ECHOCLEAR without an explicit echo
function.  This allows some code re-use and also makes it easier to write
client drivers since the default echo function does not normally need to
change.

Signed-off-by: James Nuss <jamesnuss@nanometrics.ca>
Reviewed-by: Ben Gardiner <bengardiner@nanometrics.ca>
Acked-by: Rodolfo Giometti <giometti@linux.it>
Cc: Ricardo Martins <rasm@fe.up.pt>
Cc: Alexander Gordeev <lasaine@lvk.cs.msu.su>
Cc: Igor Plyatov <plyatov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoinclude/linux/dma-mapping.h: add dma_zalloc_coherent()
Andrew Morton [Mon, 24 Oct 2011 15:00:12 +0000 (02:00 +1100)]
include/linux/dma-mapping.h: add dma_zalloc_coherent()

Lots of driver code does a dma_alloc_coherent() and then zeroes out the
memory with a memset.  Make it easy for them.

Cc: Alexandre Bounine <alexandre.bounine@idt.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agosysctl: make CONFIG_SYSCTL_SYSCALL default to n
WANG Cong [Mon, 24 Oct 2011 15:00:11 +0000 (02:00 +1100)]
sysctl: make CONFIG_SYSCTL_SYSCALL default to n

When I tried to send a patch to remove it, Andi told me we still need to
keep compabitlies for old libc, so we can't remove this completely.  Then
just make it default to n and remove the doc from
feature-removal-schedule.txt.

Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agosysctl-add-support-for-poll-fix
Andrew Morton [Mon, 24 Oct 2011 15:00:11 +0000 (02:00 +1100)]
sysctl-add-support-for-poll-fix

s/declare/define/ for definitions

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Greg KH <gregkh@suse.de>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agosysctl: add support for poll()
Lucas De Marchi [Mon, 24 Oct 2011 15:00:10 +0000 (02:00 +1100)]
sysctl: add support for poll()

Adding support for poll() in sysctl fs allows userspace to receive
notifications of changes in sysctl entries.  This adds a infrastructure to
allow files in sysctl fs to be pollable and implements it for hostname and
domainname.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Cc: Greg KH <gregkh@suse.de>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoRapidIO: documentation update
Alexandre Bounine [Mon, 24 Oct 2011 15:00:10 +0000 (02:00 +1100)]
RapidIO: documentation update

Update rapidio.txt to reflect changes from recent patch.
See http://marc.info/?l=linux-kernel&m=131285620113589&w=2 for details.

Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com>
Cc: Liu Gang <Gang.Liu@freescale.com>
Cc: Micha Nelissen <micha@neli.hopto.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agodrivers/net/rionet.c: fix ethernet address macros for LE platforms
Alexandre Bounine [Mon, 24 Oct 2011 15:00:10 +0000 (02:00 +1100)]
drivers/net/rionet.c: fix ethernet address macros for LE platforms

Modify Ethernet addess macros to be compatible with BE/LE platforms

Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com>
Cc: Chul Kim <chul.kim@idt.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Li Yang <leoli@freescale.com>
Cc: <stable@kernel.org> [2.6.39+]
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoRapidIO: fix potential null deref in rio_setup_device()
Alexandre Bounine [Mon, 24 Oct 2011 15:00:09 +0000 (02:00 +1100)]
RapidIO: fix potential null deref in rio_setup_device()

The "goto cleanup" path can deference "rswitch" when it is NULL.

Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com>
Cc: Dan Carpenter <error27@gmail.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Chul Kim <chul.kim@idt.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoRapidIO: Tsi721 driver - fixes for the initial release
Alexandre Bounine [Mon, 24 Oct 2011 15:00:09 +0000 (02:00 +1100)]
RapidIO: Tsi721 driver - fixes for the initial release

- address comments made by Andrew Morton,
  see http://marc.info/?l=linux-kernel&m=131361256714116&w=2
- add spinlock for IB_MSG handler
- rename private BDMA channel structure to avoid conflict with DMA engine
- fix endianess bug in outbound message interrupt handler

Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com>
Cc: Chul Kim <chul.kim@idt.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Li Yang <leoli@freescale.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoRapidIO: add mport driver for Tsi721 bridge
Alexandre Bounine [Mon, 24 Oct 2011 15:00:09 +0000 (02:00 +1100)]
RapidIO: add mport driver for Tsi721 bridge

Add RapidIO mport driver for IDT TSI721 PCI Express-to-SRIO bridge device.
 The driver provides full set of callback functions defined for mport
devices in RapidIO subsystem.  It also is compatible with current version
of RIONET driver (Ethernet over RapidIO messaging services).

This patch is applicable to kernel versions starting from 2.6.39.

Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com>
Signed-off-by: Chul Kim <chul.kim@idt.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Li Yang <leoli@freescale.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoarch/powerpc/sysdev/fsl_rio.c: release rapidio port I/O region resource if port faile...
Liu Gang [Mon, 24 Oct 2011 15:00:08 +0000 (02:00 +1100)]
arch/powerpc/sysdev/fsl_rio.c: release rapidio port I/O region resource if port failed to initialize

The "struct rio_mport" contains a member of master port I/O memory
resource structure "struct resource iores".  This resource will be read
from device tree and be used for rapidio R/W transaction memory space.
Rapidio requests the port I/O memory resource under the root resource
"iomem_resource".

struct rio_mport *port;
port = kzalloc(sizeof(struct rio_mport), GFP_KERNEL);

request_resource(&iomem_resource, &port->iores);

When port failed to initialize, allocated "rio_mport" structure memory
will be freed, and the port I/O memory resource structure pointer
"&port->iores" will be invalid.  If other requests resource under
"iomem_resource", "&port->iores" node may be operated in the child
resources list and this will cause the system to crash.

So the requested port I/O memory resource should be released before
freeing allocated "rio_mport" structure.

Signed-off-by: Liu Gang <Gang.Liu@freescale.com>
Acked-by: Alexandre Bounine <alexandre.bounine@idt.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Grant Likely <grant.likely@secretlab.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agodrivers/rapidio/rio-scan.c: use discovered bit to test if enumeration is complete
Liu Gang [Mon, 24 Oct 2011 15:00:08 +0000 (02:00 +1100)]
drivers/rapidio/rio-scan.c: use discovered bit to test if enumeration is complete

The discovered bit in PGCCSR register indicates if the device has been
discovered by system host.  In Rapidio systems, some agent devices can also
be master devices.  They can issue requests into the system.

Signed-off-by: Liu Gang <Gang.Liu@freescale.com>
Acked-by: Alexandre Bounine <alexandre.bounine@idt.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoinit-add-root=partuuid=uuid-partnroff=%d-support-fix
Stephen Rothwell [Mon, 24 Oct 2011 15:00:07 +0000 (02:00 +1100)]
init-add-root=partuuid=uuid-partnroff=%d-support-fix

After merging the akpm tree, today's linux-next build (lost of them)
produced this warning:

WARNING: init/mounts.o(.text+0x192): Section mismatch in reference from the function devt_from_partuuid() to the variable .init.data:root_wait
The function devt_from_partuuid() references
the variable __initdata root_wait.
This is often because devt_from_partuuid lacks a __initdata
annotation or the annotation of root_wait is wrong.

Commit 185237e4cfab ("This patch makes two changes:"
init-add-root=partuuid=uuid-partnroff=%d-support-update.patch) adds the
reference to root_wait from the non-init function devt_from_partuuid().

The easiest thing to do is to remove __init_date from root_wait.

I have applied this patch as a merge fixup for today:

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Wed, 10 Aug 2011 12:09:35 +1000
Subject: [PATCH] do_mounts: remove __init_data from root_wait

as it is now used from a non init routine.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Will Drewry <wad@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoinit: add root=PARTUUID=UUID/PARTNROFF=%d support
Will Drewry [Mon, 24 Oct 2011 15:00:07 +0000 (02:00 +1100)]
init: add root=PARTUUID=UUID/PARTNROFF=%d support

Expand root=PARTUUID=UUID syntax to support selecting a root partition by
integer offset from a known, unique partition.  This approach provides
similar properties to specifying a device and partition number, but using
the UUID as the unique path prior to evaluating the offset.

For example,
  root=PARTUUID=99DE9194-FC15-4223-9192-FC243948F88B/PARTNROFF=1
selects the partition with UUID 99DE.. then select the next
partition.

This change is motivated by a particular usecase in Chromium OS where the
bootloader can easily determine what partition it is on (by UUID) but
doesn't perform general partition table walking.

That said, support for this model provides a direct mechanism for the user
to modify the root partition to boot without specifically needing to
extract each UUID or update the bootloader explicitly when the root
partition UUID is changed (if it is recreated to be larger, for instance).
 Pinning to a /boot-style partition UUID allows the arbitrary root
partition reconfiguration/modifications with slightly less ambiguity than
just [dev][partition] and less stringency than the specific root partition
UUID.

Signed-off-by: Will Drewry <wad@chromium.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoipc/sem.c: alternatives to preempt_disable()
Manfred Spraul [Mon, 24 Oct 2011 15:00:07 +0000 (02:00 +1100)]
ipc/sem.c: alternatives to preempt_disable()

ipc/sem.c uses a custom wakeup scheme that relies on preempt_disable().
On -RT, this causes increased latencies and debug warnings.

The patch adds two additional schemes:
- one built around a completion - could be better for -RT kernels
- one built around a spinlock - unfortunately it's broken
- and the current one

My preferred solution would be the spinlock implementation: RT would use
premptible spinlocks, mainline normal spinlocks.  Thus both get the
optimal implementation without any special code in ipc/sem.c.
Unfortunately, I don't see how it could be fixed.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoinclude/linux/sem.h: make sysv_sem empty if SYSVIPC is disabled
Manfred Spraul [Mon, 24 Oct 2011 15:00:06 +0000 (02:00 +1100)]
include/linux/sem.h: make sysv_sem empty if SYSVIPC is disabled

For the sysvsem undo, each task struct contains a sysv_sem structure with
a pointer to the undo information.

This pointer is only necessary if sysvipc is enabled - thus the pointer
can be made conditional on CONFIG_SYSVIPC.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoipc/sem.c: remove private structures from public header file
Manfred Spraul [Mon, 24 Oct 2011 15:00:06 +0000 (02:00 +1100)]
ipc/sem.c: remove private structures from public header file

include/linux/sem.h contains several structures that are only used within
ipc/sem.c.

The patch moves them into ipc/sem.c - there is no need to expose the
structures to the whole kernel.

No functional changes, only whitespace cleanups and 80-char per line
fixes.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoipc/sem.c: handle spurious wakeups
Manfred Spraul [Mon, 24 Oct 2011 15:00:05 +0000 (02:00 +1100)]
ipc/sem.c: handle spurious wakeups

semtimedop() does not handle spurious wakeups, it returns -EINTR to user
space.  Most other schedule() users would just loop and not return to user
space.  The patch adds such a loop to semtimedop()

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Reported-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoipc/sem.c: fix return code race with semop vs. semop +semctl(IPC_RMID)
Manfred Spraul [Mon, 24 Oct 2011 15:00:05 +0000 (02:00 +1100)]
ipc/sem.c: fix return code race with semop vs. semop +semctl(IPC_RMID)

sys_semtimedop() may return -EIDRM although the semaphore operation
completed successfully:

thread 1: thread 2:
semtimedop(), sleeps
semop():
* acquires sem_lock()
semtimedop() woken up due to timeout
sem_lock() loops
* notices that thread 2 could be completed.
* performs the operations that thread 2 is sleeping on.
* marks the semaphore operation as IN_WAKEUP
* drops sem_lock(), does wakeup, sets return code to 0
* thread delayed due to interrupt, whatever
* returns to user space
* thread still delayed
semctl(IPC_RMID)
* acquires sem_lock()
* ipc_rmid(), ipcp->deleted=1
* drops sem_lock()
* thread finally continues - but seem_lock()
  now fails due to ipcp->deleted == 1
* returns -EIDRM instead of 0

The fix is trivial: Always use the return code in queue.status.

In real world, the race probably doesn't matter:
If the semaphore array is destroyed, the app is probably not interested
if the last operation succeeded or was already cancelled.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mike Galbraith <efault@gmx.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoproc: force dcache drop on unauthorized access
Vasiliy Kulikov [Mon, 24 Oct 2011 15:00:05 +0000 (02:00 +1100)]
proc: force dcache drop on unauthorized access

The patch "proc: fix races against execve() of /proc/PID/fd**" is still a
partial fix for a setxid problem.  link(2) is a yet another way to
identify whether a specific fd is opened by a privileged process.  By
calling link(2) against /proc/PID/fd/* an attacker may identify whether
the fd number is valid for PID by analysing link(2) return code.

Both getattr() and link() can be used by the attacker iff the dentry is
present in the dcache.  In this case ->lookup() is not called and the only
way to check ptrace permissions is either operation handler or
->revalidate().  The easiest solution to prevent any unauthorized access
to /proc/PID/fd*/ files is to force the dentry drop on each unauthorized
access attempt.

If an attacker keeps opened fd of /proc/PID/fd/ and dcache contains a
specific dentry for some /proc/PID/fd/XXX, any future attemp to use the
dentry by the attacker would lead to the dentry drop as a result of a
failed ptrace check in ->revalidate().  Then the attacker cannot spawn a
dentry for the specific fd number because of ptrace check in ->lookup().

The dentry drop can be still observed by an attacker by analysing
information from /proc/slabinfo, which is addressed in the successive
patch.

Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoproc-fix-races-against-execve-of-proc-pid-fd-fix
Vasiliy Kulikov [Mon, 24 Oct 2011 15:00:04 +0000 (02:00 +1100)]
proc-fix-races-against-execve-of-proc-pid-fd-fix

In the patch "proc: fix races against execve() of /proc/PID/fd**"
proc_pid_fd_link_getattr() leaked task_struct if ptrace check fails.

Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Reported-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoproc: fix races against execve() of /proc/PID/fd**
Vasiliy Kulikov [Mon, 24 Oct 2011 15:00:04 +0000 (02:00 +1100)]
proc: fix races against execve() of /proc/PID/fd**

fd* files are restricted to the task's owner, and other users may not get
direct access to them.  But one may open any of these files and run any
setuid program, keeping opened file descriptors.  As there are permission
checks on open(), but not on readdir() and read(), operations on the kept
file descriptors will not be checked.  It makes it possible to violate
procfs permission model.

Reading fdinfo/* may disclosure current fds' position and flags, reading
directory contents of fdinfo/ and fd/ may disclosure the number of opened
files by the target task.  This information is not sensible per se, but it
can reveal some private information (like length of a password stored in a
file) under certain conditions.

Used existing (un)lock_trace functions to check for ptrace_may_access(),
but instead of using EPERM return code from it use EACCES to be consistent
with existing proc_pid_follow_link()/proc_pid_readlink() return code.  If
they differ, attacker can guess what fds exist by analyzing stat() return
code.  Patched handlers: stat() for fd/*, stat() and read() for fdindo/*,
readdir() and lookup() for fd/ and fdinfo/.

Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: <stable@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoprocfs: report EISDIR when reading sysctl dirs in proc
Pavel Emelyanov [Mon, 24 Oct 2011 15:00:03 +0000 (02:00 +1100)]
procfs: report EISDIR when reading sysctl dirs in proc

On reading sysctl dirs we should return -EISDIR instead of -EINVAL.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agocpusets: avoid looping when storing to mems_allowed if one node remains set
David Rientjes [Mon, 24 Oct 2011 15:00:03 +0000 (02:00 +1100)]
cpusets: avoid looping when storing to mems_allowed if one node remains set

{get,put}_mems_allowed() exist so that general kernel code may locklessly
access a task's set of allowable nodes without having the chance that a
concurrent write will cause the nodemask to be empty on configurations
where MAX_NUMNODES > BITS_PER_LONG.

This could incur a significant delay, however, especially in low memory
conditions because the page allocator is blocking and reclaim requires
get_mems_allowed() itself.  It is not atypical to see writes to
cpuset.mems take over 2 seconds to complete, for example.  In low memory
conditions, this is problematic because it's one of the most imporant
times to change cpuset.mems in the first place!

The only way a task's set of allowable nodes may change is through cpusets
by writing to cpuset.mems and when attaching a task to a generic code is
not reading the nodemask with get_mems_allowed() at the same time, and
then clearing all the old nodes.  This prevents the possibility that a
reader will see an empty nodemask at the same time the writer is storing a
new nodemask.

If at least one node remains unchanged, though, it's possible to simply
set all new nodes and then clear all the old nodes.  Changing a task's
nodemask is protected by cgroup_mutex so it's guaranteed that two threads
are not changing the same task's nodemask at the same time, so the
nodemask is guaranteed to be stored before another thread changes it and
determines whether a node remains set or not.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Paul Menage <paul@paulmenage.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agomm/page_cgroup.c: quiet sparse noise
H Hartley Sweeten [Mon, 24 Oct 2011 15:00:03 +0000 (02:00 +1100)]
mm/page_cgroup.c: quiet sparse noise

warning: symbol 'swap_cgroup_ctrl' was not declared. Should it be static?

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agomemcg: Fix race condition in memcg_check_events() with this_cpu usage
Steven Rostedt [Mon, 24 Oct 2011 15:00:02 +0000 (02:00 +1100)]
memcg: Fix race condition in memcg_check_events() with this_cpu usage

Various code in memcontrol.c () calls this_cpu_read() on the calculations
to be done from two different percpu variables, or does an open-coded
read-modify-write on a single percpu variable.

Disable preemption throughout these operations so that the writes go to
the correct palces.

[ Added this_cpu to __this_cpu conversion by Johannes ]

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agomemcg: close race between charge and putback
Johannes Weiner [Mon, 24 Oct 2011 15:00:02 +0000 (02:00 +1100)]
memcg: close race between charge and putback

There is a potential race between a thread charging a page and another
thread putting it back to the LRU list:

charge:                         putback:
SetPageCgroupUsed               SetPageLRU
PageLRU && add to memcg LRU     PageCgroupUsed && add to memcg LRU

The order of setting one flag and checking the other is crucial, otherwise
the charge may observe !PageLRU while the putback observes !PageCgroupUsed
and the page is not linked to the memcg LRU at all.

Global memory pressure may fix this by trying to isolate and putback the
page for reclaim, where that putback would link it to the memcg LRU again.
 Without that, the memory cgroup is undeletable due to a charge whose
physical page can not be found and moved out.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Cc: Ying Han <yinghan@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agomemcg-skip-scanning-active-lists-based-on-individual-size-fix
Johannes Weiner [Mon, 24 Oct 2011 15:00:01 +0000 (02:00 +1100)]
memcg-skip-scanning-active-lists-based-on-individual-size-fix

Also ditch the documentation note for the removed stats value.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agomemcg: skip scanning active lists based on individual size
Johannes Weiner [Mon, 24 Oct 2011 15:00:01 +0000 (02:00 +1100)]
memcg: skip scanning active lists based on individual size

Reclaim decides to skip scanning an active list when the corresponding
inactive list is above a certain size in comparison to leave the assumed
working set alone while there are still enough reclaim candidates around.

The memcg implementation of comparing those lists instead reports whether
the whole memcg is low on the requested type of inactive pages,
considering all nodes and zones.

This can lead to an oversized active list not being scanned because of the
state of the other lists in the memcg, as well as an active list being
scanned while its corresponding inactive list has enough pages.

Not only is this wrong, it's also a scalability hazard, because the global
memory state over all nodes and zones has to be gathered for each memcg
and zone scanned.

Make these calculations purely based on the size of the two LRU lists
that are actually affected by the outcome of the decision.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agomemcg: do not expose uninitialized mem_cgroup_per_node to world
Igor Mammedov [Mon, 24 Oct 2011 15:00:01 +0000 (02:00 +1100)]
memcg: do not expose uninitialized mem_cgroup_per_node to world

If somebody is touching data too early, it might be easier to diagnose a
problem when dereferencing NULL at mem->info.nodeinfo[node] than trying to
understand why mem_cgroup_per_zone is [un|partly]initialized.

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agomemcg: replace ss->id_lock with a rwlock
Andrew Bresticker [Mon, 24 Oct 2011 15:00:00 +0000 (02:00 +1100)]
memcg: replace ss->id_lock with a rwlock

While back-porting Johannes Weiner's patch "mm: memcg-aware global
reclaim" for an internal effort, we noticed a significant performance
regression during page-reclaim heavy workloads due to high contention of
the ss->id_lock.  This lock protects idr map, and serializes calls to
idr_get_next() in css_get_next() (which is used during the memcg hierarchy
walk).  Since idr_get_next() is just doing a look up, we need only
serialize it with respect to idr_remove()/idr_get_new().  By making the
ss->id_lock a rwlock, contention is greatly reduced and performance
improves.

Tested: cat a 256m file from a ramdisk in a 128m container 50 times on
each core (one file + container per core) in parallel on a NUMA machine.
Result is the time for the test to complete in 1 of the containers.  Both
kernels included Johannes' memcg-aware global reclaim patches.

Before rwlock patch: 1710.778s
After rwlock patch: 152.227s

Signed-off-by: Andrew Bresticker <abrestic@google.com>
Cc: Paul Menage <menage@gmail.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agomemcg: fix oom schedule_timeout()
KAMEZAWA Hiroyuki [Mon, 24 Oct 2011 15:00:00 +0000 (02:00 +1100)]
memcg: fix oom schedule_timeout()

Before calling schedule_timeout(), task state should be changed.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agomemcg: rename mem variable to memcg
Raghavendra K T [Mon, 24 Oct 2011 15:00:00 +0000 (02:00 +1100)]
memcg: rename mem variable to memcg

The memcg code sometimes uses "struct mem_cgroup *mem" and sometimes uses
"struct mem_cgroup *memcg".  Rename all mem variables to memcg in source
file.

Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agocgroups: ERR_PTR needs err.h
Stephen Rothwell [Mon, 24 Oct 2011 14:59:59 +0000 (01:59 +1100)]
cgroups: ERR_PTR needs err.h

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agocgroups: add a task counter subsystem
Frederic Weisbecker [Mon, 24 Oct 2011 14:59:59 +0000 (01:59 +1100)]
cgroups: add a task counter subsystem

Add a new subsystem to limit the number of running tasks, similar to the
NR_PROC rlimit but in the scope of a cgroup.

The user can set an upper bound limit that is checked every time a task
forks in a cgroup or is moved into a cgroup with that subsystem binded.

The primary goal is to protect against forkbombs that explode inside a
container.  The traditional NR_PROC rlimit is not efficient in that case
because if we run containers in parallel under the same user, one of these
could starve all the others by spawning a high number of tasks close to
the user wide limit.

This is a prevention against forkbombs, so it's not deemed to cure the
effects of a forkbomb when the system is in a state where it's not
responsive.  It's aimed at preventing from ever reaching that state and
stop the spreading of tasks early.  While defining the limit on the
allowed number of tasks, it's up to the user to find the right balance
between the resource its containers may need and what it can afford to
provide.

As it's totally dissociated from the rlimit NR_PROC, both can be
complementary: the cgroup task counter can set an upper bound per
container and the rlmit can be an upper bound on the overall set of
containers.

Also this subsystem can be used to kill all the tasks in a cgroup without
races against concurrent forks, by setting the limit of tasks to 0, any
further forks can be rejected.  This is a good way to kill a forkbomb in a
container, or simply kill any container without the need to retry an
unbound number of times.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <htejun@gmail.com>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agocgroups: allow subsystems to cancel a fork
Frederic Weisbecker [Mon, 24 Oct 2011 14:59:58 +0000 (01:59 +1100)]
cgroups: allow subsystems to cancel a fork

Let the subsystem's fork callback return an error value so that they can
cancel a fork.  This is going to be used by the task counter subsystem to
implement the limit.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <htejun@gmail.com>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agocgroups: pull up res counter charge failure interpretation to caller
Frederic Weisbecker [Mon, 24 Oct 2011 14:59:58 +0000 (01:59 +1100)]
cgroups: pull up res counter charge failure interpretation to caller

res_counter_charge() always returns -ENOMEM when the limit is reached and
the charge thus can't happen.

However it's up to the caller to interpret this failure and return the
appropriate error value.  The task counter subsystem will need to report
the user that a fork() has been cancelled because of some limit reached,
not because we are too short on memory.

Fix this by returning -1 when res_counter_charge() fails.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <htejun@gmail.com>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agores_counter: allow charge failure pointer to be null
Frederic Weisbecker [Mon, 24 Oct 2011 14:59:58 +0000 (01:59 +1100)]
res_counter: allow charge failure pointer to be null

So that callers of res_counter_charge() don't have to create and pass this
pointer even if they aren't interested in it.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <htejun@gmail.com>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agocgroups: add res counter common ancestor searching
Kirill A. Shutemov [Mon, 24 Oct 2011 14:59:57 +0000 (01:59 +1100)]
cgroups: add res counter common ancestor searching

Add a new API to find the common ancestor between two resource counters.
This includes the passed resource counter themselves.

Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agocgroups: ability to stop res charge propagation on bounded ancestor
Frederic Weisbecker [Mon, 24 Oct 2011 14:59:57 +0000 (01:59 +1100)]
cgroups: ability to stop res charge propagation on bounded ancestor

Moving a task from a cgroup to another may require to substract its
resource charge from the old cgroup and add it to the new one.

For this to happen, the uncharge/charge propagation can just stop when we
reach the common ancestor for the two cgroups.  Further the performance
reasons, we also want to avoid to temporarily overload the common
ancestors with a non-accurate resource counter usage if we charge first
the new cgroup and uncharge the old one thereafter.  This is going to be a
requirement for the coming max number of task subsystem.

To solve this, provide a pair of new API that can charge/uncharge a
resource counter until we reach a given ancestor.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <htejun@gmail.com>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agocgroups: new cancel_attach_task() subsystem callback
Frederic Weisbecker [Mon, 24 Oct 2011 14:59:57 +0000 (01:59 +1100)]
cgroups: new cancel_attach_task() subsystem callback

To cancel a process attachment on a subsystem, we only call the
cancel_attach() callback once on the leader but we have no way to cancel
the attachment individually for each member of the process group.

This is going to be needed for the max number of tasks susbystem that is
coming.

To prepare for this integration, call a new cancel_attach_task() callback
on each task of the group until we reach the member that failed to attach.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <htejun@gmail.com>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agocgroups: add previous cgroup in can_attach_task/attach_task callbacks
Frederic Weisbecker [Mon, 24 Oct 2011 14:59:56 +0000 (01:59 +1100)]
cgroups: add previous cgroup in can_attach_task/attach_task callbacks

This is to prepare the integration of a new max number of proc cgroup
subsystem.  We'll need to release some resources from the previous cgroup.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <htejun@gmail.com>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agocgroups: new resource counter inheritance API
Frederic Weisbecker [Mon, 24 Oct 2011 14:59:56 +0000 (01:59 +1100)]
cgroups: new resource counter inheritance API

Provide an API to inherit a counter value from a parent.  This can be
useful to implement cgroup.clone_children on a resource counter.

Still the resources of the children are limited by those of the parent, so
this is only to provide a default setting behaviour when clone_children is
set.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <htejun@gmail.com>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agocgroups: add res_counter_write_u64() API
Frederic Weisbecker [Mon, 24 Oct 2011 14:59:55 +0000 (01:59 +1100)]
cgroups: add res_counter_write_u64() API

Extend the resource counter API with a mirror of res_counter_read_u64() to
make it handy to update a resource counter value from a cgroup subsystem
u64 value file.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Aditya Kali <adityakali@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agocgroup/kmemleak: Annotate alloc_page() for cgroup allocations
Steven Rostedt [Mon, 24 Oct 2011 14:59:55 +0000 (01:59 +1100)]
cgroup/kmemleak: Annotate alloc_page() for cgroup allocations

When the cgroup base was allocated with kmalloc, it was necessary to
annotate the variable with kmemleak_not_leak().  But because it has
recently been changed to be allocated with alloc_page() (which skips
kmemleak checks) causes a warning on boot up.

I was triggering this output:

 allocated 8388608 bytes of page_cgroup
 please try 'cgroup_disable=memory' option if you don't want memory cgroups
 kmemleak: Trying to color unknown object at 0xf5840000 as Grey
 Pid: 0, comm: swapper Not tainted 3.0.0-test #12
 Call Trace:
  [<c17e34e6>] ? printk+0x1d/0x1f^M
  [<c10e2941>] paint_ptr+0x4f/0x78
  [<c178ab57>] kmemleak_not_leak+0x58/0x7d
  [<c108ae9f>] ? __rcu_read_unlock+0x9/0x7d
  [<c1cdb462>] kmemleak_init+0x19d/0x1e9
  [<c1cbf771>] start_kernel+0x346/0x3ec
  [<c1cbf1b4>] ? loglevel+0x18/0x18
  [<c1cbf0aa>] i386_start_kernel+0xaa/0xb0

After a bit of debugging I tracked the object 0xf840000 (and others) down
to the cgroup code.  The change from allocating base with kmalloc to
alloc_page() has the base not calling kmemleak_alloc() which adds the
pointer to the object_tree_root, but kmemleak_not_leak() adds it to the
crt_early_log[] table.  On kmemleak_init(), the entry is found in the
early_log[] but not the object_tree_root, and this error message is
displayed.

If alloc_page() fails then it defaults back to vmalloc() which still uses
the kmemleak_alloc() which makes us still need the kmemleak_not_leak()
call.  The solution is to call the kmemleak_alloc() directly if the
alloc_page() succeeds.

Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agocgroups: don't attach task to subsystem if migration failed
Ben Blum [Mon, 24 Oct 2011 14:59:55 +0000 (01:59 +1100)]
cgroups: don't attach task to subsystem if migration failed

If a task has exited to the point it has called cgroup_exit() already,
then we can't migrate it to another cgroup anymore.

This can happen when we are attaching a task to a new cgroup between the
call to ->can_attach_task() on subsystems and the migration that is
eventually tried in cgroup_task_migrate().

In this case cgroup_task_migrate() returns -ESRCH and we don't want to
attach the task to the subsystems because the attachment to the new cgroup
itself failed.

Fix this by only calling ->attach_task() on the subsystems if the cgroup
migration succeeded.

Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
Acked-by: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agocgroups: more safe tasklist locking in cgroup_attach_proc
Ben Blum [Mon, 24 Oct 2011 14:59:54 +0000 (01:59 +1100)]
cgroups: more safe tasklist locking in cgroup_attach_proc

Fix unstable tasklist locking in cgroup_attach_proc.

According to this thread - https://lkml.org/lkml/2011/7/27/243 - RCU is
not sufficient to guarantee the tasklist is stable w.r.t.  de_thread and
exit.  Taking tasklist_lock for reading, instead of rcu_read_lock, ensures
proper exclusion.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
Acked-by: Paul Menage <paul@paulmenage.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agohfs: fix hfs_find_init() sb->ext_tree NULL ptr oops
Phillip Lougher [Mon, 24 Oct 2011 14:59:48 +0000 (01:59 +1100)]
hfs: fix hfs_find_init() sb->ext_tree NULL ptr oops

Clement Lecigne reports a filesystem which causes a kernel oops in
hfs_find_init() trying to dereference sb->ext_tree which is NULL.

This proves to be because the filesystem has a corrupted MDB extent
record, where the extents file does not fit into the first three extents
in the file record (the first blocks).

In hfs_get_block() when looking up the blocks for the extent file
(HFS_EXT_CNID), it fails the first blocks special case, and falls through
to the extent code (which ultimately calls hfs_find_init()) which is in
the process of being initialised.

Hfs avoids this scenario by always having the extents b-tree fitting into
the first blocks (the extents B-tree can't have overflow extents).

The fix is to check at mount time that the B-tree fits into first blocks,
i.e.  fail if HFS_I(inode)->alloc_blocks >= HFS_I(inode)->first_blocks

Note, the existing commit 47f365eb57573 ("hfs: fix oops on mount with
corrupted btree extent records") becomes subsumed into this as a special
case, but only for the extents B-tree (HFS_EXT_CNID), it is perfectly
acceptable for the catalog B-Tree file to grow beyond three extents, with
the remaining extent descriptors in the extents overfow.

This fixes CVE-2011-2203

Reported-by: Clement LECIGNE <clement.lecigne@netasq.com>
Signed-off-by: Phillip Lougher <plougher@redhat.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoisofs: add readpages support
Namjae Jeon [Mon, 24 Oct 2011 14:59:06 +0000 (01:59 +1100)]
isofs: add readpages support

Use mpage_readpages() instead of multiple calls to isofs_readpage() to
reduce the CPU utilization and make performance higher.

Signed-off-by: Namjae Jeon <linkinjeon@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agominix: describe usage of different magic numbers
Sami Kerola [Mon, 24 Oct 2011 14:59:05 +0000 (01:59 +1100)]
minix: describe usage of different magic numbers

One can get this information from minix/inode.c, but adding the
explanations at the definition sites is more appropriate.

Signed-off-by: Sami Kerola <kerolasa@iki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agodrivers/rtc/rtc-mc13xxx.c: move probe and remove callbacks to .init.text and .exit...
Uwe Kleine-König [Mon, 24 Oct 2011 14:59:05 +0000 (01:59 +1100)]
drivers/rtc/rtc-mc13xxx.c: move probe and remove callbacks to .init.text and .exit.text

The driver is added using platform_driver_probe(), so the callbacks can be
discarded more aggessively.

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agortc: add initial support for mcp7941x parts
David Anders [Mon, 24 Oct 2011 14:59:04 +0000 (01:59 +1100)]
rtc: add initial support for mcp7941x parts

Add initial support for the microchip mcp7941x series of real time clocks.

The mcp7941x series is generally compatible with the ds1307 and ds1337 rtc
devices from dallas semiconductor.  minor differences include a backup
battery enable bit, and the polarity of the oscillator enable bit.

Signed-off-by: David Anders <danders.dev@gmail.com>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Reviewed-by: Wolfram Sang <w.sang@pengutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agodrivers/rtc/class.c: convert idr to ida and use ida_simple_get()
Jonathan Cameron [Mon, 24 Oct 2011 14:59:04 +0000 (01:59 +1100)]
drivers/rtc/class.c: convert idr to ida and use ida_simple_get()

This is the one use of an ida that doesn't retry on receiving -EAGAIN.
I'm assuming do so will cause no harm and may help on a rare occasion.

Signed-off-by: Jonathan Cameron <jic23@cam.ac.uk>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agooprofilefs: handle zero-length writes
Mike Waychison [Mon, 24 Oct 2011 14:59:04 +0000 (01:59 +1100)]
oprofilefs: handle zero-length writes

Currently in oprofilefs, files that use ulong_fops mis-handle writes of
zero length.  A count of 0 causes oprofilefs_ulong_from_user to return 0
(success), which then leads to oprofile_set_ulong being called to stuff
"value" into file->private_data without it being initialized.

Fix this by moving the check for a zero-length write up into
ulong_write_file.

Signed-off-by: Mike Waychison <mikew@google.com>
Cc: Robert Richter <robert.richter@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoinit/do_mounts_rd.c: fix ramdisk identification for padded cramfs
Neil Armstrong [Mon, 24 Oct 2011 14:59:03 +0000 (01:59 +1100)]
init/do_mounts_rd.c: fix ramdisk identification for padded cramfs

When a cramfs ramdisk padded with 512 bytes is given to the kernel, the
current identify_ramdisk_image function fails to identify it.

Tested with a padded cramfs image on an ARM based board.

Signed-off-by: Neil Armstrong <narmstrong@neotion.com>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Davidlohr Bueso <dave@gnu.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoramfs: remove module leftovers
Richard Weinberger [Mon, 24 Oct 2011 14:59:03 +0000 (01:59 +1100)]
ramfs: remove module leftovers

Since ramfs is hard-selected to "y", the module leftovers make no sense.

Signed-off-by: Richard Weinberger <richard@nod.at>
Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agobinfmt_elf: fix PIE execution with randomization disabled
Jiri Kosina [Mon, 24 Oct 2011 14:59:02 +0000 (01:59 +1100)]
binfmt_elf: fix PIE execution with randomization disabled

The case of address space randomization being disabled in runtime through
randomize_va_space sysctl is not treated properly in load_elf_binary(),
resulting in SIGKILL coming at exec() time for certain PIE-linked binaries
in case the randomization has been disabled at runtime prior to calling
exec().

Handle the randomize_va_space == 0 case the same way as if we were not
supporting .text randomization at all.

Based on original patch by H.J. Lu and Josh Boyer.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: H.J. Lu <hongjiu.lu@intel.com>
Cc: <stable@kernel.org>
Tested-by: Josh Boyer <jwboyer@redhat.com>
Acked-by: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoepoll: limit paths
Jason Baron [Mon, 24 Oct 2011 14:59:02 +0000 (01:59 +1100)]
epoll: limit paths

The current epoll code can be tickled to run basically indefinitely in
both loop detection path check (on ep_insert()), and in the wakeup paths.
The programs that tickle this behavior set up deeply linked networks of
epoll file descriptors that cause the epoll algorithms to traverse them
indefinitely.  A couple of these sample programs have been previously
posted in this thread: https://lkml.org/lkml/2011/2/25/297.

To fix the loop detection path check algorithms, I simply keep track of
the epoll nodes that have been already visited.  Thus, the loop detection
becomes proportional to the number of epoll file descriptor and links.
This dramatically decreases the run-time of the loop check algorithm.  In
one diabolical case I tried it reduced the run-time from 15 mintues (all
in kernel time) to .3 seconds.

Fixing the wakeup paths could be done at wakeup time in a similar manner
by keeping track of nodes that have already been visited, but the
complexity is harder, since there can be multiple wakeups on different
cpus...Thus, I've opted to limit the number of possible wakeup paths when
the paths are created.

This is accomplished, by noting that the end file descriptor points that
are found during the loop detection pass (from the newly added link), are
actually the sources for wakeup events.  I keep a list of these file
descriptors and limit the number and length of these paths that emanate
from these 'source file descriptors'.  In the current implemetation I
allow 1000 paths of length 1, 500 of length 2, 100 of length 3, 50 of
length 4 and 10 of length 5.  Note that it is sufficient to check the
'source file descriptors' reachable from the newly added link, since no
other 'source file descriptors' will have newly added links.  This allows
us to check only the wakeup paths that may have gotten too long, and not
re-check all possible wakeup paths on the system.

In terms of the path limit selection, I think its first worth noting that
the most common case for epoll, is probably the model where you have 1
epoll file descriptor that is monitoring n number of 'source file
descriptors'.  In this case, each 'source file descriptor' has a 1 path of
length 1.  Thus, I believe that the limits I'm proposing are quite
reasonable and in fact may be too generous.  Thus, I'm hoping that the
proposed limits will not prevent any workloads that currently work to
fail.

In terms of locking, I have extended the use of the 'epmutex' to all
epoll_ctl add and remove operations.  Currently its only used in a subset
of the add paths.  I need to hold the epmutex, so that we can correctly
traverse a coherent graph, to check the number of paths.  I believe that
this additional locking is probably ok, since its in the setup/teardown
paths, and doesn't affect the running paths, but it certainly is going to
add some extra overhead.  Also, worth noting is that the epmuex was
recently added to the ep_ctl add operations in the initial path loop
detection code using the argument that it was not on a critical path.

Another thing to note here, is the length of epoll chains that is allowed.
Currently, eventpoll.c defines:

/* Maximum number of nesting allowed inside epoll sets */
#define EP_MAX_NESTS 4

This basically means that I am limited to a graph depth of 5 (EP_MAX_NESTS
+ 1).  However, this limit is currently only enforced during the loop
check detection code, and only when the epoll file descriptors are added
in a certain order.  Thus, this limit is currently easily bypassed.  The
newly added check for wakeup paths, stricly limits the wakeup paths to a
length of 5, regardless of the order in which ep's are linked together.
Thus, a side-effect of the new code is a more consistent enforcement of
the graph depth.

Thus far, I've tested this, using the sample programs previously
mentioned, which now either return quickly or return -EINVAL.  I've also
testing using the piptest.c epoll tester, which showed no difference in
performance.  I've also created a number of different epoll networks and
tested that they behave as expectded.

I believe this solves the original diabolical test cases, while still
preserving the sane epoll nesting.

Signed-off-by: Jason Baron <jbaron@redhat.com>
Cc: Nelson Elhage <nelhage@ksplice.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoepoll: fix spurious lockdep warnings
Nelson Elhage [Mon, 24 Oct 2011 14:59:02 +0000 (01:59 +1100)]
epoll: fix spurious lockdep warnings

epoll can acquire recursively acquire ep->mtx on multiple "struct
eventpoll"s at once in the case where one epoll fd is monitoring another
epoll fd.  This is perfectly OK, since we're careful about the lock
ordering, but it causes spurious lockdep warnings.  Annotate the recursion
using mutex_lock_nested, and add a comment explaining the nesting rules
for good measure.

Recent versions of systemd are triggering this, and it can also be
demonstrated with the following trivial test program:

--------------------8<--------------------

int main(void) {
   int e1, e2;
   struct epoll_event evt = {
       .events = EPOLLIN
   };

   e1 = epoll_create1(0);
   e2 = epoll_create1(0);
   epoll_ctl(e1, EPOLL_CTL_ADD, e2, &evt);
   return 0;
}
--------------------8<--------------------

Reported-by: Paul Bolle <pebolle@tiscali.nl>
Tested-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Nelson Elhage <nelhage@nelhage.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: <stable@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agolib-crc-add-slice-by-8-algorithm-to-crc32c-fix
Andrew Morton [Mon, 24 Oct 2011 14:59:01 +0000 (01:59 +1100)]
lib-crc-add-slice-by-8-algorithm-to-crc32c-fix

don't include asm/msr.h

Cc: Bob Pearson <rpearson@systemfabricworks.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: frank zago <fzago@systemfabricworks.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agolib/crc: add slice by 8 algorithm to crc32.c
frank zago [Mon, 24 Oct 2011 14:59:01 +0000 (01:59 +1100)]
lib/crc: add slice by 8 algorithm to crc32.c

Add support for slice by 8 to existing crc32 algorithm.  Also modify
gen_crc32table.c to only produce table entries that are actually used.
The parameters CRC_LE_BITS and CRC_BE_BITS determine the number of bits in
the input array that are processed during each step.  Generally the more
bits the faster the algorithm is but the more table data required.

Using an x86_64 Opteron machine running at 2100MHz the following table was
collected with a pre-warmed cache by computing the crc 1000 times on a
buffer of 4096 bytes.

BITS Size LE Cycles/byte BE Cycles/byte
----------------------------------------------
1 873 41.65 34.60
2 1097 25.43 29.61
4 1057 13.29 15.28
8 2913 7.13 8.19
32 9684 2.80 2.82
64 18178 1.53 1.53

BITS is the value of CRC_LE_BITS or CRC_BE_BITS. The old
default was 8 which actually selected the 32 bit algorithm. In
this version the value 8 is used to select the standard
8 bit algorithm and two new values: 32 and 64 are introduced
to select the slice by 4 and slice by 8 algorithms respectively.

Where Size is the size of crc32.o's text segment which includes
code and table data when both LE and BE versions are set to BITS.

The current version of crc32.c by default uses the slice by 4 algorithm
which requires about 2.8 cycles per byte.  The slice by 8 algorithm is
roughly 2X faster and enables packet processing at over 1GB/sec on a
typical 2-3GHz system.

Signed-off-by: Bob Pearson <rpearson@systemfabricworks.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agocheckpatch: add a --strict check for utf-8 in commit logs
Joe Perches [Mon, 24 Oct 2011 14:59:01 +0000 (01:59 +1100)]
checkpatch: add a --strict check for utf-8 in commit logs

Some find using utf-8 in commit logs inappropriate.

Some patch commit logs contain unintended utf-8 characters when doing
things like copy/pasting compilation output.

Look for the start of any commit log by skipping initial lines that look
like email headers and "From: " lines.

Stop looking for utf-8 at the first signature line.

Signed-off-by: Joe Perches <joe@perches.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agokernel.h/checkpatch: mark strict_strto<foo> and simple_strto<foo> as obsolete
Joe Perches [Mon, 24 Oct 2011 14:59:00 +0000 (01:59 +1100)]
kernel.h/checkpatch: mark strict_strto<foo> and simple_strto<foo> as obsolete

Mark obsolete/deprecated strict_strto<foo> and simple_strto<foo> functions
and macros as obsolete.

Update checkpatch to warn about their use.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agollist-return-whether-list-is-empty-before-adding-in-llist_add-fix
Andrew Morton [Mon, 24 Oct 2011 14:59:00 +0000 (01:59 +1100)]
llist-return-whether-list-is-empty-before-adding-in-llist_add-fix

clarify comment

Cc: Huang Ying <ying.huang@intel.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agowireless: at76c50x: follow rename pack_hex_byte to hex_byte_pack
Andy Shevchenko [Mon, 24 Oct 2011 14:58:59 +0000 (01:58 +1100)]
wireless: at76c50x: follow rename pack_hex_byte to hex_byte_pack

There is no functional change.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agofat: follow rename pack_hex_byte() to hex_byte_pack()
Andy Shevchenko [Mon, 24 Oct 2011 14:58:59 +0000 (01:58 +1100)]
fat: follow rename pack_hex_byte() to hex_byte_pack()

There is no functional change.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agosecurity: follow rename pack_hex_byte() to hex_byte_pack()
Andy Shevchenko [Mon, 24 Oct 2011 14:58:59 +0000 (01:58 +1100)]
security: follow rename pack_hex_byte() to hex_byte_pack()

There is no functional change.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Mimi Zohar <zohar@us.ibm.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agolib/string.c: fix strim() semantics for strings that have only blanks
Michael Holzheu [Mon, 24 Oct 2011 14:58:58 +0000 (01:58 +1100)]
lib/string.c: fix strim() semantics for strings that have only blanks

Commit 84c95c9acf0 ("string: on strstrip(), first remove leading spaces
before running over str") improved\7f the performance of the strim()
function.

Unfortunately this changed the semantics of strim() and broke my code.
Before the patch it was possible to use strim() without using the return
value for removing trailing spaces from strings that had either only
blanks or only trailing blanks.

Now this does not work any longer for strings that *only* have blanks.

Before patch: "   " -> ""    (empty string)
After patch:  "   " -> "   " (no change)

I think we should remove your patch to restore the old behavior.

The description (lib/string.c):

 * Note that the first trailing whitespace is replaced with a %NUL-terminator

=> The first trailing whitespace of a string that only has whitespace
   characters is the first whitespace

The patch restores the old strim() semantics.

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Cc: Andre Goddard Rosa <andre.goddard@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agolib/idr.c: fix comment for ida_get_new_above()
Wang Sheng-Hui [Mon, 24 Oct 2011 14:58:57 +0000 (01:58 +1100)]
lib/idr.c: fix comment for ida_get_new_above()

Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agolib/percpu_counter.c: enclose hotplug only variables in hotplug ifdef
Glauber Costa [Mon, 24 Oct 2011 14:58:57 +0000 (01:58 +1100)]
lib/percpu_counter.c: enclose hotplug only variables in hotplug ifdef

These variables are only used when CONFIG_HOTPLUG_CPU is enabled, they are
ifdef'ed everywhere else.  So don't define them when CONFIG_HOTPLUG_CPU is
not enabled.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agolib-bitmapc-quiet-sparse-noise-about-address-space-fix
Andrew Morton [Mon, 24 Oct 2011 14:58:57 +0000 (01:58 +1100)]
lib-bitmapc-quiet-sparse-noise-about-address-space-fix

Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: H Hartley Sweeten <hartleys@visionengravers.com>
Cc: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Len Brown <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agolib/bitmap.c: quiet sparse noise about address space
H Hartley Sweeten [Mon, 24 Oct 2011 14:58:56 +0000 (01:58 +1100)]
lib/bitmap.c: quiet sparse noise about address space

__bitmap_parse() and __bitmap_parselist() both take a pointer to a kernel
buffer as a parameter and then cast it to a pointer to user buffer for use
in cases when the parameter is_user indicates that the buffer is actually
located in user space.  This casting, and the casts in the callers,
results in sparse noise like the following:

warning: incorrect type in initializer (different address spaces)
  expected char const [noderef] <asn:1>*ubuf
  got char const *buf
warning: cast removes address space of expression

Since these casts are intentional, use __force to quiet the noise.

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agolib/spinlock_debug.c: print owner on spinlock lockup
Akinobu Mita [Mon, 24 Oct 2011 14:58:56 +0000 (01:58 +1100)]
lib/spinlock_debug.c: print owner on spinlock lockup

When SPIN_BUG_ON is triggered, the lock owner information is reported.
But it is omitted when spinlock lockup is detected.

This information is useful especially on the architectures which don't
implement trigger_all_cpu_backtrace() that is called just after detecting
lockup.  So report it and also avoid message format duplication.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agolib/kstrtox: common code between kstrto*() and simple_strto*() functions
Alexey Dobriyan [Mon, 24 Oct 2011 14:58:55 +0000 (01:58 +1100)]
lib/kstrtox: common code between kstrto*() and simple_strto*() functions

Currently termination logic (\0 or \n\0) is hardcoded in _kstrtoull(),
avoid that for code reuse between kstrto*() and simple_strtoull().
Essentially, make them different only in termination logic.

simple_strtoull() (and scanf(), BTW) ignores integer overflow, that's a
bug we currently don't have guts to fix, making KSTRTOX_OVERFLOW hack
necessary.

Almost forgot: patch shrinks code size by about ~80 bytes on x86_64.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agodrivers-leds-leds-lp5521c-check-if-reset-is-successful-fix
Andrew Morton [Mon, 24 Oct 2011 14:58:55 +0000 (01:58 +1100)]
drivers-leds-leds-lp5521c-check-if-reset-is-successful-fix

fix up code comment

Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Naga Radhesh <naga.radheshy@stericsson.com>
Cc: Richard Purdie <richard.purdie@linuxfoundation.org>
Cc: Srinidhi KASAGAR <srinidhi.kasagar@stericsson.com>
Cc: srinidhi kasagar <srinidhi.kasagar@stericsson.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agodrivers/leds/leds-lp5521.c: check if reset is successful
Srinidhi KASAGAR [Mon, 24 Oct 2011 14:58:55 +0000 (01:58 +1100)]
drivers/leds/leds-lp5521.c: check if reset is successful

Make sure that the reset is successful by issuing a dummy read to R
channel current register and check its default value.  On some platforms,
without this dummy read, any further access to {R/G/B}_EXEC will not have
any impact.

Signed-off-by: srinidhi kasagar <srinidhi.kasagar@stericsson.com>
Tested-by: Naga Radhesh <naga.radheshy@stericsson.com>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Cc: Richard Purdie <richard.purdie@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 years agoleds: turn the blink_timer off before starting to blink
Antonio Ospite [Mon, 24 Oct 2011 14:58:54 +0000 (01:58 +1100)]
leds: turn the blink_timer off before starting to blink

Depending on the implementation of the hardware blinking function in
blink_set(), the led can support hardware blinking for some values of
delay_on and delay_off and fall-back to software blinking for some other
values.

Turning off the blink_timer unconditionally before starting to blink
make sure that a sequence like:

  OFF
  hardware blinking
  software blinking
  hardware blinking

does not leave the software blinking timer active.

Signed-off-by: Antonio Ospite <ospite@studenti.unina.it>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Cc: Richard Purdie <rpurdie@rpsys.net>
Cc: <stable@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>