git.kernelconcepts.de Git - karo-tx-linux.git/log

]> git.kernelconcepts.de Git - karo-tx-linux.git/log

projects / karo-tx-linux.git / log

commit | commitdiff | tree

Stephen Rothwell [Fri, 23 Sep 2011 05:38:27 +0000 (15:38 +1000)]

Add linux-next specific files for 20110923

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Stephen Rothwell [Fri, 23 Sep 2011 05:16:03 +0000 (15:16 +1000)]

Merge branch 'akpm'

commit | commitdiff | tree

Clifton Barnes [Wed, 24 Aug 2011 23:47:59 +0000 (09:47 +1000)]

Fixes the deadlock when inserting and removing the ds2780.

Signed-off-by: Clifton Barnes <cabarnes@indesign-llc.com>
Cc: Evgeniy Polyakov <zbr@ioremap.net>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Clifton Barnes [Wed, 24 Aug 2011 23:47:58 +0000 (09:47 +1000)]

Adds a nolock function to the w1 interface to avoid locking the
mutex if needed.

Signed-off-by: Clifton Barnes <cabarnes@indesign-llc.com>
Cc: Evgeniy Polyakov <zbr@ioremap.net>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Clifton Barnes [Wed, 24 Aug 2011 23:47:57 +0000 (09:47 +1000)]

Simply creates one point to call the w1 interface.

Signed-off-by: Clifton Barnes <cabarnes@indesign-llc.com>
Cc: Evgeniy Polyakov <zbr@ioremap.net>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Jonathan Cameron [Wed, 24 Aug 2011 23:47:56 +0000 (09:47 +1000)]

Straightforward. As an aside, the ida_init calls are not needed as far as
I can see needed. (DEFINE_IDA does the same already).

Signed-off-by: Jonathan Cameron <jic23@cam.ac.uk>
Cc: Evgeniy Polyakov <zbr@ioremap.net>
Acked-by: Clifton Barnes <cabarnes@indesign-llc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Andrew Morton [Wed, 24 Aug 2011 23:47:55 +0000 (09:47 +1000)]

remove unneeded cast of void*

Cc: Alexander Gordeev <lasaine@lvk.cs.msu.su>
Cc: Igor Plyatov <plyatov@gmail.com>
Cc: James Nuss <jamesnuss@nanometrics.ca>
Cc: Ricardo Martins <rasm@fe.up.pt>
Cc: Rodolfo Giometti <giometti@linux.it>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

James Nuss [Wed, 24 Aug 2011 23:47:54 +0000 (09:47 +1000)]

This client driver allows you to use a GPIO pin as a source for PPS
signals. Platform data [1] are used to specify the GPIO pin number,
label, assert event edge type, and whether clear events are captured.

This driver is based on the work by Ricardo Martins who submitted an
initial implementation [2] of a PPS IRQ client driver to the linuxpps
mailing-list on Dec 3 2010.

[1] include/linux/pps-gpio.h
[2] http://ml.enneenne.com/pipermail/linuxpps/2010-December/004155.html

Signed-off-by: James Nuss <jamesnuss@nanometrics.ca>
Cc: Ricardo Martins <rasm@fe.up.pt>
Acked-by: Rodolfo Giometti <giometti@linux.it>
Cc: Ricardo Martins <rasm@fe.up.pt>
Cc: Alexander Gordeev <lasaine@lvk.cs.msu.su>
Cc: Igor Plyatov <plyatov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

James Nuss [Wed, 24 Aug 2011 23:47:54 +0000 (09:47 +1000)]

A default echo function has been provided so it is no longer an error when
you specify PPS_ECHOASSERT or PPS_ECHOCLEAR without an explicit echo
function. This allows some code re-use and also makes it easier to write
client drivers since the default echo function does not normally need to
change.

Signed-off-by: James Nuss <jamesnuss@nanometrics.ca>
Reviewed-by: Ben Gardiner <bengardiner@nanometrics.ca>
Acked-by: Rodolfo Giometti <giometti@linux.it>
Cc: Ricardo Martins <rasm@fe.up.pt>
Cc: Alexander Gordeev <lasaine@lvk.cs.msu.su>
Cc: Igor Plyatov <plyatov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

WANG Cong [Wed, 24 Aug 2011 23:47:49 +0000 (09:47 +1000)]

When I tried to send a patch to remove it, Andi told me we still need to
keep compabitlies for old libc, so we can't remove this completely. Then
just make it default to n and remove the doc from
feature-removal-schedule.txt.

Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Alexandre Bounine [Wed, 24 Aug 2011 23:47:48 +0000 (09:47 +1000)]

Add RapidIO mport driver for IDT TSI721 PCI Express-to-SRIO bridge device.
The driver provides full set of callback functions defined for mport
devices in RapidIO subsystem. It also is compatible with current version
of RIONET driver (Ethernet over RapidIO messaging services).

This patch is applicable to kernel versions starting from 2.6.39.

Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com>
Signed-off-by: Chul Kim <chul.kim@idt.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Li Yang <leoli@freescale.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Liu Gang [Wed, 24 Aug 2011 23:47:47 +0000 (09:47 +1000)]

The discovered bit in PGCCSR register indicates if the device has been
discovered by system host. In Rapidio systems, some agent devices can also
be master devices. They can issue requests into the system.

Signed-off-by: Liu Gang <Gang.Liu@freescale.com>
Acked-by: Alexandre Bounine <alexandre.bounine@idt.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Stephen Rothwell [Wed, 24 Aug 2011 23:47:46 +0000 (09:47 +1000)]

After merging the akpm tree, today's linux-next build (lost of them)
produced this warning:

WARNING: init/mounts.o(.text+0x192): Section mismatch in reference from the function devt_from_partuuid() to the variable .init.data:root_wait
The function devt_from_partuuid() references
the variable __initdata root_wait.
This is often because devt_from_partuuid lacks a __initdata
annotation or the annotation of root_wait is wrong.

Commit 185237e4cfab ("This patch makes two changes:"
init-add-root=partuuid=uuid-partnroff=%d-support-update.patch) adds the
reference to root_wait from the non-init function devt_from_partuuid().

The easiest thing to do is to remove __init_date from root_wait.

I have applied this patch as a merge fixup for today:

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Wed, 10 Aug 2011 12:09:35 +1000
Subject: [PATCH] do_mounts: remove __init_data from root_wait

as it is now used from a non init routine.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Will Drewry <wad@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Will Drewry [Wed, 24 Aug 2011 23:47:45 +0000 (09:47 +1000)]

Expand root=PARTUUID=UUID syntax to support selecting a root partition by
integer offset from a known, unique partition. This approach provides
similar properties to specifying a device and partition number, but using
the UUID as the unique path prior to evaluating the offset.

For example,
root=PARTUUID=99DE9194-FC15-4223-9192-FC243948F88B/PARTNROFF=1
selects the partition with UUID 99DE.. then select the next
partition.

This change is motivated by a particular usecase in Chromium OS where the
bootloader can easily determine what partition it is on (by UUID) but
doesn't perform general partition table walking.

That said, support for this model provides a direct mechanism for the user
to modify the root partition to boot without specifically needing to
extract each UUID or update the bootloader explicitly when the root
partition UUID is changed (if it is recreated to be larger, for instance).
Pinning to a /boot-style partition UUID allows the arbitrary root
partition reconfiguration/modifications with slightly less ambiguity than
just [dev][partition] and less stringency than the specific root partition
UUID.

Signed-off-by: Will Drewry <wad@chromium.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Andrew Morton [Wed, 24 Aug 2011 23:47:45 +0000 (09:47 +1000)]

Force this on for -next/mm testing purposes.

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Serge E. Hallyn" <serge.hallyn@canonical.com>
Cc: Daniel Lezcano <daniel.lezcano@free.fr>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Pavel Emelyanov [Wed, 24 Aug 2011 23:47:44 +0000 (09:47 +1000)]

On reading sysctl dirs we should return -EISDIR instead of -EINVAL.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Greg Thelen [Wed, 24 Aug 2011 23:47:43 +0000 (09:47 +1000)]

Both mem_cgroup_charge_statistics() and mem_cgroup_move_account() were
unnecessarily disabling preemption when adjusting per-cpu counters:
    preempt_disable()
    __this_cpu_xxx()
    __this_cpu_yyy()
    preempt_enable()

This change does not disable preemption and thus CPU switch is possible
within these routines.  This does not cause a problem because the total
of all cpu counters is summed when reporting stats.  Now both
mem_cgroup_charge_statistics() and mem_cgroup_move_account() look like:
    this_cpu_xxx()
    this_cpu_yyy()

akpm: this is an optimisation for x86 and a deoptimisation for non-x86.
The non-x86 situation will be fixed as architectures implement their
atomic this_cpu_foo() operations.

Signed-off-by: Greg Thelen <gthelen@google.com>
Reported-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Igor Mammedov [Wed, 24 Aug 2011 23:47:42 +0000 (09:47 +1000)]

If somebody is touching data too early, it might be easier to diagnose a
problem when dereferencing NULL at mem->info.nodeinfo[node] than trying to
understand why mem_cgroup_per_zone is [un|partly]initialized.

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Andrew Bresticker [Wed, 24 Aug 2011 23:47:41 +0000 (09:47 +1000)]

While back-porting Johannes Weiner's patch "mm: memcg-aware global
reclaim" for an internal effort, we noticed a significant performance
regression during page-reclaim heavy workloads due to high contention of
the ss->id_lock.  This lock protects idr map, and serializes calls to
idr_get_next() in css_get_next() (which is used during the memcg hierarchy
walk).  Since idr_get_next() is just doing a look up, we need only
serialize it with respect to idr_remove()/idr_get_new().  By making the
ss->id_lock a rwlock, contention is greatly reduced and performance
improves.

Tested: cat a 256m file from a ramdisk in a 128m container 50 times on
each core (one file + container per core) in parallel on a NUMA machine.
Result is the time for the test to complete in 1 of the containers.  Both
kernels included Johannes' memcg-aware global reclaim patches.

Before rwlock patch: 1710.778s
After rwlock patch: 152.227s

Signed-off-by: Andrew Bresticker <abrestic@google.com>
Cc: Paul Menage <menage@gmail.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

KAMEZAWA Hiroyuki [Wed, 24 Aug 2011 23:47:40 +0000 (09:47 +1000)]

Before calling schedule_timeout(), task state should be changed.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Raghavendra K T [Wed, 24 Aug 2011 23:47:40 +0000 (09:47 +1000)]

The memcg code sometimes uses "struct mem_cgroup *mem" and sometimes uses
"struct mem_cgroup *memcg". Rename all mem variables to memcg in source
file.

Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Sami Kerola [Wed, 24 Aug 2011 23:47:39 +0000 (09:47 +1000)]

One can get this information from minix/inode.c, but adding the
explanations at the definition sites is more appropriate.

Signed-off-by: Sami Kerola <kerolasa@iki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Jonathan Cameron [Wed, 24 Aug 2011 23:47:38 +0000 (09:47 +1000)]

This is the one use of an ida that doesn't retry on receiving -EAGAIN.
I'm assuming do so will cause no harm and may help on a rare occasion.

Signed-off-by: Jonathan Cameron <jic23@cam.ac.uk>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Mike Waychison [Wed, 24 Aug 2011 23:47:38 +0000 (09:47 +1000)]

Currently in oprofilefs, files that use ulong_fops mis-handle writes of
zero length. A count of 0 causes oprofilefs_ulong_from_user to return 0
(success), which then leads to oprofile_set_ulong being called to stuff
"value" into file->private_data without it being initialized.

Fix this by moving the check for a zero-length write up into
ulong_write_file.

Signed-off-by: Mike Waychison <mikew@google.com>
Cc: Robert Richter <robert.richter@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Nelson Elhage [Wed, 24 Aug 2011 23:47:37 +0000 (09:47 +1000)]

epoll can acquire recursively acquire ep->mtx on multiple "struct
eventpoll"s at once in the case where one epoll fd is monitoring another
epoll fd.  This is perfectly OK, since we're careful about the lock
ordering, but it causes spurious lockdep warnings.  Annotate the recursion
using mutex_lock_nested, and add a comment explaining the nesting rules
for good measure.

Recent versions of systemd are triggering this, and it can also be
demonstrated with the following trivial test program:

--------------------8<--------------------

int main(void) {
   int e1, e2;
   struct epoll_event evt = {
       .events = EPOLLIN
   };

   e1 = epoll_create1(0);
   e2 = epoll_create1(0);
   epoll_ctl(e1, EPOLL_CTL_ADD, e2, &evt);
   return 0;
}
--------------------8<--------------------

Reported-by: Paul Bolle <pebolle@tiscali.nl>
Tested-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Nelson Elhage <nelhage@nelhage.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Andrew Morton [Wed, 24 Aug 2011 23:47:37 +0000 (09:47 +1000)]

don't include asm/msr.h

Cc: Bob Pearson <rpearson@systemfabricworks.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: frank zago <fzago@systemfabricworks.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

frank zago [Wed, 24 Aug 2011 23:47:36 +0000 (09:47 +1000)]

Add support for slice by 8 to existing crc32 algorithm.  Also modify
gen_crc32table.c to only produce table entries that are actually used.
The parameters CRC_LE_BITS and CRC_BE_BITS determine the number of bits in
the input array that are processed during each step.  Generally the more
bits the faster the algorithm is but the more table data required.

Using an x86_64 Opteron machine running at 2100MHz the following table was
collected with a pre-warmed cache by computing the crc 1000 times on a
buffer of 4096 bytes.

BITS Size LE Cycles/byte BE Cycles/byte
----------------------------------------------
1 873 41.65 34.60
2 1097 25.43 29.61
4 1057 13.29 15.28
8 2913 7.13 8.19
32 9684 2.80 2.82
64 18178 1.53 1.53

BITS is the value of CRC_LE_BITS or CRC_BE_BITS. The old
default was 8 which actually selected the 32 bit algorithm. In
this version the value 8 is used to select the standard
8 bit algorithm and two new values: 32 and 64 are introduced
to select the slice by 4 and slice by 8 algorithms respectively.

Where Size is the size of crc32.o's text segment which includes
code and table data when both LE and BE versions are set to BITS.

The current version of crc32.c by default uses the slice by 4 algorithm
which requires about 2.8 cycles per byte.  The slice by 8 algorithm is
roughly 2X faster and enables packet processing at over 1GB/sec on a
typical 2-3GHz system.

Signed-off-by: Bob Pearson <rpearson@systemfabricworks.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Alexey Dobriyan [Wed, 24 Aug 2011 23:47:36 +0000 (09:47 +1000)]

Currently termination logic (\0 or \n\0) is hardcoded in _kstrtoull(),
avoid that for code reuse between kstrto*() and simple_strtoull().
Essentially, make them different only in termination logic.

simple_strtoull() (and scanf(), BTW) ignores integer overflow, that's a
bug we currently don't have guts to fix, making KSTRTOX_OVERFLOW hack
necessary.

Almost forgot: patch shrinks code size by about ~80 bytes on x86_64.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Masakazu Mokuno [Wed, 24 Aug 2011 23:47:36 +0000 (09:47 +1000)]

The memory for struct led_trigger should be kfreed in the
led_trigger_register() error path. Also this function should return NULL
on error.

Signed-off-by: Masakazu Mokuno <mokuno@sm.sony.co.jp>
Cc: Richard Purdie <rpurdie@rpsys.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Axel Lin [Wed, 24 Aug 2011 23:47:35 +0000 (09:47 +1000)]

include linux/module.h

Signed-off-by: Axel Lin <axel.lin@gmail.com>
Cc: Magnus Damm <damm@opensource.se>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Richard Purdie <rpurdie@rpsys.net>
Cc: Grant Likely <grant.likely@secretlab.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Magnus Damm [Wed, 24 Aug 2011 23:47:35 +0000 (09:47 +1000)]

Add V2 of the LED driver for a single timer channel for the TPU hardware
block commonly found in Renesas SoCs.

The driver has been written with optimal Power Management in mind, so to
save power the LED is driven as a regular GPIO pin in case of maximum
brightness and power off which allows the TPU hardware to be idle and
which in turn allows the clocks to be stopped and the power domain to be
turned off transparently.

Any other brightness level requires use of the TPU hardware in PWM mode.
TPU hardware device clocks and power are managed through Runtime PM.
System suspend and resume is known to be working - during suspend the LED
is set to off by the generic LED code.

The TPU hardware timer is equipeed with a 16-bit counter together with an
up-to-divide-by-64 prescaler which makes the hardware suitable for
brightness control.  Hardware blink is unsupported.

The LED PWM waveform has been verified with a Fluke 123 Scope meter on a
sh7372 Mackerel board.  Tested with experimental sh7372 A3SP power domain
patches.  Platform device bind/unbind tested ok.

V2 has been tested on the DS2 LED of the sh73a0-based AG5EVM.

Signed-off-by: Magnus Damm <damm@opensource.se>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Richard Purdie <rpurdie@rpsys.net>
Cc: Grant Likely <grant.likely@secretlab.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Andrew Morton [Wed, 24 Aug 2011 23:47:34 +0000 (09:47 +1000)]

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Yanmin Zhang [Wed, 24 Aug 2011 23:47:34 +0000 (09:47 +1000)]

We are enabling some power features on medfield. To test suspend-2-RAM
conveniently, we need turn on/off console_suspend_enabled frequently.

Add a module parameter, so users could change it by:
/sys/module/printk/parameters/console_suspend

Signed-off-by: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Yanmin Zhang [Wed, 24 Aug 2011 23:47:33 +0000 (09:47 +1000)]

We are enabling some power features on medfield. To test suspend-2-RAM
conveniently, we need turn on/off ignore_loglevel frequently without
rebooting.

Add a module parameter, so users could change it by:
/sys/module/printk/parameters/ignore_loglevel

Signed-off-by: Zhang Yanmin <yanmin_zhang@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Yanmin Zhang [Wed, 24 Aug 2011 23:47:33 +0000 (09:47 +1000)]

We are enabling some power features on medfield. To test suspend-2-RAM
conveniently, we need turn on/off ignore_loglevel frequently without
rebooting.

Add a module parameter, so users can change it by:
/sys/module/printk/parameters/ignore_loglevel

Signed-off-by: Yanmin Zhang <yanmin.zhang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Éric Piel [Wed, 24 Aug 2011 23:47:33 +0000 (09:47 +1000)]

Signed-off-by: Ilkka Koskinen <ilkka.koskinen@nokia.com>
Signed-off-by: Éric Piel <eric.piel@tremplin-utc.net>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Witold Pilat <witold.pilat@gmail.com>
Cc: Lyall Pearce <lyall.pearce@hp.com>
Cc: Malte Starostik <m-starostik@versanet.de>
Cc: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
Cc: Christian Lamparter <chunkeey@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Éric Piel [Wed, 24 Aug 2011 23:47:32 +0000 (09:47 +1000)]

Change exported functions to use the device given as parameter
instead of the global one.

Signed-off-by: Ilkka Koskinen <ilkka.koskinen@nokia.com>
Signed-off-by: Éric Piel <eric.piel@tremplin-utc.net>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Witold Pilat <witold.pilat@gmail.com>
Cc: Lyall Pearce <lyall.pearce@hp.com>
Cc: Malte Starostik <m-starostik@versanet.de>
Cc: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
Cc: Christian Lamparter <chunkeey@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Éric Piel [Wed, 24 Aug 2011 23:47:31 +0000 (09:47 +1000)]

commit | commitdiff | tree

Éric Piel [Wed, 24 Aug 2011 23:47:31 +0000 (09:47 +1000)]

commit | commitdiff | tree

Éric Piel [Wed, 24 Aug 2011 23:47:30 +0000 (09:47 +1000)]

Add axis correction for HP ProBook 6555b.

Signed-off-by: Malte Starostik <m-starostik@versanet.de>
Signed-off-by: Éric Piel <eric.piel@tremplin-utc.net>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Witold Pilat <witold.pilat@gmail.com>
Cc: Lyall Pearce <lyall.pearce@hp.com>
Cc: Ilkka Koskinen <ilkka.koskinen@nokia.com>
Cc: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
Cc: Christian Lamparter <chunkeey@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Éric Piel [Wed, 24 Aug 2011 23:47:29 +0000 (09:47 +1000)]

Add axis correction for HP EliteBook 8540w.

Reported-by: Lyall Pearce <lyall.pearce@hp.com>
Signed-off-by: Éric Piel <eric.piel@tremplin-utc.net>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Witold Pilat <witold.pilat@gmail.com>
Cc: Malte Starostik <m-starostik@versanet.de>
Cc: Ilkka Koskinen <ilkka.koskinen@nokia.com>
Cc: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
Cc: Christian Lamparter <chunkeey@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Éric Piel [Wed, 24 Aug 2011 23:47:29 +0000 (09:47 +1000)]

Add axis correction for HP EliteBook 2730p.

Tested-by: Witold Pilat <witold.pilat@gmail.com>
Signed-off-by: Éric Piel <eric.piel@tremplin-utc.net>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Lyall Pearce <lyall.pearce@hp.com>
Cc: Malte Starostik <m-starostik@versanet.de>
Cc: Ilkka Koskinen <ilkka.koskinen@nokia.com>
Cc: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
Cc: Christian Lamparter <chunkeey@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Éric Piel [Wed, 24 Aug 2011 23:47:28 +0000 (09:47 +1000)]

In the move of the lis3 driver, the hp_accel.c file got dropped from the
MAINTAINER file. Make it explicit again that this file is tied to lis3
again.

Signed-off-by: Éric Piel <eric.piel@tremplin-utc.net>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Witold Pilat <witold.pilat@gmail.com>
Cc: Lyall Pearce <lyall.pearce@hp.com>
Cc: Malte Starostik <m-starostik@versanet.de>
Cc: Ilkka Koskinen <ilkka.koskinen@nokia.com>
Cc: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
Cc: Christian Lamparter <chunkeey@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Éric Piel [Wed, 24 Aug 2011 23:47:28 +0000 (09:47 +1000)]

After an "unexpected" reboot, I found this Oops in my logs:

divide error: 0000 [#1] PREEMPT SMP=20
CPU 0=20
Modules linked in: lis3lv02d hp_wmi input_polldev [...]
Pid: 390, comm: modprobe Tainted: G         C  2.6.39-rc7-wl+=20
RIP: 0010:[<ffffffffa014b427>]  [<ffffffffa014b427>]
lis3lv02d_poweron+0x4e/0x94 [lis3lv02d]
RSP: 0018:ffff8801d6407cf8  EFLAGS: 00010246
RAX: 0000000000000bb8 RBX: ffffffffa014e000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffea00066e4708 RDI: ffff8801df002700
RBP: ffff8801d6407d18 R08: ffffea00066c5a30 R09: ffffffff812498c9
R10: ffff8801d7bfcea0 R11: ffff8801d7bfce10 R12: 0000000000000bb8
R13: 00000000ffffffda R14: ffffffffa0154120 R15: ffffffffa0154030
=46S:  00007fc0705db700(0000) GS:ffff8801dfa00000(0000) knlGS:0
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f33549174f0 CR3: 00000001d65c9000 CR4: 00000000000406f0
Process modprobe (pid: 390, threadinfo ffff8801d6406000, task ffff8801d6b40=
000)
Stack:
ffffffffa0154120 62ffffffa0154030 ffffffffa014e000 00000000ffffffea
ffff8801d6407d58 ffffffffa014bcc1 0000000000000000 0000000000000048
ffff8801d8bae800 00000000ffffffea 00000000ffffffda ffffffffa0154120
Call Trace:
[<ffffffffa014bcc1>] lis3lv02d_init_device+0x1ce/0x496 [lis3lv02d]
[<ffffffffa01522ff>] lis3lv02d_add+0x10f/0x17c [hp_accel]
[<ffffffff81233e11>] acpi_device_probe+0x49/0x117
[...]
Code: 3a 75 06 80 4d ef 50 eb 04 80 4d ef 40 0f b6 55 ef be 21
00 00 00 48 89 df ff 53 18 44 8b 63 6c e8 3e fc ff ff 89 c1 44
89 e0 99 <f7> f9 89 c7 e8 93 82 ef e0 48 83 7b 30 00 74 2d 45
31 e4 80 7b=20
RIP  [<ffffffffa014b427>] lis3lv02d_poweron+0x4e/0x94 [lis3lv02d]
RSP <ffff8801d6407cf8>

>From my POV, it looks like the hardware is not working as expected
and returns a bogus data rate. The driver doesn't check the result
and directly uses it as some sort of divisor in some places:

msleep(lis3->pwron_delay / lis3lv02d_get_odr());

Under this circumstances, this could very well cause the
"divide by zero" exception from above.

For now, I fixed it the easiest and most obvious way:
Check if the result is sane and if it isn't use a sane default
instead. I went for "100" in the latter case, simply because
/sys/devices/platform/lis3lv02d/rate returns it on a successful
boot.

Signed-off-by: Christian Lamparter <chunkeey@googlemail.com>
Signed-off-by: Éric Piel <eric.piel@tremplin-utc.net>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Witold Pilat <witold.pilat@gmail.com>
Cc: Lyall Pearce <lyall.pearce@hp.com>
Cc: Malte Starostik <m-starostik@versanet.de>
Cc: Ilkka Koskinen <ilkka.koskinen@nokia.com>
Cc: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
Cc: Christian Lamparter <chunkeey@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Sedat Dilek [Wed, 24 Aug 2011 23:47:27 +0000 (09:47 +1000)]

s/lib-/obj-/ for usercopy.o

Reported-by: Randy Dunlap <rdunlap@xenotime.net>
Reported-by: Sedat Dilek <sedat.dilek@gmail.com>
Tested-by: Randy Dunlap <rdunlap@xenotime.net>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Stephen Boyd [Wed, 24 Aug 2011 23:47:27 +0000 (09:47 +1000)]

The help text for this config is duplicated across the x86, parisc, and
s390 Kconfig.debug files. Arnd Bergman noted that the help text was
slightly misleading and should be fixed to state that enabling this option
isn't a problem when using pre 4.4 gcc.

To simplify the rewording, consolidate the text into lib/Kconfig.debug and
modify it there to be more explicit about when you should say N to this
config.

Also, make the text a bit more generic by stating that this option enables
compile time checks so we can cover architectures which emit warnings vs.
ones which emit errors. The details of how an architecture decided to
implement the checks isn't as important as the concept of compile time
checking of copy_from_user() calls.

While we're doing this, remove all the copy_from_user_overflow() code
that's duplicated many times and place it into lib/ so that any
architecture supporting this option can get the function for free.

Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Stephen Boyd [Wed, 24 Aug 2011 23:47:26 +0000 (09:47 +1000)]

Strict user copy checks are only really supported on x86_32 even though
the config option is selectable on x86_64. Add the necessary support to
the 64 bit code to trigger copy_from_user() warnings at compile time.

Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Stephen Boyd [Wed, 24 Aug 2011 23:47:26 +0000 (09:47 +1000)]

Enabling DEBUG_STRICT_USER_COPY_CHECKS causes the following warning:

In file included from arch/x86/include/asm/uaccess.h:573,
                 from kernel/kprobes.c:55:
In function 'copy_from_user',
    inlined from 'write_enabled_file_bool' at
    kernel/kprobes.c:2191:
arch/x86/include/asm/uaccess_64.h:65:
warning: call to 'copy_from_user_overflow' declared with
attribute warning: copy_from_user() buffer size is not provably
correct

presumably due to buf_size being signed causing GCC to fail to see that
buf_size can't become negative.

Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Shaohua Li [Wed, 24 Aug 2011 23:47:24 +0000 (09:47 +1000)]

auto_demotion_disable is called only for online CPUs. For hotplugged
CPUs, we should disable it too.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: Len Brown <lenb@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Shaohua Li [Wed, 24 Aug 2011 23:47:24 +0000 (09:47 +1000)]

smp_call_function() only lets all other CPUs execute a specific function,
while we expect all CPUs do in intel_idle. Without the fix, we could have
one cpu which has auto_demotion enabled or has no boradcast timer setup.
Usually we don't see impact because auto demotion just harms power and the
intel_idle init is called in CPU 0, where boradcast timer delivers
interrupt, but this still could be a problem.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: Len Brown <lenb@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Magnus Lynch [Wed, 24 Aug 2011 23:47:23 +0000 (09:47 +1000)]

The current implementation of the /dev/hpet driver couples opening the
device with allocating one of the (scarce) timers (aka comparators).  This
is a limitation in that the main counter may be valuable to applications
seeking a high-resolution timer who have no use for the interrupt
generating functionality of the comparators.

This patch alters the open semantics so that when the device is opened, no
timer is allocated.  Operations that depend on a timer being in context
implicitly attempt allocating a timer, to maintain backward compatibility.
There is also an IOCTL (HPET_ALLOC_TIMER _IO) added so that the
allocation may be done explicitly.  (I prefer the explicit open then
allocate pattern but don't know how practical it would be to require all
existing code to be changed.)

/dev/hpet is accessed via mmap().  This is the only interface of /dev/hpet
that is actually used in practice.

[akpm@linux-foundation.org: coding-style tweaks]
[arnd@arndb.de: fix build]
Signed-off-by: Magnus Lynch <maglyx@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: john stultz <johnstul@us.ibm.com>
Acked-by: Clemens Ladisch <clemens@ladisch.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Andrew Morton [Wed, 24 Aug 2011 23:47:22 +0000 (09:47 +1000)]

Make the security_inode_init_security() initxattrs arg const, to match the
non-stubbed version of that function.

Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Andy Shevchenko [Wed, 24 Aug 2011 23:47:21 +0000 (09:47 +1000)]

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Eric Paris <eparis@parisplace.org>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Andrew Morton [Wed, 24 Aug 2011 23:47:20 +0000 (09:47 +1000)]

reduce ifdeffery

Cc: Amerigo Wang <amwang@redhat.com>
Cc: Greg KH <greg@kroah.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Randy Dunlap [Wed, 24 Aug 2011 23:47:20 +0000 (09:47 +1000)]

vmstat_text is only available when PROC_FS or SYSFS is enabled. This
causes build errors in drivers/base/node.c when they are both disabled:

drivers/built-in.o: In function `node_read_vmstat':
node.c:(.text+0x10e28f): undefined reference to `vmstat_text'

Rather than litter drivers/base/node.c with #ifdef/#endif around the
affected lines of code, add macros for optional sysdev attributes so that
those lines of code will be ignored, without using #ifdef/#endif in the .c
file(s). I.e., the ifdeffery is done only in a header file with
sysdev_create_file_optional() and sysdev_remove_file_optional().

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: Amerigo Wang <amwang@redhat.com>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Andrea Arcangeli [Wed, 24 Aug 2011 23:47:19 +0000 (09:47 +1000)]

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Andrew Morton [Wed, 24 Aug 2011 23:47:18 +0000 (09:47 +1000)]

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Andrew Morton [Wed, 24 Aug 2011 23:47:17 +0000 (09:47 +1000)]

coding-style nitpicking

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Andrea Arcangeli [Wed, 24 Aug 2011 23:47:17 +0000 (09:47 +1000)]

This adds THP support to mremap (decreases the number of split_huge_page()
calls).

Here are also some benchmarks with a proggy like this:

===
#define _GNU_SOURCE
#include <sys/mman.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <sys/time.h>

#define SIZE (5UL*1024*1024*1024)

int main()
{
        static struct timeval oldstamp, newstamp;
long diffsec;
char *p, *p2, *p3, *p4;
if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p3, 2*1024*1024, 4096))
perror("memalign"), exit(1);

memset(p, 0xff, SIZE);
memset(p2, 0xff, SIZE);
memset(p3, 0x77, 4096);
gettimeofday(&oldstamp, NULL);
p4 = mremap(p, SIZE, SIZE, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
gettimeofday(&newstamp, NULL);
diffsec = newstamp.tv_sec - oldstamp.tv_sec;
diffsec = newstamp.tv_usec - oldstamp.tv_usec + 1000000 * diffsec;
printf("usec %ld\n", diffsec);
if (p == MAP_FAILED || p4 != p3)
//if (p == MAP_FAILED)
perror("mremap"), exit(1);
if (memcmp(p4, p2, SIZE))
printf("mremap bug\n"), exit(1);
printf("ok\n");

return 0;
}
===

THP on

Performance counter stats for './largepage13' (3 runs):

          69195836 dTLB-loads                 ( +-   3.546% )  (scaled from 50.30%)
             60708 dTLB-load-misses           ( +-  11.776% )  (scaled from 52.62%)
         676266476 dTLB-stores                ( +-   5.654% )  (scaled from 69.54%)
             29856 dTLB-store-misses          ( +-   4.081% )  (scaled from 89.22%)
        1055848782 iTLB-loads                 ( +-   4.526% )  (scaled from 80.18%)
              8689 iTLB-load-misses           ( +-   2.987% )  (scaled from 58.20%)

        7.314454164  seconds time elapsed   ( +-   0.023% )

THP off

Performance counter stats for './largepage13' (3 runs):

        1967379311 dTLB-loads                 ( +-   0.506% )  (scaled from 60.59%)
           9238687 dTLB-load-misses           ( +-  22.547% )  (scaled from 61.87%)
        2014239444 dTLB-stores                ( +-   0.692% )  (scaled from 60.40%)
           3312335 dTLB-store-misses          ( +-   7.304% )  (scaled from 67.60%)
        6764372065 iTLB-loads                 ( +-   0.925% )  (scaled from 79.00%)
              8202 iTLB-load-misses           ( +-   0.475% )  (scaled from 70.55%)

        9.693655243  seconds time elapsed   ( +-   0.069% )

grep thp /proc/vmstat
thp_fault_alloc 35849
thp_fault_fallback 0
thp_collapse_alloc 3
thp_collapse_alloc_failed 0
thp_split 0

thp_split 0 confirms no thp split despite plenty of hugepages allocated.

The measurement of only the mremap time (so excluding the 3 long
memset and final long 10GB memory accessing memcmp):

THP on

usec 14824
usec 14862
usec 14859

THP off

usec 256416
usec 255981
usec 255847

With an older kernel without the mremap optimizations (the below patch
optimizes the non THP version too).

THP on

usec 392107
usec 390237
usec 404124

THP off

usec 444294
usec 445237
usec 445820

I guess with a threaded program that sends more IPI on large SMP it'd
create an even larger difference.

All debug options are off except DEBUG_VM to avoid skewing the
results.

The only problem for native 2M mremap like it happens above both the
source and destination address must be 2M aligned or the hugepmd can't be
moved without a split but that is an hardware limitation.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Andrea Arcangeli [Wed, 24 Aug 2011 23:47:16 +0000 (09:47 +1000)]

This replaces ptep_clear_flush() with ptep_get_and_clear() and a single
flush_tlb_range() at the end of the loop, to avoid sending one IPI for
each page.

The mmu_notifier_invalidate_range_start/end section is enlarged
accordingly but this is not going to fundamentally change things.  It was
more by accident that the region under mremap was for the most part still
available for secondary MMUs: the primary MMU was never allowed to
reliably access that region for the duration of the mremap (modulo
trapping SIGSEGV on the old address range which sounds unpractical and
flakey).  If users wants secondary MMUs not to lose access to a large
region under mremap they should reduce the mremap size accordingly in
userland and run multiple calls.  Overall this will run faster so it's
actually going to reduce the time the region is under mremap for the
primary MMU which should provide a net benefit to apps.

For KVM this is a noop because the guest physical memory is never
mremapped, there's just no point it ever moving it while guest runs.  One
target of this optimization is JVM GC (so unrelated to the mmu notifier
logic).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Andrea Arcangeli [Wed, 24 Aug 2011 23:47:14 +0000 (09:47 +1000)]

Using "- 1" relies on the old_end to be page aligned and PAGE_SIZE > 1,
those are reasonable requirements but the check remains obscure and it
looks more like an off by one error than an overflow check. This I feel
will improve readability.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Sam Ravnborg [Wed, 24 Aug 2011 23:47:14 +0000 (09:47 +1000)]

With the NO_BOOTMEM symbol added architectures may now use the following
syntax to tell that they do not need bootmem:

select NO_BOOTMEM

This is much more convinient than adding a new kconfig symbol which was
otherwise required.

Adding this symbol does not conflict with the architctures that already
define their own symbol.

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Sam Ravnborg [Wed, 24 Aug 2011 23:47:13 +0000 (09:47 +1000)]

SPARC32 require access to the start address. Add a new helper
memblock_start_of_DRAM() to give access to the address of the first
memblock - which contains the lowest address.

The awkward name was chosen to match the already present
memblock_end_of_DRAM().

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Yinghai Lu <yinghai@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Konstantin Khlebnikov [Wed, 24 Aug 2011 23:47:12 +0000 (09:47 +1000)]

Logic added in commit 8cab4754d24a0 ("vmscan: make mapped executable pages
the first class citizen") was noticeably weakened in commit
645747462435d84 ("vmscan: detect mapped file pages used only once").

Currently these pages can become "first class citizens" only after second
usage. After this patch page_check_references() will activate they after
first usage, and executable code gets yet better chance to stay in memory.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Konstantin Khlebnikov [Wed, 24 Aug 2011 23:47:11 +0000 (09:47 +1000)]

Commit 645747462435 ("vmscan: detect mapped file pages used only once")
greatly decreases lifetime of single-used mapped file pages.
Unfortunately it also decreases life time of all shared mapped file pages.
Because after commit bf3f3bc5e7347 ("mm: don't mark_page_accessed
in fault path") page-fault handler does not mark page active or even
referenced.

Thus page_check_references() activates file page only if it was used twice
while it stays in inactive list, meanwhile it activates anon pages after
first access.  Inactive list can be small enough, this way reclaimer can
accidentally throw away any widely used page if it wasn't used twice in
short period.

After this patch page_check_references() also activate file mapped page at
first inactive list scan if this page is already used multiple times via
several ptes.

I found this while trying to fix degragation in rhel6 (~2.6.32) from rhel5
(~2.6.18).  There a complete mess with >100 web/mail/spam/ftp containers,
they share all their files but there a lot of anonymous pages: ~500mb
shared file mapped memory and 15-20Gb non-shared anonymous memory.  In
this situation major-pagefaults are very costly, because all containers
share the same page.  In my load kernel created a disproportionate
pressure on the file memory, compared with the anonymous, they equaled
only if I raise swappiness up to 150 =)

These patches actually wasn't helped a lot in my problem, but I saw
noticable (10-20 times) reduce in count and average time of
major-pagefault in file-mapped areas.

Actually both patches are fixes for commit v2.6.33-5448-g6457474, because
it was aimed at one scenario (singly used pages), but it breaks the logic
in other scenarios (shared and/or executable pages)

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: Pekka Enberg <penberg@kernel.org>
Acked-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Mitsuo Hayasaka [Wed, 24 Aug 2011 23:47:10 +0000 (09:47 +1000)]

The /proc/vmallocinfo shows information about vmalloc allocations in
vmlist that is a linklist of vm_struct.  It, however, may access pages
field of vm_struct where a page was not allocated.  This results in a null
pointer access and leads to a kernel panic.

Why this happens: In __vmalloc_node_range() called from vmalloc(), newly
allocated vm_struct is added to vmlist at __get_vm_area_node() and then,
some fields of vm_struct such as nr_pages and pages are set at
__vmalloc_area_node().  In other words, it is added to vmlist before it is
fully initialized.  At the same time, when the /proc/vmallocinfo is read,
it accesses the pages field of vm_struct according to the nr_pages field
at show_numa_info().  Thus, a null pointer access happens.

The patch adds the newly allocated vm_struct to the vmlist *after* it is
fully initialized.  So, it can avoid accessing the pages field with
unallocated page when show_numa_info() is called.

Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Andrew Morton [Wed, 24 Aug 2011 23:47:09 +0000 (09:47 +1000)]

massage atomic.h inclusions

Cc: Dave Chinner <david@fromorbit.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Konstantin Khlebnikov [Wed, 24 Aug 2011 23:47:08 +0000 (09:47 +1000)]

Use atomic-long operations instead of looping around cmpxchg().

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Konstantin Khlebnikov [Wed, 24 Aug 2011 23:47:07 +0000 (09:47 +1000)]

A shrinker function can return -1, means that it cannot do anything
without a risk of deadlock.  For example prune_super() does this if it
cannot grab a superblock refrence, even if nr_to_scan=0.  Currently we
interpret this -1 as a ULONG_MAX size shrinker and evaluate `total_scan'
according to this.  So the next time around this shrinker can cause really
big pressure.  Let's skip such shrinkers instead.

Also make total_scan signed, otherwise the check (total_scan < 0) below
never works.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Akinobu Mita [Wed, 24 Aug 2011 23:47:07 +0000 (09:47 +1000)]

Use newly introduced memchr_inv() for page verification.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Akinobu Mita [Wed, 24 Aug 2011 23:47:06 +0000 (09:47 +1000)]

memchr_inv() is mainly used to check whether the whole buffer is filled
with just a specified byte.

The function name and prototype are stolen from logfs and the
implementation is from SLUB.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Matt Mackall <mpm@selenic.com>
Acked-by: Joern Engel <joern@logfs.org>
Cc: Marcin Slusarz <marcin.slusarz@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Akinobu Mita [Wed, 24 Aug 2011 23:47:05 +0000 (09:47 +1000)]

printk_ratelimit() should not be used, because it shares ratelimiting
state with all other unrelated printk_ratelimit() callsites.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Shaohua Li [Wed, 24 Aug 2011 23:47:04 +0000 (09:47 +1000)]

It's possible a zone watermark is ok when entering the balance_pgdat()
loop, while the zone is within the requested classzone_idx. Count pages
from this zone into `balanced'. In this way, we can skip shrinking zones
too much for high order allocation.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Mel Gorman [Wed, 24 Aug 2011 23:47:03 +0000 (09:47 +1000)]

When direct reclaim encounters a dirty page, it gets recycled around the
LRU for another cycle. This patch marks the page PageReclaim similar to
deactivate_page() so that the page gets reclaimed almost immediately after
the page gets cleaned. This is to avoid reclaiming clean pages that are
younger than a dirty page encountered at the end of the LRU that might
have been something like a use-once page.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Alex Elder <aelder@sgi.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Mel Gorman [Wed, 24 Aug 2011 23:47:02 +0000 (09:47 +1000)]

Workloads that are allocating frequently and writing files place a large
number of dirty pages on the LRU.  With use-once logic, it is possible for
them to reach the end of the LRU quickly requiring the reclaimer to scan
more to find clean pages.  Ordinarily, processes that are dirtying memory
will get throttled by dirty balancing but this is a global heuristic and
does not take into account that LRUs are maintained on a per-zone basis.
This can lead to a situation whereby reclaim is scanning heavily, skipping
over a large number of pages under writeback and recycling them around the
LRU consuming CPU.

This patch checks how many of the number of pages isolated from the LRU
were dirty and under writeback.  If a percentage of them under writeback,
the process will be throttled if a backing device or the zone is
congested.  Note that this applies whether it is anonymous or file-backed
pages that are under writeback meaning that swapping is potentially
throttled.  This is intentional due to the fact if the swap device is
congested, scanning more pages and dispatching more IO is not going to
help matters.

The percentage that must be in writeback depends on the priority.  At
default priority, all of them must be dirty.  At DEF_PRIORITY-1, 50% of
them must be, DEF_PRIORITY-2, 25% etc.  i.e.  as pressure increases the
greater the likelihood the process will get throttled to allow the flusher
threads to make some progress.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Alex Elder <aelder@sgi.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Mel Gorman [Wed, 24 Aug 2011 23:47:01 +0000 (09:47 +1000)]

It is preferable that no dirty pages are dispatched for cleaning from the
page reclaim path.  At normal priorities, this patch prevents kswapd
writing pages.

However, page reclaim does have a requirement that pages be freed in a
particular zone.  If it is failing to make sufficient progress (reclaiming
< SWAP_CLUSTER_MAX at any priority priority), the priority is raised to
scan more pages.  A priority of DEF_PRIORITY - 3 is considered to be the
point where kswapd is getting into trouble reclaiming pages.  If this
priority is reached, kswapd will dispatch pages for writing.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Alex Elder <aelder@sgi.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Mel Gorman [Wed, 24 Aug 2011 23:47:00 +0000 (09:47 +1000)]

Direct reclaim should never writeback pages. Warn if an attempt is made.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Alex Elder <aelder@sgi.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Mel Gorman [Wed, 24 Aug 2011 23:46:59 +0000 (09:46 +1000)]

Direct reclaim should never writeback pages. For now, handle the
situation and warn about it. Ultimately, this will be a BUG_ON.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Alex Elder <aelder@sgi.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Mel Gorman [Wed, 24 Aug 2011 23:46:58 +0000 (09:46 +1000)]

Lumpy reclaim worked with two passes - the first which queued pages for IO
and the second which waited on writeback. As direct reclaim can no longer
write pages there is some dead code. This patch removes it but direct
reclaim will continue to wait on pages under writeback while in
synchronous reclaim mode.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Alex Elder <aelder@sgi.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Mel Gorman [Wed, 24 Aug 2011 23:46:57 +0000 (09:46 +1000)]

Testing from the XFS folk revealed that there is still too much I/O from
the end of the LRU in kswapd.  Previously it was considered acceptable by
VM people for a small number of pages to be written back from reclaim with
testing generally showing about 0.3% of pages reclaimed were written back
(higher if memory was low).  That writing back a small number of pages is
ok has been heavily disputed for quite some time and Dave Chinner
explained it well;

It doesn't have to be a very high number to be a problem. IO
is orders of magnitude slower than the CPU time it takes to
flush a page, so the cost of making a bad flush decision is
very high. And single page writeback from the LRU is almost
always a bad flush decision.

To complicate matters, filesystems respond very differently to requests
from reclaim according to Christoph Hellwig;

xfs tries to write it back if the requester is kswapd
ext4 ignores the request if it's a delayed allocation
btrfs ignores the request

As a result, each filesystem has different performance characteristics
when under memory pressure and there are many pages being dirtied.  In
some cases, the request is ignored entirely so the VM cannot depend on the
IO being dispatched.

The objective of this series is to reduce writing of filesystem-backed
pages from reclaim, play nicely with writeback that is already in progress
and throttle reclaim appropriately when writeback pages are encountered.
The assumption is that the flushers will always write pages faster than if
reclaim issues the IO.  The new problem is that reclaim has very little
control over how long before a page in a particular zone or container is
cleaned which is discussed later.  A secondary goal is to avoid the
problem whereby direct reclaim splices two potentially deep call stacks
together.

Patch 1 disables writeback of filesystem pages from direct reclaim
entirely. Anonymous pages are still written.

Patch 2 removes dead code in lumpy reclaim as it is no longer able
to synchronously write pages. This hurts lumpy reclaim but
there is an expectation that compaction is used for hugepage
allocations these days and lumpy reclaim's days are numbered.

Patches 3-4 add warnings to XFS and ext4 if called from
direct reclaim. With patch 1, this "never happens" and is
intended to catch regressions in this logic in the future.

Patch 5 disables writeback of filesystem pages from kswapd unless
the priority is raised to the point where kswapd is considered
to be in trouble.

Patch 6 throttles reclaimers if too many dirty pages are being
encountered and the zones or backing devices are congested.

Patch 7 invalidates dirty pages found at the end of the LRU so they
are reclaimed quickly after being written back rather than
waiting for a reclaimer to find them

I consider this series to be orthogonal to the writeback work but it is
worth noting that the writeback work affects the viability of patch 8 in
particular.

I tested this on ext4 and xfs using fs_mark, a simple writeback test based
on dd and a micro benchmark that does a streaming write to a large mapping
(exercises use-once LRU logic) followed by streaming writes to a mix of
anonymous and file-backed mappings.  The command line for fs_mark when
botted with 512M looked something like

./fs_mark -d  /tmp/fsmark-2676  -D  100  -N  150  -n  150  -L  25  -t  1  -S0  -s  10485760

The number of files was adjusted depending on the amount of available
memory so that the files created was about 3xRAM.  For multiple threads,
the -d switch is specified multiple times.

The test machine is x86-64 with an older generation of AMD processor with
4 cores.  The underlying storage was 4 disks configured as RAID-0 as this
was the best configuration of storage I had available.  Swap is on a
separate disk.  Dirty ratio was tuned to 40% instead of the default of
20%.

Testing was run with and without monitors to both verify that the patches
were operating as expected and that any performance gain was real and not
due to interference from monitors.

Here is a summary of results based on testing XFS.

512M1P-xfs           Files/s  mean                 32.69 ( 0.00%)     34.44 ( 5.08%)
512M1P-xfs           Elapsed Time fsmark                    51.41     48.29
512M1P-xfs           Elapsed Time simple-wb                114.09    108.61
512M1P-xfs           Elapsed Time mmap-strm                113.46    109.34
512M1P-xfs           Kswapd efficiency fsmark                 62%       63%
512M1P-xfs           Kswapd efficiency simple-wb              56%       61%
512M1P-xfs           Kswapd efficiency mmap-strm              44%       42%
512M-xfs             Files/s  mean                 30.78 ( 0.00%)     35.94 (14.36%)
512M-xfs             Elapsed Time fsmark                    56.08     48.90
512M-xfs             Elapsed Time simple-wb                112.22     98.13
512M-xfs             Elapsed Time mmap-strm                219.15    196.67
512M-xfs             Kswapd efficiency fsmark                 54%       56%
512M-xfs             Kswapd efficiency simple-wb              54%       55%
512M-xfs             Kswapd efficiency mmap-strm              45%       44%
512M-4X-xfs          Files/s  mean                 30.31 ( 0.00%)     33.33 ( 9.06%)
512M-4X-xfs          Elapsed Time fsmark                    63.26     55.88
512M-4X-xfs          Elapsed Time simple-wb                100.90     90.25
512M-4X-xfs          Elapsed Time mmap-strm                261.73    255.38
512M-4X-xfs          Kswapd efficiency fsmark                 49%       50%
512M-4X-xfs          Kswapd efficiency simple-wb              54%       56%
512M-4X-xfs          Kswapd efficiency mmap-strm              37%       36%
512M-16X-xfs         Files/s  mean                 60.89 ( 0.00%)     65.22 ( 6.64%)
512M-16X-xfs         Elapsed Time fsmark                    67.47     58.25
512M-16X-xfs         Elapsed Time simple-wb                103.22     90.89
512M-16X-xfs         Elapsed Time mmap-strm                237.09    198.82
512M-16X-xfs         Kswapd efficiency fsmark                 45%       46%
512M-16X-xfs         Kswapd efficiency simple-wb              53%       55%
512M-16X-xfs         Kswapd efficiency mmap-strm              33%       33%

Up until 512-4X, the FSmark improvements were statistically significant.
For the 4X and 16X tests the results were within standard deviations but
just barely.  The time to completion for all tests is improved which is an
important result.  In general, kswapd efficiency is not affected by
skipping dirty pages.

1024M1P-xfs          Files/s  mean                 39.09 ( 0.00%)     41.15 ( 5.01%)
1024M1P-xfs          Elapsed Time fsmark                    84.14     80.41
1024M1P-xfs          Elapsed Time simple-wb                210.77    184.78
1024M1P-xfs          Elapsed Time mmap-strm                162.00    160.34
1024M1P-xfs          Kswapd efficiency fsmark                 69%       75%
1024M1P-xfs          Kswapd efficiency simple-wb              71%       77%
1024M1P-xfs          Kswapd efficiency mmap-strm              43%       44%
1024M-xfs            Files/s  mean                 35.45 ( 0.00%)     37.00 ( 4.19%)
1024M-xfs            Elapsed Time fsmark                    94.59     91.00
1024M-xfs            Elapsed Time simple-wb                229.84    195.08
1024M-xfs            Elapsed Time mmap-strm                405.38    440.29
1024M-xfs            Kswapd efficiency fsmark                 79%       71%
1024M-xfs            Kswapd efficiency simple-wb              74%       74%
1024M-xfs            Kswapd efficiency mmap-strm              39%       42%
1024M-4X-xfs         Files/s  mean                 32.63 ( 0.00%)     35.05 ( 6.90%)
1024M-4X-xfs         Elapsed Time fsmark                   103.33     97.74
1024M-4X-xfs         Elapsed Time simple-wb                204.48    178.57
1024M-4X-xfs         Elapsed Time mmap-strm                528.38    511.88
1024M-4X-xfs         Kswapd efficiency fsmark                 81%       70%
1024M-4X-xfs         Kswapd efficiency simple-wb              73%       72%
1024M-4X-xfs         Kswapd efficiency mmap-strm              39%       38%
1024M-16X-xfs        Files/s  mean                 42.65 ( 0.00%)     42.97 ( 0.74%)
1024M-16X-xfs        Elapsed Time fsmark                   103.11     99.11
1024M-16X-xfs        Elapsed Time simple-wb                200.83    178.24
1024M-16X-xfs        Elapsed Time mmap-strm                397.35    459.82
1024M-16X-xfs        Kswapd efficiency fsmark                 84%       69%
1024M-16X-xfs        Kswapd efficiency simple-wb              74%       73%
1024M-16X-xfs        Kswapd efficiency mmap-strm              39%       40%

All FSMark tests up to 16X had statistically significant improvements.
For the most part, tests are completing faster with the exception of the
streaming writes to a mixture of anonymous and file-backed mappings which
were slower in two cases

In the cases where the mmap-strm tests were slower, there was more
swapping due to dirty pages being skipped.  The number of additional pages
swapped is almost identical to the fewer number of pages written from
reclaim.  In other words, roughly the same number of pages were reclaimed
but swapping was slower.  As the test is a bit unrealistic and stresses
memory heavily, the small shift is acceptable.

4608M1P-xfs          Files/s  mean                 29.75 ( 0.00%)     30.96 ( 3.91%)
4608M1P-xfs          Elapsed Time fsmark                   512.01    492.15
4608M1P-xfs          Elapsed Time simple-wb                618.18    566.24
4608M1P-xfs          Elapsed Time mmap-strm                488.05    465.07
4608M1P-xfs          Kswapd efficiency fsmark                 93%       86%
4608M1P-xfs          Kswapd efficiency simple-wb              88%       84%
4608M1P-xfs          Kswapd efficiency mmap-strm              46%       45%
4608M-xfs            Files/s  mean                 27.60 ( 0.00%)     28.85 ( 4.33%)
4608M-xfs            Elapsed Time fsmark                   555.96    532.34
4608M-xfs            Elapsed Time simple-wb                659.72    571.85
4608M-xfs            Elapsed Time mmap-strm               1082.57   1146.38
4608M-xfs            Kswapd efficiency fsmark                 89%       91%
4608M-xfs            Kswapd efficiency simple-wb              88%       82%
4608M-xfs            Kswapd efficiency mmap-strm              48%       46%
4608M-4X-xfs         Files/s  mean                 26.00 ( 0.00%)     27.47 ( 5.35%)
4608M-4X-xfs         Elapsed Time fsmark                   592.91    564.00
4608M-4X-xfs         Elapsed Time simple-wb                616.65    575.07
4608M-4X-xfs         Elapsed Time mmap-strm               1773.02   1631.53
4608M-4X-xfs         Kswapd efficiency fsmark                 90%       94%
4608M-4X-xfs         Kswapd efficiency simple-wb              87%       82%
4608M-4X-xfs         Kswapd efficiency mmap-strm              43%       43%
4608M-16X-xfs        Files/s  mean                 26.07 ( 0.00%)     26.42 ( 1.32%)
4608M-16X-xfs        Elapsed Time fsmark                   602.69    585.78
4608M-16X-xfs        Elapsed Time simple-wb                606.60    573.81
4608M-16X-xfs        Elapsed Time mmap-strm               1549.75   1441.86
4608M-16X-xfs        Kswapd efficiency fsmark                 98%       98%
4608M-16X-xfs        Kswapd efficiency simple-wb              88%       82%
4608M-16X-xfs        Kswapd efficiency mmap-strm              44%       42%

Unlike the other tests, the fsmark results are not statistically
significant but the min and max times are both improved and for the most
part, tests completed faster.

There are other indications that this is an improvement as well.  For
example, in the vast majority of cases, there were fewer pages scanned by
direct reclaim implying in many cases that stalls due to direct reclaim
are reduced.  KSwapd is scanning more due to skipping dirty pages which is
unfortunate but the CPU usage is still acceptable

In an earlier set of tests, I used blktrace and in almost all cases
throughput throughout the entire test was higher.  However, I ended up
discarding those results as recording blktrace data was too heavy for my
liking.

On a laptop, I plugged in a USB stick and ran a similar tests of tests
using it as backing storage.  A desktop environment was running and for
the entire duration of the tests, firefox and gnome terminal were
launching and exiting to vaguely simulate a user.

1024M-xfs            Files/s  mean               0.41 ( 0.00%)        0.44 ( 6.82%)
1024M-xfs            Elapsed Time fsmark               2053.52   1641.03
1024M-xfs            Elapsed Time simple-wb            1229.53    768.05
1024M-xfs            Elapsed Time mmap-strm            4126.44   4597.03
1024M-xfs            Kswapd efficiency fsmark              84%       85%
1024M-xfs            Kswapd efficiency simple-wb           92%       81%
1024M-xfs            Kswapd efficiency mmap-strm           60%       51%
1024M-xfs            Avg wait ms fsmark                5404.53     4473.87
1024M-xfs            Avg wait ms simple-wb             2541.35     1453.54
1024M-xfs            Avg wait ms mmap-strm             3400.25     3852.53

The mmap-strm results were hurt because firefox launching had a tendency
to push the test out of memory.  On the postive side, firefox launched
marginally faster with the patches applied.  Time to completion for many
tests was faster but more importantly - the "Avg wait" time as measured by
iostat was far lower implying the system would be more responsive.  It was
also the case that "Avg wait ms" on the root filesystem was lower.  I
tested it manually and while the system felt slightly more responsive
while copying data to a USB stick, it was marginal enough that it could be
my imagination.

This patch: do not writeback filesystem pages in direct reclaim.

When kswapd is failing to keep zones above the min watermark, a process
will enter direct reclaim in the same manner kswapd does.  If a dirty page
is encountered during the scan, this page is written to backing storage
using mapping->writepage.

This causes two problems.  First, it can result in very deep call stacks,
particularly if the target storage or filesystem are complex.  Some
filesystems ignore write requests from direct reclaim as a result.  The
second is that a single-page flush is inefficient in terms of IO.  While
there is an expectation that the elevator will merge requests, this does
not always happen.  Quoting Christoph Hellwig;

The elevator has a relatively small window it can operate on,
and can never fix up a bad large scale writeback pattern.

This patch prevents direct reclaim writing back filesystem pages by
checking if current is kswapd.  Anonymous pages are still written to swap
as there is not the equivalent of a flusher thread for anonymous pages.
If the dirty pages cannot be written back, they are placed back on the LRU
lists.  There is now a direct dependency on dirty page balancing to
prevent too many pages in the system being dirtied which would prevent
reclaim making forward progress.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Alex Elder <aelder@sgi.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Andrew Morton [Wed, 24 Aug 2011 23:46:56 +0000 (09:46 +1000)]

add missing ;

Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Christoph Lameter [Wed, 24 Aug 2011 23:46:56 +0000 (09:46 +1000)]

Add comments to explain the page statistics field in the mm_struct.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Christoph Lameter [Wed, 24 Aug 2011 23:46:54 +0000 (09:46 +1000)]

Some kernel components pin user space memory (infiniband and perf) (by
increasing the page count) and account that memory as "mlocked".

The difference between mlocking and pinning is:

A. mlocked pages are marked with PG_mlocked and are exempt from
   swapping. Page migration may move them around though.
   They are kept on a special LRU list.

B. Pinned pages cannot be moved because something needs to
   directly access physical memory. They may not be on any
   LRU list.

I recently saw an mlockalled process where mm->locked_vm became
bigger than the virtual size of the process (!) because some
memory was accounted for twice:

Once when the page was mlocked and once when the Infiniband
layer increased the refcount because it needt to pin the RDMA
memory.

This patch introduces a separate counter for pinned pages and
accounts them seperately.

Signed-off-by: Christoph Lameter <cl@linux.com>
Cc: Mike Marciniszyn <infinipath@qlogic.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Johannes Weiner [Wed, 24 Aug 2011 23:46:54 +0000 (09:46 +1000)]

The nr_force_scan[] tuple holds the effective scan numbers for anon and
file pages in case the situation called for a forced scan and the
regularly calculated scan numbers turned out zero.

However, the effective scan number can always be assumed to be
SWAP_CLUSTER_MAX right before the division into anon and file. The
numerators and denominator are properly set up for all cases, be it force
scan for just file, just anon, or both, to do the right thing.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Ying Han <yinghan@google.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Dave Jones [Wed, 24 Aug 2011 23:46:51 +0000 (09:46 +1000)]

When we get a bad_page bug report, it's useful to see what modules the
user had loaded.

Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Robert P. J. Day [Wed, 24 Aug 2011 23:46:51 +0000 (09:46 +1000)]

Add the leading word "tmpfs" to the Kconfig string to make it blindingly
obvious that this selection refers to tmpfs.

Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

David Rientjes [Wed, 24 Aug 2011 23:46:49 +0000 (09:46 +1000)]

After selecting a task to kill, the oom killer iterates all processes and
kills all other threads that share the same mm_struct in different thread
groups. It would not otherwise be helpful to kill a thread if its memory
would not be subsequently freed.

A kernel thread, however, may assume a user thread's mm by using
use_mm(). This is only temporary and should not result in sending a
SIGKILL to that kthread.

This patch ensures that only user threads and not kthreads are sent a
SIGKILL if they share the same mm_struct as the oom killed task.

Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Johannes Weiner [Wed, 24 Aug 2011 23:46:49 +0000 (09:46 +1000)]

The tracing ring-buffer used this function briefly, but not anymore.
Make it local to the writeback code again.

Also, move the function so that no forward declaration needs to be
reintroduced.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Shaohua Li [Wed, 24 Aug 2011 23:46:48 +0000 (09:46 +1000)]

per-task block plug can reduce block queue lock contention and increase
request merge.  Currently page reclaim doesn't support it.  I originally
thought page reclaim doesn't need it, because kswapd thread count is
limited and file cache write is done at flusher mostly.

When I test a workload with heavy swap in a 4-node machine, each CPU is
doing direct page reclaim and swap.  This causes block queue lock
contention.  In my test, without below patch, the CPU utilization is about
2% ~ 7%.  With the patch, the CPU utilization is about 1% ~ 3%.  Disk
throughput isn't changed.  This should improve normal kswapd write and
file cache write too (increase request merge for example), but might not
be so obvious as I explain above.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Hugh Dickins [Wed, 24 Aug 2011 23:46:48 +0000 (09:46 +1000)]

radix_tree_tag_get()'s BUG (when it sees a tag after saw_unset_tag) was
unsafe and removed in 2.6.34, but the pointless saw_unset_tag left behind.

Remove it now, and return 0 as soon as we see unset tag - we already rely
upon the root tag to be correct, returning 0 immediately if it's not set.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Minchan Kim [Wed, 24 Aug 2011 23:46:46 +0000 (09:46 +1000)]

unmap_and_move() is one a big messy function. Clean it up.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Minchan Kim [Wed, 24 Aug 2011 23:46:45 +0000 (09:46 +1000)]

In __zone_reclaim case, we don't want to shrink mapped page.  Nonetheless,
we have isolated mapped page and re-add it into LRU's head.  It's
unnecessary CPU overhead and makes LRU churning.

Of course, when we isolate the page, the page might be mapped but when we
try to migrate the page, the page would be not mapped.  So it could be
migrated.  But race is rare and although it happens, it's no big deal.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Minchan Kim [Wed, 24 Aug 2011 23:46:44 +0000 (09:46 +1000)]

In async mode, compaction doesn't migrate dirty or writeback pages. So,
it's meaningless to pick the page and re-add it to lru list.

Of course, when we isolate the page in compaction, the page might be dirty
or writeback but when we try to migrate the page, the page would be not
dirty, writeback. So it could be migrated. But it's very unlikely as
isolate and migration cycle is much faster than writeout.

So, this patch helps cpu overhead and prevent unnecessary LRU churning.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Minchan Kim [Wed, 24 Aug 2011 23:46:44 +0000 (09:46 +1000)]

Change ISOLATE_XXX macro with bitwise isolate_mode_t type. Normally,
macro isn't recommended as it's type-unsafe and making debugging harder as
symbol cannot be passed throught to the debugger.

Quote from Johannes
" Hmm, it would probably be cleaner to fully convert the isolation mode
into independent flags. INACTIVE, ACTIVE, BOTH is currently a
tri-state among flags, which is a bit ugly."

This patch moves isolate mode from swap.h to mmzone.h by memcontrol.h

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Minchan Kim [Wed, 24 Aug 2011 23:46:42 +0000 (09:46 +1000)]

acct_isolated of compaction uses page_lru_base_type which returns only
base type of LRU list so it never returns LRU_ACTIVE_ANON or
LRU_ACTIVE_FILE. In addtion, cc->nr_[anon|file] is used in only
acct_isolated so it doesn't have fields in conpact_control.

This patch removes fields from compact_control and makes clear function of
acct_issolated which counts the number of anon|file pages isolated.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Christopher Yeoh [Wed, 24 Aug 2011 23:46:42 +0000 (09:46 +1000)]

> You might get some speed benefit by optimising for the small copies
> here.  Define a local on-stack array of N page*'s and point
> process_pages at that if the number of pages is <= N.  Saves a
> malloc/free and is more cache-friendly.  But only if the result is
> measurable!

I have done some benchmarking on this, and it gains about 5-7% on a
microbenchmark with 4kb size copies and about a 1% gain with a more
realistic (but modified for smaller copies) hpcc benchmark. The
performance gain disappears into the noise by about 64kb sized copies.
No measurable overhead for larger copies. So I think its worth including

Included below is the patch (based on v4) - for ease of review the first diff
is just against the latest version of CMA which has been posted here previously.
The second is the entire CMA patch.

Signed-off-by: Chris Yeoh <cyeoh@au1.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: <linux-man@vger.kernel.org>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Christopher Yeoh [Wed, 24 Aug 2011 23:46:41 +0000 (09:46 +1000)]

- Add x86_64 specific wire up

- Change behaviour so process_vm_readv and process_vm_writev return
  the number of bytes successfully read or written even if an error
  occurs

- Add more kernel doc interface comments

- rename some internal functions (process_vm_rw_check_iovecs,
  process_vm_rw) so they make more sense.

- Add licence message

- Fix kernel-doc comment format

Still need to do benchmarking to see if the optimisation for small copies
using a local on-stack array in process_vm_rw_core is worth it.

Signed-off-by: Chris Yeoh <cyeoh@au1.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Christopher Yeoh [Wed, 24 Aug 2011 23:46:40 +0000 (09:46 +1000)]

The basic idea behind cross memory attach is to allow MPI programs doing
intra-node communication to do a single copy of the message rather than a
double copy of the message via shared memory.

The following patch attempts to achieve this by allowing a destination
process, given an address and size from a source process, to copy memory
directly from the source process into its own address space via a system
call.  There is also a symmetrical ability to copy from the current
process's address space into a destination process's address space.

- Use of /proc/pid/mem has been considered, but there are issues with
  using it:
  - Does not allow for specifying iovecs for both src and dest, assuming
    preadv or pwritev was implemented either the area read from or
  written to would need to be contiguous.
  - Currently mem_read allows only processes who are currently
  ptrace'ing the target and are still able to ptrace the target to read
  from the target. This check could possibly be moved to the open call,
  but its not clear exactly what race this restriction is stopping
  (reason  appears to have been lost)
  - Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
  domain socket is a bit ugly from a userspace point of view,
  especially when you may have hundreds if not (eventually) thousands
  of processes  that all need to do this with each other
  - Doesn't allow for some future use of the interface we would like to
  consider adding in the future (see below)
  - Interestingly reading from /proc/pid/mem currently actually
  involves two copies! (But this could be fixed pretty easily)

As mentioned previously use of vmsplice instead was considered, but has
problems.  Since you need the reader and writer working co-operatively if
the pipe is not drained then you block.  Which requires some wrapping to
do non blocking on the send side or polling on the receive.  In all to all
communication it requires ordering otherwise you can deadlock.  And in the
example of many MPI tasks writing to one MPI task vmsplice serialises the
copying.

There are some cases of MPI collectives where even a single copy interface
does not get us the performance gain we could.  For example in an
MPI_Reduce rather than copy the data from the source we would like to
instead use it directly in a mathops (say the reduce is doing a sum) as
this would save us doing a copy.  We don't need to keep a copy of the data
from the source.  I haven't implemented this, but I think this interface
could in the future do all this through the use of the flags - eg could
specify the math operation and type and the kernel rather than just
copying the data would apply the specified operation between the source
and destination and store it in the destination.

Although we don't have a "second user" of the interface (though I've had
some nibbles from people who may be interested in using it for intra
process messaging which is not MPI).  This interface is something which
hardware vendors are already doing for their custom drivers to implement
fast local communication.  And so in addition to this being useful for
OpenMPI it would mean the driver maintainers don't have to fix things up
when the mm changes.

There was some discussion about how much faster a true zero copy would
go. Here's a link back to the email with some testing I did on that:

http://marc.info/?l=linux-mm&m=130105930902915&w=2

There is a basic man page for the proposed interface here:

http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt

This has been implemented for x86 and powerpc, other architecture should
mainly (I think) just need to add syscall numbers for the process_vm_readv
and process_vm_writev. There are 32 bit compatibility versions for
64-bit kernels.

For arch maintainers there are some simple tests to be able to quickly
verify that the syscalls are working correctly here:

http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgz

Signed-off-by: Chris Yeoh <yeohc@au1.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: <linux-man@vger.kernel.org>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Dave Jones [Wed, 24 Aug 2011 23:46:39 +0000 (09:46 +1000)]

When we get corruption reports, it's useful to see if the kernel was
tainted, to rule out problems we can't do anything about.

Signed-off-by: Dave Jones <davej@redhat.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Dave Jones [Wed, 24 Aug 2011 23:46:38 +0000 (09:46 +1000)]

Linux kernel for KaRo TX COM modules

RSS Atom