]> git.kernelconcepts.de Git - karo-tx-linux.git/log
karo-tx-linux.git
10 years agoAdd linux-next specific files for 20130912 next-20130912
Stephen Rothwell [Thu, 12 Sep 2013 04:15:32 +0000 (14:15 +1000)]
Add linux-next specific files for 20130912

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
10 years agoMerge branch 'akpm/master'
Stephen Rothwell [Thu, 12 Sep 2013 03:49:57 +0000 (13:49 +1000)]
Merge branch 'akpm/master'

10 years agomm/Kconfig: add MMU dependency for MIGRATION.
Chen Gang [Wed, 28 Aug 2013 00:18:21 +0000 (10:18 +1000)]
mm/Kconfig: add MMU dependency for MIGRATION.

MIGRATION must depend on MMU, or allmodconfig for the nommu sh
architecture fails to build:

    CC      mm/migrate.o
  mm/migrate.c: In function 'remove_migration_pte':
  mm/migrate.c:134:3: error: implicit declaration of function 'pmd_trans_huge' [-Werror=implicit-function-declaration]
     if (pmd_trans_huge(*pmd))
     ^
  mm/migrate.c:149:2: error: implicit declaration of function 'is_swap_pte' [-Werror=implicit-function-declaration]
    if (!is_swap_pte(pte))
    ^
  ...

Also let CMA depend on MMU, or when NOMMU, if we select CMA, it will
select MIGRATION by force.

Signed-off-by: Chen Gang <gang.chen@asianux.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agokernel: replace strict_strto*() with kstrto*()
Jingoo Han [Wed, 28 Aug 2013 00:17:52 +0000 (10:17 +1000)]
kernel: replace strict_strto*() with kstrto*()

The usage of strict_strto*() is not preferred, because strict_strto*() is
obsolete.  Thus, kstrto*() should be used.

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm, thp: count thp_fault_fallback anytime thp fault fails
David Rientjes [Wed, 28 Aug 2013 00:17:51 +0000 (10:17 +1000)]
mm, thp: count thp_fault_fallback anytime thp fault fails

Currently, thp_fault_fallback in vmstat only gets incremented if a
hugepage allocation fails.  If current's memcg hits its limit or the page
fault handler returns an error, it is incorrectly accounted as a
successful thp_fault_alloc.

Count thp_fault_fallback anytime the page fault handler falls back to
using regular pages and only count thp_fault_alloc when a hugepage has
actually been faulted.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agothp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page()
Kirill A. Shutemov [Wed, 28 Aug 2013 00:17:51 +0000 (10:17 +1000)]
thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page()

do_huge_pmd_anonymous_page() has copy-pasted piece of handle_mm_fault()
to handle fallback path.

Let's consolidate code back by introducing VM_FAULT_FALLBACK return
code.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hughd@google.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agothp: do_huge_pmd_anonymous_page() cleanup
Kirill A. Shutemov [Wed, 28 Aug 2013 00:17:50 +0000 (10:17 +1000)]
thp: do_huge_pmd_anonymous_page() cleanup

Minor cleanup: unindent most code of the fucntion by inverting one
condition.  It's preparation for the next patch.

No functional changes.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hughd@google.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agothp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
Kirill A. Shutemov [Wed, 28 Aug 2013 00:17:50 +0000 (10:17 +1000)]
thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()

It's confusing that mk_huge_pmd() has semantics different from mk_pte() or
mk_pmd().  I spent some time on debugging issue cased by this
inconsistency.

Let's move maybe_pmd_mkwrite() out of mk_huge_pmd() and adjust prototype
to match mk_pte().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hughd@google.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm: cleanup add_to_page_cache_locked()
Kirill A. Shutemov [Wed, 28 Aug 2013 00:17:49 +0000 (10:17 +1000)]
mm: cleanup add_to_page_cache_locked()

Make add_to_page_cache_locked() cleaner:

 - unindent most code of the function by inverting one condition;
 - streamline code no-error path;
 - move insert error path outside normal code path;
 - call radix_tree_preload_end() earlier;

No functional changes.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hughd@google.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agothp: account anon transparent huge pages into NR_ANON_PAGES
Kirill A. Shutemov [Wed, 28 Aug 2013 00:17:48 +0000 (10:17 +1000)]
thp: account anon transparent huge pages into NR_ANON_PAGES

We use NR_ANON_PAGES as base for reporting AnonPages to user.  There's not
much sense in not accounting transparent huge pages there, but add them on
printing to user.

Let's account transparent huge pages in NR_ANON_PAGES in the first place.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hughd@google.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm-drop-actor-argument-of-do_generic_file_read-fix
Andrew Morton [Wed, 28 Aug 2013 00:17:48 +0000 (10:17 +1000)]
mm-drop-actor-argument-of-do_generic_file_read-fix

fix mm-drop-actor-argument-of-do_generic_file_read for linux-next changes

Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm: drop actor argument of do_generic_file_read()
Kirill A. Shutemov [Wed, 28 Aug 2013 00:17:47 +0000 (10:17 +1000)]
mm: drop actor argument of do_generic_file_read()

There's only one caller of do_generic_file_read() and the only actor is
file_read_actor().  No reason to have a callback parameter.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agotruncate: drop 'oldsize' truncate_pagecache() parameter
Kirill A. Shutemov [Wed, 28 Aug 2013 00:17:47 +0000 (10:17 +1000)]
truncate: drop 'oldsize' truncate_pagecache() parameter

truncate_pagecache() doesn't care about old size since cedabed49b ("vfs:
Fix vmtruncate() regression").  Let's drop it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm: make lru_add_drain_all() selective
Chris Metcalf [Wed, 28 Aug 2013 00:17:46 +0000 (10:17 +1000)]
mm: make lru_add_drain_all() selective

make lru_add_drain_all() only selectively interrupt the cpus that have
per-cpu free pages that can be drained.

This is important in nohz mode where calling mlockall(), for example,
otherwise will interrupt every core unnecessarily.

This is important on workloads where nohz cores are handling 10 Gb traffic
in userspace.  Those CPUs do not enter the kernel and place pages into LRU
pagevecs and they really, really don't want to be interrupted, or they
drop packets on the floor.

Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomemcg: document cgroup dirty/writeback memory statistics
Sha Zhengju [Wed, 28 Aug 2013 00:17:45 +0000 (10:17 +1000)]
memcg: document cgroup dirty/writeback memory statistics

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomemcg: add per cgroup writeback pages accounting
Sha Zhengju [Wed, 28 Aug 2013 00:17:45 +0000 (10:17 +1000)]
memcg: add per cgroup writeback pages accounting

Add memcg routines to count writeback pages, later dirty pages will also
be accounted.

After Kame's commit 89c06bd5 ("memcg: use new logic for page stat
accounting"), we can use 'struct page' flag to test page state instead of
per page_cgroup flag.  But memcg has a feature to move a page from a
cgroup to another one and may have race between "move" and "page stat
accounting".  So in order to avoid the race we have designed a new lock:

         mem_cgroup_begin_update_page_stat()
         modify page information        -->(a)
         mem_cgroup_update_page_stat()  -->(b)
         mem_cgroup_end_update_page_stat()

It requires both (a) and (b)(writeback pages accounting) to be pretected
in mem_cgroup_{begin/end}_update_page_stat().  It's full no-op for
!CONFIG_MEMCG, almost no-op if memcg is disabled (but compiled in), rcu
read lock in the most cases (no task is moving), and spin_lock_irqsave on
top in the slow path.

There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
And the lock order is:
--> memcg->move_lock
  --> mapping->tree_lock

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Greg Thelen <gthelen@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomemcg: check for proper lock held in mem_cgroup_update_page_stat
Sha Zhengju [Wed, 28 Aug 2013 00:17:44 +0000 (10:17 +1000)]
memcg: check for proper lock held in mem_cgroup_update_page_stat

We should call mem_cgroup_begin_update_page_stat() before
mem_cgroup_update_page_stat() to get proper locks, however the latter
doesn't do any checking that we use proper locking, which would be hard.
Suggested by Michal Hock we could at least test for rcu_read_lock_held()
because RCU is held if !mem_cgroup_disabled().

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Greg Thelen <gthelen@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomemcg: remove MEMCG_NR_FILE_MAPPED
Sha Zhengju [Wed, 28 Aug 2013 00:17:43 +0000 (10:17 +1000)]
memcg: remove MEMCG_NR_FILE_MAPPED

While accounting memcg page stat, it's not worth to use
MEMCG_NR_FILE_MAPPED as an extra layer of indirection because of the
complexity and presumed performance overhead.  We can use
MEM_CGROUP_STAT_FILE_MAPPED directly.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomemcg: reduce function dereference
Sha Zhengju [Wed, 28 Aug 2013 00:17:43 +0000 (10:17 +1000)]
memcg: reduce function dereference

This function dereferences res far too often, so optimize it.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Jeff Liu <jeff.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomemcg: avoid overflow caused by PAGE_ALIGN
Sha Zhengju [Wed, 28 Aug 2013 00:17:42 +0000 (10:17 +1000)]
memcg: avoid overflow caused by PAGE_ALIGN

Since PAGE_ALIGN is aligning up(the next page boundary), so after
PAGE_ALIGN, the value might be overflow, such as write the MAX value to
*.limit_in_bytes.

$ cat /cgroup/memory/memory.limit_in_bytes
18446744073709551615

# echo 18446744073709551615 > /cgroup/memory/memory.limit_in_bytes
bash: echo: write error: Invalid argument

Some user programs might depend on such behaviours(like libcg, we read the
value in snapshot, then use the value to reset cgroup later), and that
will cause confusion.  So we need to fix it.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Jeff Liu <jeff.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomemcg: rename RESOURCE_MAX to RES_COUNTER_MAX
Sha Zhengju [Wed, 28 Aug 2013 00:17:41 +0000 (10:17 +1000)]
memcg: rename RESOURCE_MAX to RES_COUNTER_MAX

RESOURCE_MAX is far too general name, change it to RES_COUNTER_MAX.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Jeff Liu <jeff.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomemcg: correct RESOURCE_MAX to ULLONG_MAX
Sha Zhengju [Wed, 28 Aug 2013 00:17:41 +0000 (10:17 +1000)]
memcg: correct RESOURCE_MAX to ULLONG_MAX

Current RESOURCE_MAX is ULONG_MAX, but the value we used to set resource
limit is unsigned long long, so we can set bigger value than that which is
strange.  The XXX_MAX should be reasonable max value, bigger than that
should be overflow.

Notice that this change will affect user output of default *.limit_in_bytes:
before change:
$ cat /cgroup/memory/memory.limit_in_bytes
9223372036854775807

after change:
$ cat /cgroup/memory/memory.limit_in_bytes
18446744073709551615

But it doesn't alter the API in term of input - we can still use "echo -1
> *.limit_in_bytes" to reset the numbers to "unlimited".

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Jeff Liu <jeff.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm: memcg: do not trap chargers with full callstack on OOM
Johannes Weiner [Wed, 28 Aug 2013 00:17:40 +0000 (10:17 +1000)]
mm: memcg: do not trap chargers with full callstack on OOM

The memcg OOM handling is incredibly fragile and can deadlock.  When a
task fails to charge memory, it invokes the OOM killer and loops right
there in the charge code until it succeeds.  Comparably, any other task
that enters the charge path at this point will go to a waitqueue right
then and there and sleep until the OOM situation is resolved.  The problem
is that these tasks may hold filesystem locks and the mmap_sem; locks that
the selected OOM victim may need to exit.

For example, in one reported case, the task invoking the OOM killer was
about to charge a page cache page during a write(), which holds the
i_mutex.  The OOM killer selected a task that was just entering truncate()
and trying to acquire the i_mutex:

OOM invoking task:
[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
[<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
[<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
[<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
[<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
[<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
[<ffffffff81193a18>] ext3_write_begin+0x88/0x270
[<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
[<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
[<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
[<ffffffff8111156a>] do_sync_write+0xea/0x130
[<ffffffff81112183>] vfs_write+0xf3/0x1f0
[<ffffffff81112381>] sys_write+0x51/0x90
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

OOM kill victim:
[<ffffffff811109b8>] do_truncate+0x58/0xa0              # takes i_mutex
[<ffffffff81121c90>] do_last+0x250/0xa30
[<ffffffff81122547>] path_openat+0xd7/0x440
[<ffffffff811229c9>] do_filp_open+0x49/0xa0
[<ffffffff8110f7d6>] do_sys_open+0x106/0x240
[<ffffffff8110f950>] sys_open+0x20/0x30
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

The OOM handling task will retry the charge indefinitely while the OOM
killed task is not releasing any resources.

A similar scenario can happen when the kernel OOM killer for a memcg is
disabled and a userspace task is in charge of resolving OOM situations.
In this case, ALL tasks that enter the OOM path will be made to sleep on
the OOM waitqueue and wait for userspace to free resources or increase the
group's limit.  But a userspace OOM handler is prone to deadlock itself on
the locks held by the waiting tasks.  For example one of the sleeping
tasks may be stuck in a brk() call with the mmap_sem held for writing but
the userspace handler, in order to pick an optimal victim, may need to
read files from /proc/<pid>, which tries to acquire the same mmap_sem for
reading and deadlocks.

This patch changes the way tasks behave after detecting a memcg OOM and
makes sure nobody loops or sleeps with locks held:

1. When OOMing in a user fault, invoke the OOM killer and restart the
   fault instead of looping on the charge attempt.  This way, the OOM
   victim can not get stuck on locks the looping task may hold.

2. When OOMing in a user fault but somebody else is handling it
   (either the kernel OOM killer or a userspace handler), don't go to
   sleep in the charge context.  Instead, remember the OOMing memcg in
   the task struct and then fully unwind the page fault stack with
   -ENOMEM.  pagefault_out_of_memory() will then call back into the
   memcg code to check if the -ENOMEM came from the memcg, and then
   either put the task to sleep on the memcg's OOM waitqueue or just
   restart the fault.  The OOM victim can no longer get stuck on any
   lock a sleeping task may hold.

Debugged by Michal Hocko.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: azurIt <azurit@pobox.sk>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm: memcg: rework and document OOM waiting and wakeup
Johannes Weiner [Wed, 28 Aug 2013 00:17:40 +0000 (10:17 +1000)]
mm: memcg: rework and document OOM waiting and wakeup

The memcg OOM handler open-codes a sleeping lock for OOM serialization
(trylock, wait, repeat) because the required locking is so specific to
memcg hierarchies.  However, it would be nice if this construct would be
clearly recognizable and not be as obfuscated as it is right now.  Clean
up as follows:

1. Remove the return value of mem_cgroup_oom_unlock()

2. Rename mem_cgroup_oom_lock() to mem_cgroup_oom_trylock().

3. Pull the prepare_to_wait() out of the memcg_oom_lock scope.  This
   makes it more obvious that the task has to be on the waitqueue
   before attempting to OOM-trylock the hierarchy, to not miss any
   wakeups before going to sleep.  It just didn't matter until now
   because it was all lumped together into the global memcg_oom_lock
   spinlock section.

4. Pull the mem_cgroup_oom_notify() out of the memcg_oom_lock scope.
   It is proctected by the hierarchical OOM-lock.

5. The memcg_oom_lock spinlock is only required to propagate the OOM
   lock in any given hierarchy atomically.  Restrict its scope to
   mem_cgroup_oom_(trylock|unlock).

6. Do not wake up the waitqueue unconditionally at the end of the
   function.  Only the lockholder has to wake up the next in line
   after releasing the lock.

   Note that the lockholder kicks off the OOM-killer, which in turn
   leads to wakeups from the uncharges of the exiting task.  But a
   contender is not guaranteed to see them if it enters the OOM path
   after the OOM kills but before the lockholder releases the lock.
   Thus there has to be an explicit wakeup after releasing the lock.

7. Put the OOM task on the waitqueue before marking the hierarchy as
   under OOM as that is the point where we start to receive wakeups.
   No point in listening before being on the waitqueue.

8. Likewise, unmark the hierarchy before finishing the sleep, for
   symmetry.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: azurIt <azurit@pobox.sk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm: memcg: enable memcg OOM killer only for user faults
Johannes Weiner [Wed, 28 Aug 2013 00:17:39 +0000 (10:17 +1000)]
mm: memcg: enable memcg OOM killer only for user faults

System calls and kernel faults (uaccess, gup) can handle an out of memory
situation gracefully and just return -ENOMEM.

Enable the memcg OOM killer only for user faults, where it's really the
only option available.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: azurIt <azurit@pobox.sk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agox86: finish user fault error path with fatal signal
Johannes Weiner [Wed, 28 Aug 2013 00:17:38 +0000 (10:17 +1000)]
x86: finish user fault error path with fatal signal

The x86 fault handler bails in the middle of error handling when the task
has a fatal signal pending.  For a subsequent patch this is a problem in
OOM situations because it relies on pagefault_out_of_memory() being called
even when the task has been killed, to perform proper per-task OOM state
unwinding.

Shortcutting the fault like this is a rather minor optimization that saves
a few instructions in rare cases.  Just remove it for user-triggered
faults.

Use the opportunity to split the fault retry handling from actual fault
errors and add locking documentation that reads suprisingly similar to
ARM's.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: azurIt <azurit@pobox.sk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoarch: mm: pass userspace fault flag to generic fault handler
Johannes Weiner [Wed, 28 Aug 2013 00:17:38 +0000 (10:17 +1000)]
arch: mm: pass userspace fault flag to generic fault handler

Unlike global OOM handling, memory cgroup code will invoke the OOM killer
in any OOM situation because it has no way of telling faults occuring in
kernel context - which could be handled more gracefully - from
user-triggered faults.

Pass a flag that identifies faults originating in user space from the
architecture-specific fault handlers to generic code so that memcg OOM
handling can be improved.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: azurIt <azurit@pobox.sk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoarch: mm: do not invoke OOM killer on kernel fault OOM
Johannes Weiner [Wed, 28 Aug 2013 00:17:37 +0000 (10:17 +1000)]
arch: mm: do not invoke OOM killer on kernel fault OOM

Kernel faults are expected to handle OOM conditions gracefully (gup,
uaccess etc.), so they should never invoke the OOM killer.  Reserve this
for faults triggered in user context when it is the only option.

Most architectures already do this, fix up the remaining few.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: azurIt <azurit@pobox.sk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoarch: mm: remove obsolete init OOM protection
Johannes Weiner [Wed, 28 Aug 2013 00:17:37 +0000 (10:17 +1000)]
arch: mm: remove obsolete init OOM protection

The memcg code can trap tasks in the context of the failing allocation
until an OOM situation is resolved.  They can hold all kinds of locks (fs,
mm) at this point, which makes it prone to deadlocking.

This series converts memcg OOM handling into a two step process that is
started in the charge context, but any waiting is done after the fault
stack is fully unwound.

Patches 1-4 prepare architecture handlers to support the new memcg
requirements, but in doing so they also remove old cruft and unify
out-of-memory behavior across architectures.

Patch 5 disables the memcg OOM handling for syscalls, readahead, kernel
faults, because they can gracefully unwind the stack with -ENOMEM.  OOM
handling is restricted to user triggered faults that have no other option.

Patch 6 reworks memcg's hierarchical OOM locking to make it a little more
obvious wth is going on in there: reduce locked regions, rename locking
functions, reorder and document.

Patch 7 implements the two-part OOM handling such that tasks are never
trapped with the full charge stack in an OOM situation.

This patch:

Back before smart OOM killing, when faulting tasks were killed directly on
allocation failures, the arch-specific fault handlers needed special
protection for the init process.

Now that all fault handlers call into the generic OOM killer (609838c "mm:
invoke oom-killer from remaining unconverted page fault handlers"), which
already provides init protection, the arch-specific leftovers can be
removed.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: azurIt <azurit@pobox.sk>
Acked-by: Vineet Gupta <vgupta@synopsys.com> [arch/arc bits]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomemcg: trivial cleanups
Andrew Morton [Wed, 28 Aug 2013 00:17:36 +0000 (10:17 +1000)]
memcg: trivial cleanups

Clean up some mess made by the "Soft limit rework" series, and a few other
things.

Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomemcg, vmscan: do not fall into reclaim-all pass too quickly
Michal Hocko [Wed, 28 Aug 2013 00:17:35 +0000 (10:17 +1000)]
memcg, vmscan: do not fall into reclaim-all pass too quickly

shrink_zone starts with soft reclaim pass first and then falls back to
regular reclaim if nothing has been scanned.  This behavior is natural but
there is a catch.  Memcg iterators, when used with the reclaim cookie, are
designed to help to prevent from over reclaim by interleaving reclaimers
(per node-zone-priority) so the tree walk might miss many (even all) nodes
in the hierarchy e.g.  when there are direct reclaimers racing with each
other or with kswapd in the global case or multiple allocators reaching
the limit for the target reclaim case.  To make it even more complicated,
targeted reclaim doesn't do the whole tree walk because it stops
reclaiming once it reclaims sufficient pages.  As a result groups over the
limit might be missed, thus nothing is scanned, and reclaim would fall
back to the reclaim all mode.

This patch checks for the incomplete tree walk in shrink_zone.  If no
group has been visited and the hierarchy is soft reclaimable then we must
have missed some groups, in which case the __shrink_zone is called again.
This doesn't guarantee there will be some progress of course because the
current reclaimer might be still racing with others but it would at least
give a chance to start the walk without a big risk of reclaim latencies.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomemcg: track all children over limit in the root
Michal Hocko [Wed, 28 Aug 2013 00:17:35 +0000 (10:17 +1000)]
memcg: track all children over limit in the root

Children in soft limit excess are currently tracked up the hierarchy in
memcg->children_in_excess.  Nevertheless there still might exist tons of
groups that are not in hierarchy relation to the root cgroup (e.g.  all
first level groups if root_mem_cgroup->use_hierarchy == false).

As the whole tree walk has to be done when the iteration starts at
root_mem_cgroup the iterator should be able to skip the walk if there is
no child above the limit without iterating them.  This can be done easily
if the root tracks all children rather than only hierarchical children.
This is done by this patch which updates root_mem_cgroup
children_in_excess if root_mem_cgroup->use_hierarchy == false so the root
knows about all children in excess.

Please note that this is not an issue for inner memcgs which have
use_hierarchy == false because then only the single group is visited so no
special optimization is necessary.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomemcg, vmscan: do not attempt soft limit reclaim if it would not scan anything
Michal Hocko [Wed, 28 Aug 2013 00:17:34 +0000 (10:17 +1000)]
memcg, vmscan: do not attempt soft limit reclaim if it would not scan anything

mem_cgroup_should_soft_reclaim controls whether soft reclaim pass is done
and it always says yes currently.  Memcg iterators are clever to skip
nodes that are not soft reclaimable quite efficiently but
mem_cgroup_should_soft_reclaim can be more clever and do not start the
soft reclaim pass at all if it knows that nothing would be scanned anyway.

In order to do that, simply reuse mem_cgroup_soft_reclaim_eligible for the
target group of the reclaim and allow the pass only if the whole subtree
wouldn't be skipped.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomemcg: track children in soft limit excess to improve soft limit
Michal Hocko [Wed, 28 Aug 2013 00:17:33 +0000 (10:17 +1000)]
memcg: track children in soft limit excess to improve soft limit

Soft limit reclaim has to check the whole reclaim hierarchy while doing
the first pass of the reclaim.  This leads to a higher system time which
can be visible especially when there are many groups in the hierarchy.

This patch adds a per-memcg counter of children in excess.  It also
restores MEM_CGROUP_TARGET_SOFTLIMIT into mem_cgroup_event_ratelimit for a
proper batching.

If a group crosses soft limit for the first time it increases parent's
children_in_excess up the hierarchy.  The similarly if a group gets below
the limit it will decrease the counter.  The transition phase is recorded
in soft_contributed flag.

mem_cgroup_soft_reclaim_eligible then uses this information to better
decide whether to skip the node or the whole subtree.  The rule is simple.
 Skip the node with a children in excess or skip the whole subtree
otherwise.

This has been tested by a stream IO (dd if=/dev/zero of=file with
4*MemTotal size) which is quite sensitive to overhead during reclaim.  The
load is running in a group with soft limit set to 0 and without any limit.
 Apart from that there was a hierarchy with ~500, 2k and 8k groups (two
groups on each level) without any pages in them.  base denotes to the
kernel on which the whole series is based on, rework is the kernel before
this patch and reworkoptim is with this patch applied:

* Run with soft limit set to 0
Elapsed
0-0-limit/base: min: 88.21 max: 94.61 avg: 91.73 std: 2.65 runs: 3
0-0-limit/rework: min: 76.05 [86.2%] max: 79.08 [83.6%] avg: 77.84 [84.9%] std: 1.30 runs: 3
0-0-limit/reworkoptim: min: 77.98 [88.4%] max: 80.36 [84.9%] avg: 78.92 [86.0%] std: 1.03 runs: 3
System
0.5k-0-limit/base: min: 34.86 max: 36.42 avg: 35.89 std: 0.73 runs: 3
0.5k-0-limit/rework: min: 43.26 [124.1%] max: 48.95 [134.4%] avg: 46.09 [128.4%] std: 2.32 runs: 3
0.5k-0-limit/reworkoptim: min: 46.98 [134.8%] max: 50.98 [140.0%] avg: 48.49 [135.1%] std: 1.77 runs: 3
Elapsed
0.5k-0-limit/base: min: 88.50 max: 97.52 avg: 93.92 std: 3.90 runs: 3
0.5k-0-limit/rework: min: 75.92 [85.8%] max: 78.45 [80.4%] avg: 77.34 [82.3%] std: 1.06 runs: 3
0.5k-0-limit/reworkoptim: min: 75.79 [85.6%] max: 79.37 [81.4%] avg: 77.55 [82.6%] std: 1.46 runs: 3
System
2k-0-limit/base: min: 34.57 max: 37.65 avg: 36.34 std: 1.30 runs: 3
2k-0-limit/rework: min: 64.17 [185.6%] max: 68.20 [181.1%] avg: 66.21 [182.2%] std: 1.65 runs: 3
2k-0-limit/reworkoptim: min: 49.78 [144.0%] max: 52.99 [140.7%] avg: 51.00 [140.3%] std: 1.42 runs: 3
Elapsed
2k-0-limit/base: min: 92.61 max: 97.83 avg: 95.03 std: 2.15 runs: 3
2k-0-limit/rework: min: 78.33 [84.6%] max: 84.08 [85.9%] avg: 81.09 [85.3%] std: 2.35 runs: 3
2k-0-limit/reworkoptim: min: 75.72 [81.8%] max: 78.57 [80.3%] avg: 76.73 [80.7%] std: 1.30 runs: 3
System
8k-0-limit/base: min: 39.78 max: 42.09 avg: 41.09 std: 0.97 runs: 3
8k-0-limit/rework: min: 200.86 [504.9%] max: 265.42 [630.6%] avg: 241.80 [588.5%] std: 29.06 runs: 3
8k-0-limit/reworkoptim: min: 53.70 [135.0%] max: 54.89 [130.4%] avg: 54.43 [132.5%] std: 0.52 runs: 3
Elapsed
8k-0-limit/base: min: 95.11 max: 98.61 avg: 96.81 std: 1.43 runs: 3
8k-0-limit/rework: min: 246.96 [259.7%] max: 331.47 [336.1%] avg: 301.32 [311.2%] std: 38.52 runs: 3
8k-0-limit/reworkoptim: min: 76.79 [80.7%] max: 81.71 [82.9%] avg: 78.97 [81.6%] std: 2.05 runs: 3

System time is increased by 30-40% but it is reduced a lot comparing to
kernel without this patch.  The higher time can be explained by the fact
that the original soft reclaim scanned at priority 0 so it was much more
effective for this workload (which is basically touch once and writeback).
 The Elapsed time looks better though (~20%).

* Run with no soft limit set
System
0-no-limit/base: min: 42.18 max: 50.38 avg: 46.44 std: 3.36 runs: 3
0-no-limit/rework: min: 40.57 [96.2%] max: 47.04 [93.4%] avg: 43.82 [94.4%] std: 2.64 runs: 3
0-no-limit/reworkoptim: min: 40.45 [95.9%] max: 45.28 [89.9%] avg: 42.10 [90.7%] std: 2.25 runs: 3
Elapsed
0-no-limit/base: min: 75.97 max: 78.21 avg: 76.87 std: 0.96 runs: 3
0-no-limit/rework: min: 75.59 [99.5%] max: 80.73 [103.2%] avg: 77.64 [101.0%] std: 2.23 runs: 3
0-no-limit/reworkoptim: min: 77.85 [102.5%] max: 82.42 [105.4%] avg: 79.64 [103.6%] std: 1.99 runs: 3
System
0.5k-no-limit/base: min: 44.54 max: 46.93 avg: 46.12 std: 1.12 runs: 3
0.5k-no-limit/rework: min: 42.09 [94.5%] max: 46.16 [98.4%] avg: 43.92 [95.2%] std: 1.69 runs: 3
0.5k-no-limit/reworkoptim: min: 42.47 [95.4%] max: 45.67 [97.3%] avg: 44.06 [95.5%] std: 1.31 runs: 3
Elapsed
0.5k-no-limit/base: min: 78.26 max: 81.49 avg: 79.65 std: 1.36 runs: 3
0.5k-no-limit/rework: min: 77.01 [98.4%] max: 80.43 [98.7%] avg: 78.30 [98.3%] std: 1.52 runs: 3
0.5k-no-limit/reworkoptim: min: 76.13 [97.3%] max: 77.87 [95.6%] avg: 77.18 [96.9%] std: 0.75 runs: 3
System
2k-no-limit/base: min: 62.96 max: 69.14 avg: 66.14 std: 2.53 runs: 3
2k-no-limit/rework: min: 76.01 [120.7%] max: 81.06 [117.2%] avg: 78.17 [118.2%] std: 2.12 runs: 3
2k-no-limit/reworkoptim: min: 62.57 [99.4%] max: 66.10 [95.6%] avg: 64.53 [97.6%] std: 1.47 runs: 3
Elapsed
2k-no-limit/base: min: 76.47 max: 84.22 avg: 79.12 std: 3.60 runs: 3
2k-no-limit/rework: min: 89.67 [117.3%] max: 93.26 [110.7%] avg: 91.10 [115.1%] std: 1.55 runs: 3
2k-no-limit/reworkoptim: min: 76.94 [100.6%] max: 79.21 [94.1%] avg: 78.45 [99.2%] std: 1.07 runs: 3
System
8k-no-limit/base: min: 104.74 max: 151.34 avg: 129.21 std: 19.10 runs: 3
8k-no-limit/rework: min: 205.23 [195.9%] max: 285.94 [188.9%] avg: 258.98 [200.4%] std: 38.01 runs: 3
8k-no-limit/reworkoptim: min: 161.16 [153.9%] max: 184.54 [121.9%] avg: 174.52 [135.1%] std: 9.83 runs: 3
Elapsed
8k-no-limit/base: min: 125.43 max: 181.00 avg: 154.81 std: 22.80 runs: 3
8k-no-limit/rework: min: 254.05 [202.5%] max: 355.67 [196.5%] avg: 321.46 [207.6%] std: 47.67 runs: 3
8k-no-limit/reworkoptim: min: 193.77 [154.5%] max: 222.72 [123.0%] avg: 210.18 [135.8%] std: 12.13 runs: 3

Both System and Elapsed are in stdev with the base kernel for all
configurations except for 8k where both System and Elapsed are up by 35%.
I do not have a good explanation for this because there is no soft reclaim
pass going on as no group is above the limit which is checked in
mem_cgroup_should_soft_reclaim.

Then I have tested kernel build with the same configuration to see the
behavior with a more general behavior.

* Soft limit set to 0 for the build
System
0-0-limit/base: min: 242.70 max: 245.17 avg: 243.85 std: 1.02 runs: 3
0-0-limit/rework min: 237.86 [98.0%] max: 240.22 [98.0%] avg: 239.00 [98.0%] std: 0.97 runs: 3
0-0-limit/reworkoptim: min: 241.11 [99.3%] max: 243.53 [99.3%] avg: 242.01 [99.2%] std: 1.08 runs: 3
Elapsed
0-0-limit/base: min: 348.48 max: 360.86 avg: 356.04 std: 5.41 runs: 3
0-0-limit/rework min: 286.95 [82.3%] max: 290.26 [80.4%] avg: 288.27 [81.0%] std: 1.43 runs: 3
0-0-limit/reworkoptim: min: 286.55 [82.2%] max: 289.00 [80.1%] avg: 287.69 [80.8%] std: 1.01 runs: 3
System
0.5k-0-limit/base: min: 251.77 max: 254.41 avg: 252.70 std: 1.21 runs: 3
0.5k-0-limit/rework min: 286.44 [113.8%] max: 289.30 [113.7%] avg: 287.60 [113.8%] std: 1.23 runs: 3
0.5k-0-limit/reworkoptim: min: 252.18 [100.2%] max: 253.16 [99.5%] avg: 252.62 [100.0%] std: 0.41 runs: 3
Elapsed
0.5k-0-limit/base: min: 347.83 max: 353.06 avg: 350.04 std: 2.21 runs: 3
0.5k-0-limit/rework min: 290.19 [83.4%] max: 295.62 [83.7%] avg: 293.12 [83.7%] std: 2.24 runs: 3
0.5k-0-limit/reworkoptim: min: 293.91 [84.5%] max: 294.87 [83.5%] avg: 294.29 [84.1%] std: 0.42 runs: 3
System
2k-0-limit/base: min: 263.05 max: 271.52 avg: 267.94 std: 3.58 runs: 3
2k-0-limit/rework min: 458.99 [174.5%] max: 468.31 [172.5%] avg: 464.45 [173.3%] std: 3.97 runs: 3
2k-0-limit/reworkoptim: min: 267.10 [101.5%] max: 279.38 [102.9%] avg: 272.78 [101.8%] std: 5.05 runs: 3
Elapsed
2k-0-limit/base: min: 372.33 max: 379.32 avg: 375.47 std: 2.90 runs: 3
2k-0-limit/rework min: 334.40 [89.8%] max: 339.52 [89.5%] avg: 337.44 [89.9%] std: 2.20 runs: 3
2k-0-limit/reworkoptim: min: 301.47 [81.0%] max: 319.19 [84.1%] avg: 307.90 [82.0%] std: 8.01 runs: 3
System
8k-0-limit/base: min: 320.50 max: 332.10 avg: 325.46 std: 4.88 runs: 3
8k-0-limit/rework min: 1115.76 [348.1%] max: 1165.66 [351.0%] avg: 1132.65 [348.0%] std: 23.34 runs: 3
8k-0-limit/reworkoptim: min: 403.75 [126.0%] max: 409.22 [123.2%] avg: 406.16 [124.8%] std: 2.28 runs: 3
Elapsed
8k-0-limit/base: min: 475.48 max: 585.19 avg: 525.54 std: 45.30 runs: 3
8k-0-limit/rework min: 616.25 [129.6%] max: 625.90 [107.0%] avg: 620.68 [118.1%] std: 3.98 runs: 3
8k-0-limit/reworkoptim: min: 420.18 [88.4%] max: 428.28 [73.2%] avg: 423.05 [80.5%] std: 3.71 runs: 3

Apart from 8k the system time is comparable with the base kernel while
Elapsed is up to 20% better with all configurations.

* No soft limit set
System
0-no-limit/base: min: 234.76 max: 237.42 avg: 236.25 std: 1.11 runs: 3
0-no-limit/rework min: 233.09 [99.3%] max: 238.65 [100.5%] avg: 236.09 [99.9%] std: 2.29 runs: 3
0-no-limit/reworkoptim: min: 236.12 [100.6%] max: 240.53 [101.3%] avg: 237.94 [100.7%] std: 1.88 runs: 3
Elapsed
0-no-limit/base: min: 288.52 max: 295.42 avg: 291.29 std: 2.98 runs: 3
0-no-limit/rework min: 283.17 [98.1%] max: 284.33 [96.2%] avg: 283.78 [97.4%] std: 0.48 runs: 3
0-no-limit/reworkoptim: min: 288.50 [100.0%] max: 290.79 [98.4%] avg: 289.78 [99.5%] std: 0.95 runs: 3
System
0.5k-no-limit/base: min: 286.51 max: 293.23 avg: 290.21 std: 2.78 runs: 3
0.5k-no-limit/rework min: 291.69 [101.8%] max: 294.38 [100.4%] avg: 292.97 [101.0%] std: 1.10 runs: 3
0.5k-no-limit/reworkoptim: min: 277.05 [96.7%] max: 288.76 [98.5%] avg: 284.17 [97.9%] std: 5.11 runs: 3
Elapsed
0.5k-no-limit/base: min: 294.94 max: 298.92 avg: 296.47 std: 1.75 runs: 3
0.5k-no-limit/rework min: 292.55 [99.2%] max: 294.21 [98.4%] avg: 293.55 [99.0%] std: 0.72 runs: 3
0.5k-no-limit/reworkoptim: min: 294.41 [99.8%] max: 301.67 [100.9%] avg: 297.78 [100.4%] std: 2.99 runs: 3
System
2k-no-limit/base: min: 443.41 max: 466.66 avg: 457.66 std: 10.19 runs: 3
2k-no-limit/rework min: 490.11 [110.5%] max: 516.02 [110.6%] avg: 501.42 [109.6%] std: 10.83 runs: 3
2k-no-limit/reworkoptim: min: 435.25 [98.2%] max: 458.11 [98.2%] avg: 446.73 [97.6%] std: 9.33 runs: 3
Elapsed
2k-no-limit/base: min: 330.85 max: 333.75 avg: 332.52 std: 1.23 runs: 3
2k-no-limit/rework min: 343.06 [103.7%] max: 349.59 [104.7%] avg: 345.95 [104.0%] std: 2.72 runs: 3
2k-no-limit/reworkoptim: min: 330.01 [99.7%] max: 333.92 [100.1%] avg: 332.22 [99.9%] std: 1.64 runs: 3
System
8k-no-limit/base: min: 1175.64 max: 1259.38 avg: 1222.39 std: 34.88 runs: 3
8k-no-limit/rework min: 1226.31 [104.3%] max: 1241.60 [98.6%] avg: 1233.74 [100.9%] std: 6.25 runs: 3
8k-no-limit/reworkoptim: min: 1023.45 [87.1%] max: 1056.74 [83.9%] avg: 1038.92 [85.0%] std: 13.69 runs: 3
Elapsed
8k-no-limit/base: min: 613.36 max: 619.60 avg: 616.47 std: 2.55 runs: 3
8k-no-limit/rework min: 627.56 [102.3%] max: 642.33 [103.7%] avg: 633.44 [102.8%] std: 6.39 runs: 3
8k-no-limit/reworkoptim: min: 545.89 [89.0%] max: 555.36 [89.6%] avg: 552.06 [89.6%] std: 4.37 runs: 3

and these numbers look good as well.  System time is around 100%
(suprisingly better for the 8k case) and Elapsed is copies that trend.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm: fix memcg-less page reclaim
Hugh Dickins [Wed, 28 Aug 2013 00:17:33 +0000 (10:17 +1000)]
mm: fix memcg-less page reclaim

Now that everybody loves memcg, configures it on, and would not dream
of booting with cgroup_disable=memory, it can pass unnoticed for weeks
that memcg-less page reclaim is completely broken.

mmotm's "memcg: enhance memcg iterator to support predicates" replaces
__shrink_zone()'s "do { } while (memcg);" loop by a "while (memcg) {}"
loop: which is nicer for memcg, but does nothing for !CONFIG_MEMCG or
cgroup_disable=memory.  Page reclaim hangs, making no progress.

Adding mem_cgroup_disabled() and once++ test there is ugly.  Ideally,
even a !CONFIG_MEMCG build might in future have a stub root_mem_cgroup,
which would get around this: but that's not so at present.

However, it appears that nothing actually dereferences the memcg pointer
in the mem_cgroup_disabled() case, here or anywhere else that case can
reach mem_cgroup_iter() (mem_cgroup_iter_break() is not called in
global reclaim).

So, simply pass back an ordinarily-oopsing non-NULL address the first
time, and we shall hear about it if I'm wrong.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomemcg: enhance memcg iterator to support predicates
Michal Hocko [Wed, 28 Aug 2013 00:17:32 +0000 (10:17 +1000)]
memcg: enhance memcg iterator to support predicates

The caller of the iterator might know that some nodes or even subtrees
should be skipped but there is no way to tell iterators about that so the
only choice left is to let iterators to visit each node and do the
selection outside of the iterating code.  This, however, doesn't scale
well with hierarchies with many groups where only few groups are
interesting.

This patch adds mem_cgroup_iter_cond variant of the iterator with a
callback which gets called for every visited node.  There are three
possible ways how the callback can influence the walk.  Either the node is
visited, it is skipped but the tree walk continues down the tree or the
whole subtree of the current group is skipped.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agovmscan, memcg: do softlimit reclaim also for targeted reclaim
Michal Hocko [Wed, 28 Aug 2013 00:17:31 +0000 (10:17 +1000)]
vmscan, memcg: do softlimit reclaim also for targeted reclaim

Soft reclaim has been done only for the global reclaim (both background
and direct).  Since "memcg: integrate soft reclaim tighter with zone
shrinking code" there is no reason for this limitation anymore as the soft
limit reclaim doesn't use any special code paths and it is a part of the
zone shrinking code which is used by both global and targeted reclaims.

From the semantic point of view it is natural to consider soft limit
before touching all groups in the hierarchy tree which is touching the
hard limit because soft limit tells us where to push back when there is a
memory pressure.  It is not important whether the pressure comes from the
limit or imbalanced zones.

This patch simply enables soft reclaim unconditionally in
mem_cgroup_should_soft_reclaim so it is enabled for both global and
targeted reclaim paths.  mem_cgroup_soft_reclaim_eligible needs to learn
about the root of the reclaim to know where to stop checking soft limit
state of parents up the hierarchy.  Say we have

A (over soft limit)
 \
  B (below s.l., hit the hard limit)
 / \
C   D (below s.l.)

B is the source of the outside memory pressure now for D but we shouldn't
soft reclaim it because it is behaving well under B subtree and we can
still reclaim from C (pressumably it is over the limit).
mem_cgroup_soft_reclaim_eligible should therefore stop climbing up the
hierarchy at B (root of the memory pressure).

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Glauber Costa <glommer@openvz.org>
Reviewed-by: Tejun Heo <tj@kernel.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomemcg: get rid of soft-limit tree infrastructure
Michal Hocko [Wed, 28 Aug 2013 00:17:31 +0000 (10:17 +1000)]
memcg: get rid of soft-limit tree infrastructure

Now that the soft limit is integrated to the reclaim directly the whole
soft-limit tree infrastructure is not needed anymore.  Rip it out.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Glauber Costa <glommer@openvz.org>
Reviewed-by: Tejun Heo <tj@kernel.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomemcg, vmscan: integrate soft reclaim tighter with zone shrinking code
Michal Hocko [Wed, 28 Aug 2013 00:17:30 +0000 (10:17 +1000)]
memcg, vmscan: integrate soft reclaim tighter with zone shrinking code

This patchset is sitting out of tree for quite some time without any
objections.  I would be really happy if it made it into 3.12.  I do not
want to push it too hard but I think this work is basically ready and
waiting more doesn't help.

The basic idea is quite simple.  Pull soft reclaim into shrink_zone in the
first step and get rid of the previous soft reclaim infrastructure.
shrink_zone is done in two passes now.  First it tries to do the soft
limit reclaim and it falls back to reclaim-all mode if no group is over
the limit or no pages have been scanned.  The second pass happens at the
same priority so the only time we waste is the memcg tree walk which has
been updated in the third step to have only negligible overhead.

As a bonus we will get rid of a _lot_ of code by this and soft reclaim
will not stand out like before when it wasn't integrated into the zone
shrinking code and it reclaimed at priority 0 (the testing results show
that some workloads suffers from such an aggressive reclaim).  The clean
up is in a separate patch because I felt it would be easier to review that
way.

The second step is soft limit reclaim integration into targeted reclaim.
It should be rather straight forward.  Soft limit has been used only for
the global reclaim so far but it makes sense for any kind of pressure
coming from up-the-hierarchy, including targeted reclaim.

The third step (patches 4-8) addresses the tree walk overhead by enhancing
memcg iterators to enable skipping whole subtrees and tracking number of
over soft limit children at each level of the hierarchy.  This information
is updated same way the old soft limit tree was updated (from
memcg_check_events) so we shouldn't see an additional overhead.  In fact
mem_cgroup_update_soft_limit is much simpler than tree manipulation done
previously.

__shrink_zone uses mem_cgroup_soft_reclaim_eligible as a predicate for
mem_cgroup_iter so the decision whether a particular group should be
visited is done at the iterator level which allows us to decide to skip
the whole subtree as well (if there is no child in excess).  This reduces
the tree walk overhead considerably.

* TEST 1
========

My primary test case was a parallel kernel build with 2 groups (make is
running with -j8 with a distribution .config in a separate cgroup without
any hard limit) on a 32 CPU machine booted with 1GB memory and both builds
run taskset to Node 0 cpus.

I was mostly interested in 2 setups.  Default - no soft limit set and -
and 0 soft limit set to both groups.  The first one should tell us whether
the rework regresses the default behavior while the second one should show
us improvements in an extreme case where both workloads are always over
the soft limit.

/usr/bin/time -v has been used to collect the statistics and each
configuration had 3 runs after fresh boot without any other load on the
system.

base is mmotm-2013-07-18-16-40
rework all 8 patches applied on top of base

* No-limit
User
no-limit/base: min: 651.92 max: 672.65 avg: 664.33 std: 8.01 runs: 6
no-limit/rework: min: 657.34 [100.8%] max: 668.39 [99.4%] avg: 663.13 [99.8%] std: 3.61 runs: 6
System
no-limit/base: min: 69.33 max: 71.39 avg: 70.32 std: 0.79 runs: 6
no-limit/rework: min: 69.12 [99.7%] max: 71.05 [99.5%] avg: 70.04 [99.6%] std: 0.59 runs: 6
Elapsed
no-limit/base: min: 398.27 max: 422.36 avg: 408.85 std: 7.74 runs: 6
no-limit/rework: min: 386.36 [97.0%] max: 438.40 [103.8%] avg: 416.34 [101.8%] std: 18.85 runs: 6

The results are within noise. Elapsed time has a bigger variance but the
average looks good.

* 0-limit
User
0-limit/base: min: 573.76 max: 605.63 avg: 585.73 std: 12.21 runs: 6
0-limit/rework: min: 645.77 [112.6%] max: 666.25 [110.0%] avg: 656.97 [112.2%] std: 7.77 runs: 6
System
0-limit/base: min: 69.57 max: 71.13 avg: 70.29 std: 0.54 runs: 6
0-limit/rework: min: 68.68 [98.7%] max: 71.40 [100.4%] avg: 69.91 [99.5%] std: 0.87 runs: 6
Elapsed
0-limit/base: min: 1306.14 max: 1550.17 avg: 1430.35 std: 90.86 runs: 6
0-limit/rework: min: 404.06 [30.9%] max: 465.94 [30.1%] avg: 434.81 [30.4%] std: 22.68 runs: 6

The improvement is really huge here (even bigger than with my previous
testing and I suspect that this highly depends on the storage).  Page
fault statistics tell us at least part of the story:

Minor
0-limit/base: min: 37180461.00 max: 37319986.00 avg: 37247470.00 std: 54772.71 runs: 6
0-limit/rework: min: 36751685.00 [98.8%] max: 36805379.00 [98.6%] avg: 36774506.33 [98.7%] std: 17109.03 runs: 6
Major
0-limit/base: min: 170604.00 max: 221141.00 avg: 196081.83 std: 18217.01 runs: 6
0-limit/rework: min: 2864.00 [1.7%] max: 10029.00 [4.5%] avg: 5627.33 [2.9%] std: 2252.71 runs: 6

Same as with my previous testing Minor faults are more or less within
noise but Major fault count is way bellow the base kernel.

While this looks as a nice win it is fair to say that 0-limit
configuration is quite artificial. So I was playing with 0-no-limit
loads as well.

* TEST 2
========

The following results are from 2 groups configuration on a 16GB machine
(single NUMA node).

- A running stream IO (dd if=/dev/zero of=local.file bs=1024) with
  2*TotalMem with 0 soft limit.
- B running a mem_eater which consumes TotalMem-1G without any limit. The
  mem_eater consumes the memory in 100 chunks with 1s nap after each
  mmap+poppulate so that both loads have chance to fight for the memory.

The expected result is that B shouldn't be reclaimed and A shouldn't see
a big dropdown in elapsed time.

User
base: min: 2.68 max: 2.89 avg: 2.76 std: 0.09 runs: 3
rework: min: 3.27 [122.0%] max: 3.74 [129.4%] avg: 3.44 [124.6%] std: 0.21 runs: 3
System
base: min: 86.26 max: 88.29 avg: 87.28 std: 0.83 runs: 3
rework: min: 81.05 [94.0%] max: 84.96 [96.2%] avg: 83.14 [95.3%] std: 1.61 runs: 3
Elapsed
base: min: 317.28 max: 332.39 avg: 325.84 std: 6.33 runs: 3
rework: min: 281.53 [88.7%] max: 298.16 [89.7%] avg: 290.99 [89.3%] std: 6.98 runs: 3

System time improved slightly as well as Elapsed. My previous testing
has shown worse numbers but this again seem to depend on the storage
speed.

My theory is that the writeback doesn't catch up and prio-0 soft reclaim
falls into wait on writeback page too often in the base kernel. The
patched kernel doesn't do that because the soft reclaim is done from the
kswapd/direct reclaim context. This can be seen on the following graph
nicely. The A's group usage_in_bytes regurarly drops really low very often.

All 3 runs
http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/stream.png
resp. a detail of the single run
http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/stream-one-run.png

mem_eater seems to be doing better as well. It gets to the full
allocation size faster as can be seen on the following graph:
http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/mem_eater-one-run.png

/proc/meminfo collected during the test also shows that rework kernel
hasn't swapped that much (well almost not at all):
base: max: 123900 K avg: 56388.29 K
rework: max: 300 K avg: 128.68 K

kswapd and direct reclaim statistics are of no use unfortunatelly because
soft reclaim is not accounted properly as the counters are hidden by
global_reclaim() checks in the base kernel.

* TEST 3
========

Another test was the same configuration as TEST2 except the stream IO was
replaced by a single kbuild (16 parallel jobs bound to Node0 cpus same as
in TEST1) and mem_eater allocated TotalMem-200M so kbuild had only 200MB
left.

Kbuild did better with the rework kernel here as well:
User
base: min: 860.28 max: 872.86 avg: 868.03 std: 5.54 runs: 3
rework: min: 880.81 [102.4%] max: 887.45 [101.7%] avg: 883.56 [101.8%] std: 2.83 runs: 3
System
base: min: 84.35 max: 85.06 avg: 84.79 std: 0.31 runs: 3
rework: min: 85.62 [101.5%] max: 86.09 [101.2%] avg: 85.79 [101.2%] std: 0.21 runs: 3
Elapsed
base: min: 135.36 max: 243.30 avg: 182.47 std: 45.12 runs: 3
rework: min: 110.46 [81.6%] max: 116.20 [47.8%] avg: 114.15 [62.6%] std: 2.61 runs: 3
Minor
base: min: 36635476.00 max: 36673365.00 avg: 36654812.00 std: 15478.03 runs: 3
rework: min: 36639301.00 [100.0%] max: 36695541.00 [100.1%] avg: 36665511.00 [100.0%] std: 23118.23 runs: 3
Major
base: min: 14708.00 max: 53328.00 avg: 31379.00 std: 16202.24 runs: 3
rework: min: 302.00 [2.1%] max: 414.00 [0.8%] avg: 366.33 [1.2%] std: 47.22 runs: 3

Again we can see a significant improvement in Elapsed (it also seems to
be more stable), there is a huge dropdown for the Major page faults and
much more swapping:
base: max: 583736 K avg: 112547.43 K
rework: max: 4012 K avg: 124.36 K

Graphs from all three runs show the variability of the kbuild quite
nicely.  It even seems that it took longer after every run with the base
kernel which would be quite surprising as the source tree for the build is
removed and caches are dropped after each run so the build operates on a
freshly extracted sources everytime.
http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/kbuild-mem_eater.png

My other testing shows that this is just a matter of timing and other runs
behave differently the std for Elapsed time is similar ~50.  Example of
other three runs:
http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/kbuild-mem_eater2.png

So to wrap this up.  The series is still doing good and improves the soft
limit.

The testing results for bunch of cgroups with both stream IO and kbuild
loads can be found in "memcg: track children in soft limit excess to
improve soft limit".

This patch:

Memcg soft reclaim has been traditionally triggered from the global
reclaim paths before calling shrink_zone.  mem_cgroup_soft_limit_reclaim
then picked up a group which exceeds the soft limit the most and reclaimed
it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages.

The infrastructure requires per-node-zone trees which hold over-limit
groups and keep them up-to-date (via memcg_check_events) which is not cost
free.  Although this overhead hasn't turned out to be a bottle neck the
implementation is suboptimal because mem_cgroup_update_tree has no idea
which zones consumed memory over the limit so we could easily end up
having a group on a node-zone tree having only few pages from that
node-zone.

This patch doesn't try to fix node-zone trees management because it seems
that integrating soft reclaim into zone shrinking sounds much easier and
more appropriate for several reasons.  First of all 0 priority reclaim was
a crude hack which might lead to big stalls if the group's LRUs are big
and hard to reclaim (e.g.  a lot of dirty/writeback pages).  Soft reclaim
should be applicable also to the targeted reclaim which is awkward right
now without additional hacks.  Last but not least the whole infrastructure
eats quite some code.

After this patch shrink_zone is done in 2 passes.  First it tries to do
the soft reclaim if appropriate (only for global reclaim for now to keep
compatible with the original state) and fall back to ignoring soft limit
if no group is eligible to soft reclaim or nothing has been scanned during
the first pass.  Only groups which are over their soft limit or any of
their parents up the hierarchy is over the limit are considered eligible
during the first pass.

Soft limit tree which is not necessary anymore will be removed in the
follow up patch to make this patch smaller and easier to review.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Glauber Costa <glommer@openvz.org>
Reviewed-by: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Ying Han <yinghan@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Glauber Costa <glommer@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomemcg: remove redundant code in mem_cgroup_force_empty_write()
Li Zefan [Wed, 28 Aug 2013 00:17:30 +0000 (10:17 +1000)]
memcg: remove redundant code in mem_cgroup_force_empty_write()

vfs guarantees the cgroup won't be destroyed, so it's redundant to get a
css reference.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoMerge branch 'akpm-current/current'
Stephen Rothwell [Thu, 12 Sep 2013 03:28:18 +0000 (13:28 +1000)]
Merge branch 'akpm-current/current'

Conflicts:
Documentation/block/cmdline-partition.txt
drivers/block/aoe/aoeblk.c
drivers/rtc/rtc-hid-sensor-time.c
fs/namei.c
fs/namespace.c
include/linux/smp.h
kernel/fork.c
mm/mempolicy.c
mm/mlock.c
mm/sparse.c
scripts/checkpatch.pl

10 years agoMerge remote-tracking branch 'aio/master'
Stephen Rothwell [Thu, 12 Sep 2013 02:54:20 +0000 (12:54 +1000)]
Merge remote-tracking branch 'aio/master'

Conflicts:
fs/block_dev.c
fs/nfs/direct.c

10 years agoMerge remote-tracking branch 'lzo-update/lzo-update'
Stephen Rothwell [Thu, 12 Sep 2013 02:52:18 +0000 (12:52 +1000)]
Merge remote-tracking branch 'lzo-update/lzo-update'

10 years agoMerge remote-tracking branch 'dma-buf/for-next'
Stephen Rothwell [Thu, 12 Sep 2013 02:50:28 +0000 (12:50 +1000)]
Merge remote-tracking branch 'dma-buf/for-next'

10 years agoMerge remote-tracking branch 'dma-mapping/dma-mapping-next'
Stephen Rothwell [Thu, 12 Sep 2013 02:48:46 +0000 (12:48 +1000)]
Merge remote-tracking branch 'dma-mapping/dma-mapping-next'

10 years agoMerge remote-tracking branch 'samsung/for-next'
Stephen Rothwell [Thu, 12 Sep 2013 02:46:44 +0000 (12:46 +1000)]
Merge remote-tracking branch 'samsung/for-next'

10 years agoMerge remote-tracking branch 'renesas/next'
Stephen Rothwell [Thu, 12 Sep 2013 02:45:00 +0000 (12:45 +1000)]
Merge remote-tracking branch 'renesas/next'

Conflicts:
arch/arm/boot/dts/r8a73a4.dtsi
arch/arm/boot/dts/r8a7790.dtsi

10 years agoMerge remote-tracking branch 'mvebu/for-next'
Stephen Rothwell [Thu, 12 Sep 2013 02:43:05 +0000 (12:43 +1000)]
Merge remote-tracking branch 'mvebu/for-next'

Conflicts:
arch/arm/boot/dts/kirkwood.dtsi
drivers/pci/host/Kconfig
drivers/pci/host/pci-mvebu.c

10 years agoMerge remote-tracking branch 'msm/for-next'
Stephen Rothwell [Thu, 12 Sep 2013 02:42:54 +0000 (12:42 +1000)]
Merge remote-tracking branch 'msm/for-next'

10 years agoMerge remote-tracking branch 'imx-mxs/for-next'
Stephen Rothwell [Thu, 12 Sep 2013 02:41:03 +0000 (12:41 +1000)]
Merge remote-tracking branch 'imx-mxs/for-next'

10 years agoMerge remote-tracking branch 'ep93xx/ep93xx-for-next'
Stephen Rothwell [Thu, 12 Sep 2013 02:39:16 +0000 (12:39 +1000)]
Merge remote-tracking branch 'ep93xx/ep93xx-for-next'

10 years agoMerge remote-tracking branch 'cortex/for-next'
Stephen Rothwell [Thu, 12 Sep 2013 02:39:09 +0000 (12:39 +1000)]
Merge remote-tracking branch 'cortex/for-next'

10 years agoMerge remote-tracking branch 'arm-soc/for-next'
Stephen Rothwell [Thu, 12 Sep 2013 02:37:27 +0000 (12:37 +1000)]
Merge remote-tracking branch 'arm-soc/for-next'

10 years agoMerge remote-tracking branch 'vhost/linux-next'
Stephen Rothwell [Thu, 12 Sep 2013 02:35:34 +0000 (12:35 +1000)]
Merge remote-tracking branch 'vhost/linux-next'

10 years agoMerge remote-tracking branch 'bcon/master'
Stephen Rothwell [Thu, 12 Sep 2013 02:32:57 +0000 (12:32 +1000)]
Merge remote-tracking branch 'bcon/master'

Conflicts:
drivers/block/Kconfig

10 years agoMerge remote-tracking branch 'target-updates/for-next'
Stephen Rothwell [Thu, 12 Sep 2013 02:30:28 +0000 (12:30 +1000)]
Merge remote-tracking branch 'target-updates/for-next'

10 years agoMerge remote-tracking branch 'scsi/for-next'
Stephen Rothwell [Thu, 12 Sep 2013 02:26:24 +0000 (12:26 +1000)]
Merge remote-tracking branch 'scsi/for-next'

10 years agoMerge remote-tracking branch 'leds/for-next'
Stephen Rothwell [Thu, 12 Sep 2013 02:24:24 +0000 (12:24 +1000)]
Merge remote-tracking branch 'leds/for-next'

Conflicts:
drivers/leds/leds-renesas-tpu.c

10 years agoMerge remote-tracking branch 'drivers-x86/linux-next'
Stephen Rothwell [Thu, 12 Sep 2013 01:59:40 +0000 (11:59 +1000)]
Merge remote-tracking branch 'drivers-x86/linux-next'

10 years agoMerge remote-tracking branch 'workqueues/for-next'
Stephen Rothwell [Thu, 12 Sep 2013 01:57:59 +0000 (11:57 +1000)]
Merge remote-tracking branch 'workqueues/for-next'

10 years agoMerge remote-tracking branch 'xen-tip/linux-next'
Stephen Rothwell [Thu, 12 Sep 2013 01:56:14 +0000 (11:56 +1000)]
Merge remote-tracking branch 'xen-tip/linux-next'

10 years agoMerge remote-tracking branch 'tip/auto-latest'
Stephen Rothwell [Thu, 12 Sep 2013 01:54:09 +0000 (11:54 +1000)]
Merge remote-tracking branch 'tip/auto-latest'

10 years agoMerge remote-tracking branch 'spi/for-next'
Stephen Rothwell [Thu, 12 Sep 2013 01:52:27 +0000 (11:52 +1000)]
Merge remote-tracking branch 'spi/for-next'

10 years agoMerge remote-tracking branch 'iommu/next'
Stephen Rothwell [Thu, 12 Sep 2013 01:49:18 +0000 (11:49 +1000)]
Merge remote-tracking branch 'iommu/next'

10 years agoMerge remote-tracking branch 'watchdog/master'
Stephen Rothwell [Thu, 12 Sep 2013 01:47:33 +0000 (11:47 +1000)]
Merge remote-tracking branch 'watchdog/master'

10 years agoMerge remote-tracking branch 'selinux/master'
Stephen Rothwell [Thu, 12 Sep 2013 01:42:27 +0000 (11:42 +1000)]
Merge remote-tracking branch 'selinux/master'

Conflicts:
security/selinux/hooks.c

10 years agoMerge remote-tracking branch 'regulator/for-next'
Stephen Rothwell [Thu, 12 Sep 2013 01:40:38 +0000 (11:40 +1000)]
Merge remote-tracking branch 'regulator/for-next'

10 years agoMerge remote-tracking branch 'omap_dss2/for-next'
Stephen Rothwell [Thu, 12 Sep 2013 01:40:33 +0000 (11:40 +1000)]
Merge remote-tracking branch 'omap_dss2/for-next'

10 years agoMerge remote-tracking branch 'fbdev/for-next'
Stephen Rothwell [Thu, 12 Sep 2013 01:38:44 +0000 (11:38 +1000)]
Merge remote-tracking branch 'fbdev/for-next'

10 years agoMerge remote-tracking branch 'mfd-lj/for-mfd-next'
Stephen Rothwell [Thu, 12 Sep 2013 01:36:58 +0000 (11:36 +1000)]
Merge remote-tracking branch 'mfd-lj/for-mfd-next'

10 years agoMerge remote-tracking branch 'slab/for-next'
Stephen Rothwell [Thu, 12 Sep 2013 01:35:07 +0000 (11:35 +1000)]
Merge remote-tracking branch 'slab/for-next'

10 years agoMerge remote-tracking branch 'kgdb/kgdb-next'
Stephen Rothwell [Thu, 12 Sep 2013 01:35:02 +0000 (11:35 +1000)]
Merge remote-tracking branch 'kgdb/kgdb-next'

10 years agoMerge remote-tracking branch 'block/for-next'
Stephen Rothwell [Thu, 12 Sep 2013 01:32:57 +0000 (11:32 +1000)]
Merge remote-tracking branch 'block/for-next'

10 years agoMerge remote-tracking branch 'cgroup/for-next'
Stephen Rothwell [Thu, 12 Sep 2013 01:31:13 +0000 (11:31 +1000)]
Merge remote-tracking branch 'cgroup/for-next'

10 years agoMerge remote-tracking branch 'drm-intel/for-linux-next'
Stephen Rothwell [Thu, 12 Sep 2013 01:29:08 +0000 (11:29 +1000)]
Merge remote-tracking branch 'drm-intel/for-linux-next'

10 years agoMerge remote-tracking branch 'l2-mtd/master'
Stephen Rothwell [Thu, 12 Sep 2013 01:27:23 +0000 (11:27 +1000)]
Merge remote-tracking branch 'l2-mtd/master'

10 years agoMerge remote-tracking branch 'bluetooth/master'
Stephen Rothwell [Thu, 12 Sep 2013 01:25:37 +0000 (11:25 +1000)]
Merge remote-tracking branch 'bluetooth/master'

10 years agoMerge remote-tracking branch 'ipsec-next/master'
Stephen Rothwell [Thu, 12 Sep 2013 01:23:45 +0000 (11:23 +1000)]
Merge remote-tracking branch 'ipsec-next/master'

Conflicts:
include/net/xfrm.h

10 years agoMerge remote-tracking branch 'slave-dma/next'
Stephen Rothwell [Thu, 12 Sep 2013 01:23:31 +0000 (11:23 +1000)]
Merge remote-tracking branch 'slave-dma/next'

10 years agoMerge remote-tracking branch 'ubi/linux-next'
Stephen Rothwell [Thu, 12 Sep 2013 01:21:42 +0000 (11:21 +1000)]
Merge remote-tracking branch 'ubi/linux-next'

10 years agoMerge remote-tracking branch 'thermal/next'
Stephen Rothwell [Thu, 12 Sep 2013 01:19:55 +0000 (11:19 +1000)]
Merge remote-tracking branch 'thermal/next'

10 years agoMerge remote-tracking branch 'idle/next'
Stephen Rothwell [Thu, 12 Sep 2013 01:18:04 +0000 (11:18 +1000)]
Merge remote-tracking branch 'idle/next'

10 years agoMerge remote-tracking branch 'pm/linux-next'
Stephen Rothwell [Thu, 12 Sep 2013 01:16:20 +0000 (11:16 +1000)]
Merge remote-tracking branch 'pm/linux-next'

10 years agoMerge remote-tracking branch 'libata/for-next'
Stephen Rothwell [Thu, 12 Sep 2013 01:16:17 +0000 (11:16 +1000)]
Merge remote-tracking branch 'libata/for-next'

10 years agoMerge remote-tracking branch 'kbuild/for-next'
Stephen Rothwell [Thu, 12 Sep 2013 01:16:07 +0000 (11:16 +1000)]
Merge remote-tracking branch 'kbuild/for-next'

10 years agoMerge remote-tracking branch 'v4l-dvb/master'
Stephen Rothwell [Thu, 12 Sep 2013 01:16:04 +0000 (11:16 +1000)]
Merge remote-tracking branch 'v4l-dvb/master'

Conflicts:
drivers/media/platform/s5p-mfc/s5p_mfc_dec.c

10 years agoMerge remote-tracking branch 'hid/for-next'
Stephen Rothwell [Thu, 12 Sep 2013 01:15:13 +0000 (11:15 +1000)]
Merge remote-tracking branch 'hid/for-next'

10 years agoMerge remote-tracking branch 'vfs/for-next'
Stephen Rothwell [Thu, 12 Sep 2013 01:06:15 +0000 (11:06 +1000)]
Merge remote-tracking branch 'vfs/for-next'

10 years agoMerge remote-tracking branch 'xfs/for-next'
Stephen Rothwell [Thu, 12 Sep 2013 01:04:26 +0000 (11:04 +1000)]
Merge remote-tracking branch 'xfs/for-next'

10 years agoMerge remote-tracking branch 'ubifs/linux-next'
Stephen Rothwell [Thu, 12 Sep 2013 01:02:43 +0000 (11:02 +1000)]
Merge remote-tracking branch 'ubifs/linux-next'

10 years agoMerge remote-tracking branch 'ocfs2/linux-next'
Stephen Rothwell [Thu, 12 Sep 2013 01:00:17 +0000 (11:00 +1000)]
Merge remote-tracking branch 'ocfs2/linux-next'

10 years agoMerge remote-tracking branch 'nfs/linux-next'
Stephen Rothwell [Thu, 12 Sep 2013 00:58:26 +0000 (10:58 +1000)]
Merge remote-tracking branch 'nfs/linux-next'

10 years agoMerge remote-tracking branch 'logfs/master'
Stephen Rothwell [Thu, 12 Sep 2013 00:56:17 +0000 (10:56 +1000)]
Merge remote-tracking branch 'logfs/master'

10 years agoMerge remote-tracking branch 'jfs/jfs-next'
Stephen Rothwell [Thu, 12 Sep 2013 00:54:33 +0000 (10:54 +1000)]
Merge remote-tracking branch 'jfs/jfs-next'

10 years agoMerge remote-tracking branch 'fuse/for-next'
Stephen Rothwell [Thu, 12 Sep 2013 00:52:51 +0000 (10:52 +1000)]
Merge remote-tracking branch 'fuse/for-next'

10 years agoMerge remote-tracking branch 'fscache/fscache'
Stephen Rothwell [Thu, 12 Sep 2013 00:51:09 +0000 (10:51 +1000)]
Merge remote-tracking branch 'fscache/fscache'

10 years agoMerge remote-tracking branch 'ecryptfs/next'
Stephen Rothwell [Thu, 12 Sep 2013 00:49:26 +0000 (10:49 +1000)]
Merge remote-tracking branch 'ecryptfs/next'

10 years agoMerge remote-tracking branch 'cifs/for-next'
Stephen Rothwell [Thu, 12 Sep 2013 00:47:41 +0000 (10:47 +1000)]
Merge remote-tracking branch 'cifs/for-next'

10 years agoMerge remote-tracking branch 'ceph/master'
Stephen Rothwell [Thu, 12 Sep 2013 00:45:59 +0000 (10:45 +1000)]
Merge remote-tracking branch 'ceph/master'

10 years agoMerge remote-tracking branch 'btrfs/next'
Stephen Rothwell [Thu, 12 Sep 2013 00:44:11 +0000 (10:44 +1000)]
Merge remote-tracking branch 'btrfs/next'