If two CPU's simultaneously call ext4_ext_get_blocks() at the same
time, there is nothing protecting the i_cached_extent structure from
being used and updated at the same time. This could potentially cause
the wrong location on disk to be read or written to, including
potentially causing the corruption of the block group descriptors
and/or inode table.
This bug has been in the ext4 code since almost the very beginning of
ext4's development. Fortunately once the data is stored in the page
cache cache, ext4_get_blocks() doesn't need to be called, so trying to
replicate this problem to the point where we could identify its root
cause was *extremely* difficult. Many thanks to Kevin Shanahan for
working over several months to be able to reproduce this easily so we
could finally nail down the cause of the corruption.
The BH_Unwritten flag indicates that the buffer is allocated on disk
but has not been written; that is, the disk was part of a persistent
preallocation area. That flag should only be set when a get_blocks()
function is looking up a inode's logical to physical block mapping.
When ext4_get_blocks_wrap() is called with create=1, the uninitialized
extent is converted into an initialized one, so the BH_Unwritten flag
is no longer appropriate. Hence, we need to make sure the
BH_Unwritten is not left set, since the combination of BH_Mapped and
BH_Unwritten is not allowed; among other things, it will result ext4's
get_block() to be called over and over again during the write_begin
phase of write(2).
Use a very large unsigned number (~0xffff) as as the fake block number
for the delayed new buffer. The VFS should never try to write out this
number, but if it does, this will make it obvious.
We need to mark the buffer_head mapping preallocated space as new
during write_begin. Otherwise we don't zero out the page cache content
properly for a partial write. This will cause file corruption with
preallocation.
Now that we mark the buffer_head new we also need to have a valid
buffer_head blocknr so that unmap_underlying_metadata() unmaps the
correct block.
Don't try to look at i_file_acl_high unless the INCOMPAT_64BIT feature
bit is set. The field is normally zero, but older versions of e2fsck
didn't automatically check to make sure of this, so in the spirit of
"be liberal in what you accept", don't look at i_file_acl_high unless
we are using a 64-bit filesystem.
If the block containing external extended attributes (which is stored
in i_file_acl and i_file_acl_high) is larger than the on-disk
filesystem, the process which tried to access the extended attributes
will endlessly issue kernel printks complaining that
"__find_get_block_slow() failed", locking up that CPU until the system
is forcibly rebooted.
So when we read in the inode, make sure the i_file_acl value is legal,
and if not, flag the filesystem as being corrupted.
Update information about locking in JBD2 revoke code. Inconsistency in
comments found by Lin Tan <tammy000@gmail.com>
CC: Lin Tan <tammy000@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Add a mount option which allows the user to disable automatic
allocation of blocks whose allocation by delayed allocation when the
file was originally truncated or when the file is renamed over an
existing file. This feature is intended to save users from the
effects of naive application writers, but it reduces the effectiveness
of the delayed allocation code. This mount option disables this
safety feature, which may be desirable for prodcutions systems where
the risk of unclean shutdowns or unexpected system crashes is low.
With delayed allocation we should not/cannot discard inode prealloc
space during file close. We would still have dirty pages for which we
haven't allocated blocks yet. With this fix after each get_blocks
request we check whether we have zero reserved blocks and if yes and
we don't have any writers on the file we discard inode prealloc space.
When renaming a file such that a link to another inode is overwritten,
force any delay allocated blocks that to be allocated so that if the
filesystem is mounted with data=ordered, the data blocks will be
pushed out to disk along with the journal commit. Many application
programs expect this, so we do this to avoid zero length files if the
system crashes unexpectedly.
When closing a file that had been previously truncated, force any
delay allocated blocks that to be allocated so that if the filesystem
is mounted with data=ordered, the data blocks will be pushed out to
disk along with the journal commit. Many application programs expect
this, so we do this to avoid zero length files if the system crashes
unexpectedly.
Add an ioctl which forces all of the delay allocated blocks to be
allocated. This also provides a function ext4_alloc_da_blocks() which
will be used by the following commits to force files to be fully
allocated to preserve application-expected ext3 behaviour.
Some poeple are reading the ext4 feature list too literally and create
dubious test cases involving very long filenames and 1k blocksize and
then complain when they run into an htree-imposed limit. So add fine
print to the "fix 32000 subdirectory limit" ext4 feature.
ext4_iget() returns -ESTALE if invoked on a deleted inode, in order to
report errors to NFS properly. However, in ext4_lookup(), this
-ESTALE can be propagated to userspace if the filesystem is corrupted
such that a directory entry references a deleted inode. This leads to
a misleading error message - "Stale NFS file handle" - and confusion
on the part of the admin.
The bug can be easily reproduced by creating a new filesystem, making
a link to an unused inode using debugfs, then mounting and attempting
to ls -l said link.
This patch thus changes ext4_lookup to return -EIO if it receives
-ESTALE from ext4_iget(), as ext4 does for other filesystem metadata
corruption; and also invokes the appropriate ext*_error functions when
this case is detected.
At the moment there are few restrictions on which flags may be set on
which inodes. Specifically DIRSYNC may only be set on directories and
IMMUTABLE and APPEND may not be set on links. Tighten that to disallow
TOPDIR being set on non-directories and only NODUMP and NOATIME to be set
on non-regular file, non-directories.
Introduces a flags masking function which masks flags based on mode and
use it during inode creation and when flags are set via the ioctl to
facilitate future consistency.
At present INDEX and EXTENTS are the only flags that new ext4 inodes do
NOT inherit from their parent. In addition prevent the flags DIRTY,
ECOMPR, IMAGIC, TOPDIR, HUGE_FILE and EXT_MIGRATE from being inherited.
List inheritable flags explicitly to prevent future flags from
accidentally being inherited.
This fixes the TOPDIR flag inheritance bug reported at
http://bugzilla.kernel.org/show_bug.cgi?id=9866.
HPET on AMD 81xx chipset needs a second write (with HPET_TN_SETVAL
cleared) to T0_CMP register to set the period in periodic mode.
With this patch HPET_COUNTER is still stopped but not reset when HPET
is programmed in periodic mode. This should help to avoid races when
HPET is programmed in periodic mode and fixes a boot time hang that
I've observed on a machine when using 1000HZ.
[ Impact: fix boot time hang on machines with AMD 81xx chipset ]
Reported-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com> Tested-by: Jeff Mahoney <jeffm@suse.com>
LKML-Reference: <20090421180037.GA2763@alberich.amd.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Chuck Ebbert <cebbert@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
x86: hpet: stop HPET_COUNTER when programming periodic mode
Impact: fix system hang on some systems operating with HZ_1000
On a system that stalled with HZ_1000, the first value written to
T0_CMP (when the main counter was not stopped) did not trigger an
interrupt. Instead after the main counter wrapped around (after
several minutes) an interrupt was triggered and afterwards the
periodic interrupt took effect.
This can be fixed by implementing HPET spec recommendation for
programming the periodic mode (i.e. stopping the main counter).
When ptrace_detach() takes tasklist, the tracee can be SIGKILL'ed. If it
has already passed exit_notify() we can leak a zombie, because a) ptracing
disables the auto-reaping logic, and b) ->real_parent was not notified
about the child's death.
ptrace_detach() should follow the ptrace_exit's logic, change the code
accordingly.
ignoring_children() takes parent->sighand->siglock and checks
k_sigaction[SIGCHLD] atomically. But this buys nothing, we can't get the
"really" wrong result even if we race with sigaction(SIGCHLD). If we read
the "stale" sa_handler/sa_flags we can pretend it was changed right after
the check.
Remove spin_lock(->siglock), and kill "int ign" which caches the result of
ignoring_children() which becomes rather trivial.
Perhaps it makes sense to export this helper, do_notify_parent() can use
it too.
Move the code from __ptrace_detach() to its single caller and kill this
helper.
Also, fix the ->exit_state check, we shouldn't wake up EXIT_DEAD tasks.
Actually, I think task_is_stopped_or_traced() makes more sense, but this
needs another patch.
The commit a760a6656e6f00bb0144a42a048cf0266646e22c (crypto:
api - Fix module load deadlock with fallback algorithms) broke
the auto-loading of algorithms that require fallbacks. The
problem is that the fallback mask check is missing an and which
cauess bits that should be considered to interfere with the
result.
Since the padlock-aes driver doesn't require a fallback (it's
only padlock-sha that does), it should use the aes alias rather
than aes-all so that ones that do need a fallback can use it.
When request_key() is called, without there being any standard process
keyrings on which to fall back if a destination keyring is not specified, an
oops is liable to occur when construct_alloc_key() calls down_write() on
dest_keyring's semaphore.
Due to function inlining this may be seen as an oops in down_write() as called
from request_key_and_link().
This situation crops up during boot, where request_key() is called from within
the kernel (such as in CIFS mounts) where nobody is actually logged in, and so
PAM has not had a chance to create a session keyring and user keyrings to act
as the fallback.
To fix this, make construct_alloc_key() not attempt to cache a key if there is
no fallback key if no destination keyring is given specifically.
Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Chuck Ebbert <cebbert@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
ACM sets the low latency flag but calls the flip buffer routines from
IRQ context which isn't permitted (and as of 2.6.29 causes a warning
hence this one was caught)
Fortunatelt ACM doesn't need to set this flag in the first place as it
only set it to work around problems in ancient (pre tty flip rewrite)
kernels.
ide_tape_issue_pc() assumed drive->pc isn't NULL on invocation when
checking for back-to-back request sense issues but drive->pc can be
NULL and even when it's not NULL, it's not safe to dereference it once
the previous command is complete because pc could have been freed or
was on stack. Kill back-to-back REQUEST_SENSE detection.
This patch fixes the following regression that occurred during the
scsi_dma_map()/unmap()
changes when compiling with CONFIG_DMA_API_DEBUG=y :
WARNING: at lib/dma-debug.c:496 check_unmap+0x142/0x542()
Hardware name:
3w-xxxx 0000:02:02.0: DMA-API: device driver tries to free DMA memory
it has not allocated [device address=0x0000000000000000] [size=36
bytes]
Signed-off-by: Adam Radford <aradford@gmail.com> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Martin Knoblauch reports that trying to build 2.6.30-rc6-git3 with
RHEL4.3 userspace (gcc (GCC) 3.4.5 20051201 (Red Hat 3.4.5-2)) causes an
internal compiler error (ICE):
and after some debugging it turns out that it's due to the code trying
to figure out the rough value of the current stack pointer by taking an
address of an uninitialized variable and casting that to an integer.
This is clearly a compiler bug, but it's not worth fighting - while the
current stack kernel pointer might be somewhat hard to predict in user
space, it's also not generally going to change for a lot of the call
chains for a particular process.
So just drop it, and mumble some incoherent curses at the compiler.
Tested-by: Martin Knoblauch <spamtrap@knobisoft.de> Cc: Matt Mackall <mpm@selenic.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
It's a really simple patch that basically just open-codes the current
"secure_ip_id()" call, but when open-coding it we now use a _static_
hashing area, so that it gets updated every time.
And to make sure somebody can't just start from the same original seed of
all-zeroes, and then do the "half_md4_transform()" over and over until
they get the same sequence as the kernel has, each iteration also mixes in
the same old "current->pid + jiffies" we used - so we should now have a
regular strong pseudo-number generator, but we also have one that doesn't
have a single seed.
Note: the "pid + jiffies" is just meant to be a tiny tiny bit of noise. It
has no real meaning. It could be anything. I just picked the previous
seed, it's just that now we keep the state in between calls and that will
feed into the next result, and that should make all the difference.
I made that hash be a per-cpu data just to avoid cache-line ping-pong:
having multiple CPU's write to the same data would be fine for randomness,
and add yet another layer of chaos to it, but since get_random_int() is
supposed to be a fast interface I did it that way instead. I considered
using "__raw_get_cpu_var()" to avoid any preemption overhead while still
getting the hash be _mostly_ ping-pong free, but in the end good taste won
out.
Add barrier() to bnx2_get_hw_{tx|rx}_cons() to fix this issue:
http://bugzilla.kernel.org/show_bug.cgi?id=12698
This issue was reported by multiple i386 users. Without barrier(),
the compiled code looks like the following where %eax contains the
address of the tx_cons or rx_cons in the DMA status block. The
status block contents can change between the cmpb and the movzwl
instruction. The driver would crash if the value was not 0xff during
the cmpb instruction, but changed to 0xff during the movzwl
instruction.
Thanks to Pascal de Bruijn <pmjdebruijn@pcode.nl> for reporting the
problem and Holger Noefer <hnoefer@pironet-ndh.com> for patiently
testing test patches for us.
[greg - took out version change]
Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
"There is another problem with this piece of code. The sband will be NULL
after second iteration on single band device and cause null pointer
dereference. Everything is working with dual band card. Sorry, but i
don't know how to explain this clearly in English. I have looked on the
second patch for pid algorithm and found similar bug."
Reported-by: Karol Szuster <qflon@o2.pl> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
pid doesn't count with some band having more bitrates than the one
associated the first time.
Fix that by counting the maximal available bitrate count and allocate
big enough space.
Secondly, fix touching uninitialized memory which causes panics.
Index sucked from this random memory points to the hell.
The fix is to sort the rates on each band change.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
minstrel doesn't count max rate count in fact, since it doesn't use
a loop variable `i' and hence allocs space only for bitrates found in
the first band.
Fix it by involving the `i' as an index so that it traverses all the
bands now and finds the real max bitrate count.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Cc: Felix Fietkau <nbd@openwrt.org> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
We need to be symmetrical in what is done when key is set and cleared.
This is important wrt the key flags as they are used during key
clearing and if they are not set when the key is set the key cannot be
cleared completely.
This addresses the many occurences of the WARN found in
iwl_set_tkip_dynamic_key_info() and tracked in
http://www.kerneloops.org/searchweek.php?search=iwl_set_dynamic_key
If calling iwl_set_tkip_dynamic_key_info()/iwl_remove_dynamic_key()
pair a few times in a row will cause that we run out of key space.
This is because the index stored in the key flags is used by
iwl_remove_dynamic_key() to decide if it should remove the key.
Unfortunately the key flags, and hence the key index is currently only
set at the time the key is written to the device (in
iwl_update_tkip_key()) and _not_ in iwl_set_tkip_dynamic_key_info().
Fix this by setting flags in iwl_set_tkip_dynamic_key_info().
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Jeff Kirsher [Tue, 2 Jun 2009 23:38:52 +0000 (16:38 -0700)]
igb: fix LRO warning
This fix is only needed for 2.6.29.y tree, since in 2.6.30 and later IGB
has moved to using GRO instead of LRO.
igb supports LRO, but was not setting any hooks to the ->set_flags
ethtool_ops function. This would trigger warnings if the user tried
to enable or disable LRO.
Based on the patch provided by Stephen Hemminger <shemminger@vyatta.com>
Reported-by: Sergey Kononenko <sergk@sergk.org.ua> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> CC: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Patch to fix bad length checking in e1000. E1000 by default does two
things:
1) Spans rx descriptors for packets that don't fit into 1 skb on recieve
2) Strips the crc from a frame by subtracting 4 bytes from the length prior to
doing an skb_put
Since the e1000 driver isn't written to support receiving packets that span
multiple rx buffers, it checks the End of Packet bit of every frame, and
discards it if its not set. This places us in a situation where, if we have a
spanning packet, the first part is discarded, but the second part is not (since
it is the end of packet, and it passes the EOP bit test). If the second part of
the frame is small (4 bytes or less), we subtract 4 from it to remove its crc,
underflow the length, and wind up in skb_over_panic, when we try to skb_put a
huge number of bytes into the skb. This amounts to a remote DOS attack through
careful selection of frame size in relation to interface MTU. The fix for this
is already in the e1000e driver, as well as the e1000 sourceforge driver, but no
one ever pushed it to e1000. This is lifted straight from e1000e, and prevents
small frames from causing the underflow described above
Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Tested-by: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Eric Paris [Mon, 1 Jun 2009 14:21:05 +0000 (10:21 -0400)]
SELinux: BUG in SELinux compat_net code
This patch is not applicable to Linus's tree as the code in question has
been removed for 2.6.30. I'm sending in case any of the stable
maintainers would like to push to their branches (which I think anything
pre 2.6.30 would like to do).
Ubuntu users were experiencing a kernel panic when they enabled SELinux
due to an old bug in our handling of the compatibility mode network
controls, introduced Jan 1 2008 effad8df44261031a882e1a895415f7186a5098e
Most distros have not used the compat_net code since the new code was
introduced and so noone has hit this problem before. Ubuntu is the only
distro I know that enabled that legacy cruft by default. But, I was ask
to look at it and found that the above patch changed a call to
avc_has_perm from if(send_perm) to if(!send_perm) in
selinux_ip_postroute_iptables_compat(). The result is that users who
turn on SELinux and have compat_net set can (and oftern will) BUG() in
avc_has_perm_noaudit since they are requesting 0 permissions.
This patch corrects that accidental bug introduction.
Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
It is possible for ide-cd to ignore ide_error()'s return value under
some circumstances. Workaround it in ide_intr() and ide_timer_expiry()
by checking if there is a device/port reset pending currently.
Under CONFIG_MAXSMP, cpus_hardware_enabled is allocated from the heap and
not statically initialized. This causes a crash on reboot when kvm thinks
vmx is enabled on random nonexistent cpus and accesses nonexistent percpu
lists.
Fix by explicitly clearing the variable.
Reported-and-tested-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
It turns out that such devices lack cable detection altogether
(which in turn results in incorrect detection of 40-wire cables
by our current cable detection strategy) so always handle them
by trusting host-side cable detection only.
When AMD C1E is enabled, local APIC timer will stop even in C1. To avoid
suspend/resume hang, this patch removes C1 and replace it with a cpu_relax() in
suspend/resume path. This hasn't any impact in runtime path.
http://bugzilla.kernel.org/show_bug.cgi?id=13233
[ impact: avoid suspend/resume hang in AMD CPU with C1E enabled ]
Tested-by: Dmitry Lyzhyn <thisistempbox@yahoo.com> Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Commit 5b7f3a50 (fix dataflash 64-bit divisions) unfortunately
introduced a typo. Erase addr and len were swapped in the pageaddr
calculation, causing the wrong sectors to get erased.
Signed-off-by: Peter Korsgaard <jacmet@sunsite.dk> Acked-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Pascal reported and bisected a commit:
| x86/PCI: don't call e820_all_mapped with -1 in the mmconfig case
which broke one system system.
ACPI: Using IOAPIC for interrupt routing
PCI: MCFG configuration 0: base f0000000 segment 0 buses 0 - 255
PCI: MCFG area at f0000000 reserved in ACPI motherboard resources
PCI: Using MMCONFIG for extended config space
it didn't have
PCI: updated MCFG configuration 0: base f0000000 segment 0 buses 0 - 63
anymore, and try to use 0xf000000 - 0xffffffff for mmconfig
For 32bit, mcfg_res->end could be 32bit only (if 64 resources aren't used)
So use end - 1 to pass the value in mcfg->end to avoid overflow.
We don't need to worry about the e820 path, they are always 64 bit.
This patch (as1244) fixes a crash in usb-serial that occurs when a
sub-driver returns a positive value from its attach method, indicating
that new firmware was loaded and the device will disconnect and
reconnect. The usb-serial core then skips the step of registering the
port devices; when the disconnect occurs, the attempt to unregister
the ports fails dramatically.
This problem shows up with Keyspan devices and it might affect others
as well.
When the attach method returns a positive value, the patch sets
num_ports to 0. This tells usb_serial_disconnect() not to try
unregistering any of the ports; instead they are cleaned up by
destroy_serial().
Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Tested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Kernel 2.6.18 broke the MotU Fastlane, which uses duplicate endpoint
numbers in a manner that is not only illegal but also confuses the
kernel's endpoint descriptor caching mechanism. To work around this, we
have to add a separate usb_set_interface() call to guide the USB core to
the correct descriptors.
Signed-off-by: Clemens Ladisch <clemens@ladisch.de> Reported-and-tested-by: David Fries <david@fries.net> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The option driver (and presumably others) allocates several URBs when it
opens and tries to free them when it closes. The isp1760_urb_dequeue
function gets called, but the packet being dequeued is not necessarily at
the
front of one of the 32 queues. If not, the isp1760_urb_done function doesn't
get called for the URB and the process trying to free it hangs forever on a
wait_queue. This patch does two things. If the URB being dequeued has others
queued behind it, it re-queues them. And it searches the queues looking for
the URB being dequeued rather than just looking at the one at the front of
the queue.
hugetlbfs reserves huge pages but does not fault them at mmap() time to
ensure that future faults succeed. The reservation behaviour differs
depending on whether the mapping was mapped MAP_SHARED or MAP_PRIVATE.
For MAP_SHARED mappings, hugepages are reserved when mmap() is first
called and are tracked based on information associated with the inode.
Other processes mapping MAP_SHARED use the same reservation. MAP_PRIVATE
track the reservations based on the VMA created as part of the mmap()
operation. Each process mapping MAP_PRIVATE must make its own
reservation.
hugetlbfs currently checks if a VMA is MAP_SHARED with the VM_SHARED flag
and not VM_MAYSHARE. For file-backed mappings, such as hugetlbfs,
VM_SHARED is set only if the mapping is MAP_SHARED and the file was opened
read-write. If a shared memory mapping was mapped shared-read-write for
populating of data and mapped shared-read-only by other processes, then
hugetlbfs would account for the mapping as if it was MAP_PRIVATE. This
causes processes to fail to map the file MAP_SHARED even though it should
succeed as the reservation is there.
This patch alters mm/hugetlb.c and replaces VM_SHARED with VM_MAYSHARE
when the intent of the code was to check whether the VMA was mapped
MAP_SHARED or MAP_PRIVATE.
Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Ingo Molnar <mingo@elte.hu> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: <starlight@binnacle.cx> Cc: Eric B Munson <ebmunson@us.ibm.com> Cc: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
On x86 and x86-64, it is possible that page tables are shared beween
shared mappings backed by hugetlbfs. As part of this,
page_table_shareable() checks a pair of vma->vm_flags and they must match
if they are to be shared. All VMA flags are taken into account, including
VM_LOCKED.
The problem is that VM_LOCKED is cleared on fork(). When a process with a
shared memory segment forks() to exec() a helper, there will be shared
VMAs with different flags. The impact is that the shared segment is
sometimes considered shareable and other times not, depending on what
process is checking.
What happens is that the segment page tables are being shared but the
count is inaccurate depending on the ordering of events. As the page
tables are freed with put_page(), bad pmd's are found when some of the
children exit. The hugepage counters also get corrupted and the Total and
Free count will no longer match even when all the hugepage-backed regions
are freed. This requires a reboot of the machine to "fix".
This patch addresses the problem by comparing all flags except VM_LOCKED
when deciding if pagetables should be shared or not for hugetlbfs-backed
mapping.
Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Ingo Molnar <mingo@elte.hu> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: <starlight@binnacle.cx> Cc: Eric B Munson <ebmunson@us.ibm.com> Cc: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Its possible for cfg80211 to have scheduled the work and for
the global workqueue to not have kicked in prior to a cfg80211
driver's regulatory hint or wiphy_apply_custom_regulatory().
Although this is very unlikely its possible and should fix
this race. When this race would happen you are expected to have
hit a null pointer dereference panic.
Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com> Tested-by: Alan Jenkins <alan-jenkins@tuffmail.co.uk> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The processor is documented to reload the PDPTRs while in PAE mode if any
of the CR4 bits PSE, PGE, or PAE change. Linux relies on this
behaviour when zapping the low mappings of PAE kernels during boot.
The code already handled changes to CR4.PAE; augment it to also notice changes
to PSE and PGE.
This triggered while booting an F11 PAE kernel; the futex initialization code
runs before any CR3 reloads and writes to a NULL pointer; the futex subsystem
ended up uninitialized, killing PI futexes and pulseaudio which uses them.
Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The paravirt tlb flush may be used not only to flush TLBs, but also
to reload the four page-directory-pointer-table entries, as it is used
as a replacement for reloading CR3. Change the code to do the entire
CR3 reloading dance instead of simply flushing the TLB.
Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Actually the icom driver is crashing when is being removed because
the driver is kfreeing the adapter structure before calling
pci_release_regions(), which result in the following error:
get_event_name uses sprintf to fill a buffer declared on the stack. It fills
the buffer 2 bytes at a time. What the code doesn't take into account is that
sprintf(buf, "%02x", data) actually writes 3 bytes. 2 bytes for the data and
then it nul terminates the string. Since we declare buf to be 40 characters
long and then we write 40 bytes of data into buf sprintf is going to write 41
characters. The fix is to leave room in buf for the nul terminator.
Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This situation can occur when attempting to attach a block device whose
backend is an empty physical CD-ROM driver. The backend in this case
will go directly from the Initialising state to Closing->Closed.
Previously this would result in a NULL pointer deref on info->gd
(xenbus_dev_fatal does not return as a1a15ac5 seems to expect)
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The futex code installs a read only mapping via get_user_pages_fast()
even if the futex op function has to modify user space data. The
eventual fault was fixed up by futex_handle_fault() which walked the
VMA with mmap_sem held.
After the cleanup patches which removed the mmap_sem dependency of the
futex code commit 4dc5b7a36a49eff97050894cf1b3a9a02523717 (futex:
clean up fault logic) removed the private VMA walk logic from the
futex code. This change results in a stale RO mapping which is not
fixed up.
Instead of reintroducing the previous fault logic we set up the
mapping in get_user_pages_fast() read/write for all operations which
modify user space data. Also handle private futexes in the same way
and make the current unconditional access_ok(VERIFY_WRITE) depend on
the futex op.
Reported-by: Andreas Schwab <schwab@linux-m68k.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The problem is that permission checking is skipped if atomic open is
possible, but when exec opens a file, it just opens it O_READONLY which
means EXEC permission will not be checked at that time.
This problem is observed by the following sequence (executed as root):
mount -t nfs4 server:/ /mnt4
echo "ls" >/mnt4/foo
chmod 744 /mnt4/foo
su guest -c "mnt4/foo"
When sending a message to user space using wimax_msg(), if nla_put()
fails, correctly interpret the return code from wimax_msg_alloc() as
an err ptr and return the error code instead of crashing (as it is
assuming than non-NULL means the pointer is ok).
Commit c45d6320 ("fix reference counting of ftdi_private") stopped
ftdi_sio_port_remove() from directly freeing the port-private data, with
the intention if the port was still open, it would be freed when
ftdi_close() is eventually called and releases the last refcount on the
structure.
That's all very well, but ftdi_sio_port_remove() still contains a call
to usb_set_serial_port_data(port, NULL) -- so by the time we get to
ftdi_close() for the port which was unplugged, it _still_ oopses on
dereferencing that NULL pointer, as it did before (and does in 2.6.29).
The fix is just not to clear the private data in ftdi_sio_port_remove().
Then the refcount is properly reduced to zero when the final kref_put()
happens in ftdi_close().
Remove a bogus comment too, while we're at it. And stop doing things
inside "if (priv)" -- it must _always_ be there.
Based loosely on an earlier patch by Daniel Mack, and suggestions by
Alan Stern.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Tested-by: Daniel Mack <daniel@caiaq.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
If there is a dummy "espdma" or "ledma" parent device above ESP scsi
or LE ethernet device nodes, we have to match the bus as SBUS.
Otherwise the address and size cell counts are wrong and we don't
calculate the final physical device resource values correctly at all.
Commit 5280267c1dddb8d413595b87dc406624bb497946 ("sparc: Fix handling
of LANCE and ESP parent nodes in of_device.c") was meant to fix this
problem, but that only influences the inner loop of
build_device_resources(). We need this logic to also kick in at the
beginning of build_device_resources() as well, when we make the first
attempt to determine the device's immediate parent bus type for 'reg'
property element extraction.
Based almost entirely upon a patch by Friedrich Oslage.
Tested-by: Meelis Roos <mroos@linux.ee> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The 8169 chip only generates MSI interrupts when all enabled event
sources are quiescent and one or more sources transition to active. If
not all of the active events are acknowledged, or a new event becomes
active while the existing ones are cleared in the handler, we will not
see a new interrupt.
The current interrupt handler masks off the Rx and Tx events once the
NAPI handler has been scheduled, which opens a race window in which we
can get another Rx or Tx event and never ACK'ing it, stopping all
activity until the link is reset (ifconfig down/up). Fix this by always
ACK'ing all event sources, and loop in the handler until we have all
sources quiescent.
Signed-off-by: David Dillow <dave@thedillows.org> Tested-by: Michael Buesch <mb@bu3sch.de> Tested-by: Michael Riepe <michael.riepe@googlemail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Fix locking issue in alb MAC address management; removed
incorrect locking and replaced with correct locking. This bug was
introduced in commit 059fe7a578fba5bbb0fdc0365bfcf6218fa25eb0
("bonding: Convert locks to _bh, rework alb locking for new locking")
Bug reported by Paul Smith <paul@mad-scientist.net>, who also
tested the fix.
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Changeset ca17584bf2ad1b1e37a5c0e4386728cc5fc9dabc ("mac8390: update
to net_device_ops") broke mac8390 by adding 8390.o to the link. That
meant that lib8390.c was included twice, once in mac8390.c and once in
8390.c, subject to different macros. This patch reverts that by
avoiding the wrappers in 8390.c. They seem to be of no value since
COMPAT_NET_DEV_OPS is going away soon.
Tested with a Kinetics EtherPort card.
Signed-off-by: Finn Thain <fthain@telegraphics.com.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Check whether the underlying device provides a set of ethtool ops before
checking for individual handlers to avoid NULL pointer dereferences.
Reported-by: Art van Breemen <ard@telegraafnet.nl> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
typo -- pkt_dev->nflows is for stats only, the number of concurrent
flows is stored in cflows.
Reported-By: Vladimir Ivashchenko <hazard@francoudi.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Alexander V. Lukyanov found a regression in 2.6.29 and made a complete
analysis found in http://bugzilla.kernel.org/show_bug.cgi?id=13339
Quoted here because its a perfect one :
begin_of_quotation
2.6.29 patch has introduced flexible route cache rebuilding. Unfortunately the
patch has at least one critical flaw, and another problem.
rt_intern_hash calculates rthi pointer, which is later used for new entry
insertion. The same loop calculates cand pointer which is used to clean the
list. If the pointers are the same, rtable leak occurs, as first the cand is
removed then the new entry is appended to it.
This leak leads to unregister_netdevice problem (usage count > 0).
Another problem of the patch is that it tries to insert the entries in certain
order, to facilitate counting of entries distinct by all but QoS parameters.
Unfortunately, referencing an existing rtable entry moves it to list beginning,
to speed up further lookups, so the carefully built order is destroyed.
For the first problem the simplest patch it to set rthi=0 when rthi==cand, but
it will also destroy the ordering.
end_of_quotation
Trying to keep dst_entries ordered is too complex and breaks the fact that
order should depend on the frequency of use for garbage collection.
A possible fix is to make rt_intern_hash() simpler, and only makes
rt_check_expire() a litle bit smarter, being able to cope with an arbitrary
entries order. The added loop is running on cache hot data, while cpu
is prefetching next object, so should be unnoticied.
Reported-and-analyzed-by: Alexander V. Lukyanov <lav@yar.ru> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
rt_check_expire() computes average and standard deviation of chain lengths,
but not correclty reset length to 0 at beginning of each chain.
This probably gives overflows for sum2 (and sum) on loaded machines instead
of meaningful results.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When called with a consumed value that is less than skb_headlen(skb)
bytes into a page frag, skb_seq_read() incorrectly returns an
offset/length relative to skb->data. Ensure that data which should come
from a page frag does.
Signed-off-by: Thomas Chenault <thomas_chenault@dell.com> Tested-by: Shyam Iyer <shyam_iyer@dell.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
A long-standing feature in tcp_init_metrics() is such that
any of its goto reset prevents call to tcp_init_cwnd().
Signed-off-by: Ilpo Jarvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Commit 518a09ef11 (tcp: Fix recvmsg MSG_PEEK influence of
blocking behavior) lets the loop run longer than the race check
did previously expect, so we need to be more careful with this
check and consider the work we have been doing.
I tried my best to deal with urg hole madness too which happens
here:
if (!sock_flag(sk, SOCK_URGINLINE)) {
++*seq;
...
by using additional offset by one but I certainly have very
little interest in testing that part.
Signed-off-by: Ilpo Jarvinen <ilpo.jarvinen@helsinki.fi> Tested-by: Frans Pop <elendil@planet.nl> Tested-by: Ian Zimmermann <itz@buug.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When kernel inserts a temporary SA for IKE, it uses the wrong hash
value for dst list. Two hash values were calcultated before: one with
source address and one with a wildcard source address.
Bug hinted by Junwei Zhang <junwei.zhang@6wind.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The MPC5200 PSC device is wired up to a dedicated interrupt line
which is never shared. This patch removes the IRQF_SHARED flag
from the request_irq() call which eliminates the "IRQF_DISABLED
is not guaranteed on shared IRQs" warning message from the console
output.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca> Reviewed-by: Wolfram Sang <w.sang@pengutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch fixes an invalid pointer access in case the receive queue
holds no pointer to the next skb when the queue is empty.
Signed-off-by: Hannes Hering <hering2@de.ibm.com> Signed-off-by: Jan-Bernd Themann <themann@de.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Rearrange locking of i_mutex on destination and call to
ocfs2_rw_lock() so locks are only held while buffers are copied with
the pipe_to_file() actor, and not while waiting for more data on the
pipe.
Rearrange locking of i_mutex on destination so it's only held while
buffers are copied with the pipe_to_file() actor, and not while
waiting for more data on the pipe.
splice_from_pipe_next() will wait (if necessary) for more buffers to
be added to the pipe. splice_from_pipe_feed() will feed the buffers
to the supplied actor and return when there's no more data available
(or if all of the requested data has been copied).
This is necessary so that implementations can do locking around the
non-waiting splice_from_pipe_feed().
This patch should not cause any change in behavior.
This was an omission from commit 26c3679101dbccc054dcf370143941844ba70531
"fuse: destroy bdi on umount", which moved the bdi_destroy() call from
fuse_conn_put() to fuse_put_super().
KVM optimizes guest port 80 accesses by passthing them through to the host.
Some AMD machines die on port 80 writes, allowing the guest to hard-lock the
host.
Remove the port passthrough to avoid the problem.
Reported-by: Piotr Jaroszyński <p.jaroszynski@gmail.com> Tested-by: Piotr Jaroszyński <p.jaroszynski@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch (as1240) adds the NOGET quirk for three devices from CH
Products: the Pro pedals, the Combatstick joystick, and the Flight-Sim
yoke. Without these quirks, the devices haven't worked for many
kernel releases. Sometimes replugging them after boot-up would get
them to work and sometimes they wouldn't work at all.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Reported-by: Sean Hildebrand <silverwraithii@gmail.com> Reported-by: Sid Boyce <sboyce@blueyonder.co.uk> Tested-by: Sean Hildebrand <silverwraithii@gmail.com> Tested-by: Sid Boyce <sboyce@blueyonder.co.uk> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>