Currently we are not including randomized stack size when calculating
mmap_base address in arch_pick_mmap_layout for topdown case. This might
cause that mmap_base starts in the stack reserved area because stack is
randomized by 1GB for 64b (8MB for 32b) and the minimum gap is 128MB.
If the stack really grows down to mmap_base then we can get silent mmap
region overwrite by the stack values.
Let's include maximum stack randomization size into MIN_GAP which is
used as the low bound for the gap in mmap.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
LKML-Reference: <1252400515-6866-1-git-send-email-mhocko@suse.cz> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Cc: Chuck Ebbert <cebbert@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When calling vfs_unlink() on the lower dentry, d_delete() turns the
dentry into a negative dentry when the d_count is 1. This eventually
caused a NULL pointer deref when a read() or write() was done and the
negative dentry's d_inode was dereferenced in
ecryptfs_read_update_atime() or ecryptfs_getxattr().
Placing mutt's tmpdir in an eCryptfs mount is what initially triggered
the oops and I was able to reproduce it with the following sequence:
Which is why I have always preferred sizeof(struct foo) over
sizeof(var).
Signed-off-by: Jean Delvare <khali@linux-fr.org> Acked-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
We noticed very erratic behavior [throughput] with the AIM7 shared
workload running on recent distro [SLES11] and mainline kernels on an
8-socket, 32-core, 256GB x86_64 platform. On the SLES11 kernel
[2.6.27.19+] with Barcelona processors, as we increased the load [10s of
thousands of tasks], the throughput would vary between two "plateaus"--one
at ~65K jobs per minute and one at ~130K jpm. The simple patch below
causes the results to smooth out at the ~130k plateau.
But wait, there's more:
We do not see this behavior on smaller platforms--e.g., 4 socket/8 core.
This could be the result of the larger number of cpus on the larger
platform--a scalability issue--or it could be the result of the larger
number of interconnect "hops" between some nodes in this platform and how
the tasks for a given load end up distributed over the nodes' cpus and
memories--a stochastic NUMA effect.
The variability in the results are less pronounced [on the same platform]
with Shanghai processors and with mainline kernels. With 31-rc6 on
Shanghai processors and 288 file systems on 288 fibre attached storage
volumes, the curves [jpm vs load] are both quite flat with the patched
kernel consistently producing ~3.9% better throughput [~80K jpm vs ~77K
jpm] than the unpatched kernel.
Profiling indicated that the "slow" runs were incurring high[er]
contention on an anon_vma lock in vma_adjust(), apparently called from the
sbrk() system call.
The patch:
A comment in mm/mmap.c:vma_adjust() suggests that we don't really need the
anon_vma lock when we're only adjusting the end of a vma, as is the case
for brk(). The comment questions whether it's worth while to optimize for
this case. Apparently, on the newer, larger x86_64 platforms, with
interesting NUMA topologies, it is worth while--especially considering
that the patch [if correct!] is quite simple.
We can detect this condition--no overlap with next vma--by noting a NULL
"importer". The anon_vma pointer will also be NULL in this case, so
simply avoid loading vma->anon_vma to avoid the lock.
However, we DO need to take the anon_vma lock when we're inserting a vma
['insert' non-NULL] even when we have no overlap [NULL "importer"], so we
need to check for 'insert', as well. And Hugh points out that we should
also take it when adjusting vm_start (so that rmap.c can rely upon
vma_address() while it holds the anon_vma lock).
akpm: Zhang Yanmin reprts a 150% throughput improvement with aim7, so it
might be -stable material even though thiss isn't a regression: "this
issue is not clear on dual socket Nehalem machine (2*4*2 cpu), but is
severe on large machine (4*8*2 cpu)"
[hugh.dickins@tiscali.co.uk: test vma start too] Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Nick Piggin <npiggin@suse.de> Cc: Eric Whitney <eric.whitney@hp.com> Tested-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
do_anonymous_page() has been wrong to dirty the pte regardless.
If it's not going to mark the pte writable, then it won't help
to mark it dirty here, and clogs up memory with pages which will
need swap instead of being thrown away. Especially wrong if no
overcommit is chosen, and this vma is not yet VM_ACCOUNTed -
we could exceed the limit and OOM despite no overcommit.
Lee Schermerhorn [Tue, 22 Sep 2009 00:01:04 +0000 (17:01 -0700)]
hugetlb: restore interleaving of bootmem huge pages (2.6.31)
Not upstream as it is fixed differently in .32
I noticed that alloc_bootmem_huge_page() will only advance to the next
node on failure to allocate a huge page. I asked about this on linux-mm
and linux-numa, cc'ing the usual huge page suspects. Mel Gorman
responded:
I strongly suspect that the same node being used until allocation
failure instead of round-robin is an oversight and not deliberate
at all. It appears to be a side-effect of a fix made way back in
commit 63b4613c3f0d4b724ba259dc6c201bb68b884e1a ["hugetlb: fix
hugepage allocation with memoryless nodes"]. Prior to that patch
it looked like allocations would always round-robin even when
allocation was successful.
Andy Whitcroft countered that the existing behavior looked like Andi
Kleen's original implementation and suggested that we ask him. We did and
Andy replied that his intention was to interleave the allocations. So,
...
This patch moves the advance of the hstate next node from which to
allocate up before the test for success of the attempted allocation. This
will unconditionally advance the next node from which to alloc,
interleaving successful allocations over the nodes with sufficient
contiguous memory, and skipping over nodes that fail the huge page
allocation attempt.
Note that alloc_bootmem_huge_page() will only be called for huge pages of
order > MAX_ORDER.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: David Rientjes <rientjes@google.com> Cc: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Eric Whitney <eric.whitney@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The length of the to-copy data structure is currently stored in
a signed integer. However many comparisons are done with sizeof(..)
which is unsigned. It's more suitable for this variable to be unsigned
to make these comparisons more naturally right.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
as a security check against optlen being negative (or zero) in the
set socket option.
Unfortunately, "sizeof(int)" is an unsigned property, with the
result that the whole comparison is done in unsigned, letting
negative values slip through.
This patch changes this to
if (optlen < (int)sizeof(int))
return -EINVAL;
so that the comparison is done as signed, and negative values
get properly caught.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This incorrect backport to 2.6.28.10 placed some code into the probe function
which used a pointer before it was initialized. Moving this to the correct
place (as it is in upstream).
Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Tim Gardner <tim.gardner@canonical.com> Acked-by: Steve Conklin <steve.conklin@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Cord Walter <qord@cwalter.net> Signed-off-by: Komuro <komurojun-mbn@nifty.com> Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Christoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The enc28j60 driver doesn't check whether the length of the packet as reported
by the hardware fits into the preallocated buffer. When stressed, the hardware
may report insanely large packets even tough the "Receive OK" bit is set. Fix
this.
Signed-off-by: Baruch Siach <baruch@tkos.co.il> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch adds a new usbid for Zcomax XG-705A to the device table.
Reported-by: Jari Jaakola <jari.jaakola@gmail.com> Signed-off-by: Christian Lamparter <chunkeey@googlemail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
In theory it could happen that on one CPU we initialize a new inode but
clearing of I_NEW | I_LOCK gets reordered before some of the
initialization. Thus on another CPU we return not fully uptodate inode
from iget_locked().
This seems to fix a corruption issue on ext3 mounted over NFS.
[akpm@linux-foundation.org: add some commentary] Signed-off-by: Jan Kara <jack@suse.cz> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
nfsd: fix hung up of nfs client while sync write data to nfs server
Commit 'Short write in nfsd becomes a full write to the client'
(31dec2538e45e9fff2007ea1f4c6bae9f78db724) broken the sync write.
With the following commands to reproduce:
$ mount -t nfs -o sync 192.168.0.21:/nfsroot /mnt
$ cd /mnt
$ echo aaaa > temp.txt
Then nfs client is hung up.
In SYNC mode the server alaways return the write count 0 to the
client. This is because the value of host_err in nfsd_vfs_write()
will be overwrite in SYNC mode by 'host_err=nfsd_sync(file);',
and then we return host_err(which is now 0) as write count.
This patch fixed the problem.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Cc: Chuck Ebbert <cebbert@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Short write in nfsd becomes a full write to the client
If a filesystem being written to via NFS returns a short write count
(as opposed to an error) to nfsd, nfsd treats that as a success for
the entire write, rather than the short count that actually succeeded.
For example, given a 8192 byte write, if the underlying filesystem
only writes 4096 bytes, nfsd will ack back to the nfs client that all
8192 bytes were written. The nfs client does have retry logic for
short writes, but this is never called as the client is told the
complete write succeeded.
There are probably other ways it could happen, but in my case it
happened with a fuse (filesystem in userspace) filesystem which can
rather easily have a partial write.
Here is a patch to properly return the short write count to the
client.
Signed-off-by: David Shaw <dshaw@jabberwocky.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Cc: Chuck Ebbert <cebbert@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When the volume is changed continuously (e.g., when the user drags a
volume slider with the mouse), the driver does lots of I2C writes.
Apparently, the sound chip can get confused when we poll the I2C status
register too much, and fails to complete a read from it. On the PCI-E
models, the PCI-E/PCI bridge gets upset by this and generates a machine
check exception.
To avoid this, this patch replaces the polling with an unconditional
wait that is guaranteed to be long enough.
Signed-off-by: Clemens Ladisch <clemens@ladisch.de> Tested-by: Johann Messner <johann.messner at jku.at> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The SLB can change sizes across a live migration, which was not
being handled, resulting in possible machine crashes during
migration if migrating to a machine which has a smaller max SLB
size than the source machine. Fix this by first reducing the
SLB size to the minimum possible value, which is 32, prior to
migration. Then during the device tree update which occurs after
migration, we make the call to ensure the SLB gets updated. Also
add the slb_size to the lparcfg output so that the migration
tools can check to make sure the kernel has this capability
before allowing migration in scenarios where the SLB size will change.
Signed-off-by: Brian King <brking@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
ata_tf_read_block() has off-by-one error when converting CHS address
to LBA. The bug isn't very visible because ata_tf_read_block() is
used only when generating sense data for a failed RW command and CHS
addressing isn't used too often these days.
Fix minimum period size for cs46xx cards. This fixes a problem in the
case where neither a period size nor a buffer size is passed to ALSA;
this is the case in Audacious, OpenAL, and others.
Signed-off-by: Sophie Hamilton <kernel@theblob.org> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Some drives report 0 as the number of written blocks when there are some blocks
recorded. Use device size in such case so that we can automagically mount such
media.
Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When probing the device in tpm_tis_init the call request_locality
uses timeout_a, which wasn't being initalized until after
request_locality. This results in request_locality falsely timing
out if the chip is still starting. Move the initialization to before
request_locality.
This probably only matters for embedded cases (ie mine), a BIOS likely
gets the TPM into a state where this code path isn't necessary.
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Acked-by: Rajiv Andrade <srajiv@linux.vnet.ibm.com> Signed-off-by: James Morris <jmorris@namei.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
After applying the patch:
$ ./hello
Trace trap # user-mode execution after execve() finishes
If the ELF headers are actually self-inconsistent, then dying is fine.
But having no PROT_WRITE segment is perfectly normal and correct if
there is no segment with p_memsz > p_filesz (i.e. bss). John Reiser
suggested checking for PROT_WRITE in the bss logic. I think it makes
most sense to simply apply the bss logic only when there is bss.
This patch looks less trivial than it is due to some reindentation.
It just moves the "if (last_bss > elf_bss) {" test up to include the
partial-page bss logic as well as the more-pages bss logic.
Reported-by: John Reiser <jreiser@bitwagon.com> Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: James Morris <jmorris@namei.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
kmem_cache_destroy() should call rcu_barrier() *after* kmem_cache_close() and
*before* sysfs_slab_remove() or risk rcu_free_slab() being called after
kmem_cache is deleted (kfreed).
rmmod nf_conntrack can crash the machine because it has to kmem_cache_destroy()
a SLAB_DESTROY_BY_RCU enabled cache.
Reported-by: Zdenek Kabelac <zdenek.kabelac@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The function jffs2_nor_wbuf_flash_setup() doesn't allocate the verify buffer
if CONFIG_JFFS2_FS_WBUF_VERIFY is defined, so causing a kernel panic when
that macro is enabled and the verify function is called. Similarly the
jffs2_nor_wbuf_flash_cleanup() must free the buffer if
CONFIG_JFFS2_FS_WBUF_VERIFY is enabled.
The following patch fixes the problem.
The following patch applies to 2.6.30 kernel.
memcpy() should take into account size of pointers,
not only number of pointers to copy.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
E100 places it's RX packet descriptors inside skb->data and uses them
with bidirectional streaming DMA mapping. Data in descriptors is
accessed simultaneously by the chip (writing status and size when
a packet is received) and CPU (reading to check if the packet was
received). This isn't a valid usage of PCI DMA API, which requires use
of the coherent (consistent) memory for such purpose. Unfortunately e100
chips working in "simplified" RX mode have to store received data
directly after the descriptor. Fixing the driver to conform to the API
would require using unsupported "flexible" RX mode or receiving data
into a coherent memory and using CPU to copy it to network buffers.
This patch, while not yet making the driver conform to the PCI DMA API,
allows it to work correctly on X86 with swiotlb (while not breaking
other architectures).
Signed-off-by: Krzysztof Hałasa <khc@pm.waw.pl> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Trond Myklebust [Fri, 21 Aug 2009 17:37:17 +0000 (13:37 -0400)]
SUNRPC: Fix tcp reconnection
This fixes a problem that was reported as Red Hat Bugzilla entry number
485339, in which rpciod starts looping on the TCP connection code,
rendering the NFS client unusable for 1/2 minute or so.
[SCSI] sr: report more accurate drive status after closing the tray.
So, what's happening here is that the drive is reporting a sense of
2/4/1 ("logical unit is becoming ready") from sr_test_unit_ready(), and
then we ask for the media event notification before checking that result
at all. The check_media_event_descriptor() call isn't getting a check
condition, but it's also reporting that the tray is closed and that
there's no media. In actuality it doesn't yet know if there's media or
not, but there's no way to express that in the media event status field.
My current thought is that if it told us the device isn't yet ready, we
should return that immediately, since there's nothing that'll tell us
any more data than that reliably:
USB: removal of tty->low_latency hack dating back to the old serial code
This removes tty->low_latency from all USB serial drivers that push
data into the tty layer at hard interrupt context. It's no longer needed
and actually harmful.
Signed-off-by: Oliver Neukum <oliver@neukum.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Chuck Ebbert <cebbert@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Ideally we should have a directory of drivers and a link to the 'active'
driver. For now just show the first device which is effectively the existing
semantics without a warning.
This is an update on the original buggy patch that I then forgot to
resubmit. Confusingly it was proposed by Red Hat, written by Etched Pixels
fixed and submitted by Intel ...
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
In a non-sparse extend, we correctly allocate (and zero) the clusters between
the old_i_size and pos, but we don't zero the portions of the cluster we're
writing to outside of pos<->len.
It handles clustersize > pagesize and blocksize < pagesize.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
sllc_arphrd member of sockaddr_llc might not be changed. Zero sllc
before copying to the above layer's structure.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Oleg Nesterov [Mon, 24 Aug 2009 10:45:29 +0000 (12:45 +0200)]
kthreads: fix kthread_create() vs kthread_stop() race
The bug should be "accidently" fixed by recent changes in 2.6.31,
all kernels <= 2.6.30 need the fix. The problem was never noticed before,
it was found because it causes mysterious failures with GFS mount/umount.
Credits to Robert Peterson. He blaimed kthread.c from the very beginning.
But, despite my promise, I forgot to inspect the old implementation until
he did a lot of testing and reminded me. This led to huge delay in fixing
this bug.
kthread_stop() does put_task_struct(k) before it clears kthread_stop_info.k.
This means another kthread_create() can re-use this task_struct, but the
new kthread can still see kthread_should_stop() == T and exit even without
calling threadfn().
Reported-by: Robert Peterson <rpeterso@redhat.com> Tested-by: Robert Peterson <rpeterso@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Ulrich Drepper correctly points out that there is generally padding in
the structure on 64-bit hosts, and that copying the structure from
kernel to user space can leak information from the kernel stack in those
padding bytes.
Avoid the whole issue by just copying the three members one by one
instead, which also means that the function also can avoid the need for
a stack frame. This also happens to match how we copy the new structure
from user space, so it all even makes sense.
[ The obvious solution of adding a memset() generates horrid code, gcc
does really stupid things. ]
raw_getname() can leak 10 bytes of kernel memory to user
(two bytes hole between can_family and can_ifindex,
8 bytes at the end of sockaddr_can structure)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Oliver Hartkopp <oliver@hartkopp.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
atalk_getname() can leak 8 bytes of kernel memory to user
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The processor is documented to reload the PDPTRs while in PAE mode if any
of the CR4 bits PSE, PGE, or PAE change. Linux relies on this
behaviour when zapping the low mappings of PAE kernels during boot.
The code already handled changes to CR4.PAE; augment it to also notice changes
to PSE and PGE.
This triggered while booting an F11 PAE kernel; the futex initialization code
runs before any CR3 reloads and writes to a NULL pointer; the futex subsystem
ended up uninitialized, killing PI futexes and pulseaudio which uses them.
Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The paravirt tlb flush may be used not only to flush TLBs, but also
to reload the four page-directory-pointer-table entries, as it is used
as a replacement for reloading CR3. Change the code to do the entire
CR3 reloading dance instead of simply flushing the TLB.
Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
KVM optimizes guest port 80 accesses by passthing them through to the host.
Some AMD machines die on port 80 writes, allowing the guest to hard-lock the
host.
Remove the port passthrough to avoid the problem.
Reported-by: Piotr Jaroszyński <p.jaroszynski@gmail.com> Tested-by: Piotr Jaroszyński <p.jaroszynski@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
vmx_set_msr() does not allow i386 guests to touch EFER, but they can still
do so through the default: label in the switch. If they set EFER_LME, they
can oops the host.
Fix by having EFER access through the normal channel (which will check for
EFER_LME) even on i386.
Reported-and-tested-by: Benjamin Gilbert <bgilbert@cs.cmu.edu> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
One of vcpu_setup responsibilities is to do mmu initialization.
However, in case we fail in kvm_arch_vcpu_reset, before we get the
chance to init mmu. OTOH, vcpu_destroy will attempt to destroy mmu,
triggering a bug. Keeping track of whether or not mmu is initialized
would unnecessarily complicate things. Rather, we just make return,
making sure any needed uninitialization is done before we return, in
case we fail.
Signed-off-by: Glauber Costa <glommer@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
There is a potential issue that, when guest using pagetable without vmexit when
EPT enabled, guest would use PAT/PCD/PWT bits to index PAT msr for it's memory,
which would be inconsistent with host side and would cause host MCE due to
inconsistent cache attribute.
The patch set IGMT bit in EPT entry to ignore guest PAT and use WB as default
memory type to protect host (notice that all memory mapped by KVM should be WB).
Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The page fault path can use two rmap_desc structures, if:
- walk_addr's dirty pte update allocates one rmap_desc.
- mmu_lock is dropped, sptes are zapped resulting in rmap_desc being
freed.
- fetch->mmu_set_spte allocates another rmap_desc.
Currently KVM implements MC0-MC4_MISC read support. When booting Linux this
results in KVM warnings in the kernel log when the guest tries to read
MC5_MISC. Fix this warnings with this patch.
We're in a hot path. We can't use kmalloc() because
it might impact performance. So, we just stick the buffer that
we need into the kvm_vcpu_arch structure. This is used very
often, so it is not really a waste.
We also have to move the buffer structure's definition to the
arch-specific x86 kvm header.
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@qumranet.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
On my machine with gcc 3.4, kvm uses ~2k of stack in a few
select functions. This is mostly because gcc fails to
notice that the different case: statements could have their
stack usage combined. It overflows very nicely if interrupts
happen during one of these large uses.
This patch uses two methods for reducing stack usage.
1. dynamically allocate large objects instead of putting
on the stack.
2. Use a union{} member for all of the case variables. This
tricks gcc into combining them all into a single stack
allocation. (There's also a comment on this)
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@qumranet.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The accessed bit was accidentally turned on in a random flag word, rather
than, the spte itself, which was lucky, since it used the non-EPT compatible
PT_ACCESSED_MASK.
Fix by turning the bit on in the spte and changing it to use the portable
accessed mask.
Signed-off-by: Avi Kivity <avi@qumranet.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This is esoteric and only needed to break COW on MAP_SHARED mappings. Since
KVM no longer does these sorts of mappings, breaking COW on them is no longer
necessary.
Signed-off-by: Avi Kivity <avi@qumranet.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch fixes the bug that was reported in
http://bugzilla.kernel.org/show_bug.cgi?id=14053
If we're in the case where we need to force a reencode and then resend of
the RPC request, due to xprt_transmit failing with a networking error, then
we _must_ retransmit the entire request.
snd_interval_list() expected a sorted list but did not document this, so
there are drivers that give it an unsorted list. To fix this, change
the algorithm to work with any list.
This fixes the "Slave PCM not usable" error with USB devices that have
multiple alternate settings with sample rates in decreasing order, such
as the Philips Askey VC010 WebCam.
http://bugzilla.kernel.org/show_bug.cgi?id=14028
Reported-and-tested-by: Andrzej <adkadk@gmail.com> Signed-off-by: Clemens Ladisch <clemens@ladisch.de> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch fixes the napi list handling when an ehea interface is shut
down to avoid corruption of the napi list.
Signed-off-by: Hannes Hering <hering2@de.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
do {
ret = waitpid(pid, &status, 0);
} while (ret == -1 && errno == EINTR);
}
return 0;
}
quickly creates an unkillable task.
If copy_process(CLONE_THREAD) races with de_thread()
copy_signal()->atomic(signal->count) breaks the signal->notify_count
logic, and the execing thread can hang forever in kernel space.
Change copy_process() to increment count/live only when we know for sure
we can't fail. In this case the forked thread will take care of its
reference to signal correctly.
If copy_process() fails, check CLONE_THREAD flag. If it it set - do
nothing, the counters were not changed and current belongs to the same
thread group. If it is not set, ->signal must be released in any case
(and ->count must be == 1), the forked child is the only thread in the
thread group.
We need more cleanups here, in particular signal->count should not be used
by de_thread/__exit_signal at all. This patch only fixes the bug.
This patch fixes the wrong headphone output routing for MacBookPro 3,1/4,1
quirk with ALC889A codec, which caused the silent headphone output.
Also, this gives the individual Headphone and Speaker volume controls.
We can't call nfs_readdata_release()/nfs_writedata_release() without
first initialising and referencing args.context. Doing so inside
nfs_direct_read_schedule_segment()/nfs_direct_write_schedule_segment()
causes an Oops.
We should rather be calling nfs_readdata_free()/nfs_writedata_free() in
those cases.
Looking at the O_DIRECT code, the "struct nfs_direct_req" is already
referencing the nfs_open_context for us. Since the readdata and writedata
structures carry a reference to that, we can simplify things by getting rid
of the extra nfs_open_context references, so that we can replace all
instances of nfs_readdata_release()/nfs_writedata_release().
kernel_sendpage() does the proper default case handling for when the
socket doesn't have a native sendpage implementation.
Now, arguably this might be something that we could instead solve by
just specifying that all protocols should do it themselves at the
protocol level, but we really only care about the common protocols.
Does anybody really care about sendpage on something like Appletalk? Not
likely.
Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Julien TINNES <julien@cr0.org> Acked-by: Tavis Ormandy <taviso@sdf.lonestar.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
mm_for_maps() takes ->mmap_sem after security checks, this looks
strange and obfuscates the locking rules. Move this lock to its
single caller, m_start().
It would be nice to kill __ptrace_may_access(). It requires task_lock(),
but this lock is only needed to read mm->flags in the middle.
Convert mm_for_maps() to use ptrace_may_access(), this also simplifies
the code a little bit.
Also, we do not need to take ->mmap_sem in advance. In fact I think
mm_for_maps() should not play with ->mmap_sem at all, the caller should
take this lock.
With or without this patch, without ->cred_guard_mutex held we can race
with exec() and get the new ->mm but check old creds.
This patch (as1272) changes the error code returned when an open call
for a USB device node fails to locate the corresponding device. The
appropriate error code is -ENODEV, not -ENOENT.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu> CC: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Attached patch adds USB vendor and product IDs for Bayer's USB to serial
converter cable used by Bayer blood glucose meters. It seems to be a
FT232RL based device and works without any problem with ftdi_sio driver
when this patch is applied. See: http://winglucofacts.com/cables/
Signed-off-by: Marko Hänninen <bugitus@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
access_ok() checks must be done on every part of the userspace structure
that is accessed. If access_ok() on one part of the struct succeeded, it
does not imply it will succeed on other parts of the struct. (Does
depend on the architecture implementation of access_ok()).
This changes the __get_user() users to first check access_ok() on the
data structure.
Signed-off-by: Michael Buesch <mb@bu3sch.de> Cc: Pete Zaitcev <zaitcev@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
While looking at Jens Rosenboom bug report
(http://lkml.org/lkml/2009/7/27/35) about strange sys_futex call done from
a dying "ps" program, we found following problem.
clone() syscall has special support for TID of created threads. This
support includes two features.
One (CLONE_CHILD_SETTID) is to set an integer into user memory with the
TID value.
One (CLONE_CHILD_CLEARTID) is to clear this same integer once the created
thread dies.
The integer location is a user provided pointer, provided at clone()
time.
kernel keeps this pointer value into current->clear_child_tid.
At execve() time, we should make sure kernel doesnt keep this user
provided pointer, as full user memory is replaced by a new one.
As glibc fork() actually uses clone() syscall with CLONE_CHILD_SETTID and
CLONE_CHILD_CLEARTID set, chances are high that we might corrupt user
memory in forked processes.
Following sequence could happen:
1) bash (or any program) starts a new process, by a fork() call that
glibc maps to a clone( ... CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID
...) syscall
2) When new process starts, its current->clear_child_tid is set to a
location that has a meaning only in bash (or initial program) context
(&THREAD_SELF->tid)
3) This new process does the execve() syscall to start a new program.
current->clear_child_tid is left unchanged (a non NULL value)
4) If this new program creates some threads, and initial thread exits,
kernel will attempt to clear the integer pointed by
current->clear_child_tid from mm_release() :
/*
* We don't check the error code - if userspace has
* not set up a proper pointer then tough luck.
*/
<< here >> put_user(0, tidptr);
sys_futex(tidptr, FUTEX_WAKE, 1, NULL, NULL, 0);
}
5) OR : if new program is not multi-threaded, but spied by /proc/pid
users (ps command for example), mm_users > 1, and the exiting program
could corrupt 4 bytes in a persistent memory area (shm or memory mapped
file)
If current->clear_child_tid points to a writeable portion of memory of the
new program, kernel happily and silently corrupts 4 bytes of memory, with
unexpected effects.
Fix is straightforward and should not break any sane program.
The FIEMAP_IOC_FIEMAP mapping ioctl was missing a 32-bit compat handler,
which means that 32-bit suerspace on 64-bit kernels cannot use this ioctl
command.
The structure is nicely aligned, padded, and sized, so it is just this
simple.
Tested w/ 32-bit ioctl tester (from Josef) on a 64-bit kernel on ext4.
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Cc: <linux-ext4@vger.kernel.org> Cc: Mark Lord <lkml@rtr.ca> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Josef Bacik <josef@redhat.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>