We can call ext4_mb_check_limits even after successfully allocating
the requested blocks. In that case, make sure we don't overwrite
ac_status if it already has the status AC_STATUS_FOUND. This fixes
the lockdep warning:
=============================================
[ INFO: possible recursive locking detected ]
2.6.28-rc6-autokern1 #1
---------------------------------------------
fsstress/11948 is trying to acquire lock:
(&meta_group_info[i]->alloc_sem){----}, at: [<c04d9a49>] ext4_mb_load_buddy+0x9f/0x278
.....
Xen doesn't report that barriers are not supported until buffer I/O is
reported as completed, instead of when the buffer I/O is submitted.
Add a check and a fallback codepath to journal_wait_on_commit_record()
to detect this case, so that attempts to mount ext4 filesystems on
LVM/devicemapper devices on Xen guests don't blow up with an "Aborting
journal on device XXX"; "Remounting filesystem read-only" error.
Thanks to Andreas Sundstrom for reporting this issue.
I chased the cause of following ext4 oops report which is tested on
ia64 box.
http://bugzilla.kernel.org/show_bug.cgi?id=12018
The cause is the size of s_mb_maxs array that is defined as "unsigned
short" in ext4_sb_info structure. If the file system's block size is
8k or greater, an unsigned short is not wide enough to contain the
value fs->blocksize << 3.
When iterating through the pages which have mapped buffer_heads, we
failed to update the b_state value. This results in allocating blocks
at logical offset 0.
If the filesystem has errors, ext4_da_writepages() will return a *lot*
of errors, including lots and lots of stack dumps. While it's true
that we are dropping user data on the floor, which is unfortunate, the
stack dumps aren't helpful, and they tend to obscure the true original
root cause of the problem. So in the case where the filesystem has
aborted, return an EROFS right away.
The original ext3 hash algorithms assumed that variables of type char
were signed, as God and K&R intended. Unfortunately, this assumption
is not true on some architectures. Userspace support for marking
filesystems with non-native signed/unsigned chars was added two years
ago, but the kernel-side support was never added (until now).
Ensure that we do not set pcb->data.raw beyond its size, print an error message
and return false if we attempt to. A timout message was printed one too early.
Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
For a reason that I was unable to understand in three months of debugging,
mount ext2 -o remount stopped working properly when remounting from
regular operation to xip, or the other way around. According to a git
bisect search, the problem was introduced with the VM_MIXEDMAP/PTE_SPECIAL
rework in the vm:
In the failing scenario, the filesystem is mounted read only via root=
kernel parameter on s390x. During remount (in rc.sysinit), the inodes of
the bash binary and its libraries are busy and cannot be invalidated (the
bash which is running rc.sysinit resides on subject filesystem).
Afterwards, another bash process (running ifup-eth) recurses into a
subshell, runs dup_mm (via fork). Some of the mappings in this bash
process were created from inodes that could not be invalidated during
remount.
Both parent and child process crash some time later due to inconsistencies
in their address spaces. The issue seems to be timing sensitive, various
attempts to recreate it have failed.
This patch refuses to change the xip flag during remount in case some
inodes cannot be invalidated. This patch keeps users from running into
that issue.
[akpm@linux-foundation.org: cleanup] Signed-off-by: Carsten Otte <cotte@de.ibm.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Jared Hulbert <jaredeh@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Commit 3d2a71a596bd9c761c8487a2178e95f8a61da083 ("x86, traps: converge
do_debug handlers") changed the preemption disable logic of do_debug()
so vm86_handle_trap() is called with preemption disabled resulting in:
BUG: sleeping function called from invalid context at include/linux/kernel.h:155
in_atomic(): 1, irqs_disabled(): 0, pid: 3005, name: dosemu.bin
Pid: 3005, comm: dosemu.bin Tainted: G W 2.6.29-rc1 #51
Call Trace:
[<c050d669>] copy_to_user+0x33/0x108
[<c04181f4>] save_v86_state+0x65/0x149
[<c0418531>] handle_vm86_trap+0x20/0x8f
[<c064e345>] do_debug+0x15b/0x1a4
[<c064df1f>] debug_stack_correct+0x27/0x2c
[<c040365b>] sysenter_do_call+0x12/0x2f
BUG: scheduling while atomic: dosemu.bin/3005/0x10000001
Restore the original calling convention and reenable preemption before
calling handle_vm86_trap().
Reported-by: Michal Suchanek <hramrach@centrum.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Impact: fix race leading to crash under KVM and Xen
The CPA code may be called while we're in lazy mmu update mode - for
example, when using DEBUG_PAGE_ALLOC and doing a slab allocation
in an interrupt handler which interrupted a lazy mmu update. In this
case, the in-memory pagetable state may be out of date due to pending
queued updates. We need to flush any pending updates before inspecting
the page table. Similarly, we must explicitly flush any modifications
CPA may have made (which comes down to flushing queued operations when
flushing the TLB).
I am not sure what happened. It looks like we have always leaked
the q->queue that is allocated from the kfifo_init call. nab finally
noticed that we were leaking and this patch fixes it by adding a
kfree call to iscsi_pool_free. kfifo_free is not used per kfifo_init's
instructions to use kfree.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jean Delvare <jdelvare@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This is the cause of the DMA faults and disk corruption that people have
been seeing. Some chipsets neglect to report the RWBF "capability" --
the flag which says that we need to flush the chipset write-buffer when
changing the DMA page tables, to ensure that the change is visible to
the IOMMU.
Override that bit on the affected chipsets, and everything is happy
again.
Thanks to Chris and Bhavesh and others for helping to debug.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Tested-by: Chris Wright <chrisw@sous-sol.org> Reviewed-by: Bhavesh Davda <bhavesh@vmware.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
bugzilla: #12363
commit 7cd5b08be3c489df11b559fef210b81133764ad4 added a second regression:
some Dell's and Compaq's lockup on boot. So we revert most of the code.
The ICH9 reboot issue remains in place and will need some more fixing... :-(
Signed-off-by: Wim Van Sebroeck <wim@iguana.be> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
If a process registers for asynchronous notification on a POSIX message
queue, it gets a signal and a siginfo_t structure when a message arrives
on the message queue. The si_pid in the siginfo_t structure is set to the
PID of the process that sent the message to the message queue.
The principle is the following:
. when mq_notify(SIGEV_SIGNAL) is called, the caller registers for
notification when a msg arrives. The associated pid structure is stroed into
inode_info->notify_owner. Let's call this process P1.
. when mq_send() is called by say P2, P2 sends a signal to P1 to notify
him about msg arrival.
The way .si_pid is set today is not correct, since it doesn't take into account
the fact that the process that is sending the message might not be in the
same namespace as the notified one.
This patch proposes to set si_pid to the sender's pid into the notify_owner
namespace.
Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Bastian Blank <bastian@waldi.eu.org> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
A current problem with the pid namespace is that it is easy to do pid
related work after exit_task_namespaces which drops the nsproxy pointer.
However if we are doing pid namespace related work we are always operating
on some struct pid which retains the pid_namespace pointer of the pid
namespace it was allocated in.
So provide ns_of_pid which allows us to find the pid namespace a pid was
allocated in.
Using this we have the needed infrastructure to do pid namespace related
work at anytime we have a struct pid, removing the chance of accidentally
having a NULL pointer dereference when accessing current->nsproxy.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Bastian Blank <bastian@waldi.eu.org> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Nadia Derbey <Nadia.Derbey@bull.net> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The trick in socket splicing where we try to convert the skb->data
into a page based reference using virt_to_page() does not work so
well.
The idea is to pass the virt_to_page() reference via the pipe
buffer, and refcount the buffer using a SKB reference.
But if we are splicing from a socket to a socket (via sendpage)
this doesn't work.
The from side processing will grab the page (and SKB) references.
The sendpage() calls will grab page references only, return, and
then the from side processing completes and drops the SKB ref.
The page based reference to skb->data is not enough to keep the
kmalloc() buffer backing it from being reused. Yet, that is
all that the socket send side has at this point.
This leads to data corruption if the skb->data buffer is reused
by SLAB before the send side socket actually gets the TX packet
out to the device.
The fix employed here is to simply allocate a page and copy the
skb->data bytes into that page.
This will hurt performance, but there is no clear way to fix this
properly without a copy at the present time, and it is important
to get rid of the data corruption.
When user tries to map all chunks given in argument, kernel
works on a copy of the chunkmap, but at the end it doesn't
check the copy, but the orginal one.
Signed-off-by: Qu Haoran <haoran.qu@6wind.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The patch fixes a typo in the inverse mapping of Node Information
request. Following draft-ietf-ipngwg-icmp-name-lookups-09, "Querier"
sends a type 139 (ICMPV6_NI_QUERY) packet to "Responder" which answer
with a type 140 (ICMPV6_NI_REPLY) packet.
Signed-off-by: Eric Leblond <eric@inl.fr> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The dev->pio_mode > XFER_PIO_0 test is there to avoid unnecessary
speed down warning messages but it accidentally disabled SATA link spd
down during configuration phase after reset where PIO mode is always
zero.
This patch fixes the problem by moving the test where it belongs.
This makes libata probing sequence behave better when the connection
is flaky at higher link speeds which isn't too uncommon for eSATA
devices.
When checking for the CFA feature set support, ata_id_is_cfa() tests bit 2 in
word 82 of the identify data instead the word 83; it also checks the ATA/PI
version support in the word 80 (which the CompactFlash specifications have as
reserved), this having no slightest chance to work on the modern CF cards that
don't have 0x848A in the word 0...
Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com> Cc: Chuck Ebbert <cebbert@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Report descriptor fixup for MS 1028 receiver changes also values for
Keyboard and Consumer, which incorrectly trims the range, causing correct
events being thrown away before passing to userspace.
We need to keep the GenDesk usage fixup though, as it reports totally bogus
values about axis.
Fix the initial value for input hwport. The old value (-1) may cause
Oops when an realtime MIDI byte is received before the input port is
explicitly given.
Instead, now it's set to the broadcasting as default.
sparc64 needs sign-extended function parameters. We have to enable
the system call wrappers.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
tcp_splice_data_recv has two lengths to consider: the len parameter it
gets from tcp_read_sock, which specifies the amount of data in the skb,
and rd_desc->count, which is the amount of data the splice caller still
wants. Currently it passes just the latter to skb_splice_bits, which then
splices min(rd_desc->count, skb->len - offset) bytes.
Most of the time this is fine, except when the skb contains urgent data.
In that case len goes only up to the urgent byte and is less than
skb->len - offset. By ignoring len tcp_splice_data_recv may a) splice
data tcp_read_sock told it not to, b) return to tcp_read_sock a value > len.
Now, tcp_read_sock doesn't handle used > len and leaves the socket in a
bad state (both sk_receive_queue and copied_seq are bad at that point)
resulting in duplicated data and corruption.
Fix by passing min(rd_desc->count, len) to skb_splice_bits.
Signed-off-by: Dimitris Michailidis <dm@chelsio.com> Acked-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
As spotted by Willy Tarreau, current splice() from tcp socket to pipe is not
optimal. It processes at most one segment per call.
This results in low performance and very high overhead due to syscall rate
when splicing from interfaces which do not support LRO.
Willy provided a patch inside tcp_splice_read(), but a better fix
is to let tcp_read_sock() process as many segments as possible, so
that tcp_rcv_space_adjust() and tcp_cleanup_rbuf() are called less
often.
With this change, splice() behaves like tcp_recvmsg(), being able
to consume many skbs in one system call. With typical 1460 bytes
of payload per frame, that means splice(SPLICE_F_NONBLOCK) can return
16*1460 = 23360 bytes.
Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
802.1Q expanded the maximum ethernet frame size by 4 bytes for the
VLAN tag. We're not taking this into account in virtio_net, which
means the buffers we provide to the backend in the virtqueue RX ring
aren't big enough to hold a full MTU VLAN packet. For QEMU/KVM,
this results in the backend exiting with a packet truncation error.
Signed-off-by: Alex Williamson <alex.williamson@hp.com> Acked-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Tap devices can make use of a small MAC filter set via the
TUNSETTXFILTER ioctl. The filter has a set of exact matches
plus a hash for imperfect filtering of additional multicast
addresses. The current code is unbalanced, adding unicast
addresses to the multicast hash, but only checking the hash
against multicast addresses. This results in the filter
dropping unicast addresses that overflow the exact filter.
The fix is simply to disable the filter by leaving count set
to zero if we find non-multicast addresses after the exact
match table is filled.
Signed-off-by: Alex Williamson <alex.williamson@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Printing anything over netconsole before hw is up and running is,
of course, not going to work.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
In function sock_getsockopt() located in net/core/sock.c, optval v.val
is not correctly initialized and directly returned in userland in case
we have SO_BSDCOMPAT option set.
As the options passed to ip6_append_data may be ephemeral, we need
to duplicate it for corking. This patch applies the simplest fix
which is to memdup all the relevant bits.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This field can be used to detect incorrect sizing of socket receive
buffers.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The UDP header pointer assignment must happen after calling
pskb_may_pull(). As pskb_may_pull() can potentially alter the SKB
buffer.
This was exposted by running multicast traffic through the NIU driver,
as it won't prepull the protocol headers into the linear area on
receive.
Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
In the lockup situation the driver seems to go off in an eternal storm
of interrupts right after calling request_irq(). It doesn't actually
do anything interesting in the interrupt handler. Since connecting the link
afterwards works, something later in initialization must fix this.
Looking at gem_do_start() and gem_open(), it seems that the only thing
done while opening the device after the request_irq(), is a call to
napi_enable().
I don't know what the ordering requirements are for the
initialization, but I boldly tried to move the napi_enable() call
inside gem_do_start() before the link state is checked and interrupts
subsequently enabled, and it seems to work for me. Doesn't even break
anything too obvious...
Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
As the mmap handler gets called under mmap_sem, and we may grab
mmap_sem elsewhere under the socket lock to access user data, we
should avoid grabbing the socket lock in the mmap handler.
Since the only thing we care about in the mmap handler is for
pg_vec* to be invariant, i.e., to exclude packet_set_ring, we
can achieve this by simply using a new mutex.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Tested-by: Martin MOKREJĹ <mmokrejs@ribosome.natur.cuni.cz> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
packet_lookup_frames() fails to get user frame if current frame header
status contains extra flags.
This is due to the wrong assumption on the operators precedence during
frame status tests.
Fixed by forcing the right operators precedence order with explicit brackets.
Signed-off-by: Paolo Abeni <paolo.abeni@gmail.com> Signed-off-by: Sebastiano Di Paola <sebastiano.dipaola@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
We can't really let that get exposed to userspace
because this conflicts with types defined in netinet/ip.h
which userland is almost certainly going to have included
either explicitly or implicitly.
So guard this include with a __KERNEL__ ifdef.
Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Benjamin Zores <benjamin.zores@alcatel-lucent.fr> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Actually I did some testing and added a few printks and found that the
st->cur_skb->data was 0 and hence the ptr used by iscsi_tcp was null.
This caused the kernel panic.
I enabled the debug_tcp and with a few printks found that the code did
not go to the next_skb label and could find that the sequence being
followed was this -
Reversing the two conditions the attached patch fixes the issue for me
on top of Herbert's patches.
Signed-off-by: Shyam Iyer <shyam_iyer@dell.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The frag_list handling was broken in skb_seq_read:
1) We didn't add the stepped offset when looking at the head
are of fragments other than the first.
2) We didn't take the stepped offset away when setting the data
pointer in the head area.
3) The frag index wasn't reset.
This patch fixes both issues.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Recent changes to the retransmit code exposed a long standing
bug where it was possible for a chunk to be time stamped
after the retransmit timer was reset. This caused a rare
situation where the retrnamist timer has expired, but
nothing was marked for retrnasmission because all of
timesamps on data were less then 1 rto ago. As result,
the timer was never restarted since nothing was retransmitted,
and this resulted in a hung association that did couldn't
complete the data transfer. The solution is to timestamp
the chunk when it's added to the packet for transmission
purposes. After the packet is trsnmitted the rtx timer
is restarted. This guarantees that when the timer expires,
there will be data to retransmit.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Commit 62aeaff5ccd96462b7077046357a6d7886175a57
(sctp: Start T3-RTX timer when fast retransmitting lowest TSN)
introduced a regression where it was possible to forcibly
restart the sctp retransmit timer at the transmission of any
new chunk. This resulted in much longer timeout times and
sometimes hung sctp connections.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
crc32c algorithm provides a byteswaped result. On little-endian
arches, the result ends up in big-endian/network byte order.
On big-endinan arches, the result ends up in little-endian
order and needs to be byte swapped again. Thus calling cpu_to_le32
gives the right output.
Tested-by: Jukka Taimisto <jukka.taimisto@mail.suomi.net> Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
If a client requests a blocking lock, is denied, then requests it again,
then here in nlmsvc_lock() we will call vfs_lock_file() without FL_SLEEP
set, because we've already queued a block and don't need the locks code
to do it again.
But that means vfs_lock_file() will return -EAGAIN instead of
FILE_LOCK_DENIED. So we still need to translate that -EAGAIN return
into a nlm_lck_blocked error in this case, and put ourselves back on
lockd's block list.
The bug was introduced by bde74e4bc64415b1 "locks: add special return
value for asynchronous locks".
Thanks to Frank van Maarseveen for the report; his original test
case was essentially
for i in `seq 30`; do flock /nfsmount/foo sleep 10 & done
Tested-by: Frank van Maarseveen <frankvm@frankvm.com> Reported-by: Frank van Maarseveen <frankvm@frankvm.com> Cc: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Fixed v_mapped_by_tlbcam() and p_mapped_by_tlbcam() to use phys_addr_t
instead of unsigned long. In 36-bit physical mode we really need these
functions to deal with phys_addr_t when trying to match a physical
address or when returning one.
Signed-off-by: Kumar Gala <galak@kernel.crashing.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Christophe Saout reported [in precursor to:
http://marc.info/?l=linux-kernel&m=123209902707347&w=4]:
> Note that I also some a different issue with CONFIG_UNEVICTABLE_LRU.
> Seems like Xen tears down current->mm early on process termination, so
> that __get_user_pages in exit_mmap causes nasty messages when the
> process had any mlocked pages. (in fact, it somehow manages to get into
> the swapping code and produces a null pointer dereference trying to get
> a swap token)
Jeremy explained:
Yes. In the normal case under Xen, an in-use pagetable is "pinned",
meaning that it is RO to the kernel, and all updates must go via hypercall
(or writes are trapped and emulated, which is much the same thing). An
unpinned pagetable is not currently in use by any process, and can be
directly accessed as normal RW pages.
As an optimisation at process exit time, we unpin the pagetable as early
as possible (switching the process to init_mm), so that all the normal
pagetable teardown can happen with direct memory accesses.
This happens in exit_mmap() -> arch_exit_mmap(). The munlocking happens
a few lines below. The obvious thing to do would be to move
arch_exit_mmap() to below the munlock code, but I think we'd want to
call it even if mm->mmap is NULL, just to be on the safe side.
Thus, this patch:
exit_mmap() needs to unlock any locked vmas before calling arch_exit_mmap,
as the latter may switch the current mm to init_mm, which would cause the
former to fail.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christophe Saout <christophe@saout.de> Cc: Keir Fraser <keir.fraser@eu.citrix.com> Cc: Christophe Saout <christophe@saout.de> Cc: Alex Williamson <alex.williamson@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
It made the following changes:
1. Decrement wbc->nr_to_write instead of nr_to_write
2. Decrement wbc->nr_to_write _only_ if wbc->sync_mode == WB_SYNC_NONE
3. If synced nr_to_write pages, stop only if if wbc->sync_mode ==
WB_SYNC_NONE, otherwise keep going.
However, according to the commit message, the intention was to only make
change 3. Change 1 is a bug. Change 2 does not seem to be necessary,
and it breaks UBIFS expectations, so if needed, it should be done
separately later. And change 2 does not seem to be documented in the
commit message.
This patch does the following:
1. Undo changes 1 and 2
2. Add a comment explaining change 3 (it very useful to have comments
in _code_, not only in the commit).
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Acked-by: Nick Piggin <npiggin@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
A bug was introduced into write_cache_pages cyclic writeout by commit 31a12666d8f0c22235297e1c1575f82061480029 ("mm: write_cache_pages cyclic
fix"). The intention (and comments) is that we should cycle back and
look for more dirty pages at the beginning of the file if there is no
more work to be done.
But the !done condition was dropped from the test. This means that any
time the page writeout loop breaks (eg. due to nr_to_write == 0), we
will set index to 0, then goto again. This will set done_index to
index, then find done is set, so will proceed to the end of the
function. When updating mapping->writeback_index for cyclic writeout,
we now use done_index == 0, so we're always cycling back to 0.
This seemed to be causing random mmap writes (slapadd and iozone) to
start writing more pages from the LRU and writeout would slowdown, and
caused bugzilla entry
http://bugzilla.kernel.org/show_bug.cgi?id=12604
about Berkeley DB slowing down dramatically.
With this patch, iozone random write performance is increased nearly
5x on my system (iozone -B -r 4k -s 64k -s 512m -s 1200m on ext2).
Signed-off-by: Nick Piggin <npiggin@suse.de> Reported-and-tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The SYSCALL wrapper patches played havoc with kernel-doc for
syscalls. Syscalls that were scanned for DocBook processing
reported warnings like this one, for sys_tgkill:
Warning(kernel/signal.c:2285): No description found for parameter 'tgkill'
Warning(kernel/signal.c:2285): No description found for parameter 'pid_t'
Warning(kernel/signal.c:2285): No description found for parameter 'int'
because the macro parameters all "look like" function parameters,
although they are not:
/**
* sys_tgkill - send signal to one specific thread
* @tgid: the thread group ID of the thread
* @pid: the PID of the thread
* @sig: signal to be sent
*
* This syscall also checks the @tgid and returns -ESRCH even if the PID
* exists but it's not belonging to the target process anymore. This
* method solves the problem of threads exiting and PIDs getting reused.
*/
SYSCALL_DEFINE3(tgkill, pid_t, tgid, pid_t, pid, int, sig)
{
...
This patch special-cases the handling SYSCALL_DEFINE* function
prototypes by expanding them to
long sys_foobar(type1 arg1, type1 arg2, ...)
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
With the new system call defines we get this on uml:
arch/um/sys-i386/built-in.o: In function `sys_call_table':
(.rodata+0x308): undefined reference to `sys_sigprocmask'
Reason for this is that uml passes the preprocessor option
-Dsigprocmask=kernel_sigprocmask to gcc when compiling the kernel.
This causes SYSCALL_DEFINE3(sigprocmask, ...) to be expanded to
SYSCALL_DEFINEx(3, kernel_sigprocmask, ...) and finally to a system
call named sys_kernel_sigprocmask. However sys_sigprocmask is missing
because of this.
To avoid macro expansion for the system call name just concatenate the
name at first define instead of carrying it through severel levels.
This was pointed out by Al Viro.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: WANG Cong <wangcong@zeuux.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Since netmos 9835 with subids 0x1014(IBM):0x0299 is now bound with
serial/8250_pci, because it has no parallel ports and subdevice id isn't
in the expected form, return -ENODEV from probe function.
Three people (Petr Mensik <pihhan@cipis.net>
["si" should be U+0161 U+00ED], Stephen Ho <stephenhoinhk@gmail.com>
on zd1211-devs and Ismael Ojeda Perez <iojedaperez@gmail.com>
on linux-wireless) reported success in getting TP-Link WN322G/WN422G
working by treating MAXIM_NEW_RF(0x08) as UW2453_RF(0x09) for rf
chip hardware initialization.
Signed-off-by: Hin-Tak Leung <htl10@users.sourceforge.net> Tested-by: Petr Mensik <pihhan@cipis.net> Tested-by: Stephen Ho <stephenhoinhk@gmail.com> Tested-by: Ismael Ojeda Perez <iojedaperez@gmail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Christoph Biedl <sourceforge.bnwi@manchmal.in-ulm.de> reported success
in the sourceforge zd1211 mailing list on this addition. This product ID
was supported by the vendor driver ZD1211LnxDrv 2.22.0.0 (and possibly
earlier) and it probably should have been added earlier.
Signed-off-by: Hin-Tak Leung <htl10@users.sourceforge.net> Tested-by: Christoph Biedl <sourceforge.bnwi@manchmal.in-ulm.de> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When the temperature exceeds 32767 milli-degrees the temperature overflows
to -32768 millidegrees. These are bothe well within the -55 - +125 degree
range for the sensor.
Fix overflow in left-shift of a u8.
Signed-off-by: Ian Dall <ian@beware.dropbear.id.au> Signed-off-by: Evgeniy Polyakov <zbr@ioremap.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
We try to find the correct outgoing interface for injected frames
based on the TA, but since this is a hack for hostapd 11w, restrict
the heuristic to AP mode interfaces. At some point we'll add the
ability to give an interface index in radiotap or so and just
remove this heuristic again.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This can also occur when an nbd device loses its nbd-client/server
connection. Although we clear the queue of any outstanding I/Os after the
client/server connection fails, any additional I/Os that get queued later
will hang.
This bug may also be the problem reported in this bug report:
http://bugzilla.kernel.org/show_bug.cgi?id=12277
Testing would need to be performed to determine if the two issues are the
same.
This problem was introduced by the new request handling thread code ("NBD:
allow nbd to be used locally", 3/2008), which entered into mainline around
2.6.25.
The fix, which is fairly simple, is to restore the check for lo->sock
being NULL in do_nbd_request. This causes I/O to an uninitialized nbd to
immediately fail with an I/O error, as it did prior to the introduction of
this bug.
Signed-off-by: Paul Clements <paul.clements@steeleye.com> Reported-by: Jon Nelson <jnelson-kernel-bugzilla@jamponi.net> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Commit 6194ba6ff6ccf8d5c54c857600843c67aa82c407 ("x86: don't special-case
pmd allocations as much") made changes to the way we handle pmd allocations,
and while doing that it dropped a call to paravirt_release_pd on the
pgd page from the pgd_dtor code path.
As a result of this missing release, the hypervisor is now unaware of the
pgd page being freed, and as a result it ends up tracking this page as a
page table page.
After this the guest may start using the same page for other purposes, and
depending on what use the page is put to, it may result in various performance
and/or functional issues ( hangs, reboots).
Since this release is only required for VMI, I now release the pgd page from
the (vmi)_pgd_free hook.
Signed-off-by: Alok N Kataria <akataria@vmware.com> Acked-by: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
There is a race between sctp_rcv() and sctp_accept() where we
have moved the association from the listening socket to the
accepted socket, but sctp_rcv() processing cached the old
socket and continues to use it.
The easy solution is to check for the socket mismatch once we've
grabed the socket lock. If we hit a mis-match, that means
that were are currently holding the lock on the listening socket,
but the association is refrencing a newly accepted socket. We need
to drop the lock on the old socket and grab the lock on the new one.
A more proper solution might be to create accepted sockets when
the new association is established, similar to TCP. That would
eliminate the race for 1-to-1 style sockets, but it would still
existing for 1-to-many sockets where a user wished to peeloff an
association. For now, we'll live with this easy solution as
it addresses the problem.
Reported-by: Michal Hocko <mhocko@suse.cz> Reported-by: Karsten Keil <kkeil@suse.de> Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch (as1202) adds Pentax to usb-storage's list of bad vendors
whose devices always need the CAPACITY_HEURISTICS flag. This is in
addition to the existing entries: Nokia, Nikon, and Motorola.
Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Tim Shimmin <tes@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch fixes sporadic firmware restarts when scanning while associated.
The firmware will quietly cancel a scan (while associated) if the dwell time
for a channel to be scanned is larger than the time it may stay away from the
operating channel (because of DTIM catching). Unfortunately the driver is not
notified about the canceled scan and therefore the scan watchdog timeout will
be hit and the driver causes a firmware restart which results in
disassociation. This mainly happens on passive channels which use a dwell time
of 120 whereas a typical beacon interval is around 100.
The patch changes the dwell time for passive channels to be slightly smaller
than the actual beacon interval to work around the firmware issue. Furthermore
the number of allowed beacon misses is increased from one to three as otherwise
most scans (while associated) won't complete successfully.
However scanning while associated will still fail in corner cases such as a
beacon intervals below 30.
Signed-off-by: Helmut Schaa <helmut.schaa@googlemail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Section B.6.2 of ACPI 3.0b specification that defines _BCL method
doesn't require the brightness levels returned to be sorted.
At least ThinkPad SL300 (and probably all IdeaPads) returns the
array reversed (i.e. bightest levels have lowest indexes), which
causes the brightness management behave in completely reversed
manner on these machines (brightness increases when the laptop is
idle, while the display dims when used).
Sorting the array by brightness level values after reading the list
fixes the issue.
http://bugzilla.kernel.org/show_bug.cgi?id=12037
Signed-off-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Lubomir Rintel <lkundrak@v3.sk> Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Thomas Renninger <trenn@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The elf_core_dump() code does its work with set_fs(KERNEL_DS) in force,
so vma_dump_size() needs to switch back with set_fs(USER_DS) to safely
use get_user() for a normal user-space address.
Checking for VM_READ optimizes out the case where get_user() would fail
anyway. The vm_file check here was already superfluous given the control
flow earlier in the function, so that is a cleanup/optimization unrelated
to other changes but an obvious and trivial one.
Commit 27421e211a39784694b597dbf35848b88363c248, Manually revert
"mlock: downgrade mmap sem while populating mlocked regions", has
introduced its own regression: __mlock_vma_pages_range() may report
an error (for example, -EFAULT from trying to lock down pages from
beyond EOF), but mlock_vma_pages_range() must hide that from its
callers as before.
The PCI-card identified as "Oxford Semiconductor Ltd EXSYS EX-41092 Dual
16950 Serial adapter" is only usable with other devices (i.e. not the same
card) after doing a "setserial /dev/ttyS<n> baud_base 115200". This
baud_base should be default for this card.
Signed-off-by: Niels de Vos <niels.devos@wincor-nixdorf.com> Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
lseek() further than length of the file will leave stale ->index
(second-to-last during iteration). Next seq_read() will not notice
that ->f_pos is big enough to return 0, but will print last item
as if ->f_pos is pointing to it.
In 2.6.25 some /proc files were converted to use the seq_file
infrastructure. But seq_files do not correctly support pread(), which
broke some usersapce applications.
To handle pread correctly we can't assume that f_pos is where we left it
in seq_read. So move traverse() so that we can eventually use it in
seq_read and do thus some day support pread().
Signed-off-by: Eric Biederman <ebiederm@xmission.com> Cc: Paul Turner <pjt@google.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
We only want to disable ASPM when the last function is removed from
the parent's device list. We determine this by checking to see if
the parent's device list is completely empty.
Unfortunately, we never hit that code because the parent is considered
an upstream port, and never had an ASPM link_state associated with it.
The early check for !link_state causes us to return early, we never
discover that our device list is empty, and thus we never remove the
downstream ports' link_state nodes.
Instead of checking to see if the parent's device list is empty, we can
check to see if we are the last device on the list, and if so, then we
know that we can clean up properly.
Cc: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Alex Chiang <achiang@hp.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
During early boot, ACPI RSDT/XSDT table entries are gathered into the
'initial_tables[]' array. This array is currently statically defined (see
./drivers/acpi/tables.c). When there are more table entries than can be
held in the 'initial_tables[]' array, the message "Truncating N table
entries!" is output. As currently implemented, this message will always
erroneously calculate N as 0.
This patch fixes the calculation that determines how many table entries
will be missing (truncated).
This modification may be used under either the GPL or the BSD-style
license used for Intel ACPI CA code.
Signed-off-by: Myron Stowe <myron.stowe@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
According to the Spec the first two elements in the _BCL package won't be
regarded as the available brightness level. The first is the brightness when
full power is connected to the box(It means that the AC adapter is plugged).
The second is the brightness level when the box is on battery.
If the first two elements are still used while finding the next brightness
level, it will fall back to the lowest level when keeping on pressing
hotkey. (On some boxes the brightness will be changed twice when hotkey is
pressed once. One is in the ACPI video driver. The other is changed by sys I/F.
In the ACPI video driver the first two elements will be used while changing
the brightness. But the first two elements is skipped while using sys I/F.
In such case there exists the inconsistency).
So he first two elements had better be skipped while showing the available
brightness or finding the next brightness level.
http://bugzilla.kernel.org/show_bug.cgi?id=12450
Signed-off-by: Zhao Yakui <yakui.zhao@intel.com> Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>