Here a newly created task is having its runqueue assigned. The new task
is not yet on the tasklist, so cannot go away. This is therefore a false
positive, suppress with an RCU read-side critical section.
Reported-by: Ben Greear <greearb@candelatech.com Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Andi Kleen <ak@linux.intel.com> Tested-by: Ben Greear <greearb@candelatech.com Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
If a high priority task is waking up on a CPU that is running a
lower priority task that is bound to a CPU, see if we can move the
high RT task to another CPU first. Note, if all other CPUs are
running higher priority tasks than the CPU bounded current task,
then it will be preempted regardless.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Gregory Haskins <ghaskins@novell.com>
LKML-Reference: <20100921024138.888922071@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When first working on the RT scheduler design, we concentrated on
keeping all CPUs running RT tasks instead of having multiple RT
tasks on a single CPU waiting for the migration thread to move
them. Instead we take a more proactive stance and push or pull RT
tasks from one CPU to another on wakeup or scheduling.
When an RT task wakes up on a CPU that is running another RT task,
instead of preempting it and killing the cache of the running RT
task, we look to see if we can migrate the RT task that is waking
up, even if the RT task waking up is of higher priority.
This may sound a bit odd, but RT tasks should be limited in
migration by the user anyway. But in practice, people do not do
this, which causes high prio RT tasks to bounce around the CPUs.
This becomes even worse when we have priority inheritance, because
a high prio task can block on a lower prio task and boost its
priority. When the lower prio task wakes up the high prio task, if
it happens to be on the same CPU it will migrate off of it.
But in reality, the above does not happen much either, because the
wake up of the lower prio task, which has already been boosted, if
it was on the same CPU as the higher prio task, it would then
migrate off of it. But anyway, we do not want to migrate them
either.
To examine the scheduling, I created a test program and examined it
under kernelshark. The test program created CPU * 2 threads, where
each thread had a different priority. The program takes different
options. The options used in this change log was to have priority
inheritance mutexes or not.
All threads did the following loop:
static void grab_lock(long id, int iter, int l)
{
ftrace_write("thread %ld iter %d, taking lock %d\n",
id, iter, l);
pthread_mutex_lock(&locks[l]);
ftrace_write("thread %ld iter %d, took lock %d\n",
id, iter, l);
busy_loop(nr_tasks - id);
ftrace_write("thread %ld iter %d, unlock lock %d\n",
id, iter, l);
pthread_mutex_unlock(&locks[l]);
}
void *start_task(void *id)
{
[...]
while (!done) {
for (l = 0; l < nr_locks; l++) {
grab_lock(id, i, l);
ftrace_write("thread %ld iter %d sleeping\n",
id, i);
ms_sleep(id);
}
i++;
}
[...]
}
The busy_loop(ms) keeps the CPU spinning for ms milliseconds. The
ms_sleep(ms) sleeps for ms milliseconds. The ftrace_write() writes
to the ftrace buffer to help analyze via ftrace.
The higher the id, the higher the prio, the shorter it does the
busy loop, but the longer it spins. This is usually the case with
RT tasks, the lower priority tasks usually run longer than higher
priority tasks.
At the end of the test, it records the number of loops each thread
took, as well as the number of voluntary preemptions, non-voluntary
preemptions, and number of migrations each thread took, taking the
information from /proc/$$/sched and /proc/$$/status.
Running this on a 4 CPU processor, the results without changes to
the kernel looked like this:
The total # of migrations did not change (several runs showed the
difference all within the noise). But we now see a dramatic
improvement to the higher priority tasks. (kernelshark showed that
the watchdog timer bumped the highest priority task to give it the
2 count. This was actually consistent with every run).
Notice that the # of iterations did not change either.
The above was with priority inheritance mutexes. That is, when the
higher prority task blocked on a lower priority task, the lower
priority task would inherit the higher priority task (which shows
why task 6 was bumped so many times). When not using priority
inheritance mutexes, the current kernel shows this:
Which shows a even bigger change. The big difference between task 3
and task 4 is because we have only 4 CPUs on the machine, causing
the 4 highest prio tasks to always have preference.
Although I did not measure cache misses, and I'm sure there would
be little to measure since the test was not data intensive, I could
imagine large improvements for higher priority tasks when dealing
with lower priority tasks. Thus, I'm satisfied with making the
change and agreeing with what Gregory Haskins argued a few years
ago when we first had this discussion.
One final note. All tasks in the above tests were RT tasks. Any RT
task will always preempt a non RT task that is running on the CPU
the RT task wants to run on.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Gregory Haskins <ghaskins@novell.com>
LKML-Reference: <20100921024138.605460343@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
scheduler uses cache_nice_tries as an indicator to do cache_hot and
active load balance, when normal load balance fails. Currently,
this value is changed on any failed load balance attempt. That ends
up being not so nice to workloads that enter/exit idle often, as
they do more frequent new_idle balance and that pretty soon results
in cache hot tasks being pulled in.
Making the cache_nice_tries ignore failed new_idle balance seems to
make better sense. With that only the failed load balance in
periodic load balance gets accounted and the rate of accumulation
of cache_nice_tries will not depend on idle entry/exit (short
running sleep-wakeup kind of tasks). This reduces movement of
cache_hot tasks.
schedstat diff (after-before) excerpt from a workload that has
frequent and short wakeup-idle pattern (:2 in cpu col below refers
to NEWIDLE idx) This snapshot was across ~400 seconds.
Currently sched_avg_update() (which updates rt_avg stats in the rq)
is getting called from scale_rt_power() (in the load balance context)
which doesn't take rq->lock.
Fix it by moving the sched_avg_update() to more appropriate
update_cpu_load() where the CFS load gets updated as well.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andi Kleen <ak@linux.intel.com>
LKML-Reference: <1282596171.2694.3.camel@sbsiddha-MOBL3> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When there's an xHCI host power loss after a suspend from memory, the USB
core attempts to reset and verify the USB devices that are attached to the
system. The xHCI driver has to reallocate those devices, since the
hardware lost all knowledge of them during the power loss.
When a hub is plugged in, and the host loses power, the xHCI hardware
structures are not updated to say the device is a hub. This is usually
done in hub_configure() when the USB hub is detected. That function is
skipped during a reset and verify by the USB core, since the core restores
the old configuration and alternate settings, and the hub driver has no
idea this happened. This bug makes the xHCI host controller reject the
enumeration of low speed devices under the resumed hub.
Therefore, make the USB core re-setup the internal xHCI hub device
information by calling update_hub_device() when hub_activate() is called
for a hub reset resume. After a host power loss, all devices under the
roothub get a reset-resume or a disconnect.
This patch should be queued for the 2.6.37 stable tree.
Clearing the cpu in prev's mm_cpumask early will avoid the flush tlb
IPI's while the cr3 is still pointing to the prev mm. And this window
can lead to the possibility of bogus TLB fills resulting in strange
failures. One such problematic scenario is mentioned below.
T1. CPU-1 is context switching from mm1 to mm2 context and got a NMI
etc between the point of clearing the cpu from the mm_cpumask(mm1)
and before reloading the cr3 with the new mm2.
T2. CPU-2 is tearing down a specific vma for mm1 and will proceed with
flushing the TLB for mm1. It doesn't send the flush TLB to CPU-1
as it doesn't see that cpu listed in the mm_cpumask(mm1).
T3. After the TLB flush is complete, CPU-2 goes ahead and frees the
page-table pages associated with the removed vma mapping.
T4. CPU-2 now allocates those freed page-table pages for something
else.
T5. As the CR3 and TLB caches for mm1 is still active on CPU-1, CPU-1
can potentially speculate and walk through the page-table caches
and can insert new TLB entries. As the page-table pages are
already freed and being used on CPU-2, this page walk can
potentially insert a bogus global TLB entry depending on the
(random) contents of the page that is being used on CPU-2.
T6. This bogus TLB entry being global will be active across future CR3
changes and can result in weird memory corruption etc.
To avoid this issue, for the prev mm that is handing over the cpu to
another mm, clear the cpu from the mm_cpumask(prev) after the cr3 is
changed.
Marking it for -stable, though we haven't seen any reported failure that
can be attributed to this.
Without tmpfs, shmem_readpage() is not compiled in causing an OOPS as
soon as we try to allocate some swappable pages for GEM.
Jan 19 22:52:26 harlie kernel: Modules linked in: i915(+) drm_kms_helper cfbcopyarea video backlight cfbimgblt cfbfillrect
Jan 19 22:52:26 harlie kernel:
Jan 19 22:52:26 harlie kernel: Pid: 1125, comm: modprobe Not tainted 2.6.37Harlie #10 To be filled by O.E.M./To be filled by O.E.M.
Jan 19 22:52:26 harlie kernel: EIP: 0060:[<00000000>] EFLAGS: 00010246 CPU: 3
Jan 19 22:52:26 harlie kernel: EIP is at 0x0
Jan 19 22:52:26 harlie kernel: EAX: 00000000 EBX: f7b7d000 ECX: f3383100 EDX: f7b7d000
Jan 19 22:52:26 harlie kernel: ESI: f1456118 EDI: 00000000 EBP: f2303c98 ESP: f2303c7c
Jan 19 22:52:26 harlie kernel: DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Jan 19 22:52:26 harlie kernel: Process modprobe (pid: 1125, ti=f2302000 task=f259cd80 task.ti=f2302000)
Jan 19 22:52:26 harlie kernel: Stack:
Jan 19 22:52:26 harlie udevd-work[1072]: '/sbin/modprobe -b pci:v00008086d00000046sv00000000sd00000000bc03sc00i00' unexpected exit with status 0x0009
Jan 19 22:52:26 harlie kernel: c1074061000000d0f2f42b8000000000000a13d2f2d5dcc000000001f2303cac
Jan 19 22:52:26 harlie kernel: c107416f00000000000a13d200000000f2303cd4f8d620edf2cee62000001000
Jan 19 22:52:26 harlie kernel: 00000000000a13d2f1456118f2d5dcc0f1a4000000001000f2303d04f8d637ab
Jan 19 22:52:26 harlie kernel: Call Trace:
Jan 19 22:52:26 harlie kernel: [<c1074061>] ? do_read_cache_page+0x71/0x160
Jan 19 22:52:26 harlie kernel: [<c107416f>] ? read_cache_page_gfp+0x1f/0x30
Jan 19 22:52:26 harlie kernel: [<f8d620ed>] ? i915_gem_object_get_pages+0xad/0x1d0 [i915]
Jan 19 22:52:26 harlie kernel: [<f8d637ab>] ? i915_gem_object_bind_to_gtt+0xeb/0x2d0 [i915]
Jan 19 22:52:26 harlie kernel: [<f8d65961>] ? i915_gem_object_pin+0x151/0x190 [i915]
Jan 19 22:52:26 harlie kernel: [<c11e16ed>] ? drm_gem_object_init+0x3d/0x60
Jan 19 22:52:26 harlie kernel: [<f8d65aa5>] ? i915_gem_init_ringbuffer+0x105/0x1e0 [i915]
Jan 19 22:52:26 harlie kernel: [<f8d571b7>] ? i915_driver_load+0x667/0x1160 [i915]
Reported-by: John J. Stimson-III <john@idsfa.net> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Andi Kleen <ak@linux.intel.com>
The accelerate mode bit gets checked by certain atom
command tables to set up some register state. It needs
to be clear when setting modes and set when not.
Multipath began to use blk_abort_queue() to allow for
lower latency path deactivation. This was found to
cause list corruption:
the cmd gets blk_abort_queued/timedout run on it and the scsi eh
somehow is able to complete and run scsi_queue_insert while
scsi_request_fn is still trying to process the request.
Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Mike Anderson <andmike@linux.vnet.ibm.com> Cc: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
No longer needlessly hold md->bdev->bd_inode->i_mutex when changing the
size of a DM device. This additional locking is unnecessary because
i_size_write() is already protected by the existing critical section in
dm_swap_table(). DM already has a reference on md->bdev so the
associated bd_inode may be changed without lifetime concerns.
A negative side-effect of having held md->bdev->bd_inode->i_mutex was
that a concurrent DM device resize and flush (via fsync) would deadlock.
Dropping md->bdev->bd_inode->i_mutex eliminates this potential for
deadlock. The following reproducer no longer deadlocks:
https://www.redhat.com/archives/dm-devel/2009-July/msg00284.html
It is defined in include/linux/ieee80211.h. As per IEEE spec.
bit6 to bit15 in block ack parameter represents buffer size.
So the bitmask should be 0xFFC0.
Signed-off-by: Amitkumar Karwar <akarwar@marvell.com> Signed-off-by: Bing Zhao <bzhao@marvell.com> Signed-off-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
selinux_inode_init_security computes transitions sids even for filesystems
that use mount point labeling. It shouldn't do that. It should just use
the mount point label always and no matter what.
This causes 2 problems. 1) it makes file creation slower than it needs to be
since we calculate the transition sid and 2) it allows files to be created
with a different label than the mount point!
# id -Z
staff_u:sysadm_r:sysadm_t:s0-s0:c0.c1023
# sesearch --type --class file --source sysadm_t --target tmp_t
Found 1 semantic te rules:
type_transition sysadm_t tmp_t : file user_tmp_t;
# mount -o loop,context="system_u:object_r:tmp_t:s0" /tmp/fs /mnt/tmp
Commit 2f90b865 added two new netlink message types to the netlink route
socket. SELinux has hooks to define if netlink messages are allowed to
be sent or received, but it did not know about these two new message
types. By default we allow such actions so noone likely noticed. This
patch adds the proper definitions and thus proper permissions
enforcement.
Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The current TPM TIS driver in git discards the timeout values returned
from the TPM. The check of the response packet needs to consider that
the return_code field is 0 on success and the size of the expected
packet is equivalent to the header size + u32 length indicator for the
TPM_GetCapability() result + 3 timeout indicators of type u32.
I am also adding a sysfs entry 'timeouts' showing the timeouts that are
being used.
If duration variable value is 0 at this point, it's because
chip->vendor.duration wasn't filled by tpm_get_timeouts() yet.
This patch sets then the lowest timeout just to give enough
time for tpm_get_timeouts() to further succeed.
This fix avoids long boot times in case another entity attempts
to send commands to the TPM when the TPM isn't accessible.
Commit 1a855a0606 (2.6.37-rc4) fixed a problem where devices were
re-added when they shouldn't be but caused a regression in a less
common case that means sometimes devices cannot be re-added when they
should be.
In particular, when re-adding a device to an array without metadata
we should always access the device, but after the above commit we
didn't.
This patch sets the In_sync flag in that case so that the re-add
succeeds.
This patch is suitable for any -stable kernel to which 1a855a0606 was
applied.
pcmcia_request_irq() and pcmcia_enable_device() are intended
to be called from process context (first function allocate memory
with GFP_KERNEL, second take a mutex). We can not take spin lock
and call them.
It's safe to move spin lock after pcmcia_enable_device() as we
still hold off IRQ until dev->base_addr is 0 and driver will
not proceed with interrupts when is not ready.
We atomically tested and cleared our bit in the cpumask, and yet the
number of cpus left (ie refs) was 0. How can this be?
It turns out commit 54fdade1c3332391948ec43530c02c4794a38172
("generic-ipi: make struct call_function_data lockless") is at fault. It
removes locking from smp_call_function_many and in doing so creates a
rather complicated race.
The problem comes about because:
- The smp_call_function_many interrupt handler walks call_function.queue
without any locking.
- We reuse a percpu data structure in smp_call_function_many.
- We do not wait for any RCU grace period before starting the next
smp_call_function_many.
Imagine a scenario where CPU A does two smp_call_functions back to back,
and CPU B does an smp_call_function in between. We concentrate on how CPU
C handles the calls:
CPU A CPU B CPU C CPU D
smp_call_function
smp_call_function_interrupt
walks
call_function.queue sees
data from CPU A on list
smp_call_function
smp_call_function_interrupt
walks
call_function.queue sees
(stale) CPU A on list
smp_call_function int
clears last ref on A
list_del_rcu, unlock
smp_call_function reuses
percpu *data A
data->cpumask sees and
clears bit in cpumask
might be using old or new fn!
decrements refs below 0
set data->refs (too late!)
The important thing to note is since the interrupt handler walks a
potentially stale call_function.queue without any locking, then another
cpu can view the percpu *data structure at any time, even when the owner
is in the process of initialising it.
The following test case hits the WARN_ON 100% of the time on my PowerPC
box (having 128 threads does help :)
#include <linux/module.h>
#include <linux/init.h>
#define ITERATIONS 100
static void do_nothing_ipi(void *dummy)
{
}
static void do_ipis(struct work_struct *dummy)
{
int i;
for (i = 0; i < ITERATIONS; i++)
smp_call_function(do_nothing_ipi, NULL, 1);
I tried to fix it by ordering the read and the write of ->cpumask and
->refs. In doing so I missed a critical case but Paul McKenney was able
to spot my bug thankfully :) To ensure we arent viewing previous
iterations the interrupt handler needs to read ->refs then ->cpumask then
->refs _again_.
Thanks to Milton Miller and Paul McKenney for helping to debug this issue.
[miltonm@bga.com: add WARN_ON and BUG_ON, remove extra read of refs before initial read of mask that doesn't help (also noted by Peter Zijlstra), adjust comments, hopefully clarify scenario ]
[miltonm@bga.com: remove excess tests] Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Remove the broken line wrapping handling in pdc_iodc_print().
It is broken in 3 ways :
- It doesn't keep track of the current screen position, it just
assumes that the new buffer will be printed at the begining of the
screen.
- It doesn't take in account that non printable characters won't
increase the current position on the screen.
- And last but not least, it triggers a kernel panic if a backspace
is the first char in the provided buffer :
Some of those functions try to adjust the CPU features, for example
to remove NAP support on some revisions. However, they seem to use
r5 as an index into the CPU table entry, which might have been right
a long time ago but no longer is. r4 is the right register to use.
This probably caused some off behaviours on some PowerMac variants
using 750cx or 7455 processor revisions.
Commit c0e69a5bbc6f ("klist.c: bit 0 in pointer can't be used as flag")
intended to make sure that all klist objects were at least pointer size
aligned, but used the constant "4" which only works on 32-bit.
Use "sizeof(void *)" which is correct in all cases.
Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Jesper Nilsson <jesper.nilsson@axis.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Andi Kleen <ak@linux.intel.com>
Added 0x0307 device id to support Motorola cables to the pl2303 usb
serial driver. This cable has a modified chip that is a pl2303, but
declares itself as 0307. Fixed by adding the right device id to the
supported devices list, assigning it the code labeled
PL2303_PRODUCT_ID_MOTOROLA.
Markus Kohn ran into a hard hang regression on an acer aspire
1310, when acpi is enabled. git bisect showed the following
commit as the bad one that introduced the boot regression.
x86, pat/mtrr: Rendezvous all the cpus for MTRR/PAT init
Because of the UP configuration of that platform,
native_smp_prepare_cpus() bailed out (in smp_sanity_check())
before doing the set_mtrr_aps_delayed_init()
Further down the boot path, native_smp_cpus_done() will call the
delayed MTRR initialization for the AP's (mtrr_aps_init()) with
mtrr_aps_delayed_init not set. This resulted in the boot
processor reprogramming its MTRR's to the values seen during the
start of the OS boot. While this is not needed ideally, this
shouldn't have caused any side-effects. This is because the
reprogramming of MTRR's (set_mtrr_state() that gets called via
set_mtrr()) will check if the live register contents are Signed-off-by: Andi Kleen <ak@linux.intel.com>
different from what is being asked to write and will do the actual
write only if they are different.
BP's mtrr state is read during the start of the OS boot and
typically nothing would have changed when we ask to reprogram it
on BP again because of the above scenario on an UP platform. So
on a normal UP platform no reprogramming of BP MTRR MSR's
happens and all is well.
However, on this platform, bios seems to be modifying the fixed
mtrr range registers between the start of OS boot and when we
double check the live registers for reprogramming BP MTRR
registers. And as the live registers are modified, we end up
reprogramming the MTRR's to the state seen during the start of
the OS boot.
During ACPI initialization, something in the bios (probably smi
handler?) don't like this fact and results in a hard lockup.
We didn't see this boot hang issue on this platform before the
commit d0af9eed5aa91b6b7b5049cae69e5ea956fd85c3, because only
the AP's (if any) will program its MTRR's to the value that BP
had at the start of the OS boot.
Fix this issue by checking mtrr_aps_delayed_init before
continuing further in the mtrr_aps_init(). Now, only AP's (if
any) will program its MTRR's to the BP values during boot.
[ By the way, this behavior of the bios modifying MTRR's after the start
of the OS boot is not common and the kernel is not prepared to
handle this situation well. Irrespective of this issue, during
suspend/resume, linux kernel will try to reprogram the BP's MTRR values
to the values seen during the start of the OS boot. So suspend/resume might
be already broken on this platform for all linux kernel versions. ]
Reported-and-bisected-by: Markus Kohn <jabber@gmx.org> Tested-by: Markus Kohn <jabber@gmx.org> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Thomas Renninger <trenn@novell.com> Cc: Rafael Wysocki <rjw@novell.com> Cc: Venkatesh Pallipadi <venki@google.com>
LKML-Reference: <1296694975.4418.402.camel@sbsiddha-MOBL3.sc.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The wake_up_process() call in ptrace_detach() is spurious and not
interlocked with the tracee state. IOW, the tracee could be running or
sleeping in any place in the kernel by the time wake_up_process() is
called. This can lead to the tracee waking up unexpectedly which can be
dangerous.
The wake_up is spurious and should be removed but for now reduce its
toxicity by only waking up if the tracee is in TRACED or STOPPED state.
This bug can possibly be used as an attack vector. I don't think it
will take too much effort to come up with an attack which triggers oops
somewhere. Most sleeps are wrapped in condition test loops and should
be safe but we have quite a number of places where sleep and wakeup
conditions are expected to be interlocked. Although the window of
opportunity is tiny, ptrace can be used by non-privileged users and with
some loading the window can definitely be extended and exploited.
Remove real devices first and dummy devices last. This gives device
driver which instantiated dummy devices themselves a chance to clean
them up before we do.
Signed-off-by: Jean Delvare <khali@linux-fr.org> Signed-off-by: Andi Kleen <ak@linux.intel.com> Tested-by: Hans Verkuil <hverkuil@xs4all.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
A check against division by zero was modified in commit b0525b48.
Since this change time_to_empty_now is always reported as zero
while the battery is discharging and as a negative value while
the battery is charging. This is because current is negative while
the battery is discharging.
Fix the check introduced by commit b0525b48 so that time_to_empty_now
is reported correctly during discharge and as zero while charging.
Signed-off-by: Sven Neumann <s.neumann@raumfeld.com> Acked-by: Daniel Mack <daniel@caiaq.de> Signed-off-by: Anton Vorontsov <cbouatmailru@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Andi Kleen <ak@linux.intel.com>
We sometimes need to map between the virtio device and
the given pci device. One such use is OS installer that
gets the boot pci device from BIOS and needs to
find the relevant block device. Since it can't,
installation fails.
Instead of creating a top-level devices/virtio-pci
directory, create each device under the corresponding
pci device node. Symlinks to all virtio-pci
devices can be found under the pci driver link in
bus/pci/drivers/virtio-pci/devices, and all virtio
devices under drivers/bus/virtio/devices.
Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Andi Kleen <ak@linux.intel.com> Tested-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Gleb Natapov <gleb@redhat.com> Tested-by: "Daniel P. Berrange" <berrange@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
pci-stub uses strsep() to separate list of ids and generates a warning
message when it fails to parse an id. However, not specifying the
parameter results in ids set to an empty string. strsep() happily
returns the empty string as the first token and thus triggers the
warning message spuriously.
In fsl_rio_dbell_handler() the code currently simply acknowledges the QFI
queue full interrupt, but does nothing to resolve the queue full
condition. Instead, it jumps to the end of the isr. When a queue full
condition occurs, the isr is then re-entered immediately and continually,
forever.
The fix is to just fall through and read out current doorbell entries.
Signed-off-by: Thomas Taranowski <tom@baringforge.com> Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Alexandre Bounine <alexandre.bounine@idt.com> Cc: Kumar Gala <galak@kernel.crashing.org> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Li Yang <leoli@freescale.com> Cc: Thomas Moll <thomas.moll@sysgo.com> Cc: Micha Nelissen <micha@neli.hopto.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Grant Likely <grant.likely@secretlab.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
rtc-cmos was setting suspend/resume hooks at the device_driver level.
However, the platform bus code (drivers/base/platform.c) only looks for
resume hooks at the dev_pm_ops level, or within the platform_driver.
Switch rtc_cmos to use dev_pm_ops so that suspend/resume code is executed
again.
Paul said:
: The user visible symptom in our (XO laptop) case was that rtcwake would
: fail to wake the laptop. The RTC alarm would expire, but the wakeup
: wasn't unmasked.
:
: As for severity, the impact may have been reduced because if I recall
: correctly, the bug only affected platforms with CONFIG_PNP disabled.
Signed-off-by: Paul Fox <pgf@laptop.org> Signed-off-by: Daniel Drake <dsd@laptop.org> Acked-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Andi Kleen <ak@linux.intel.com>
If a timer interrupt was delayed too much, hrtimer_forward_now() will
forward the timer expiry more than once. When this happens, the
additional number of elapsed ALSA timer ticks must be passed to
snd_timer_interrupt() to prevent the ALSA timer from falling behind.
This mostly fixes MIDI slowdown problems on highly-loaded systems with
badly behaved interrupt handlers.
Signed-off-by: Clemens Ladisch <clemens@ladisch.de> Signed-off-by: Andi Kleen <ak@linux.intel.com> Reported-and-tested-by: Arthur Marsh <arthur.marsh@internode.on.net> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The Conexant codec driver adds the jack arrays in init callback which
may be called also in each PM resume. This results in the addition of
new jack element at each time.
The fix is to check whether the requested jack is already present in
the array.
Fix playback/capture channels patch to change supported playback
channels of au8830 to 1,2,4 and capture channels to 1,2.
This prevent oops when oss emulation use SNDCTL_DSP_CHANNELS to
set 3 Channels
gcc 4.5+ doesn't properly evaluate some inlined expressions.
A previous patch were proposed by Andrew Morton using noinline.
However, the entire inlined function is bogus, so let's just
remove it and be happy.
There was a configuration page timing out during the initial port
enable at driver load time. The port enable would fail, and this would
result in the driver unloading itself, meanwhile the driver was accessing
freed memory in another context resulting in the panic. The fix is to
prevent access to freed memory once the driver had issued the diag reset
which woke up the sleeping port enable process. The routine
_base_reset_handler was reorganized so the last sleeping process woken up was
the port_enable.
The ioc->hba_queue_depth is not properly resized when the controller
firmware reports that it supports more outstanding IO than what can be fit
inside the reply descriptor pool depth. This is reproduced by setting the
controller global credits larger than 30,000. The bug results in an
incorrect sizing of the queues. The fix is to resize the queue_size by
dividing queue_diff by two.
When zoning end devices, the driver is not sending device
removal handshake alogrithm to firmware. This results in controller
firmware not sending sas topology add events the next time the device is
added. The fix is the driver should be doing the device removal handshake
even though the PHYSTATUS_VACANT bit is set in the PhyStatus of the
event data. The current design is avoiding the handshake when the
VACANT bit is set in the phy status.
libsas makes use of scsi_schedule_eh() but forgets to clear the
host_eh_scheduled flag in its error handling routine. Because of this,
the error handler thread never gets to sleep; it's constantly awake and
trying to run the error routine leading to console spew and inability to
run anything else (at least on a UP system). The fix is to clear the
flag as we splice the work queue.
Our current handling of medium error assumes that data is returned up
to the bad sector. This assumption holds good for all disk devices,
all DIF arrays and most ordinary arrays. However, an LSI array engine
was recently discovered which reports a medium error without returning
any data. This means that when we report good data up to the medium
error, we've reported junk originally in the buffer as good. Worse,
if the read consists of requested data plus a readahead, and the error
occurs in readahead, we'll just strip off the readahead and report
junk up to userspace as good data with no error.
The fix for this is to have the error position computation take into
account the amount of data returned by the driver using the scsi
residual data. Unfortunately, not every driver fills in this data,
but for those who don't, it's set to zero, which means we'll think a
full set of data was transferred and the behaviour will be identical
to the prior behaviour of the code (believe the buffer up to the error
sector). All modern drivers seem to set the residual, so that should
fix up the LSI failure/corruption case.
Reported-by: Douglas Gilbert <dgilbert@interlog.com> Signed-off-by: James Bottomley <James.Bottomley@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Andi Kleen <ak@linux.intel.com>
Since commit 6cd0b1cb872b3bf9fc5de4536404206ab74bafdd "iwlagn: fix
hw-rfkill while the interface is down", we enable interrupts when
device is not ready to receive them. However hardware, when it is in
some inconsistent state, can generate other than rfkill interrupts
and crash the system. I can reproduce crash with "kernel BUG at
drivers/net/wireless/iwlwifi/iwl-agn.c:1010!" message, when forcing
firmware restarts.
To fix only enable rfkill interrupt when down device and after probe.
I checked patch on laptop with 5100 device, rfkill change is still
passed to user space when device is down.
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Acked-by: Wey-Yi Guy <wey-yi.w.guy@intel.com> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Andi Kleen <ak@linux.intel.com>
The hv_netvsc gets RNDIS_STATUS_MEDIA_CONNECT event after the VM
is live migrated. Adding call to netif_notify_peers() for this event
to send GARP (Gratuitous ARP) to notify network peers. Otherwise,
the VM's network connection may stop after a live migration.
This patch should also be applied to stable kernel 2.6.32 and later.
The ni_labpc driver module only requests a shared IRQ for PCI devices,
requesting a non-shared IRQ for non-PCI devices.
As this module is also used by the ni_labpc_cs module for certain
National Instruments PCMCIA cards, it also needs to request a shared IRQ
for PCMCIA devices, otherwise you get a IRQ mismatch with the CardBus
controller.
If anyone comes across a high-speed hub that (by mistake or by design)
claims to have no Transaction Translators, plugging a full- or
low-speed device into it will cause the USB stack to crash. This
patch (as1446) prevents the problem by ignoring such devices, since
the kernel has no way to communicate with them.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Andi Kleen <ak@linux.intel.com> Tested-by: Perry Neben <neben@vmware.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The major and minor number saved in the product_info structure
were copied from the address instead of the data, causing an
inconsistency in the reported versions during firmware loading:
usb 4-1: firmware: requesting edgeport/down.fw
/usr/src/linux/drivers/usb/serial/io_edgeport.c: downloading firmware version (930) 1.16.4
[..]
/usr/src/linux/drivers/usb/serial/io_edgeport.c: edge_startup - time 3 4328191260
/usr/src/linux/drivers/usb/serial/io_edgeport.c: FirmwareMajorVersion 0.0.4
This can cause some confusion whether firmware loaded successfully
or not.
This patch (as1442) fixes a bug in g_printer: Module parameters should
not be marked "__initdata" if they are accessible in sysfs (i.e., if
the mode value in the module_param() macro is nonzero). Otherwise
attempts to access the parameters will cause addressing violations.
Character-string module parameters must not be marked "__initdata"
if the module can be unloaded, because the kernel needs to access the
parameter variable at unload time in order to free the
dynamically-allocated string.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Andi Kleen <ak@linux.intel.com> CC: Roland Kletzing <devzero@web.de> CC: Craig W. Nadler <craig@nadler.us> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch (as1440) fixes a bug in ehci-hcd. ehci->periodic_size is
used to compute the size in a dma_alloc_coherent() call, but then it
gets changed later on. As a result, the corresponding call to
dma_free_coherent() passes a different size from the original
allocation. Fix the problem by adjusting ehci->periodic_size before
carrying out any of the memory allocations.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Andi Kleen <ak@linux.intel.com> Tested-by: Larry Finger <Larry.Finger@lwfinger.net> CC: David Brownell <david-b@pacbell.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
/drivers/usb/serial/option.c: Adding support for Cinterion's HC25, HC28,
HC28J, EU3-E, EU3-P and PH8 by correcting/adding Cinterion's and
Siemens' Vendor IDs as well as Product IDs and USB_DEVICE tuples
I found the original patch on the db0fhn repeater wiki (couldn't find the email
of the origial author) I guess it was never commited.
I updated and added some Icom HAM-radio devices to the ftdi driver.
Added extra comments to make clear what devices it are.
This patch (as1438) adds an unusual_devs entry for the MagicPixel
FW_Omega2 chip, used in the CamSport Evo camera. The firmware
incorrectly reports a vendor-specific bDeviceClass.
The TrekStor DataStation maxi g.u external hard drive enclosure uses a
JMicron USB to SATA chip which needs the US_FL_IGNORE_RESIDUE flag to work
properly.
Device ID removed 0x10C4/0x8149 for West Mountain Radio Computerized
Battery Analyzer. This device is actually based on a SiLabs C8051Fxxx,
see http://www.etheus.net/SiUSBXp_Linux_Driver for further info.
Alan's commit 335f8514f200e63d689113d29cb7253a5c282967 introduced
.carrier_raised function in several drivers. That also means
tty_port_block_til_ready can now suspend the process trying to open the serial
port when Carrier Detect is low and put it into tty_port.open_wait queue. We
need to wake up the process when Carrier Detect goes high and trigger TTY
hangup when CD goes low.
Some of the devices do not report modem status line changes, or at least we
don't understand the status message, so for those we remove .carrier_raised
again.
Functions set_fan_min() and set_fan_div() assume that the fan_div
values have already been read from the register. The driver currently
doesn't initialize them at load time, they are only set when function
via686a_update_device() is called. This means that set_fan_min() and
set_fan_div() misbehave if, for example, "sensors -s" is called
before any monitoring application (e.g. "sensors") is has been run.
Fix the problem by always initializing the fan_div values at device
bind time.
When ASPM PM Feature is enabled on UMI link, devices that use ISOC stream of
data transfer may be exposed to longer latency causing less than optimal per-
formance of the device. The longer latencies are normal and are due to link
wake time coming out of low power state which happens frequently to save
power when the link is not active.
The following code will make exception for certain features of ASPM to be by
passed and keep the logic normal state only when the ISOC device is connected
and active. This change will allow the device to run at optimal performance
yet minimize the impact on overall power savings.
Signed-off-by: Alex He <alex.he@amd.com> Acked-by: David Brownell <dbrownell@users.sourceforge.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Andi Kleen <ak@linux.intel.com>
In the vhci_urb_dequeue() function the TCP connection is checked twice.
Each time when the TCP connection is closed the URB is unlinked and given
back. Remove the second attempt of unlinking and giving back of the URB completely.
This patch fixes the bug described at https://bugzilla.kernel.org/show_bug.cgi?id=24872 .
David Howells [Thu, 31 Mar 2011 18:57:36 +0000 (11:57 -0700)]
Fix cred leak in AF_NETLINK
Patch cab9e9848b9a8283b0504a2d7c435a9f5ba026de to the 2.6.35.y stable tree
stored a ref to the current cred struct in struct scm_cookie. This was fine
with AF_UNIX as that calls scm_destroy() from its packet sending functions, but
AF_NETLINK, which also uses scm_send(), does not call scm_destroy() - meaning
that the copied credentials leak each time SCM data is sent over a netlink
socket.
This can be triggered quite simply on a Fedora 13 or 14 userspace with the
2.6.35.11 kernel (or something based off of that) by calling:
#!/bin/bash
for ((i=0; i<100; i++))
do
su - -c /bin/true
cut -d: -f1 /proc/slabinfo | grep 'cred\|key\|task_struct'
cat /proc/keys | wc -l
done
This leaks the session key that pam_keyinit creates for 'su -', which appears
in /proc/keys as being revoked (has the R flag set against it) afterward su is
called.
Furthermore, if CONFIG_SLAB=y, then the cred and key slab object usage counts
can be viewed and seen to increase. The key slab increases by one object per
loop, and this can be seen after the system has had a couple of minutes to
stand after the script above has been run on it.
If the system is working correctly, the key and cred counts should return to
roughly what they were before.
Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andi Kleen <ak@linux.intel.com>
Use a safer coding style for the hotkey keymap. This does not fix any
problems, as the current code is correct. But it might help avoid
mistakes in the future.
Signed-off-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Signed-off-by: Matthew Garrett <mjg@redhat.com> Signed-off-by: Andi Kleen <ak@linux.intel.com>
If we receive two PERF_RECORD_EXIT for the same thread, we can end up
reusing session->last_match and trying to remove the thread twice from
the rb_tree, causing a segfault, so invalidade last_match in
perf_session__remove_thread.
Receiving two PERF_RECORD_EXIT for the same thread is a bug, but its a
harmless one if we make the tool more robust, like this patch does.
Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com>
LKML-Reference: <new-submission> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Andi Kleen <ak@linux.intel.com>
The change 'mac80211: Fix BUG in pskb_expand_head when transmitting shared skbs'
added a check for copying the skb if it's shared, however the tx info variable
still points at the cb of the old skb
Signed-off-by: Felix Fietkau <nbd@openwrt.org> Acked-by: Helmut Schaa <helmut.schaa@googlemail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Andi Kleen <ak@linux.intel.com>
Under memory pressure, the mac80211 mesh code
may helpfully print a message that it failed
to clone a mesh frame and then will proceed
to crash trying to use it anyway. Fix that.
Avoid the reference whenever the frame copy is unsuccessful
regardless of the debug message being suppressed or printed.
Cc: stable@kernel.org [2.6.27+] Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Andi Kleen <ak@linux.intel.com>
We can not call del_timer_sync(addba_resp_timer) from
___ieee80211_stop_tx_ba_session(), as this function can be called from
that timer callback. To fix, simply use not synchronous del_timer().
The flag PDN_INV indicates that the sensor pin S_PWR_DN has not the same
value as other webcams with the same sensor. For now, only two webcams have
been so detected: the Microsoft's VX1000 and VX3000.