The nohz full kick, which restarts the tick when any resource depend
on it, can't be executed anywhere given the operation it does on timers.
If it is called from the scheduler or timers code, chances are that
we run into a deadlock.
This is why we run the nohz full kick from an irq work. That way we make
sure that the kick runs on a virgin context.
However if that's the case when irq work runs in its own dedicated
self-ipi, things are different for the big bunch of archs that don't
support the self triggered way. In order to support them, irq works are
also handled by the timer interrupt as fallback.
Now when irq works run on the timer interrupt, the context isn't blank.
More precisely, they can run in the context of the hrtimer that runs the
tick. But the nohz kick cancels and restarts this hrtimer and cancelling
an hrtimer from itself isn't allowed. This is why we run in an endless
loop:
Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2
CPU: 2 PID: 7538 Comm: kworker/u8:8 Not tainted 3.16.0+ #34
Workqueue: btrfs-endio-write normal_work_helper [btrfs]
ffff880244c06c88 000000001b486fe1 ffff880244c06bf0 ffffffff8a7f1e37
ffffffff8ac52a18 ffff880244c06c78 ffffffff8a7ef928 0000000000000010
ffff880244c06c88 ffff880244c06c20 000000001b486fe1 0000000000000000
Call Trace:
<NMI[<ffffffff8a7f1e37>] dump_stack+0x4e/0x7a
[<ffffffff8a7ef928>] panic+0xd4/0x207
[<ffffffff8a1450e8>] watchdog_overflow_callback+0x118/0x120
[<ffffffff8a186b0e>] __perf_event_overflow+0xae/0x350
[<ffffffff8a184f80>] ? perf_event_task_disable+0xa0/0xa0
[<ffffffff8a01a4cf>] ? x86_perf_event_set_period+0xbf/0x150
[<ffffffff8a187934>] perf_event_overflow+0x14/0x20
[<ffffffff8a020386>] intel_pmu_handle_irq+0x206/0x410
[<ffffffff8a01937b>] perf_event_nmi_handler+0x2b/0x50
[<ffffffff8a007b72>] nmi_handle+0xd2/0x390
[<ffffffff8a007aa5>] ? nmi_handle+0x5/0x390
[<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0
[<ffffffff8a008062>] default_do_nmi+0x72/0x1c0
[<ffffffff8a008268>] do_nmi+0xb8/0x100
[<ffffffff8a7ff66a>] end_repeat_nmi+0x1e/0x2e
[<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0
[<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0
[<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0
<<EOE><IRQ[<ffffffff8a0ccd2f>] lock_acquired+0xaf/0x450
[<ffffffff8a0f74c5>] ? lock_hrtimer_base.isra.20+0x25/0x50
[<ffffffff8a7fc678>] _raw_spin_lock_irqsave+0x78/0x90
[<ffffffff8a0f74c5>] ? lock_hrtimer_base.isra.20+0x25/0x50
[<ffffffff8a0f74c5>] lock_hrtimer_base.isra.20+0x25/0x50
[<ffffffff8a0f7723>] hrtimer_try_to_cancel+0x33/0x1e0
[<ffffffff8a0f78ea>] hrtimer_cancel+0x1a/0x30
[<ffffffff8a109237>] tick_nohz_restart+0x17/0x90
[<ffffffff8a10a213>] __tick_nohz_full_check+0xc3/0x100
[<ffffffff8a10a25e>] nohz_full_kick_work_func+0xe/0x10
[<ffffffff8a17c884>] irq_work_run_list+0x44/0x70
[<ffffffff8a17c8da>] irq_work_run+0x2a/0x50
[<ffffffff8a0f700b>] update_process_times+0x5b/0x70
[<ffffffff8a109005>] tick_sched_handle.isra.21+0x25/0x60
[<ffffffff8a109b81>] tick_sched_timer+0x41/0x60
[<ffffffff8a0f7aa2>] __run_hrtimer+0x72/0x470
[<ffffffff8a109b40>] ? tick_sched_do_timer+0xb0/0xb0
[<ffffffff8a0f8707>] hrtimer_interrupt+0x117/0x270
[<ffffffff8a034357>] local_apic_timer_interrupt+0x37/0x60
[<ffffffff8a80010f>] smp_apic_timer_interrupt+0x3f/0x50
[<ffffffff8a7fe52f>] apic_timer_interrupt+0x6f/0x80
To fix this we force non-lazy irq works to run on irq work self-IPIs
when available. That ability of the arch to trigger irq work self IPIs
is available with arch_irq_work_has_interrupt().
Reported-by: Catalin Iacob <iacobcatalin@gmail.com>
Reported-by: Dave Jones <davej@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Convert uses of __get_cpu_var for creating a address from a percpu
offset to this_cpu_ptr.
The two cases where get_cpu_var is used to actually access a percpu
variable are changed to use this_cpu_read/raw_cpu_read.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When a timer is enqueued or modified on a dynticks target, that CPU
must re-evaluate the next tick to service that timer.
The tick re-evaluation is performed by an IPI kick on the target.
Now while we correctly call wake_up_nohz_cpu() from add_timer_on(), the
mod_timer*() API family doesn't support so well dynticks targets.
The reason for this is likely that __mod_timer() isn't supposed to
select an idle target for a timer, unless that target is the current
CPU, in which case a dynticks idle kick isn't actually needed.
But there is a small race window lurking behind that assumption: the
elected target has all the time to turn dynticks idle between the call
to get_nohz_timer_target() and the locking of its base. Hence a risk
that we enqueue a timer on a dynticks idle destination without kicking
it. As a result, the timer might be serviced too late in the future.
Also a target elected by __mod_timer() can be in full dynticks mode
and thus require to be kicked as well. And unlike idle dynticks, this
concern both local and remote targets.
To fix this whole issue, lets centralize the dynticks kick to
internal_add_timer() so that it is well handled for all sort of timer
enqueue. Even timer migration is concerned so that a full dynticks target
is correctly kicked as needed when timers are migrating to it.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/1403393357-2070-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Timers are serviced by the tick. But when a timer is enqueued on a
dynticks target, we need to kick it in order to make it reconsider the
next tick to schedule to correctly handle the timer's expiring time.
Now while this kick is correctly performed for add_timer_on(), the
mod_timer*() family has been a bit neglected.
To prepare for fixing this, we need internal_add_timer() to be able to
resolve the CPU target associated to a timer's object 'base' so that the
kick can be centralized there.
This can't be passed as an argument as not all the callers know the CPU
number of a timer's base. So lets store it in the struct tvec_base to
resolve the CPU without much overhead. It is set once for good at every
CPU's first boot.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/1403393357-2070-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>