Pull networking updates from David Miller:
1) More iov_iter conversion work from Al Viro.
[ The "crypto: switch af_alg_make_sg() to iov_iter" commit was
wrong, and this pull actually adds an extra commit on top of the
branch I'm pulling to fix that up, so that the pre-merge state is
ok. - Linus ]
2) Various optimizations to the ipv4 forwarding information base trie
lookup implementation. From Alexander Duyck.
3) Remove sock_iocb altogether, from CHristoph Hellwig.
4) Allow congestion control algorithm selection via routing metrics.
From Daniel Borkmann.
5) Make ipv4 uncached route list per-cpu, from Eric Dumazet.
6) Handle rfs hash collisions more gracefully, also from Eric Dumazet.
7) Add xmit_more support to r8169, e1000, and e1000e drivers. From
Florian Westphal.
8) Transparent Ethernet Bridging support for GRO, from Jesse Gross.
9) Add BPF packet actions to packet scheduler, from Jiri Pirko.
10) Add support for uniqu flow IDs to openvswitch, from Joe Stringer.
11) New NetCP ethernet driver, from Muralidharan Karicheri and Wingman
Kwok.
12) More sanely handle out-of-window dupacks, which can result in
serious ACK storms. From Neal Cardwell.
13) Various rhashtable bug fixes and enhancements, from Herbert Xu,
Patrick McHardy, and Thomas Graf.
14) Support xmit_more in be2net, from Sathya Perla.
15) Group Policy extensions for vxlan, from Thomas Graf.
16) Remove Checksum Offload support for vxlan, from Tom Herbert.
17) Like ipv4, support lockless transmit over ipv6 UDP sockets. From
Vlad Yasevich.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1494+1 commits)
crypto: fix af_alg_make_sg() conversion to iov_iter
ipv4: Namespecify TCP PMTU mechanism
i40e: Fix for stats init function call in Rx setup
tcp: don't include Fast Open option in SYN-ACK on pure SYN-data
openvswitch: Only set TUNNEL_VXLAN_OPT if VXLAN-GBP metadata is set
ipv6: Make __ipv6_select_ident static
ipv6: Fix fragment id assignment on LE arches.
bridge: Fix inability to add non-vlan fdb entry
net: Mellanox: Delete unnecessary checks before the function call "vunmap"
cxgb4: Add support in cxgb4 to get expansion rom version via ethtool
ethtool: rename reserved1 memeber in ethtool_drvinfo for expansion ROM version
net: dsa: Remove redundant phy_attach()
IB/mlx4: Reset flow support for IB kernel ULPs
IB/mlx4: Always use the correct port for mirrored multicast attachments
net/bonding: Fix potential bad memory access during bonding events
tipc: remove tipc_snprintf
tipc: nl compat add noop and remove legacy nl framework
tipc: convert legacy nl stats show to nl compat
tipc: convert legacy nl net id get to nl compat
tipc: convert legacy nl net id set to nl compat
...
Pull scheduler updates from Ingo Molnar:
"The main scheduler changes in this cycle were:
- various sched/deadline fixes and enhancements
- rescheduling latency fixes/cleanups
- rework the rq->clock code to be more consistent and more robust.
- minor micro-optimizations
- ->avg.decay_count fixes
- add a stack overflow check to might_sleep()
- idle-poll handler fix, possibly resulting in power savings
- misc smaller updates and fixes"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/Documentation: Remove unneeded word
sched/wait: Introduce wait_on_bit_timeout()
sched: Pull resched loop to __schedule() callers
sched/deadline: Remove cpu_active_mask from cpudl_find()
sched: Fix hrtick_start() on UP
sched/deadline: Avoid pointless __setscheduler()
sched/deadline: Fix stale yield state
sched/deadline: Fix hrtick for a non-leftmost task
sched/deadline: Modify cpudl::free_cpus to reflect rd->online
sched/idle: Add missing checks to the exit condition of cpu_idle_poll()
sched: Fix missing preemption opportunity
sched/rt: Reduce rq lock contention by eliminating locking of non-feasible target
sched/debug: Print rq->clock_task
sched/core: Rework rq->clock update skips
sched/core: Validate rq_clock*() serialization
sched/core: Remove check of p->sched_class
sched/fair: Fix sched_entity::avg::decay_count initialization
sched/debug: Fix potential call to __ffs(0) in sched_show_task()
sched/debug: Check for stack overflow in ___might_sleep()
sched/fair: Fix the dealing with decay_count in __synchronize_entity_decay()
We explicitly mark the task running after returning from
a __rt_mutex_slowlock() call, which does the actual sleeping
via wait-wake-trylocking. As such, this patch does two things:
(1) refactors the code so that setting current to TASK_RUNNING
is done by __rt_mutex_slowlock(), and not by the callers. The
downside to this is that it becomes a bit unclear when at what
point we block. As such I've added a comment that the task
blocks when calling __rt_mutex_slowlock() so readers can figure
out when it is running again.
(2) relaxes setting current's state through __set_current_state(),
instead of it's more expensive barrier alternative. There was no
need for the implied barrier as we're obviously not planning on
blocking.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1422857784.18096.1.camel@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If the kernel is compiled with function tracer support the -pg compile option
is passed to gcc to generate extra code into the prologue of each function.
This patch replaces the "open-coded" -pg compile flag with a CC_FLAGS_FTRACE
makefile variable which architectures can override if a different option
should be used for code generation.
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Conflicts:
drivers/net/xen-netfront.c
Minor overlapping changes in xen-netfront.c, mostly to do
with some buffer management changes alongside the split
of stats into TX and RX.
Signed-off-by: David S. Miller <davem@davemloft.net>
Both mutexes and rwsems took a performance hit when we switched
over from the original mcs code to the cancelable variant (osq).
The reason being the use of smp_load_acquire() when polling for
node->locked. This is not needed as reordering is not an issue,
as such, relax the barrier semantics. Paul describes the scenario
nicely: https://lkml.org/lkml/2013/11/19/405
- If we start polling before the insertion is complete, all that
happens is that the first few polls have no chance of seeing a lock
grant.
- Ordering the polling against the initialization -- the above
xchg() is already doing that for us.
The smp_load_acquire() when unqueuing make sense. In addition,
we don't need to worry about leaking the critical region as
osq is only used internally.
This impacts both regular and large levels of concurrency,
ie on a 40 core system with a disk intensive workload:
disk-1 804.83 ( 0.00%) 828.16 ( 2.90%)
disk-61 8063.45 ( 0.00%) 18181.82 (125.48%)
disk-121 7187.41 ( 0.00%) 20119.17 (179.92%)
disk-181 6933.32 ( 0.00%) 20509.91 (195.82%)
disk-241 6850.81 ( 0.00%) 20397.80 (197.74%)
disk-301 6815.22 ( 0.00%) 20287.58 (197.68%)
disk-361 7080.40 ( 0.00%) 20205.22 (185.37%)
disk-421 7076.13 ( 0.00%) 19957.33 (182.04%)
disk-481 7083.25 ( 0.00%) 19784.06 (179.31%)
disk-541 7038.39 ( 0.00%) 19610.92 (178.63%)
disk-601 7072.04 ( 0.00%) 19464.53 (175.23%)
disk-661 7010.97 ( 0.00%) 19348.23 (175.97%)
disk-721 7069.44 ( 0.00%) 19255.33 (172.37%)
disk-781 7007.58 ( 0.00%) 19103.14 (172.61%)
disk-841 6981.18 ( 0.00%) 18964.22 (171.65%)
disk-901 6968.47 ( 0.00%) 18826.72 (170.17%)
disk-961 6964.61 ( 0.00%) 18708.02 (168.62%)
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1420573509-24774-7-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We have two flavors of the MCS spinlock: standard and cancelable (OSQ).
While each one is independent of the other, we currently mix and match
them. This patch:
- Moves the OSQ code out of mcs_spinlock.h (which only deals with the traditional
version) into include/linux/osq_lock.h. No unnecessary code is added to the
more global header file, anything locks that make use of OSQ must include
it anyway.
- Renames mcs_spinlock.c to osq_lock.c. This file only contains osq code.
- Introduces a CONFIG_LOCK_SPIN_ON_OWNER in order to only build osq_lock
if there is support for it.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Waiman Long <Waiman.Long@hp.com>
Link: http://lkml.kernel.org/r/1420573509-24774-5-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We're going to make might_sleep() test for TASK_RUNNING, because
blocking without TASK_RUNNING will destroy the task state by setting
it to TASK_RUNNING.
There are a few occasions where its 'valid' to call blocking
primitives (and mutex_lock in particular) and not have TASK_RUNNING,
typically such cases are right before we set TASK_RUNNING anyhow.
Robustify the code by not assuming this; this has the beneficial side
effect of allowing optional code emission for fixing the above
might_sleep() false positives.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140924082241.988560063@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull core locking updates from Ingo Molnar:
"The main updates in this cycle were:
- mutex MCS refactoring finishing touches: improve comments, refactor
and clean up code, reduce debug data structure footprint, etc.
- qrwlock finishing touches: remove old code, self-test updates.
- small rwsem optimization
- various smaller fixes/cleanups"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/lockdep: Revert qrwlock recusive stuff
locking/rwsem: Avoid double checking before try acquiring write lock
locking/rwsem: Move EXPORT_SYMBOL() lines to follow function definition
locking/rwlock, x86: Delete unused asm/rwlock.h and rwlock.S
locking/rwlock, x86: Clean up asm/spinlock*.h to remove old rwlock code
locking/semaphore: Resolve some shadow warnings
locking/selftest: Support queued rwlock
locking/lockdep: Restrict the use of recursive read_lock() with qrwlock
locking/spinlocks: Always evaluate the second argument of spin_lock_nested()
locking/Documentation: Update locking/mutex-design.txt disadvantages
locking/Documentation: Move locking related docs into Documentation/locking/
locking/mutexes: Use MUTEX_SPIN_ON_OWNER when appropriate
locking/mutexes: Refactor optimistic spinning code
locking/mcs: Remove obsolete comment
locking/mutexes: Document quick lock release when unlocking
locking/mutexes: Standardize arguments in lock/unlock slowpaths
locking: Remove deprecated smp_mb__() barriers
Commit f0bab73cb5 ("locking/lockdep: Restrict the use of recursive
read_lock() with qrwlock") changed lockdep to try and conform to the
qrwlock semantics which differ from the traditional rwlock semantics.
In particular qrwlock is fair outside of interrupt context, but in
interrupt context readers will ignore all fairness.
The problem modeling this is that read and write side have different
lock state (interrupts) semantics but we only have a single
representation of these. Therefore lockdep will get confused, thinking
the lock can cause interrupt lock inversions.
So revert it for now; the old rwlock semantics were already imperfectly
modeled and the qrwlock extra won't fit either.
If we want to properly fix this, I think we need to resurrect the work
by Gautham did a few years ago that split the read and write state of
locks:
http://lwn.net/Articles/332801/
FWIW the locking selftest that would've failed (and was reported by
Borislav earlier) is something like:
RL(X1); /* IRQ-ON */
LOCK(A);
UNLOCK(A);
RU(X1);
IRQ_ENTER();
RL(X1); /* IN-IRQ */
RU(X1);
IRQ_EXIT();
At which point it would report that because A is an IRQ-unsafe lock we
can suffer the following inversion:
CPU0 CPU1
lock(A)
lock(X1)
lock(A)
<IRQ>
lock(X1)
And this is 'wrong' because X1 can recurse (assuming the above lock are
in fact read-lock) but lockdep doesn't know about this.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: ego@linux.vnet.ibm.com
Cc: bp@alien8.de
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20140930132600.GA7444@worktop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 9b0fc9c09f ("rwsem: skip initial trylock in rwsem_down_write_failed")
checks for if there are known active lockers in order to avoid write trylocking
using expensive cmpxchg() when it likely wouldn't get the lock.
However, a subsequent patch was added such that we directly
check for sem->count == RWSEM_WAITING_BIAS right before trying
that cmpxchg().
Thus, commit 9b0fc9c09f now just adds overhead.
This patch modifies it so that we only do a check for if
count == RWSEM_WAITING_BIAS.
Also, add a comment on why we do an "extra check" of count
before the cmpxchg().
Signed-off-by: Jason Low <jason.low2@hp.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1410913017.2447.22.camel@j-VirtualBox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Remove some unnecessary ones and explicitly include rwsem.h
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Its quite easy to get mixed up with the names -- 'torture_spinlock_irq'
is not actually a valid spinlock name.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Add a "rw_lock" torture test to stress kernel rwlocks and their irq
variant. Reader critical regions are 5x longer than writers. As such
a similar ratio of lock acquisitions is seen in the statistics. In the
case of massive contention, both hold the lock for 1/10 of a second.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The amount of global variables is getting pretty ugly. Group variables
related to the execution (ie: not parameters) in a new context structure.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
We can easily do so with our new reader lock support. Just an arbitrary
design default: readers have higher (5x) critical region latencies than
writers: 50 ms and 10 ms, respectively.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Most of it is based on what we already have for writers. This allows
readers to be very independent (and thus configurable), enabling
future module parameters to control things such as rw distribution.
Furthermore, readers have their own delaying function, allowing us
to test different rw critical region latencies, and stress locking
internals. Similarly, statistics, for now will only serve for the
number of lock acquisitions -- as opposed to writers, readers have
no failure detection.
In addition, introduce a new nreaders_stress module parameter. The
default number of readers will be the same number of writers threads.
Writer threads are interleaved with readers. Documentation is updated,
respectively.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
When performing module cleanups by calling torture_cleanup() the
'torture_type' string in nullified However, callers are not necessarily
done, and might still need to reference the variable. This impacts
both rcutorture and locktorture, causing printing things like:
[ 94.226618] (null)-torture: Stopping lock_torture_writer task
[ 94.226624] (null)-torture: Stopping lock_torture_stats task
Thus delay this operation until the very end of the cleanup process.
The consequence (which shouldn't matter for this kid of program) is,
of course, that we delay the window between rmmod and modprobing,
for instance in module_torture_begin().
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The statistics structure can serve well for both reader and writer
locks, thus simply rename some fields that mention 'write' and leave
the declaration of lwsa.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Regular locks are very different than locks with debugging. For instance
for mutexes, debugging forces to only take the slowpaths. As such, the
locktorture module should take this into account when printing related
information -- specifically when printing user passed parameters, it seems
the right place for such info.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Add a "mutex_lock" torture test. The main difference with the already
existing spinlock tests is that the latency of the critical region
is much larger. We randomly delay for (arbitrarily) either 500 ms or,
otherwise, 25 ms. While this can considerably reduce the amount of
writes compared to non blocking locks, if run long enough it can have
the same torturous effect. Furthermore it is more representative of
mutex hold times and can stress better things like thrashing.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
... to just 'torture_runnable'. It follows other variable naming
and is shorter.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Unlike the original unfair rwlock implementation, queued rwlock
will grant lock according to the chronological sequence of the lock
requests except when the lock requester is in the interrupt context.
Consequently, recursive read_lock calls will now hang the process if
there is a write_lock call somewhere in between the read_lock calls.
This patch updates the lockdep implementation to look for recursive
read_lock calls. A new read state (3) is used to mark those read_lock
call that cannot be recursively called except in the interrupt
context. The new read state does exhaust the 2 bits available in
held_lock:read bit field. The addition of any new read state in the
future may require a redesign of how all those bits are squeezed
together in the held_lock structure.
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Maarten Lankhorst <maarten.lankhorst@canonical.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1407345722-61615-2-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When unlocking, we always want to reach the slowpath with the lock's counter
indicating it is unlocked. -- as returned by the asm fastpath call or by
explicitly setting it. While doing so, at least in theory, we can optimize
and allow faster lock stealing.
When unlocking, we always want to reach the slowpath with the lock's counter
indicating it is unlocked. -- as returned by the asm fastpath call or by
explicitly setting it. While doing so, at least in theory, we can optimize
and allow faster lock stealing.
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: jason.low2@hp.com
Cc: aswin@hp.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1406752916-3341-2-git-send-email-davidlohr@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are two definitions of struct rw_semaphore, one in linux/rwsem.h
and one in linux/rwsem-spinlock.h.
For some reason they have different names for the initial field. This
makes it impossible to use C99 named initialization for
__RWSEM_INITIALIZER() -- or we have to duplicate that entire thing
along with the structure definitions.
The simpler patch is renaming the rwsem-spinlock variant to match the
regular rwsem.
This allows us to switch to C99 named initialization.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-bmrZolsbGmautmzrerog27io@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 4fc828e24c ("locking/rwsem: Support optimistic spinning")
introduced a major performance regression for workloads such as
xfs_repair which mix read and write locking of the mmap_sem across
many threads. The result was xfs_repair ran 5x slower on 3.16-rc2
than on 3.15 and using 20x more system CPU time.
Perf profiles indicate in some workloads that significant time can
be spent spinning on !owner. This is because we don't set the lock
owner when readers(s) obtain the rwsem.
In this patch, we'll modify rwsem_can_spin_on_owner() such that we'll
return false if there is no lock owner. The rationale is that if we
just entered the slowpath, yet there is no lock owner, then there is
a possibility that a reader has the lock. To be conservative, we'll
avoid spinning in these situations.
This patch reduced the total run time of the xfs_repair workload from
about 4 minutes 24 seconds down to approximately 1 minute 26 seconds,
back to close to the same performance as on 3.15.
Retesting of AIM7, which were some of the workloads used to test the
original optimistic spinning code, confirmed that we still get big
performance gains with optimistic spinning, even with this additional
regression fix. Davidlohr found that while the 'custom' workload took
a performance hit of ~-14% to throughput for >300 users with this
additional patch, the overall gain with optimistic spinning is
still ~+45%. The 'disk' workload even improved by ~+15% at >1000 users.
Tested-by: Dave Chinner <dchinner@redhat.com>
Acked-by: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1404532172.2572.30.camel@j-VirtualBox
Signed-off-by: Ingo Molnar <mingo@kernel.org>