Commit Graph

60631 Commits

Author SHA1 Message Date
Dominik Brodowski
a87d35d87a net: socket: add __sys_bind() helper; remove in-kernel call to syscall
Using the net-internal helper __sys_bind() allows us to avoid the
internal calls to the sys_bind() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:07 +02:00
Dominik Brodowski
9d6a15c3f2 net: socket: add __sys_socket() helper; remove in-kernel call to syscall
Using the net-internal helper __sys_socket() allows us to avoid the
internal calls to the sys_socket() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:06 +02:00
Dominik Brodowski
4541e80560 net: socket: add __sys_accept4() helper; remove in-kernel call to syscall
Using the net-internal helper __sys_accept4() allows us to avoid the
internal calls to the sys_accept4() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:06 +02:00
Dominik Brodowski
211b634b7f net: socket: add __sys_sendto() helper; remove in-kernel call to syscall
Using the net-internal helper __sys_sendto() allows us to avoid the
internal calls to the sys_sendto() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:05 +02:00
Dominik Brodowski
7a09e1eb9c net: socket: add __sys_recvfrom() helper; remove in-kernel call to syscall
Using the net-internal helper __sys_recvfrom() allows us to avoid the
internal calls to the sys_recvfrom() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:04 +02:00
Dominik Brodowski
2de0db992d mm: use do_futex() instead of sys_futex() in mm_release()
sys_futex() is a wrapper to do_futex() which does not modify any
values here:

- uaddr, val and val3 are kept the same

- op is masked with FUTEX_CMD_MASK, but is always set to FUTEX_WAKE.
  Therefore, val2 is always 0.

- as utime is set to NULL, *timeout is NULL

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:02 +02:00
Dominik Brodowski
d53238cd51 kernel: open-code sys_rt_sigpending() in sys_sigpending()
A similar but not fully equivalent code path is already open-coded
three times (in sys_rt_sigpending and in the two compat stubs), so
do it a fourth time here.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:00 +02:00
Linus Torvalds
486adcea4a Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
 "The main kernel side changes were:

   - Modernize the kprobe and uprobe creation/destruction tooling ABIs:

     The existing text based APIs (kprobe_events and uprobe_events in
     tracefs), are naive, limited ABIs in that they require user-space
     to clean up after themselves, which is both difficult and fragile
     if the tool is buggy or exits unexpectedly. In other words they are
     not really suited for modern, robust tooling.

     So introduce a modern, file descriptor based ABI that does not have
     these limitations: introduce the 'perf_kprobe' and 'perf_uprobe'
     PMUs and extend the perf_event_open() syscall to create events with
     a kprobe/uprobe attached to them. These [k,u]probe are associated
     with this file descriptor, so they are not available in tracefs.

     (Song Liu)

   - Intel Cannon Lake CPU support (Harry Pan)

   - Intel PT cleanups (Alexander Shishkin)

   - Improve the performance of pinned/flexible event groups by using RB
     trees (Alexey Budankov)

   - Add PERF_EVENT_IOC_MODIFY_ATTRIBUTES which allows the modification
     of hardware breakpoints, which new ABI variant massively speeds up
     existing tooling that uses hardware breakpoints to instrument (and
     debug) memory usage.

     (Milind Chabbi, Jiri Olsa)

   - Various Intel PEBS handling fixes and improvements, and other Intel
     PMU improvements (Kan Liang)

   - Various perf core improvements and optimizations (Peter Zijlstra)

   - ... misc cleanups, fixes and updates.

  There's over 200 tooling commits, here's an (imperfect) list of
  highlights:

   - 'perf annotate' improvements:

      * Recognize and handle jumps to other functions as calls, which
        improves the navigation along jumps and back. (Arnaldo Carvalho
        de Melo)

      * Add the 'P' hotkey in TUI annotation to dump annotation output
        into a file, to ease e-mail reporting of annotation details.
        (Arnaldo Carvalho de Melo)

      * Add an IPC/cycles column to the TUI (Jin Yao)

      * Improve s390 assembly annotation (Thomas Richter)

      * Refactor the output formatting logic to better separate it into
        interactive and non-interactive features and add the --stdio2
        output variant to demonstrate this. (Arnaldo Carvalho de Melo)

   - 'perf script' improvements:

      * Add Python 3 support (Jaroslav Škarvada)

      * Add --show-round-event (Jiri Olsa)

   - 'perf c2c' improvements:

      * Add NUMA analysis support (Jiri Olsa)

   - 'perf trace' improvements:

      * Improve PowerPC support (Ravi Bangoria)

   - 'perf inject' improvements:

      * Integrate ARM CoreSight traces (Robert Walker)

   - 'perf stat' improvements:

      * Add the --interval-count option (yuzhoujian)

      * Add the --timeout option (yuzhoujian)

   - 'perf sched' improvements (Changbin Du)

   - Vendor events improvements :

      * Add IBM s390 vendor events (Thomas Richter)

      * Add and improve arm64 vendor events (John Garry, Ganapatrao
        Kulkarni)

      * Update POWER9 vendor events (Sukadev Bhattiprolu)

   - Intel PT tooling improvements (Adrian Hunter)

   - PMU handling improvements (Agustin Vega-Frias)

   - Record machine topology in perf.data (Jiri Olsa)

   - Various overwrite related cleanups (Kan Liang)

   - Add arm64 dwarf post unwind support (Kim Phillips, Jean Pihet)

   - ... and lots of other changes, cleanups and fixes, see the shortlog
     and Git history for details"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (262 commits)
  perf/x86/intel: Enable C-state residency events for Cannon Lake
  perf/x86/intel: Add Cannon Lake support for RAPL profiling
  perf/x86/pt, coresight: Clean up address filter structure
  perf vendor events s390: Add JSON files for IBM z14
  perf vendor events s390: Add JSON files for IBM z13
  perf vendor events s390: Add JSON files for IBM zEC12 zBC12
  perf vendor events s390: Add JSON files for IBM z196
  perf vendor events s390: Add JSON files for IBM z10EC z10BC
  perf mmap: Be consistent when checking for an unmaped ring buffer
  perf mmap: Fix accessing unmapped mmap in perf_mmap__read_done()
  perf build: Fix check-headers.sh opts assignment
  perf/x86: Update rdpmc_always_available static key to the modern API
  perf annotate: Use absolute addresses to calculate jump target offsets
  perf annotate: Defer searching for comma in raw line till it is needed
  perf annotate: Support jumping from one function to another
  perf annotate: Add "_local" to jump/offset validation routines
  perf python: Reference Py_None before returning it
  perf annotate: Mark jumps to outher functions with the call arrow
  perf annotate: Pass function descriptor to its instruction parsing routines
  perf annotate: No need to calculate notes->start twice
  ...
2018-04-02 11:06:34 -07:00
Takashi Iwai
903d271a3f Merge tag 'asoc-v4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus
ASoC: Updates for v4.17

This is a *very* big release for ASoC.  Not much change in the core but
there s the transition of all the individual drivers over to components
which is intended to support further core work.  The goal is to make it
easier to do further core work by removing the need to special case all
the different driver classes in the core, many of the devices end up
being used in multiple roles in modern systems.

We also have quite a lot of new drivers added this month of all kinds,
quite a few for simple devices but also some more advanced ones with
more substantial code.

 - The biggest thing is the huge series from Morimoto-san which
   converted everything over to components.  This is a huge change by
   code volume but was fairly mechanical
 - Many fixes for some of the Realtek based Baytrail systems covering
   both the CODECs and the CPUs, contributed by Hans de Goode.
 - Lots of cleanups for Samsung based Odroid systems from Sylwester
   Nawrocki.
 - The Freescale SSI driver also got a lot of cleanups from Nicolin
   Chen.
 - The Blackfin drivers have been removed as part of the removal of the
   architecture.
 - New drivers for AKM AK4458 and AK5558, several AMD based machines,
   several Intel based machines, Maxim MAX9759, Motorola CPCAP,
   Socionext Uniphier SoCs, and TI PCM1789 and TDA7419
2018-04-02 19:51:39 +02:00
Linus Torvalds
701f3b3149 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
 "The main changes in the locking subsystem in this cycle were:

   - Add the Linux Kernel Memory Consistency Model (LKMM) subsystem,
     which is an an array of tools in tools/memory-model/ that formally
     describe the Linux memory coherency model (a.k.a.
     Documentation/memory-barriers.txt), and also produce 'litmus tests'
     in form of kernel code which can be directly executed and tested.

     Here's a high level background article about an earlier version of
     this work on LWN.net:

        https://lwn.net/Articles/718628/

     The design principles:

      "There is reason to believe that Documentation/memory-barriers.txt
       could use some help, and a major purpose of this patch is to
       provide that help in the form of a design-time tool that can
       produce all valid executions of a small fragment of concurrent
       Linux-kernel code, which is called a "litmus test". This tool's
       functionality is roughly similar to a full state-space search.
       Please note that this is a design-time tool, not useful for
       regression testing. However, we hope that the underlying
       Linux-kernel memory model will be incorporated into other tools
       capable of analyzing large bodies of code for regression-testing
       purposes."

     [...]

      "A second tool is klitmus7, which converts litmus tests to
       loadable kernel modules for direct testing. As with herd7, the
       klitmus7 code is freely available from

         http://diy.inria.fr/sources/index.html

       (and via "git" at https://github.com/herd/herdtools7)"

     [...]

     Credits go to:

      "This patch was the result of a most excellent collaboration
       founded by Jade Alglave and also including Alan Stern, Andrea
       Parri, and Luc Maranget."

     ... and to the gents listed in the MAINTAINERS entry:

        LINUX KERNEL MEMORY CONSISTENCY MODEL (LKMM)
        M:      Alan Stern <stern@rowland.harvard.edu>
        M:      Andrea Parri <parri.andrea@gmail.com>
        M:      Will Deacon <will.deacon@arm.com>
        M:      Peter Zijlstra <peterz@infradead.org>
        M:      Boqun Feng <boqun.feng@gmail.com>
        M:      Nicholas Piggin <npiggin@gmail.com>
        M:      David Howells <dhowells@redhat.com>
        M:      Jade Alglave <j.alglave@ucl.ac.uk>
        M:      Luc Maranget <luc.maranget@inria.fr>
        M:      "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

     The LKMM project already found several bugs in Linux locking
     primitives and improved the understanding and the documentation of
     the Linux memory model all around.

   - Add KASAN instrumentation to atomic APIs (Dmitry Vyukov)

   - Add RWSEM API debugging and reorganize the lock debugging Kconfig
     (Waiman Long)

   - ... misc cleanups and other smaller changes"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
  locking/Kconfig: Restructure the lock debugging menu
  locking/Kconfig: Add LOCK_DEBUGGING_SUPPORT to make it more readable
  locking/rwsem: Add DEBUG_RWSEMS to look for lock/unlock mismatches
  lockdep: Make the lock debug output more useful
  locking/rtmutex: Handle non enqueued waiters gracefully in remove_waiter()
  locking/atomic, asm-generic, x86: Add comments for atomic instrumentation
  locking/atomic, asm-generic: Add KASAN instrumentation to atomic operations
  locking/atomic/x86: Switch atomic.h to use atomic-instrumented.h
  locking/atomic, asm-generic: Add asm-generic/atomic-instrumented.h
  locking/xchg/alpha: Remove superfluous memory barriers from the _local() variants
  tools/memory-model: Finish the removal of rb-dep, smp_read_barrier_depends(), and lockless_dereference()
  tools/memory-model: Add documentation of new litmus test
  tools/memory-model: Remove mention of docker/gentoo image
  locking/memory-barriers: De-emphasize smp_read_barrier_depends() some more
  locking/lockdep: Show unadorned pointers
  mutex: Drop linkage.h from mutex.h
  tools/memory-model: Remove rb-dep, smp_read_barrier_depends, and lockless_dereference
  tools/memory-model: Convert underscores to hyphens
  tools/memory-model: Add a S lock-based external-view litmus test
  tools/memory-model: Add required herd7 version to README file
  ...
2018-04-02 10:27:16 -07:00
Linus Torvalds
8747a29173 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The main RCU subsystem changes in this cycle were:

  - Miscellaneous fixes, perhaps most notably removing obsolete code
    whose only purpose in life was to gather information for the
    now-removed RCU debugfs facility. Other notable changes include
    removing NO_HZ_FULL_ALL in favor of the nohz_full kernel boot
    parameter, minor optimizations for expedited grace periods, some
    added tracing, creating an RCU-specific workqueue using Tejun's new
    WQ_MEM_RECLAIM flag, and several cleanups to code and comments.

  - SRCU cleanups and optimizations.

  - Torture-test updates, perhaps most notably the adding of ARMv8
    support, but also including numerous cleanups and usability fixes"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
  rcu: Create RCU-specific workqueues with rescuers
  torture: Provide more sensible nreader/nwriter defaults for rcuperf
  torture: Grace periods do not piggyback off of themselves
  torture: Adjust rcuperf trace processing to allow for workqueues
  torture: Default jitter off when running rcuperf
  torture: Specify qemu memory size with --memory argument
  rcutorture: Add basic ARM64 support to run scripts
  rcutorture: Update kvm.sh header comment
  rcutorture: Record which grace-period primitives are tested
  rcutorture: Re-enable testing of dynamic expediting
  rcutorture: Avoid fake-writer use of undefined primitives
  rcutorture: Abstract function and module names
  rcutorture: Replace multi-instance kzalloc() with kcalloc()
  rcu: Remove SRCU throttling
  srcu: Remove dead code in srcu_gp_end()
  srcu: Reduce scans of srcu_data in counter wrap check
  srcu: Prevent sdp->srcu_gp_seq_needed_exp counter wrap
  srcu: Abstract function name
  rcu: Make expedited RCU CPU selection avoid unnecessary stores
  rcu: Trace expedited GP delays due to transitioning CPUs
  ...
2018-04-02 09:59:09 -07:00
Linus Torvalds
cc67ccecd3 Merge branch 'core-headers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull header file cleanup from Ingo Molnar:
 "Reduce <linux/interrupt.h> dependencies: a single change that drops
  two #includes from this frequently used kernel header"

* 'core-headers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  headers: Drop two #included headers from <linux/interrupt.h>
2018-04-02 09:27:29 -07:00
Linus Torvalds
320b164abb Merge tag 'drm-for-v4.17' of git://people.freedesktop.org/~airlied/linux
Pull drm updates from Dave Airlie:
 "Cannonlake and Vega12 support are probably the two major things. This
  pull lacks nouveau, Ben had some unforseen leave and a few other
  blockers so we'll see how things look or maybe leave it for this merge
  window.

  core:
   - Device links to handle sound/gpu pm dependency
   - Color encoding/range properties
   - Plane clipping into plane check helper
   - Backlight helpers
   - DP TP4 + HBR3 helper support

  amdgpu:
   - Vega12 support
   - Enable DC by default on all supported GPUs
   - Powerplay restructuring and cleanup
   - DC bandwidth calc updates
   - DC backlight on pre-DCE11
   - TTM backing store dropping support
   - SR-IOV fixes
   - Adding "wattman" like functionality
   - DC crc support
   - Improved DC dual-link handling

  amdkfd:
   - GPUVM support for dGPU
   - KFD events for dGPU
   - Enable PCIe atomics for dGPUs
   - HSA process eviction support
   - Live-lock fixes for process eviction
   - VM page table allocation fix for large-bar systems

  panel:
   - Raydium RM68200
   - AUO G104SN02 V2
   - KEO TX31D200VM0BAA
   - ARM Versatile panels

  i915:
   - Cannonlake support enabled
   - AUX-F port support added
   - Icelake base enabling until internal milestone of forcewake support
   - Query uAPI interface (used for GPU topology information currently)
   - Compressed framebuffer support for sprites
   - kmem cache shrinking when GPU is idle
   - Avoid boosting GPU when waited item is being processed already
   - Avoid retraining LSPCON link unnecessarily
   - Decrease request signaling latency
   - Deprecation of I915_SET_COLORKEY_NONE
   - Kerneldoc and compiler warning cleanup for upcoming CI enforcements
   - Full range ycbcr toggling
   - HDCP support

  i915/gvt:
   - Big refactor for shadow ppgtt
   - KBL context save/restore via LRI cmd (Weinan)
   - Properly unmap dma for guest page (Changbin)

  vmwgfx:
   - Lots of various improvements

  etnaviv:
   - Use the drm gpu scheduler
   - prep work for GC7000L support

  vc4:
   - fix alpha blending
   - Expose perf counters to userspace

  pl111:
   - Bandwidth checking/limiting
   - Versatile panel support

  sun4i:
   - A83T HDMI support
   - A80 support
   - YUV plane support
   - H3/H5 HDMI support

  omapdrm:
   - HPD support for DVI connector
   - remove lots of static variables

  msm:
   - DSI updates from 10nm / SDM845
   - fix for race condition with a3xx/a4xx fence completion irq
   - some refactoring/prep work for eventual a6xx support (ie. when we
     have a userspace)
   - a5xx debugfs enhancements
   - some mdp5 fixes/cleanups to prepare for eventually merging
     writeback
   - support (ie. when we have a userspace)

  tegra:
   - mmap() fixes for fbdev devices
   - Overlay plane for hw cursor fix
   - dma-buf cache maintenance support

  mali-dp:
   - YUV->RGB conversion support

  rockchip:
   - rk3399/chromebook fixes and improvements

  rcar-du:
   - LVDS support move to drm bridge
   - DT bindings for R8A77995
   - Driver/DT support for R8A77970

  tilcdc:
   - DRM panel support"

* tag 'drm-for-v4.17' of git://people.freedesktop.org/~airlied/linux: (1646 commits)
  drm/i915: Fix hibernation with ACPI S0 target state
  drm/i915/execlists: Use a locked clear_bit() for synchronisation with interrupt
  drm/i915: Specify which engines to reset following semaphore/event lockups
  drm/i915/dp: Write to SET_POWER dpcd to enable MST hub.
  drm/amdkfd: Use ordered workqueue to restore processes
  drm/amdgpu: Fix acquiring VM on large-BAR systems
  drm/amd/pp: clean header file hwmgr.h
  drm/amd/pp: use mlck_table.count for array loop index limit
  drm: Fix uabi regression by allowing garbage mode->type from userspace
  drm/amdgpu: Add an ATPX quirk for hybrid laptop
  drm/amdgpu: fix spelling mistake: "asssert" -> "assert"
  drm/amd/pp: Add new asic support in pp_psm.c
  drm/amd/pp: Clean up powerplay code on Vega12
  drm/amd/pp: Add smu irq handlers for legacy asics
  drm/amd/pp: Fix set wrong temperature range on smu7
  drm/amdgpu: Don't change preferred domian when fallback GTT v5
  drm/vmwgfx: Bump version patchlevel and date
  drm/vmwgfx: use monotonic event timestamps
  drm/vmwgfx: Unpin the screen object backup buffer when not used
  drm/vmwgfx: Stricter count of legacy surface device resources
  ...
2018-04-02 07:59:23 -07:00
Mark Brown
70a3550b8c Merge remote-tracking branches 'spi/topic/atmel', 'spi/topic/bcm-qspi', 'spi/topic/bcm2835aux', 'spi/topic/dw' and 'spi/topic/gpio' into spi-next 2018-04-02 15:56:34 +01:00
Viresh Kumar
8ea229511e thermal: Add cooling device's statistics in sysfs
This extends the sysfs interface for thermal cooling devices and exposes
some pretty useful statistics. These statistics have proven to be quite
useful specially while doing benchmarks related to the task scheduler,
where we want to make sure that nothing has disrupted the test,
specially the cooling device which may have put constraints on the CPUs.
The information exposed here tells us to what extent the CPUs were
constrained by the thermal framework.

The write-only "reset" file is used to reset the statistics.

The read-only "time_in_state_ms" file shows the time (in msec) spent by the
device in the respective cooling states, and it prints one line per
cooling state.

The read-only "total_trans" file shows single positive integer value
showing the total number of cooling state transitions the device has
gone through since the time the cooling device is registered or the time
when statistics were reset last.

The read-only "trans_table" file shows a two dimensional matrix, where
an entry <i,j> (row i, column j) represents the number of transitions
from State_i to State_j.

This is how the directory structure looks like for a single cooling
device:

$ ls -R /sys/class/thermal/cooling_device0/
/sys/class/thermal/cooling_device0/:
cur_state  max_state  power  stats  subsystem  type  uevent

/sys/class/thermal/cooling_device0/power:
autosuspend_delay_ms  runtime_active_time  runtime_suspended_time
control               runtime_status

/sys/class/thermal/cooling_device0/stats:
reset  time_in_state_ms  total_trans  trans_table

This is tested on ARM 64-bit Hisilicon hikey620 board running Ubuntu and
ARM 64-bit Hisilicon hikey960 board running Android.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
2018-04-02 21:49:01 +08:00
Luis Henriques
fb18a57568 ceph: quota: add initial infrastructure to support cephfs quotas
This patch adds the infrastructure required to support cephfs quotas as it
is currently implemented in the ceph fuse client.  Cephfs quotas can be
set on any directory, and can restrict the number of bytes or the number
of files stored beneath that point in the directory hierarchy.

Quotas are set using the extended attributes 'ceph.quota.max_files' and
'ceph.quota.max_bytes', and can be removed by setting these attributes to
'0'.

Link: http://tracker.ceph.com/issues/22372
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-02 11:17:51 +02:00
Rafael J. Wysocki
103cf0e579 Merge branches 'pm-cpuidle' and 'pm-tools'
* pm-cpuidle:
  cpuidle: poll_state: Avoid invoking local_clock() too often
  PM: cpuidle/suspend: Add s2idle usage and time state attributes
  cpuidle: Enable coupled cpuidle support on Exynos3250 platform
  cpuidle: poll_state: Add time limit to poll_idle()
  ARM: cpuidle: Drop memory allocation error message from arm_idle_init_cpu()

* pm-tools:
  pm-graph: AnalyzeSuspend v5.0
  pm-graph: AnalyzeBoot v2.2
  pm-graph: config files and installer
2018-04-02 11:01:02 +02:00
Rafael J. Wysocki
a9f36db7ec Merge branch 'pm-cpufreq'
* pm-cpufreq: (38 commits)
  cpufreq: CPPC: Use transition_delay_us depending transition_latency
  cpufreq: tegra186: Don't validate the frequency table twice
  cpufreq: speedstep: Don't validate the frequency table twice
  cpufreq: sparc: Don't validate the frequency table twice
  cpufreq: sh: Don't validate the frequency table twice
  cpufreq: sfi: Don't validate the frequency table twice
  cpufreq: scpi: Don't validate the frequency table twice
  cpufreq: sc520: Don't validate the frequency table twice
  cpufreq: s3c24xx: Don't validate the frequency table twice
  cpufreq: qoirq: Don't validate the frequency table twice
  cpufreq: pxa: Don't validate the frequency table twice
  cpufreq: ppc_cbe: Don't validate the frequency table twice
  cpufreq: powernow: Don't validate the frequency table twice
  cpufreq: p4-clockmod: Don't validate the frequency table twice
  cpufreq: mediatek: Don't validate the frequency table twice
  cpufreq: longhaul: Don't validate the frequency table twice
  cpufreq: ia64-acpi: Don't validate the frequency table twice
  cpufreq: elanfreq: Don't validate the frequency table twice
  cpufreq: e_powersaver: Don't validate the frequency table twice
  cpufreq: cpufreq-dt: Don't validate the frequency table twice
  ...
2018-04-02 11:00:46 +02:00
Rafael J. Wysocki
e3a495c4ee Merge branches 'pm-core', 'pm-sleep' and 'acpi-pm'
* pm-core:
  driver core: Introduce device links reference counting
  PM / wakeirq: Add wakeup name to dedicated wake irqs

* pm-sleep:
  PM / hibernate: Change message when writing to /sys/power/resume
  PM / hibernate: Make passing hibernate offsets more friendly
  PCMCIA / PM: Avoid noirq suspend aborts during suspend-to-idle

* acpi-pm:
  ACPI / PM: Fix keyboard wakeup from suspend-to-idle on ASUS UX331UA
  ACPI / PM: Allow deeper wakeup power states with no _SxD nor _SxW
  ACPI / PM: Reduce LPI constraints logging noise
  ACPI / PM: Do not reconfigure GPEs for suspend-to-idle
2018-04-02 11:00:28 +02:00
Rafael J. Wysocki
0c9ed61bdd Merge branches 'acpi-battery', 'acpi-doc' and 'acpi-pmic'
* acpi-battery:
  Revert "ACPI: battery: Add the ThinkPad "Not Charging" quirk"
  ACPI: battery: do not export degraded capacity values over 100
  ACPI: battery: make function __battery_hook_unregister() static
  ACPI: battery: Add the ThinkPad "Not Charging" quirk
  thinkpad_acpi: Add support for battery thresholds
  power: add to_power_supply macro to the API
  battery: Add the battery hooking API

* acpi-doc:
  ACPI: sysfs: Update device object sysfs documentation

* acpi-pmic:
  ACPI / PMIC: Replace license boilerplate with SPDX license identifier
2018-04-02 10:58:13 +02:00
Chengguang Xu
bb48bd4dc4 ceph: optimize memory usage
In current code, regular file and directory use same struct
ceph_file_info to store fs specific data so the struct has to
include some fields which are only used for directory
(e.g., readdir related info), when having plenty of regular files,
it will lead to memory waste.

This patch introduces dedicated ceph_dir_file_info cache for
readdir related thins. So that regular file does not include those
unused fields anymore.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-02 10:12:49 +02:00
Ilya Dryomov
08c1ac508b libceph, ceph: move ceph_calc_file_object_mapping() to striper.c
ceph_calc_file_object_mapping() has nothing to do with osdmaps.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-02 10:12:43 +02:00
Ilya Dryomov
ed0811d2d2 libceph: striping framework implementation
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-02 10:12:42 +02:00
Ilya Dryomov
b9e281c2b3 libceph: introduce BVECS data type
In preparation for rbd "fancy" striping, introduce ceph_bvec_iter for
working with bio_vec array data buffers.  The wrappers are trivial, but
make it look similar to ceph_bio_iter.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-02 10:12:39 +02:00
Ilya Dryomov
5359a17d27 libceph, rbd: new bio handling code (aka don't clone bios)
The reason we clone bios is to be able to give each object request
(and consequently each ceph_osd_data/ceph_msg_data item) its own
pointer to a (list of) bio(s).  The messenger then initializes its
cursor with cloned bio's ->bi_iter, so it knows where to start reading
from/writing to.  That's all the cloned bios are used for: to determine
each object request's starting position in the provided data buffer.

Introduce ceph_bio_iter to do exactly that -- store position within bio
list (i.e. pointer to bio) + position within that bio (i.e. bvec_iter).

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-02 10:12:38 +02:00
Ilya Dryomov
dccbf08005 libceph, ceph: change ceph_calc_file_object_mapping() signature
- make it void
- xlen (object extent length) out parameter should be u32 because only
  a single stripe unit is mapped at a time

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2018-04-02 10:12:38 +02:00
David S. Miller
c0b458a946 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Minor conflicts in drivers/net/ethernet/mellanox/mlx5/core/en_rep.c,
we had some overlapping changes:

1) In 'net' MLX5E_PARAMS_LOG_{SQ,RQ}_SIZE -->
   MLX5E_REP_PARAMS_LOG_{SQ,RQ}_SIZE

2) In 'net-next' params->log_rq_size is renamed to be
   params->log_rq_mtu_frames.

3) In 'net-next' params->hard_mtu is added.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-01 19:49:34 -04:00
Atul Gupta
e0be6bea25 ethtool: enable Inline TLS in HW
Ethtool option enables TLS record offload on HW, user
configures the feature for netdev capable of Inline TLS.
This allows user to define custom sk_prot for Inline TLS sock

Signed-off-by: Atul Gupta <atul.gupta@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 23:37:32 -04:00
David S. Miller
d4069fe6fc Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2018-03-31

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Add raw BPF tracepoint API in order to have a BPF program type that
   can access kernel internal arguments of the tracepoints in their
   raw form similar to kprobes based BPF programs. This infrastructure
   also adds a new BPF_RAW_TRACEPOINT_OPEN command to BPF syscall which
   returns an anon-inode backed fd for the tracepoint object that allows
   for automatic detach of the BPF program resp. unregistering of the
   tracepoint probe on fd release, from Alexei.

2) Add new BPF cgroup hooks at bind() and connect() entry in order to
   allow BPF programs to reject, inspect or modify user space passed
   struct sockaddr, and as well a hook at post bind time once the port
   has been allocated. They are used in FB's container management engine
   for implementing policy, replacing fragile LD_PRELOAD wrapper
   intercepting bind() and connect() calls that only works in limited
   scenarios like glibc based apps but not for other runtimes in
   containerized applications, from Andrey.

3) BPF_F_INGRESS flag support has been added to sockmap programs for
   their redirect helper call bringing it in line with cls_bpf based
   programs. Support is added for both variants of sockmap programs,
   meaning for tx ULP hooks as well as recv skb hooks, from John.

4) Various improvements on BPF side for the nfp driver, besides others
   this work adds BPF map update and delete helper call support from
   the datapath, JITing of 32 and 64 bit XADD instructions as well as
   offload support of bpf_get_prandom_u32() call. Initial implementation
   of nfp packet cache has been tackled that optimizes memory access
   (see merge commit for further details), from Jakub and Jiong.

5) Removal of struct bpf_verifier_env argument from the print_bpf_insn()
   API has been done in order to prepare to use print_bpf_insn() soon
   out of perf tool directly. This makes the print_bpf_insn() API more
   generic and pushes the env into private data. bpftool is adjusted
   as well with the print_bpf_insn() argument removal, from Jiri.

6) Couple of cleanups and prep work for the upcoming BTF (BPF Type
   Format). The latter will reuse the current BPF verifier log as
   well, thus bpf_verifier_log() is further generalized, from Martin.

7) For bpf_getsockopt() and bpf_setsockopt() helpers, IPv4 IP_TOS read
   and write support has been added in similar fashion to existing
   IPv6 IPV6_TCLASS socket option we already have, from Nikita.

8) Fixes in recent sockmap scatterlist API usage, which did not use
   sg_init_table() for initialization thus triggering a BUG_ON() in
   scatterlist API when CONFIG_DEBUG_SG was enabled. This adds and
   uses a small helper sg_init_marker() to properly handle the affected
   cases, from Prashant.

9) Let the BPF core follow IDR code convention and therefore use the
   idr_preload() and idr_preload_end() helpers, which would also help
   idr_alloc_cyclic() under GFP_ATOMIC to better succeed under memory
   pressure, from Shaohua.

10) Last but not least, a spelling fix in an error message for the
    BPF cookie UID helper under BPF sample code, from Colin.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 23:33:04 -04:00
Eric Dumazet
bf66337140 inet: frags: get rid of ipfrag_skb_cb/FRAG_CB
ip_defrag uses skb->cb[] to store the fragment offset, and unfortunately
this integer is currently in a different cache line than skb->next,
meaning that we use two cache lines per skb when finding the insertion point.

By aliasing skb->ip_defrag_offset and skb->dev, we pack all the fields
in a single cache line and save precious memory bandwidth.

Note that after the fast path added by Changli Gao in commit
d6bebca92c ("fragment: add fast path for in-order fragments")
this change wont help the fast path, since we still need
to access prev->len (2nd cache line), but will show great
benefits when slow path is entered, since we perform
a linear scan of a potentially long list.

Also, note that this potential long list is an attack vector,
we might consider also using an rb-tree there eventually.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 23:25:40 -04:00
Eric Dumazet
e5d672a078 rhashtable: reorganize struct rhashtable layout
While under frags DDOS I noticed unfortunate false sharing between
@nelems and @params.automatic_shrinking

Move @nelems at the end of struct rhashtable so that first cache line
is shared between all cpus, because almost never dirtied.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 23:25:39 -04:00
Mathieu Malaterre
619e6f340c PCI/IOV: Add missing prototypes for powerpc pcibios interfaces
Add missing prototypes for:

  resource_size_t pcibios_default_alignment(void);
  int pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs);
  int pcibios_sriov_disable(struct pci_dev *pdev);

This fixes the following warnings treated as errors when using W=1:

  arch/powerpc/kernel/pci-common.c:236:17: error: no previous prototype for ‘pcibios_default_alignment’ [-Werror=missing-prototypes]
  arch/powerpc/kernel/pci-common.c:253:5: error: no previous prototype for ‘pcibios_sriov_enable’ [-Werror=missing-prototypes]
  arch/powerpc/kernel/pci-common.c:261:5: error: no previous prototype for ‘pcibios_sriov_disable’ [-Werror=missing-prototypes]

Also, commit 978d2d6831 ("PCI: Add pcibios_iov_resource_alignment()
interface") added a new function but the prototype was located in the main
header instead of the CONFIG_PCI_IOV specific section.  Move this function
next to the newly added ones.

Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Bjorn Helgaas <helgaas@kernel.org>
2018-03-31 15:32:46 -05:00
Ingo Molnar
169310f71f Merge branch 'linus' into locking/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-31 07:30:17 +02:00
Linus Torvalds
a44406ec3d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix RCU locking in xfrm_local_error(), from Taehee Yoo.

 2) Fix return value assignments and thus error checking in
    iwl_mvm_start_ap_ibss(), from Johannes Berg.

 3) Don't count header length twice in vti4, from Stefano Brivio.

 4) Fix deadlock in rt6_age_examine_exception, from Eric Dumazet.

 5) Fix out-of-bounds access in nf_sk_lookup_slow{v4,v6}() from Subash
    Abhinov.

 6) Check nladdr size in netlink_connect(), from Alexander Potapenko.

 7) VF representor SQ numbers are 32 not 16 bits, in mlx5 driver, from
    Or Gerlitz.

 8) Out of bounds read in skb_network_protocol(), from Eric Dumazet.

 9) r8169 driver sets driver data pointer after register_netdev() which
    is too late. Fix from Heiner Kallweit.

10) Fix memory leak in mlx4 driver, from Moshe Shemesh.

11) The multi-VLAN decap fix added a regression when dealing with device
    that lack a MAC header, such as tun. Fix from Toshiaki Makita.

12) Fix integer overflow in dynamic interrupt coalescing code. From Tal
    Gilboa.

13) Use after free in vrf code, from David Ahern.

14) IPV6 route leak between VRFs fix, also from David Ahern.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (81 commits)
  net: mvneta: fix enable of all initialized RXQs
  net/ipv6: Fix route leaking between VRFs
  vrf: Fix use after free and double free in vrf_finish_output
  ipv6: sr: fix seg6 encap performances with TSO enabled
  net/dim: Fix int overflow
  vlan: Fix vlan insertion for packets without ethernet header
  net: Fix untag for vlan packets without ethernet header
  atm: iphase: fix spelling mistake: "Receiverd" -> "Received"
  vhost: validate log when IOTLB is enabled
  qede: Do not drop rx-checksum invalidated packets.
  hv_netvsc: enable multicast if necessary
  ip_tunnel: Resolve ipsec merge conflict properly.
  lan78xx: Crash in lan78xx_writ_reg (Workqueue: events lan78xx_deferred_multicast_write)
  qede: Fix barrier usage after tx doorbell write.
  vhost: correctly remove wait queue during poll failure
  net/mlx4_core: Fix memory leak while delete slave's resources
  net/mlx4_en: Fix mixed PFC and Global pause user control requests
  net/smc: use announced length in sock_recvmsg()
  llc: properly handle dev_queue_xmit() return value
  strparser: Fix sign of err codes
  ...
2018-03-30 18:47:28 -10:00
Masahiro Yamada
a95b37e20d kbuild: get <linux/compiler_types.h> out of <linux/kconfig.h>
Since commit 28128c61e0 ("kconfig.h: Include compiler types to avoid
missed struct attributes"), <linux/kconfig.h> pulls in kernel-space
headers to unrelated places.

Commit 0f9da844d8 ("MIPS: boot: Define __ASSEMBLY__ for its.S build")
suppress the build error by defining __ASSEMBLY__, but ITS (i.e. DTS)
is not assembly, and should not include <linux/compiler_types.h> in the
first place.

Looking at arch/s390/tools/Makefile, host programs gen_facilities and
gen_opcode_table now pull in <linux/compiler_types.h> as well.

The motivation for that commit was to define necessary attributes
before any struct is defined.  Obviously, this happens only in C.

It is enough to include <linux/compiler_types.h> only when compiling
C files, and only when compiling kernel space.  Move the include to
c_flags.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2018-03-31 12:22:38 +09:00
Sargun Dhillon
df0ce17331 security: convert security hooks to use hlist
This changes security_hook_heads to use hlist_heads instead of
the circular doubly-linked list heads. This should cut down
the size of the struct by about half.

In addition, it allows mutation of the hooks at the tail of the
callback list without having to modify the head. The longer-term
purpose of this is to enable making the heads read only.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Reviewed-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: James Morris <james.morris@microsoft.com>
2018-03-31 13:18:27 +11:00
Andrey Ignatov
aac3fc320d bpf: Post-hooks for sys_bind
"Post-hooks" are hooks that are called right before returning from
sys_bind. At this time IP and port are already allocated and no further
changes to `struct sock` can happen before returning from sys_bind but
BPF program has a chance to inspect the socket and change sys_bind
result.

Specifically it can e.g. inspect what port was allocated and if it
doesn't satisfy some policy, BPF program can force sys_bind to fail and
return EPERM to user.

Another example of usage is recording the IP:port pair to some map to
use it in later calls to sys_connect. E.g. if some TCP server inside
cgroup was bound to some IP:port_n, it can be recorded to a map. And
later when some TCP client inside same cgroup is trying to connect to
127.0.0.1:port_n, BPF hook for sys_connect can override the destination
and connect application to IP:port_n instead of 127.0.0.1:port_n. That
helps forcing all applications inside a cgroup to use desired IP and not
break those applications if they e.g. use localhost to communicate
between each other.

== Implementation details ==

Post-hooks are implemented as two new attach types
`BPF_CGROUP_INET4_POST_BIND` and `BPF_CGROUP_INET6_POST_BIND` for
existing prog type `BPF_PROG_TYPE_CGROUP_SOCK`.

Separate attach types for IPv4 and IPv6 are introduced to avoid access
to IPv6 field in `struct sock` from `inet_bind()` and to IPv4 field from
`inet6_bind()` since those fields might not make sense in such cases.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-31 02:16:26 +02:00
Andrey Ignatov
d74bad4e74 bpf: Hooks for sys_connect
== The problem ==

See description of the problem in the initial patch of this patch set.

== The solution ==

The patch provides much more reliable in-kernel solution for the 2nd
part of the problem: making outgoing connecttion from desired IP.

It adds new attach types `BPF_CGROUP_INET4_CONNECT` and
`BPF_CGROUP_INET6_CONNECT` for program type
`BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that can be used to override both
source and destination of a connection at connect(2) time.

Local end of connection can be bound to desired IP using newly
introduced BPF-helper `bpf_bind()`. It allows to bind to only IP though,
and doesn't support binding to port, i.e. leverages
`IP_BIND_ADDRESS_NO_PORT` socket option. There are two reasons for this:
* looking for a free port is expensive and can affect performance
  significantly;
* there is no use-case for port.

As for remote end (`struct sockaddr *` passed by user), both parts of it
can be overridden, remote IP and remote port. It's useful if an
application inside cgroup wants to connect to another application inside
same cgroup or to itself, but knows nothing about IP assigned to the
cgroup.

Support is added for IPv4 and IPv6, for TCP and UDP.

IPv4 and IPv6 have separate attach types for same reason as sys_bind
hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields
when user passes sockaddr_in since it'd be out-of-bound.

== Implementation notes ==

The patch introduces new field in `struct proto`: `pre_connect` that is
a pointer to a function with same signature as `connect` but is called
before it. The reason is in some cases BPF hooks should be called way
before control is passed to `sk->sk_prot->connect`. Specifically
`inet_dgram_connect` autobinds socket before calling
`sk->sk_prot->connect` and there is no way to call `bpf_bind()` from
hooks from e.g. `ip4_datagram_connect` or `ip6_datagram_connect` since
it'd cause double-bind. On the other hand `proto.pre_connect` provides a
flexible way to add BPF hooks for connect only for necessary `proto` and
call them at desired time before `connect`. Since `bpf_bind()` is
allowed to bind only to IP and autobind in `inet_dgram_connect` binds
only port there is no chance of double-bind.

bpf_bind() sets `force_bind_address_no_port` to bind to only IP despite
of value of `bind_address_no_port` socket field.

bpf_bind() sets `with_lock` to `false` when calling to __inet_bind()
and __inet6_bind() since all call-sites, where bpf_bind() is called,
already hold socket lock.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-31 02:15:54 +02:00
Andrey Ignatov
4fbac77d2d bpf: Hooks for sys_bind
== The problem ==

There is a use-case when all processes inside a cgroup should use one
single IP address on a host that has multiple IP configured.  Those
processes should use the IP for both ingress and egress, for TCP and UDP
traffic. So TCP/UDP servers should be bound to that IP to accept
incoming connections on it, and TCP/UDP clients should make outgoing
connections from that IP. It should not require changing application
code since it's often not possible.

Currently it's solved by intercepting glibc wrappers around syscalls
such as `bind(2)` and `connect(2)`. It's done by a shared library that
is preloaded for every process in a cgroup so that whenever TCP/UDP
server calls `bind(2)`, the library replaces IP in sockaddr before
passing arguments to syscall. When application calls `connect(2)` the
library transparently binds the local end of connection to that IP
(`bind(2)` with `IP_BIND_ADDRESS_NO_PORT` to avoid performance penalty).

Shared library approach is fragile though, e.g.:
* some applications clear env vars (incl. `LD_PRELOAD`);
* `/etc/ld.so.preload` doesn't help since some applications are linked
  with option `-z nodefaultlib`;
* other applications don't use glibc and there is nothing to intercept.

== The solution ==

The patch provides much more reliable in-kernel solution for the 1st
part of the problem: binding TCP/UDP servers on desired IP. It does not
depend on application environment and implementation details (whether
glibc is used or not).

It adds new eBPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` and
attach types `BPF_CGROUP_INET4_BIND` and `BPF_CGROUP_INET6_BIND`
(similar to already existing `BPF_CGROUP_INET_SOCK_CREATE`).

The new program type is intended to be used with sockets (`struct sock`)
in a cgroup and provided by user `struct sockaddr`. Pointers to both of
them are parts of the context passed to programs of newly added types.

The new attach types provides hooks in `bind(2)` system call for both
IPv4 and IPv6 so that one can write a program to override IP addresses
and ports user program tries to bind to and apply such a program for
whole cgroup.

== Implementation notes ==

[1]
Separate attach types for `AF_INET` and `AF_INET6` are added
intentionally to prevent reading/writing to offsets that don't make
sense for corresponding socket family. E.g. if user passes `sockaddr_in`
it doesn't make sense to read from / write to `user_ip6[]` context
fields.

[2]
The write access to `struct bpf_sock_addr_kern` is implemented using
special field as an additional "register".

There are just two registers in `sock_addr_convert_ctx_access`: `src`
with value to write and `dst` with pointer to context that can't be
changed not to break later instructions. But the fields, allowed to
write to, are not available directly and to access them address of
corresponding pointer has to be loaded first. To get additional register
the 1st not used by `src` and `dst` one is taken, its content is saved
to `bpf_sock_addr_kern.tmp_reg`, then the register is used to load
address of pointer field, and finally the register's content is restored
from the temporary field after writing `src` value.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-31 02:15:18 +02:00
Andrey Ignatov
5e43f899b0 bpf: Check attach type at prog load time
== The problem ==

There are use-cases when a program of some type can be attached to
multiple attach points and those attach points must have different
permissions to access context or to call helpers.

E.g. context structure may have fields for both IPv4 and IPv6 but it
doesn't make sense to read from / write to IPv6 field when attach point
is somewhere in IPv4 stack.

Same applies to BPF-helpers: it may make sense to call some helper from
some attach point, but not from other for same prog type.

== The solution ==

Introduce `expected_attach_type` field in in `struct bpf_attr` for
`BPF_PROG_LOAD` command. If scenario described in "The problem" section
is the case for some prog type, the field will be checked twice:

1) At load time prog type is checked to see if attach type for it must
   be known to validate program permissions correctly. Prog will be
   rejected with EINVAL if it's the case and `expected_attach_type` is
   not specified or has invalid value.

2) At attach time `attach_type` is compared with `expected_attach_type`,
   if prog type requires to have one, and, if they differ, attach will
   be rejected with EINVAL.

The `expected_attach_type` is now available as part of `struct bpf_prog`
in both `bpf_verifier_ops->is_valid_access()` and
`bpf_verifier_ops->get_func_proto()` () and can be used to check context
accesses and calls to helpers correspondingly.

Initially the idea was discussed by Alexei Starovoitov <ast@fb.com> and
Daniel Borkmann <daniel@iogearbox.net> here:
https://marc.info/?l=linux-netdev&m=152107378717201&w=2

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-31 02:14:44 +02:00
Tariq Toukan
619a8f2a42 net/mlx5e: Use linear SKB in Striding RQ
Current Striding RQ HW feature utilizes the RX buffers so that
there is no wasted room between the strides. This maximises
the memory utilization.
This prevents the use of build_skb() (which requires headroom
and tailroom), and demands to memcpy the packets headers into
the skb linear part.

In this patch, whenever a set of conditions holds, we apply
an RQ configuration that allows combining the use of linear SKB
on top of a Striding RQ.

To use build_skb() with Striding RQ, the following must hold:
1. packet does not cross a page boundary.
2. there is enough headroom and tailroom surrounding the packet.

We can satisfy 1 and 2 by configuring:
	stride size = MTU + headroom + tailoom.

This is possible only when:
a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE.
b. HW LRO is turned off.

Using linear SKB has many advantages:
- Saves a memcpy of the headers.
- No page-boundary checks in datapath.
- No filler CQEs.
- Significantly smaller CQ.
- SKB data continuously resides in linear part, and not split to
  small amount (linear part) and large amount (fragment).
  This saves datapath cycles in driver and improves utilization
  of SKB fragments in GRO.
- The fragments of a resulting GRO SKB follow the IP forwarding
  assumption of equal-size fragments.

Some implementation details:
HW writes the packets to the beginning of a stride,
i.e. does not keep headroom. To overcome this we make sure we can
extend backwards and use the last bytes of stride i-1.
Extra care is needed for stride 0 as it has no preceding stride.
We make sure headroom bytes are available by shifting the buffer
pointer passed to HW by headroom bytes.

This configuration now becomes default, whenever capable.
Of course, this implies turning LRO off.

Performance testing:
ConnectX-5, single core, single RX ring, default MTU.

UDP packet rate, early drop in TC layer:

--------------------------------------------
| pkt size | before    | after     | ratio |
--------------------------------------------
| 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
|  500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
|   64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
--------------------------------------------

TCP streams: ~20% gain

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-03-30 16:54:49 -07:00
Saeed Mahameed
b2d3907c23 net/mlx5: Eliminate query xsrq dead code
1. This function is not used anywhere in mlx5 driver
2. It has a memcpy statement that makes no sense and produces build
warning with gcc8

drivers/net/ethernet/mellanox/mlx5/core/transobj.c: In function 'mlx5_core_query_xsrq':
drivers/net/ethernet/mellanox/mlx5/core/transobj.c:347:3: error: 'memcpy' source argument is the same as destination [-Werror=restrict]

Fixes: 01949d0109 ("net/mlx5_core: Enable XRCs and SRQs when using ISSI > 0")
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-03-30 16:16:17 -07:00
Bjørn Mork
ad32eb2df8 PCI: Always define the of_node helpers
Simply move these inline functions outside the ifdef instead of duplicating
them as stubs in the !OF case.  The struct device of_node field does not
depend on OF.

This also fixes the missing stubbed pci_bus_to_OF_node().

Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: Bjorn Helgaas <helgaas@kernel.org>
2018-03-30 17:45:57 -05:00
Bjorn Helgaas
842b447f00 PCI/portdrv: Encapsulate pcie_ports_auto inside the port driver
"pcie_ports_auto" is only used inside the PCIe port driver itself, so
move it from include/linux/pci.h to portdrv.h so it's not visible to the
whole kernel.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2018-03-30 17:26:58 -05:00
Bjorn Helgaas
02bfeb4842 PCI/portdrv: Simplify PCIe feature permission checking
Some PCIe features (AER, DPC, hotplug, PME) can be managed by either the
platform firmware or the OS, so the host bridge driver may have to request
permission from the platform before using them.  On ACPI systems, this is
done by negotiate_os_control() in acpi_pci_root_add().

The PCIe port driver later uses pcie_port_platform_notify() and
pcie_port_acpi_setup() to figure out whether it can use these features.
But all we need is a single bit for each service, so these interfaces are
needlessly complicated.

Simplify this by adding bits in the struct pci_host_bridge to show when the
OS has permission to use each feature:

  + unsigned int native_aer:1;       /* OS may use PCIe AER */
  + unsigned int native_hotplug:1;   /* OS may use PCIe hotplug */
  + unsigned int native_pme:1;       /* OS may use PCIe PME */

These are set when we create a host bridge, and the host bridge driver can
clear the bits corresponding to any feature the platform doesn't want us to
use.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-03-30 17:26:54 -05:00
Michael Ellerman
f437c51748 Merge branch 'topic/paca' into next
Bring in yet another series that touches KVM code, and might need to
be merged into the kvm-ppc branch to resolve conflicts.

This required some changes in pnv_power9_force_smt4_catch/release()
due to the paca array becomming an array of pointers.
2018-03-31 09:09:36 +11:00
Prashant Bhole
f385178679 lib/scatterlist: add sg_init_marker() helper
sg_init_marker initializes sg_magic in the sg table and calls
sg_mark_end() on the last entry of the table. This can be useful to
avoid memset in sg_init_table() when scatterlist is already zeroed out

For example: when scatterlist is embedded inside other struct and that
container struct is zeroed out

Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-30 22:50:15 +02:00
Dan Williams
f44c77630d fs, dax: prepare for dax-specific address_space_operations
In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Define some generic VFS aops
helpers for dax. These noop implementations are there in the dax case to
prevent the VFS from falling back to operations with page-cache
assumptions, dax_writeback_mapping_range() may not be referenced in the
FS_DAX=n case.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Matthew Wilcox <mawilcox@microsoft.com>
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-03-30 11:34:55 -07:00
Andy Shevchenko
9def051018 crypto: Deduplicate le32_to_cpu_array() and cpu_to_le32_array()
Deduplicate le32_to_cpu_array() and cpu_to_le32_array() by moving them
to the generic header.

No functional change implied.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-03-31 01:33:09 +08:00
Tal Gilboa
f97c3dc3c0 net/dim: Fix int overflow
When calculating difference between samples, the values
are multiplied by 100. Large values may cause int overflow
when multiplied (usually on first iteration).
Fixed by forcing 100 to be of type unsigned long.

Fixes: 4c4dbb4a73 ("net/mlx5e: Move dynamic interrupt coalescing code to include/linux")
Signed-off-by: Tal Gilboa <talgi@mellanox.com>
Reviewed-by: Andy Gospodarek <gospo@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-30 12:56:22 -04:00